Subadditivity property of relative entropy

616 Views Asked by At

Let $H(\mu|\nu)$ be the relative entropy (or Kullback-Leibler convergence) defined in the usual way. I am looking for a proof or reference to the following fact: $\mu,\nu$ two-dimensional probability measures with marginals $\mu_1,\mu_2$ and $\nu_1,\nu_2$, respectively, then $$ H(\mu|\nu)\geq H(\mu_1|\nu_1) + H(\mu_2|\nu_2). $$

Any help appreciated. Thanks.

2

There are 2 best solutions below

5
On

The below answer is false; need to think; most likely the claim made is false for arbitrary reference measures; I know it true to be suitably chosen gaussians and uniforms.

This is an immediate consequence of chain rule and the joint convexity of KL divergence. You can look at the proofs of the individual claims here: https://homes.cs.washington.edu/~anuprao/pubs/CSE533Autumn2010/lecture3.pdf; chain rule follows from writing out the joint entropy, and the convexity of KL divergence follows from log-sum inequality. So chain rule says: $$D(\mu(x,y) || \nu(x,y)) = D(\mu_1(x) || \nu_1(x)) + E_{\mu_1(x)} [D(\mu(y | x) || \nu(y | x)] $$ and convexity says that $$ E_{\mu_1(x)} [D(\mu_2(y | x) || \nu_2(y | x)] \leq D( E_{\mu_1(x)} [\mu(y | x)] || E_{\mu_1(x)} [\nu(y | x)] ) = D(\mu_2(y) || \nu_2(y) ) $$ P.S. This should be considered super-additivity? From this, you can derive the subadditivity of (differential) entropy by taking $v$ to be uniform on a set that contains the support of your $\mu$s. (Gaussian for differential entropy).

0
On

Let the laws be over random variables $X,Y$. By the chain rule of KL divergences, $$ D(P_{XY}\|Q_{XY}) = D(P_X\|Q_X) + D(P_{Y|X} \|Q_{Y|X}|P_X).$$

Thus the question boils down to asking if $$ D(P_{Y|X} \|Q_{Y|X} |P_X) \ge D(P_Y\|Q_Y)$$ always holds.

This is not true - consider joint laws such that $P_{Y|X} = Q_{Y|X}$ but $P_X \neq Q_X$ in such a way that $P_Y \neq Q_Y$. Then the left hand side is simply $0$ (since $p_{y|x}/q_{y|x} = 1$ for every $x,y$) but the RHS is positive because $P_Y \neq Q_Y$.


The simplest example is when $P_{Y|X} = \delta_{yx}$ (in a discrete setting, say, where $\delta$ is the Kronecker $\delta$). Then $D(P_{XY}\|Q_{XY}) = D(P_X\|Q_X) = D(P_Y\|Q_Y)$ ).

To highlight that Gaussian's aren't special (as was suggested in an alternate answer), consider the common channel to be an AWGN with noise variance $1$, and let $P_X$ and $Q_X$ be Gaussians of variance $1$ centered at $+1/2$ and $-1/2$ respectively. (This is also less pathological than an identity channel).

Then $$ P_{XY} = \mathcal{N}\left( (1/2,1/2)', \Sigma\right), Q_{XY} = \mathcal{N}( (-1/2, -1/2)', \Sigma),$$

where $\Sigma = \begin{pmatrix} 1 & 1 \\ 1 & 2\end{pmatrix}, \Sigma^{-1} = \begin{pmatrix} 2 & -1 \\ -1 & 1\end{pmatrix}.$

Thus, $$ D(P_{XY}\|Q_{XY}) = \frac{1}{2} (1,1) \Sigma^{-1} (1,1)' = 1/2 \\ D(P_X \|Q_X) = 1/2 \\ D(P_Y\|Q_Y) = 1/4.$$

(Here I've used the standard fact that $D(\mathcal{N}(\mu, \Sigma)\|\mathcal{N}(\nu, \Sigma)) = \frac{1}{2}(\mu - \nu)' \Sigma^{-1} (\mu - \nu)$)


One operational way to see why the above is happening is using the Chernoff-Stein lemma. This says that if you're testing the hypothesis that an unknown distribution is $P$ or if it is $Q$, then the optimal rate of decay of missed detection error (as sample size increases) for any constant false alarm level is $D(P\|Q)$.

Now, the left hand side corresponds to the rate when testing $P_{XY}$ against $Q_{XY}$. On the other hand, the right hand side corresponds to the rate when testing $P_X \otimes P_Y$ against $Q_X \otimes Q_Y$. One way to interpret the second case is to think of each sample point as two independent samples from the relevant joint law, from which you're being given the $X$ from one, and the $Y$ from the other.

If the channels $P_{Y|X}$ and $Q_{Y|X}$ are indeed the same, then notice that in the first case, the $Y$ is not really helpful in the hypothesis test - all relevant information is already contained in the $X$. However, in the second case, since the $Y$ is independent of the $X$, you're essentially being given a second independent sample which is a little bit more noisy. It should be obvious that this is helpful for the test, and the rate of decay of error should thus be bigger.