Prove that that the mutual information $I(X;Y)$ is convex in the conditional entropy $p(y |x)$

4.6k Views Asked by At

I'm trying to understand the proof that $I(X;Y)$ is convex in conditional distribution $p(y \mid x)$ - from Elements of Information Theory by Cover & Thomas, theorem 2.7.4.

In the proof we fix $p(x)$ and consider two conditionals $p_1 (y \mid x)$ and $p_2 (y \mid x)$. Corresponding joints are $p_1(x, y) = p(x) \, p_1 (y \mid x)$ and $p_2(x, y) = p(x) \, (y \mid x)$ and marginals are $p_1(y)$ and $p_2(y)$.

Then we consider conditional $p^*(y \mid x) = \lambda \, p_1 (y \mid x) + (1 - \lambda) \, p_2 (y \mid x)$, joint $p^*(x, y) = \lambda \, p_1 (x, y) + (1 - \lambda) \, p_2 (x, y) = \lambda \, p(x) \, p_1 (y \mid x) + (1 - \lambda) \, p(x) \, p_2 (y \mid x)$ and marginal $p^*(y) = \lambda \, p_1 (y) + (1 - \lambda) \, p_2 (y)$.

If we let $q^*(x, y) = p(x) \, p^*(y) = \lambda \, p(x) \, p_1 (y) + (1 - \lambda) \, p(x) \, p_2 (y)$, then the KL divergence between $p^*(x, y)$ and $q^*(x, y)$ is $D \big( p^*(x, y) \ || \ q^*(x, y) \big) = D \big( p^*(x, y) \ || \ p(x) \, p^*(y) \big) = I(X; Y)$ - for $X \sim p(x)$ and $Y \sim p^*(y)$

Next in the book they conclude the proof by saying that since $D( p \ || \ q)$ is convex in $(p, q)$, so is $I(X;Y)$. What I don't understand is why it means that $I(X; Y)$ is convex in $p(y \mid x)$?

3

There are 3 best solutions below

2
On BEST ANSWER

That $D(u||v)$ is convex in the pair $(u,v)$ means that

$$D(u_{\lambda}||v_{\lambda}) \le \lambda D(u_{1}||v_{1}) +(1-\lambda)D(u_{2}||v_{2}) $$

(here $u$ and $v$ and some probability functions, with $u= \lambda u_1 +(1-\lambda) u_2$, etc)

Then $$I(X_{\lambda};Y_{\lambda}) = D(p_{\lambda}(X,Y)||p_{\lambda}(X)p_\lambda(Y))$$ where (as shown in the text)

$p_{\lambda}(X,Y) = p(X) p_\lambda(Y|X)=\lambda p_1(x,y)+(1-\lambda)p_2(x,y)= p(x) (\lambda p_1(y|x)+(1-\lambda)p_2(y|x))$

I.e.: the "mixture" can be written in terms of the variable $p(y|x)$ as well as in terms of the joint $p(x,y)$ - and the same happens for the marginal $p(y)$. Then we can also write

$p_{\lambda}(X)p_\lambda(Y) = \lambda p(x)p_1(y)+(1-\lambda)p(x)p_2(y)$

Then, pluggin this into the second equation, and using the first inequation:

$$I(X_{\lambda};Y_{\lambda}) \le \lambda I(X_1;Y_1) + (1-\lambda) I(X_2;Y_2)$$

which means that $I(X;Y)$ is convex with respect to the mixed variable (in this case, $p(y|x)$)

2
On

First lets state the claim.

Claim: $I(X;Y)=D(p(x,y)||p(x)p(y))$ is concave in $p(x)$ for fixed $p(y|x)$ and convex in $p(y|x)$ for fixed $p(x)$.

One can write the following:

$I(X;Y)=H(Y)-H(Y|X)$ where $H$ is the entropy function. The entropy functions can further be written as

$$H(Y)=-\sum_y p(y)\log(p(y)) \ \text{ and } \ H(Y|X)=-\sum_x p(x)\sum_y p(y|x)\log (p(y|x))$$

What we know is that $H(Y)$ is a concave function of $p(y)$, because $x\log x$ is a convex function of $x$. We can write $p(y)=\sum_x p(y|x)p(x)$. Hence, whenever $p(y|x)$ is fixed, meaning that it is constant for all $p(y|x)$, $p(y)$ is a linear function of $p(x)$. Therefore $H(Y)$ is also concave in $p(x)$. For fixed $p(y|x)$, $H(Y|X)$ is also linear in $p(x)$. Now consider a function which is a sum of a linear and a concave function $f(x)=c(x)+l(x)$. Since $f^{''}(x)=c^{''}(x)$, where $(\cdot)^{''}$ is the second derivative, $f$ is also concave. Therefore, $I(X;Y)$ is a concave function of $p(x)$.

Now lets come to the second part. When $p(x)$ is constant, this time $p(y)=\sum_x p(y|x)p(x)$ is a linear function of $p(y|x)$. Hence, $H(Y)$ is a concave function of $p(y|x)$. Due to same reason $H(Y|X)$ is also a concave function. Difference of two concave functions can either be concave or convex or both. Going back to the definition $I(X;Y)=D(p(x,y)||p(x)p(y))$, we also know that $D(\cdot||p(x)p(y))$ is convex in $p(x)p(y)$, since KL divergence is convex in both terms. Now, $p(x)$ was constant and $p(y)$ was linear in $p(y|x)$. Therefore, the right choice was that $I$ is convex in $p(y|x)$, else $D$ cannot be convex in $p(x)p(y)$.

Another idea would be to show that $I(X;Y)$ accepts no maximum over $p(y|x)$, except for the boundaries. I dont know however how this can be shown. What I know is that $I(X;Y)$ is always upperbunded by $\min[ H(X),H(Y)]$, simply from the symmetry of $I(X;Y)=H(Y)-H(Y|X)=H(X)-H(X|Y)$.

0
On

I'll give here a direct line of reasoning that doesn't involve relative entropies.

The mutual information (MI), generally written as $I(X:Y)$, should maybe more accurately be written as a function of joint probability distributions $p:(x,y)\mapsto p(x,y)$, as: $$I(X:Y)\equiv I(p)\equiv \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p_X(x)p_Y(y)},\tag1$$ where $p_X(x)\equiv \sum_y p(x,y)$ and $p_Y(y)\equiv\sum_x p(x,y)$ are the marginal distributions (which, we can already note, are linear functions of $p$). When discussing concavity/convexity wrt the conditional distribution, what we are doing is writing the joint distro as $p(x,y)=p_X(x)p_C(y|x)$, where I defined the conditional distribution as $p_C(y|x)\equiv \frac{p(x,y)}{p_X(x)}$. But more generally, one can imagine $p_X$ and $p_C$ as independent variables, and study the functional relation of $I$ on both. In other words, we are essentially writing the MI as $$I(p) \equiv I(p_X,p_C).\tag2$$ This is useful because if we now use the standard identity $I(X:Y)=H(X)-H(X|Y)$, we can more precisely write the associated functional relations as $$I(p_X,p_C) = H(p_X) - H(Y|X; p_X,p_C),\tag3$$ where I wrote $H(Y|X;p_X,p_C)$ to denote the conditional entropy written as a function of marginal $p_X$ and conditional $p_C$. Explicitly, this reads $$H(Y|X;p_X,p_C) \equiv \sum_x p_X(x) H(Y|X=x; p_C), \qquad H(Y|X=x; p_C)\equiv H(p_C(\bullet|x)),\tag4$$ where $p_C(\bullet|x):y\mapsto p_C(y|x)$ denotes the conditional distribution $p_C$ thought of as a function of $y$, for a given value of $x$.

With these relations, we've got everything in place to show that $I(p_X,p_C)$ is convex wrt $p_C$, and concave wrt $p_X$:

  1. We know that the Shannon entropy is concave in its argument, and we see from (4) that $H(Y|X;p_X,p_C)$ is linear in $p_X$. Hence, looking at (3), we see that $I(p_X,p_C)$ is a difference between a concave (wrt $p_X$) function and a linear (wrt $p_X$) function, and is thus also concave wrt $p_X$.
  2. From (3) we see that $I(p_X,p_C)$ depends on $p_C$ only via the conditional entropy. But from (4) we see that $H(Y|X;p_X,p_C)$ is a convex combination of the functions $H(Y|X=x;p_C)$, which are all concave in $p_C$. Hence $H(Y|X;p_X,p_C)$ is concave in $p_C$, and thus $I(p_X,p_C)$ is convex in $p_C$.