When is Conditional Mutual Information greater than Mutual Information and what does it represent?

1.4k Views Asked by At

I am struggling to find the cases for which $I(X;Y|Z)>I(X;Y)$. The only mathematical example I could find for such a case is the following: $$ I(X;Y) + I(X;Z|Y) = I(X;Z) + I(X;Y|Z). $$ This makes sense since they are both definitions of $I(X;Y,Z)$. So, if we assume $X$ and $Z$ to be independent such that $I(X;Z) = 0$, then $$ I(X;Y|Z) - I(X;Y) = I(X;Z|Y) \geq0 $$ such that $$ I(X;Y|Z) \geq I(X;Y). $$ The issue I have with this example is that if we considered $X$ and $Z$ to be independent, I also would expect $I(X;Z|Y)$ to be equal to $0$ and not greater than $0$. If it was $0$ then the MI and CMI would be equal which I can understand, but I do not get how this can be achieved and how to interpret it properly. In other words, how can conditioning a third random variable increase the mutual information between two other random variables mathematically and how can this be interpreted?

3

There are 3 best solutions below

2
On BEST ANSWER

Actually, the case you consider, that is, with $X$ and $Z$ being independent, is a well-known case where conditioning increases the mutual information.

To provide some intuition/interpretation of this result, consider a communication channel, where $X$ represents the "message" sent by a transmitter, $Z$ is the additive "noise" introduced by the channel, and $Y$ is what the receiver observes. In addition to $X$ and $Z$ being independent, the observation is modeled as $$ Y = X + Z. $$ The result $I(X;Y|Z)\geq I(X;Y)$ essentially states that knowledge at the receiver of the noise realization $Z$ (in addition to $Y$) can only increase the information about $X$. This is, of course, intuitive. (Actually, knowledge of $Y$ and $Z$ determines $X$ exactly, therefore the inequality of the mutual in formations is, here, strict.)

One issue you have with the proof of this result is how can it be that $I(X;Z|Y)> 0$ (strict inequality) when $I(X;Z)=0$. This question can be more generally posed as how come $p(x,z|y)\neq p(x|y) p(z|y)$ (i.e., $X$ and $Z$ are not independent conditioned on $Y$), even though $p(x,z)=p(x) p(z)$ ($X$ and $Z$ are independent when no conditioning is imposed).

Note that is indeed the case in the communication channel: given $Y$, knowledge of $Z$ provides information about $X$, therefore, $X$ and $Z$ are not independent when conditioned on $Y$. In summary, one can state the following

Two independent variables $X$ and $Z$ can become dependent when conditioned on a appropriate third variable $Y$ (which, obviously, should depend on both $X$ and $Y$)

0
On

The classical example is : let $X,Y$ be independent fair Bernoulli variables (take values $\{0,1\}$ with equal probability), and let $Z=X+Y \pmod 2$ (in boolean logic: $Z= X \oplus Y$ where $\oplus$ is an XOR operator).

It's easy to see that all $X,Y,Z$ have 1 bit of entropy, and that they are all pairwise independent ($P(Z|X)=P(Z)$) hence, $I(X;Y)=I(X;Z)=0$

However, it's also obvious that knowing two variables let you know the other, so for example $H(X | Y,Z)=0$ and

$$I(X;Y|Z) = H(X|Z) - H(X | Y,Z)=1 - 0 = 1 $$

The way to understand $I(X;Y|Z) > I(X;Y)$ , in this example, is: $Y$, by itself, does not give us any information gain about $X$. However, if we are given $Z$ (condition!), things change: $Y$ now gives us a lot of information,

0
On

In terms of interpretation, $I(X;Y|Z)>I(X;Y)$ is an indication that at least to some degree $X$ and $Y$ convey synergistic information about $Z$ (even though a certain degree of redundancy could still be present). $I(X;Y)-I(X;Y|Z)$ can be decomposed as the difference $R-S$, where $R$ denotes the redundant component and $S$ the synergistic one. Only when $R-S<0$ can you conclude that there has to be some amount of synergy in the way $X$ and $Y$ encode $Z$.