What are the meaning and intuition behind I(X, Y) in information theory?

511 Views Asked by At

In a course which touches information theory, I(X, Y) is mentioned and defined with a formula:

$I(X, Y) = log_2\frac{p(X/Y)}{p(X)}$

where $P(X/Y)$ is the conditional probability of event X happening if event Y has already happened.

I think I've seen the notation I(X, Y) used for mutual information in a couple of places, however mutual information is always non-negative, while the previous definition allows negative values, i.e. whenever $P(X/Y) < P(X)$. What, then, does I(X, Y) defined in this way mean?

Also, how would I connect that to the intuition behind amount of information I(X) ("Minimal number of Yes/No question needed to...")

3

There are 3 best solutions below

0
On

It is called mutual information and its form is $I(X;Y)=\sum_{x,y}p(x,y)log_2(p(x,y)/(p(x)p(y)))$ (or integrals for continuums).

Notice it is homogeneous. (Can change $x$ and $y$)

It is a measure of how much information is shared by $X$ and $Y$ please see this famous Venn diagram Taken from wikipedia page

0
On

I have only ever seen mutual information defined for random variables, and not events as in your post. The fact "mutual information is nonnegative" is for the version for random variables.

There is a connection between your definition with the standard definition, namely if $X$ and $Y$ are events, and $1_X$ and $1_Y$ are indicators for these events (random variables), then $$I(1_X,1_Y) = P(X \cap Y) \log \frac{P(X \mid Y)}{P(Y)} + P(X \cap Y^c) \log \frac{P(X \mid Y^c)}{P(Y^c)} + P(X^c \cap Y) \log \frac{P(X^c \mid Y)}{P(Y)} + P(X^c \cap Y^c) \log \frac{P(X^c \mid Y^c)}{P(Y^c)}. $$ That is, the mutual information between the two indicator random variables is a weighted average of your "event mutual informations" for each pair of events in $\{X, X^c\}\times\{Y,Y^c\}$.

0
On

The standard definition of mutual information refers to random variables, not to events:

$$\begin{align} I(X;Y) &= \sum_{X,Y} p(X,Y) \log \frac{P(X,Y)}{P(X)P(Y)} \\ &= \sum_{X,Y} p(X,Y) \log \frac{P(X \mid Y)}{P(X)} \\ &= E\left[ \log \frac{P(X \mid Y)}{P(X)}\right] \\ \end{align}$$

So, it seems that your definition is missing the expectation (or average).

Yes, the mutual information is positive (non-negative), and, yes, this is not obvious from this definition. That's actually a relatively deep result, that is a consequence of the non-negativity of the relative entropy (another non obvious result).

Also, how would I connect that to the intuition behind amount of information I(X) ("Minimal number of Yes/No question needed to...")

It's easy to show that $I(X;Y)=H(X)-H(X \mid Y)$. But the entropy $H(X)$, when expressed in bits ($\log_2$) relates to "the average (not minimal!) number of Yes/No questions (bits) needed to guess the value using the optimal strategy".

Analogously, $H(X\mid Y)$ is the number of questions needed to guess $X$, when we know $Y$.

Hence the mutual information $I(X;Y)$ (which, mind you, always relate two variables) gives you how many questions (bits) you gain when you are given the value of $Y$ to guess $X$ (rougly, how much informaton $Y$ gives you about $X$). Or viceversa.