What are the meaning and intuition behind I(X, Y) in information theory?

Question

What are the meaning and intuition behind I(X, Y) in information theory?

511 Views Asked by Bumbble Comm At 07 Apr 2026 - 2:43

In a course which touches information theory, I(X, Y) is mentioned and defined with a formula:

$I(X, Y) = log_2\frac{p(X/Y)}{p(X)}$

where $P(X/Y)$ is the conditional probability of event X happening if event Y has already happened.

I think I've seen the notation I(X, Y) used for mutual information in a couple of places, however mutual information is always non-negative, while the previous definition allows negative values, i.e. whenever $P(X/Y) < P(X)$. What, then, does I(X, Y) defined in this way mean?

Also, how would I connect that to the intuition behind amount of information I(X) ("Minimal number of Yes/No question needed to...")

Original Q&A

There are 3 best solutions below

**Bumbble Comm** · Answer 1 · 2017-12-10 22:35:24

It is called mutual information and its form is $I(X;Y)=\sum_{x,y}p(x,y)log_2(p(x,y)/(p(x)p(y)))$ (or integrals for continuums).

Notice it is homogeneous. (Can change $x$ and $y$)

It is a measure of how much information is shared by $X$ and $Y$ please see this famous Venn diagram

**Bumbble Comm** · Answer 2 · 2017-12-10 22:48:03

I have only ever seen mutual information defined for random variables, and not events as in your post. The fact "mutual information is nonnegative" is for the version for random variables.

There is a connection between your definition with the standard definition, namely if $X$ and $Y$ are events, and $1_X$ and $1_Y$ are indicators for these events (random variables), then $$I(1_X,1_Y) = P(X \cap Y) \log \frac{P(X \mid Y)}{P(Y)} + P(X \cap Y^c) \log \frac{P(X \mid Y^c)}{P(Y^c)} + P(X^c \cap Y) \log \frac{P(X^c \mid Y)}{P(Y)} + P(X^c \cap Y^c) \log \frac{P(X^c \mid Y^c)}{P(Y^c)}. $$ That is, the mutual information between the two indicator random variables is a weighted average of your "event mutual informations" for each pair of events in $\{X, X^c\}\times\{Y,Y^c\}$.

**Bumbble Comm** · Answer 3 · 2017-12-13 03:47:27

The standard definition of mutual information refers to random variables, not to events:

$$\begin{align} I(X;Y) &= \sum_{X,Y} p(X,Y) \log \frac{P(X,Y)}{P(X)P(Y)} \\ &= \sum_{X,Y} p(X,Y) \log \frac{P(X \mid Y)}{P(X)} \\ &= E\left[ \log \frac{P(X \mid Y)}{P(X)}\right] \\ \end{align}$$

So, it seems that your definition is missing the expectation (or average).

Yes, the mutual information is positive (non-negative), and, yes, this is not obvious from this definition. That's actually a relatively deep result, that is a consequence of the non-negativity of the relative entropy (another non obvious result).

Also, how would I connect that to the intuition behind amount of information I(X) ("Minimal number of Yes/No question needed to...")

It's easy to show that $I(X;Y)=H(X)-H(X \mid Y)$. But the entropy $H(X)$, when expressed in bits ($\log_2$) relates to "the average (not minimal!) number of Yes/No questions (bits) needed to guess the value using the optimal strategy".

Analogously, $H(X\mid Y)$ is the number of questions needed to guess $X$, when we know $Y$.

Hence the mutual information $I(X;Y)$ (which, mind you, always relate two variables) gives you how many questions (bits) you gain when you are given the value of $Y$ to guess $X$ (rougly, how much informaton $Y$ gives you about $X$). Or viceversa.

What are the meaning and intuition behind I(X, Y) in information theory?

There are 3 best solutions below

Related Questions in DEFINITION

Related Questions in INFORMATION-THEORY

Trending Questions

Popular # Hahtags

Popular Questions