Why does $H(X|Y)$ equal the "missing information" of $Y$ about $X$?

99 Views Asked by At

I've seen mentioned in (Horodecki, Oppenheim, Winter 2005) the fact that the conditional information equals the amount of information that Alice needs to send Bob in order for him to fully reconstruct information about a source, provided Bob already has some information.

More precisely, say Alice holds a random variable $X$, Bob holds $Y$, and there is some correlation between them. Because of the correlation, Bob has some amount of (generally incomplete) information about $X$. The statement is that, in order for Bob to have full knowledge about $X$, it is sufficient that Alice sends him $H(X|Y)$ bits of information.

Is there some general intuition to understand where this result comes from? How does Alice decide what information to send Bob? This result seems to have been first presented by Slepian and Wolf in 1973, but I haven't been very lucky in finding more modern references for this particular protocol.

As a simple example, consider a pair of binary random variables, with possible equiprobable outcomes being $00, 11, 01$. In other words, we have the joint probability distribution $$p(0,0) = p(0,1) = p(1,1) = \frac13.$$ In this case, one can see that $H(X|Y)=\frac23$, meaning Alice should just need to send (on average) 2/3 of a bit to give Bob full knowledge of her side. How would this work?

2

There are 2 best solutions below

8
On

Remember that entropy is average information: maybe Alice needs to send $0$ bits $1/3$ of the time (if Bob can already work out the information) and $1$ bit $2/3$ of the time. Maybe she needs to send $0$ bits $13/15$ of the time and $5$ bits $2/15$ of the time etc. Or maybe neither of these is true. From entropy alone, we cannot tell: this information would come from the distribution of the random variable $X|Y$.

In the case you have given, it is the former. If Bob, who knows $Y$, sees $Y=0$ then he concludes that $(X,Y)=(0,0)$, because $(1,0)$ is not a possible value. That is, $1/3$ of the time, Bob already knows all the information. If he sees $Y=1$, however, then he needs Alice to send $1$ bit of information: the value of $X$. That tells him whether it is $(1,1)$ or $(0,1)$. Each of the three outcomes is equally likely (so $Y=0$ w.p. $1/3$ and $Y=1$ w.p. $2/3$), so Alice sends on average $2/3$ of a bit of information.

0
On

Denote with $X^n,Y^n$ the registers holdings sequences of $n$ symbols sampled from the sources $X,Y$, respectively. Suppose Bob holds the sequence $y^n\equiv(y_1,...,y_n)$, while Alice the sequences $x^n\equiv (x_1,...,x_n)$. By the joint AEP theorem, we know that $(x^n,y^n)$ is a jointly typical sequence, with unit probability asymptotically with $n\to\infty$.

We want to show that there is a way for Alice to send $nH(X|Y)$ bits of information to Bob, which Bob will be able to use, together with his knowledge of $y^n$, to infer $x^n$ (again, with asymptotically vanishing error probability). We want Alice to not have to know Bob's $y^n$ to decide what to send. Rather, we're looking for a protocol in which Alice observes her $x^n$, and based on that decides which bits to send Bob. Bob, upon receiving these bits and looking at this $y^n$, will correctly deduce $x^n$.

Easy solution if Alice knows Bob's outcome

Let's start by working out a simpler case, where Alice is told about Bob's outcome before deciding what bits to send him. In this case, we need only remember that the number of joint typical sequences compatible with a given $y^n$ is (asymptotically close to) $2^{n H(X|Y)}$.

In fact, again by the joint AEP theorem, there are $\sim 2^{n H(X,Y)}$ joint typical sequences, while only $\sim 2^{nH(Y)}$ typical sequences for the marginal distribution on $Y$. In the limit of $n\to\infty$, all typical sequences are essentially identical, and thus in particular for any typical $y^n$ there is the same number of typical sequences $x^n$ such that $(x^n,y^n)$ is jointly typical. We can thus simply divide the total number of joint typical sequences by the number of typical sequences for $Y$, thus concluding that for each typical $y^n$ there are $$2^{nH(X,Y)}/2^{nH(Y) }=2^{nH(X|Y)}$$ typical sequences $x^n$ such that $(x^n,y^n)$ is jointly typical.

Alice, knowing $y^n$, also knows exactly which of the $2^{n H(X|Y)}$ typical sequences for $X$ form joint typical sequences with $y^n$. She can thus simply enumerate these possible sequences, and send the associated index to Bob. This requires sending $n H(X|Y)$ bits. Bob, which also knows the possible things Alice found compatible with his observation, upon receiving Alice's message can simply look up which compatible typical sequence for Alice is the correct one.

Solution when Alice doesn't know $y^n$

Let's now consider the more interesting situation where Alice only uses $x^n$ to decide what to send Bob.

In this situation, consider the following scheme: Alice and Bob agree beforehand to partition the set of typical sequences for $X$ in a specific way. More precisely, the set of all such typical sequences is partitioned at random in blocks of size $2^{n I(X:Y)}$. There will be $2^{n[H(X)-I(X:Y)]}=2^{n H(X|Y)}$ such blocks. When Alice sees she got the sequence $x^n$, she tells Bob which block of typical sequences $x^n$ is contained in. This requires her to send $nH(X|Y)$ bits of information.

So in summary, the message Alice sent Bob allows him to restrict his attention to only $2^{nI(X:Y)}$ possible input typical sequences. There isn't any additional information to help Bob decide on Alice's $x^n$, so the question becomes: what's the probability of this set of $2^{n I(X:Y)}$ typical sequences containing more than one $x^n$ such that $(x^n,y^n)$ is jointly typical? Note that we are ensured that one such $x^n$ is in there, by construction of the set.

We thus have $2^{n H(X)}$ typical sequences for $X$, partitioned into $2^{nH(X|Y)}$ random subsets of equal sizes, and we ask how $2^{nH(X|Y)}$ random sequences (those compatible with $y^n$) are distributed with respect to this partition.

To answer this question, consider the more general combinatorial problem: given $n$ objects, partitioned into $k=pn$ equal subsets, and pick $k$ of these objects at random. The probability of choosing a single element from each partition is $$\frac{\left(\frac{n}{np}\right)^{pn}}{\binom{n}{pn}} \sim 2^{-np\log p}2^{np\log p+n(1-p)\log(1-p)}=2^{n(1-p)\log(1-p)}.$$ In our case we have $p=2^{-n I(X:Y)}$, and this quantity thus goes to $1$ for $n\to\infty$. We can thus conclude that the decoding strategy works with vanishingly small error probability.

An (I think) somewhat similar argument is presented in Preskill's book, chapter 10 (Link to pdf).