Can conditional entropy $H(Y\mid X)$ be expressed by Kullback-Leibler divergence as $-D_{KL}\left(p(X,Y) \parallel p(X)\right)$?

2.5k Views Asked by At

I'm going through "Elements of Information Theory" by Cover and Thomas and there the conditional entropy is proven to be equal to: \begin{align}H(Y\mid X)=-\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y) \log_2p(y \mid x).\end{align}

Another well known formula is the formula for mutual information: \begin{align}I(X;Y)&=\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y)\log_2\frac{p(x,y)}{p(x)p(y)} \\ &=E_{p(x,y)}\log_2\frac{p(X,Y)}{p(X)p(Y)} \\ &=D_{KL}(p(x,y)\parallel p(x)p(y)).\end{align}

Following the same reasoning I thought that we could also write (since $p(y\mid x)=p(x,y)/p(x)$): \begin{align}H(Y\mid X)&=-\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y) \log_2 \frac{p(x,y)}{p(x)}\\ &=-E_{p(x,y)}\log_2\frac{p(X,Y)}{p(X)}\\ &=-D_{KL}(p(x,y)\parallel p(x)).\end{align}

Is my reasoning correct?

I can't seem to find that equality listed anywhere. Wikipedia lists a couple of other identities which relate Kullback-Libler divergence and conditional entropy, but no mention of this, so I suspect that I am mistaken somewhere.

2

There are 2 best solutions below

2
On BEST ANSWER

The last equality in your derivation is not correct. Note that the KL divergence $D_\text{KL}(p\|q)$ is only meaningful when the two distributions involved, $p$ and $q$, are defined over the same space. A quantity such as $D_\text{KL}(p(x,y)\|p(x))$ makes no sense as it involves the pdf $p(x,y)$, which is defined over $\mathcal{X}\times \mathcal{Y}$, and the pdf $p(x)$, which is defined over $\mathcal{X}$.

0
On

Your proposal immediately has to be false since conditional Shannon entropy is nonnegative and so is KL divergence. So you can't expect one to nontrivially be the negative of the other. You have also misidentified the last line as a KL divergence when it is not. This is fixable by factoring out a distribution: say $\mathcal{Y}$ is finite alphabet and $q$ is some distribution on $\mathcal{Y}.$ Now:

\begin{align} H(Y\mid X)&=-\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)}\\ &=-\mathbb{E}_{(X,Y)\sim p(x,y)}\left[\log\frac{p(X,Y)}{p(X)}\right]\\ &= -\mathbb{E}_{(X,Y)\sim p(x,y)}\left[\log q(Y) + \log \frac{p(X,Y)}{p(X)q(Y)}\right]\\ &=H(p,q)-D(p(x,y)\| p(x)q(y)). \end{align}

You can reasonably make sense of this formula if you know how cross entropy $H(p,q)$ (and KL divergence) are related to how many (and how many more) bits it takes to encode a source pretending it was from distribution $q$ when really it was from distribution $p$.

Going another direction, $H(Y|X)=\mathbb{E}_{X\sim p(x)}[D(\mathbf{1}_{\{y_1=y_2\}}p(y_1|X=x)\|p(y_1|X=x)p(y_2|X=x))]$