Zero KL-divergence $\Rightarrow$ same distribution?

1.5k Views Asked by At

I am looking at relative entropy=KL-divergence and trying to prove, at least in case of probability mass functions $q,\,p$ with possible states $x\in A$ that if:

$$ D(p \,\lVert\,q)=\sum_{x\in A}p\left(x\right)\,\log\left[\frac{p\left(x\right)}{q\left(x\right)}\right] = 0 $$

then $p\left(x\right)=q\left(x\right)\: \forall x \in A$.

I can see that the converse is true, and my book Cover, Thomas Elements of Information Theory does state that the above result is also valid, but the proof there is, in my opinion, incomplete. The authors use concavity of $\log$ to prove that $D(p \,\lVert\,q)>0$ (happy there). And then state, that the only way to have equality, as opposed to inequality, is to have $,\log\left[\frac{p\left(x\right)}{q\left(x\right)}\right]$ being not strictly concave. Sort of ok with that (not too happy), but then the authors state that the only way to get that is to have $\left[\frac{p\left(x\right)}{q\left(x\right)}\right]=const$. I am not sure how to make that last jump.

Can anyone help? Or suggest an alternative proof? Thanks

1

There are 1 best solutions below

2
On BEST ANSWER

"The only way to have equality is to have $\log \frac{p(x)}{q(x)}$ being not strictly concave" is a weird way to say something here, so you may have misread the proof, or it may be badly worded. However, the idea is to figure out the equality case of the inequality we used to prove that KL-divergence is nonnegative.

To begin with, here's a proof that KL-divergence is nonnegative, which is probably essentially the same as the one you read. Let $f(x) = -\log x = \log \frac1x$; this is a strictly convex function on the positive reals. Then by Jensen's inequality $$ D(p\,\|\,q) = \sum_{x \in A} p(x) f\left(\frac{q(x)}{p(x)}\right) \ge f\left(\sum_{x \in A} p(x) \frac{q(x)}{p(x)}\right) = f(1) = 0. $$ For a strictly convex function like $f(x)$, assuming that the weights $p(x)$ are all positive, equality holds if and only if the inputs to $f$ are all equal, which directly implies $\frac{q(x)}{p(x)}$ is constant and therefore $p(x)=q(x)$ for all $x$.

We have to be careful if $p(x)=0$ for some inputs $x$. Such values of $x$ are defined to contribute nothing to the KL-divergence, so essentially we have a sum over a different set $A' = \{x \in A : p(x) > 0\}$. Then on the right-hand side of the inequality, we get $$ f\left(\sum_{x \in A'} p(x) \frac{q(x)}{p(x)}\right) = f\left(\sum_{x \in A'} q(x)\right). $$ But now, if the sum over $A'$ is some probability $q<1$, then $f(q) > 0$, and we conclude $D(p\,\|\,q) > 0$. So the sum of $q(x)$ over $A'$ should be $1$. We conclude that when the KL-divergence is $0$, $p(x)=0 \implies q(x)=0$ for all $x \in A$. In order for the KL-divergence to even be defined, $q(x)=0 \implies p(x)=0$ for all $x\in A$. So we can throw away any values where $p(x)=q(x)=0$, and now we're back in the simple case, where all the weights in Jensen's inequality are positive.