KL divergence as information required to update a distribution?

44 Views Asked by At

I feel like the KL divergence should have an interpretation along the lines of,

"$D_\mathrm{KL}(q \Vert p)$ is the amount of information you need to update the distribution $p(x)$ to the distribution $q(x)$,"

but I don't know how to make this clear and demonstrate this in terms of things like bits per message.

For instance, if $p$ and $q$ are the same, you don't need to transmit any bits to someone who knows $p(x)$ to teach them $q(x)$, whereas if $p$ and $q$ are very different you need to send many bits to tell them how to update $p$ to $q$. But this is just vague intuition.

Is there a thought experiment to make this more precise? Something from information theory about "updating" a distribution $p$ to $q$.


p.s. I know the usual interpretations of KL divergence as "expected log likelihood ratio when testing $p$ vs. $q$" and the "expected excess number of bits used to transmit a sample from $q$ when using a code optimized for $p$ instead of $q$". But I don't quite see how either of these is equivalent to "the number of bits needed to update $p$ to $q$".

1

There are 1 best solutions below

1
On

I don't think there is such an interpretation that arises naturally.

Even for entropy $H(X)$ there is no such natural interpretation. The entropy tells you how many bits you need to distinguish different outputs from a given distribution. After all you could have a distribution whose specification depends on a real parameter (say Poisson) and that parameter requires an infinite number of bits to specify, but the entropy is finite, say 2.5 bits or whatever.