KL divergence as information required to update a distribution?

44 Views Asked by Bumbble Comm At 28 Mar 2026 - 11:18

I feel like the KL divergence should have an interpretation along the lines of,

"$D_\mathrm{KL}(q \Vert p)$ is the amount of information you need to update the distribution $p(x)$ to the distribution $q(x)$,"

but I don't know how to make this clear and demonstrate this in terms of things like bits per message.

For instance, if $p$ and $q$ are the same, you don't need to transmit any bits to someone who knows $p(x)$ to teach them $q(x)$, whereas if $p$ and $q$ are very different you need to send many bits to tell them how to update $p$ to $q$. But this is just vague intuition.

Is there a thought experiment to make this more precise? Something from information theory about "updating" a distribution $p$ to $q$.

p.s. I know the usual interpretations of KL divergence as "expected log likelihood ratio when testing $p$ vs. $q$" and the "expected excess number of bits used to transmit a sample from $q$ when using a code optimized for $p$ instead of $q$". But I don't quite see how either of these is equivalent to "the number of bits needed to update $p$ to $q$".

Original Q&A

There are 1 best solutions below

Bumbble Comm On 05 Jan 2024 - 6:31

I don't think there is such an interpretation that arises naturally.

Even for entropy $H(X)$ there is no such natural interpretation. The entropy tells you how many bits you need to distinguish different outputs from a given distribution. After all you could have a distribution whose specification depends on a real parameter (say Poisson) and that parameter requires an infinite number of bits to specify, but the entropy is finite, say 2.5 bits or whatever.

KL divergence as information required to update a distribution?

There are 1 best solutions below

Related Questions in INFORMATION-THEORY

Related Questions in CODING-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions