Is the classical interpretation of relative entropy wrong?

34 Views Asked by At

The relative entropy between two probability distributions $p=\{p_i\}$ and $q=\{q_i\}$ (also called KL divergence) is given by $$D_{KL}(p\ ||\ q)=\sum_i p_i \log\left(\frac{p_i}{q_i}\right)$$

The classical interpretation of $D_{KL}(p\ ||\ q)$ is that it gives the expected number of extra bits that we get if we optimize the average code length for $p$ using $q$, compared to optimizing for $p$ using $p$.

This is clearly wrong, since $D_{KL}$ can be infinity. I have thought of a correct interpretation but I would like to know if there are alternatives to it. Here is my suggestion:

Given a probability distribution, an algorithm that creates codes and approaches the entropy lower bound (with an error of at most 1 bit) is the Shannon coding. It is perhaps the first algorithm that comes to mind, and the nice thing about it is that if a symbol has probability $p_i$, then its code-length will be $\left\lceil \log_2\left(\frac{1}{p_i}\right)\right\rceil$.

Now, a correct (but not fully satisfying) interpretation is that the $D_{KL}(p\ ||\ q)$ is (within a maximum error of 2 bits to account for the ceiling) the expected number of extra bits that we get if we run Shannon coding for $p$ using $q$, compared to running it using $p$. I.e., going from

$$\sum_i p_i \left\lceil\log\left(\frac{1}{q_i}\right)\right\rceil- \sum_i p_i \left\lceil\log\left(\frac{1}{p_i}\right)\right\rceil$$

to

$$\sum_i p_i \log\left(\frac{1}{q_i}\right)- \sum_i p_i \log\left(\frac{1}{p_i}\right)$$