What is the reasoning behind the definition of relative entropy?

1.1k Views Asked by At

I do not understand the notion of relative entropy.

Relative Entropy. $D_{KL}(P||Q) = \sum_{i}^{}P(i)\log \frac{P(i)}{Q(i)}$.

I try to get some intuition why it looks the way it looks. I see that it works: if I take $Q=P$ then $D_{KL}(P||Q)=0$, so the distance between identical distributions is 0.

I tried to find some intuition in wikipedia: KL divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.

Very confusion description, and has no clue why it's actually $P(i)\frac{P(i)}{Q(i)}$.

I would appreciate if someone could give the reasoning about the definition of relative entropy.

2

There are 2 best solutions below

6
On BEST ANSWER

In information theory, relative entropy $D(P\|Q)$ is the number of extra bits required per letter on average to encode a source with a distribution $Q=(q_1,\dots,q_n)$ when the true underlying distribution is $P=(p_1,\dots,p_n)$. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---- ~~~~(*)$

The optimal expected compressed length subject to unique decodability (which is equivalent to the Kraft inequality $\sum 2^{-l_i}\le 1$) is given by $$l_i^*=\log \frac{1}{q_i}.$$

Hence the optimal expected code length (per letter) when coding with $Q$ is \begin{eqnarray} \sum p_i\log \frac{1}{q_i} &=&\sum p_i\log \frac{1}{q_i}\\ &=&\sum p_i\log \frac{p_i}{p_i\cdot q_i}\\ &=& \sum p_i\log \frac{1}{p_i}+\sum p_i\log \frac{p_i}{q_i}\\ &=& H(P)+D(P\|Q). \end{eqnarray} Hence the statement $(*)$. It has got other interpretations in probability theory and statistics though.

0
On

As you can see in this video. Cross entropy: H(p,q) can be interpreted as average message length that is sent to receiver based on predicted q distribution. In information theory, to compress given massage in most effective way, we look at its probability distribution. So basically, entropy: H(p) of the given probability distribution is our ultimate limit. For example, if I was sending 4 bits of data on average to receiver about the wheater of Istanbul, and also if I know the entropy of my message words is actually 3.2, it means that I can actually can compress my message more.

Thus, the difference between average message lenght that was sent and actual entropy of the message is equal to KL Divergence.

Edit: The person who gave me a down vote, please can you explain why?