I do not understand the notion of relative entropy.
Relative Entropy. $D_{KL}(P||Q) = \sum_{i}^{}P(i)\log \frac{P(i)}{Q(i)}$.
I try to get some intuition why it looks the way it looks. I see that it works: if I take $Q=P$ then $D_{KL}(P||Q)=0$, so the distance between identical distributions is 0.
I tried to find some intuition in wikipedia: KL divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.
Very confusion description, and has no clue why it's actually $P(i)\frac{P(i)}{Q(i)}$.
I would appreciate if someone could give the reasoning about the definition of relative entropy.
In information theory, relative entropy $D(P\|Q)$ is the number of extra bits required per letter on average to encode a source with a distribution $Q=(q_1,\dots,q_n)$ when the true underlying distribution is $P=(p_1,\dots,p_n)$. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---- ~~~~(*)$
The optimal expected compressed length subject to unique decodability (which is equivalent to the Kraft inequality $\sum 2^{-l_i}\le 1$) is given by $$l_i^*=\log \frac{1}{q_i}.$$
Hence the optimal expected code length (per letter) when coding with $Q$ is \begin{eqnarray} \sum p_i\log \frac{1}{q_i} &=&\sum p_i\log \frac{1}{q_i}\\ &=&\sum p_i\log \frac{p_i}{p_i\cdot q_i}\\ &=& \sum p_i\log \frac{1}{p_i}+\sum p_i\log \frac{p_i}{q_i}\\ &=& H(P)+D(P\|Q). \end{eqnarray} Hence the statement $(*)$. It has got other interpretations in probability theory and statistics though.