What is the reasoning behind the definition of relative entropy?

Question

What is the reasoning behind the definition of relative entropy?

1.1k Views Asked by Bumbble Comm At 30 Mar 2026 - 10:25

I do not understand the notion of relative entropy.

Relative Entropy. $D_{KL}(P||Q) = \sum_{i}^{}P(i)\log \frac{P(i)}{Q(i)}$.

I try to get some intuition why it looks the way it looks. I see that it works: if I take $Q=P$ then $D_{KL}(P||Q)=0$, so the distance between identical distributions is 0.

I tried to find some intuition in wikipedia: KL divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.

Very confusion description, and has no clue why it's actually $P(i)\frac{P(i)}{Q(i)}$.

I would appreciate if someone could give the reasoning about the definition of relative entropy.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 20 Oct 2020 - 10:41

As you can see in this video. Cross entropy: H(p,q) can be interpreted as average message length that is sent to receiver based on predicted q distribution. In information theory, to compress given massage in most effective way, we look at its probability distribution. So basically, entropy: H(p) of the given probability distribution is our ultimate limit. For example, if I was sending 4 bits of data on average to receiver about the wheater of Istanbul, and also if I know the entropy of my message words is actually 3.2, it means that I can actually can compress my message more.

Thus, the difference between average message lenght that was sent and actual entropy of the message is equal to KL Divergence.

Edit: The person who gave me a down vote, please can you explain why?

**Bumbble Comm** · Accepted Answer

In information theory, relative entropy $D(P\|Q)$ is the number of extra bits required per letter on average to encode a source with a distribution $Q=(q_1,\dots,q_n)$ when the true underlying distribution is $P=(p_1,\dots,p_n)$. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---- ~~~~(*)$

The optimal expected compressed length subject to unique decodability (which is equivalent to the Kraft inequality $\sum 2^{-l_i}\le 1$) is given by $$l_i^*=\log \frac{1}{q_i}.$$

Hence the optimal expected code length (per letter) when coding with $Q$ is \begin{eqnarray} \sum p_i\log \frac{1}{q_i} &=&\sum p_i\log \frac{1}{q_i}\\ &=&\sum p_i\log \frac{p_i}{p_i\cdot q_i}\\ &=& \sum p_i\log \frac{1}{p_i}+\sum p_i\log \frac{p_i}{q_i}\\ &=& H(P)+D(P\|Q). \end{eqnarray} Hence the statement $(*)$. It has got other interpretations in probability theory and statistics though.

What is the reasoning behind the definition of relative entropy?

There are 2 best solutions below

Related Questions in PROBABILITY-DISTRIBUTIONS

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions