Operational Meaning of Relative Entropy

224 Views Asked by At

Is there an operational meaning to understand the non-negativity of relative entropy between two probability distributions? I understand the mathematical argument/proof. But I want to know if there is an intuitive way to remember that relative entropy cannot be negative through some operational task.

1

There are 1 best solutions below

0
On

On doing a something-search for "relative entropy interpretation", these lecture notes come up. They suggest the relative entropy $D(p\|q)$, for probability measures $p$ and $q$, is the "information we gained about a random variable $X$ if we originally thought that $X \sim Q$ and now we learned $X \sim p$". Getting more data/samples/information always means we know more, and so our "knowledge"/"information" can't go down---hence the non-negativity. This is quite difficult to understand, for me at least; hopefully I can elaborate in a helpful way...


Consider a simple random walk $(X_t)_{t\ge0}$ on the cycle $[n] = \{1,...,n\}$ (with $i$ and $j$ connected if and only if $|i-j| = 1$ mod $n$). This has as its invariant distribution the uniform distribution on $[n]$, which I'll denote $\pi_n$. Hence (it can be shown that) $D(\mathcal L(X_t)\|\pi_n) \to 0$ as $t \to \infty$ (with $n$ fixed), where $\mathcal L(X_t)$ is the law/distribution of $X_n$.

Consider the following "thought-experiment". Now, suppose we've run for some large time $t$, so that I believe $X_n$ is exactly uniform (ie $\mathcal L(X_t) = \pi_n$). You, though, are a better probabilist: you know that while $\mathcal L(X_t)$ converges to $\pi_n$, in various senses, for every fixed $t$ we have $\mathcal L(X_t) \ne \pi_n$.

Now, I sample $X_t$ (for this large $t$) lots of times and see, lo and behold, that it is not uniform. However, it is pretty close to uniform. I have "learnt" some information; this amount is precisely $D(\mathcal L(X_t)\|\pi_n)$, and so this number must be non-negative.


Wikipedia also has a section called "Interpretations"