Intuitive interpretation of entropy

111 Views Asked by At

I'm trying to understand entropy and KL divergence. While it makes sense in a simplistic case, such as the case of a coin flip, I struggle wrapping my head around it when it is a more complicated case where the information content is a decimal. I am trying to imagine it in the form of a binary tree, where $$ \log_2\left(\frac{1}{p(x)}\right) $$ is the depth to the leaf of the binary tree. This would give us the number of moves we would have to take to reach the leaf from the root. However, if we have something like: $$ p(x_1) = \frac{7}{8} , p(x_2) = \frac{1}{8} $$ I struggle to visualize the meaning, besides from a functional point of view. How can I interpret that we have $$ \log_2\left(\frac87\right) = 0.193 $$ "bits" of information, is there a way to visualize this, preferably in the style of binary tree codings?

1

There are 1 best solutions below

2
On

Perhaps focusing on the definition of entropy as an expected value may help you.

Remember that a contiuous random variable (RV) $X$ with a distribution $p(x)$ has an expected value given by $$ \langle X\rangle = \int{x p(x) dx}. $$

By analogy, and referring to the definition of entropy (here I'm using the continuous case), one has that $$ H = \langle -\log(p(x)) \rangle = -\int{p(x)\log(p(x))dx} $$

Now, what's the meaning of this unusual RV, given by -$\langle \log(p(x)) \rangle$? First, note that the minus sign and the log allow us to express $H$ as $$ H = \left\langle \log\left(\frac{1}{p(x)}\right)\right\rangle $$

Look at the expression above and think about the magnitude of $p$ in two extreme cases:

  • A very common event $x$, which roughly leads to $p(x) \approx 1$, and then $\log(1/p(x)) \approx 0$; and
  • A very rare event $x$, which roughly leads to $p(x) \approx 0$, making $\log(1/p(x))$ grow significantly.

Well, think about the amount of information these two extreme cases carry: which one of them are more informative than the other? The common event, which is conceptually ordinary; or the rare event, which by its own definition tells us that something unusual is going to happen?

Conceptually speaking, then, $-\log(p(x))$ may be seen as the amount of information carried by the event $x$. Therefore, $H$ would correspond to the average amount of information carried by the system, since a sum over all events is being performed (the integral sum)