Intuitive explanation of entropy

55.7k Views Asked by At

I have bumped many times into entropy, but it has never been clear for me why we use this formula:

If $X$ is random variable then its entropy is:

$$H(X) = -\displaystyle\sum_{x} p(x)\log p(x).$$

Why are we using this formula? Where did this formula come from? I'm looking for the intuition. Is it because this function just happens to have some good analytical and practical properties? Is it just because it works? Where did Shannon get this from? Did he sit under a tree and entropy fell to his head like the apple did for Newton? How do you interpret this quantity in the real physical world?

3

There are 3 best solutions below

8
On BEST ANSWER

We want to define a measure of the amount of information a discrete random variable produces. Our basic setup consists of an information source and a recipient. We can think of our recipient as being in some state. When the information source sends a message, the arrival of the message causes the recipient to go to a different state. This "change" is exactly what we want to measure.

Suppose we have a set of $n$ events with respectively the following probabilities

$$p_1,p_2,...,p_n.$$

We want a measure of how much choice we are to make, how uncertain are we?

Intuitively, it should satisfy the following four conditions.

Let $H$ be our "measure".

  1. $H$ is continous at every $p_i$

  2. If $p_i = 1$, then $H$ is minimum with a value of $0$, no uncertainty.

  3. If $p_1 = p_2= \dots = p_n$, i.e. $p_i=\frac{1}{n}$, then $H$ is maximum. In other words, when every outcome is equally likely, the uncertainty is greatest, and hence so is the entropy.

  4. If a choice is broken down into two successive choices, the value of the original $H$ should be the weighted sum of the value of the two new ones.

    An example of this condition $4$ is that $$H\left(\frac1{2}, \frac1{3}, \frac{1}{6} \right) = H\left(\frac{1}{2}, \frac{1}{2} \right) + \frac{1}{2} H\left(1 \right) + \frac{1}{2} H\left(\frac{2}{3}, \frac{1}{3} \right)$$

Here we decided to either take the a) first element or b) one of the other two elements. Then in a) we had no further decision, but for b) we had to decide which of those two to take.

The only $H$ satisfying the conditions above is:

$$H = −K\sum^n_{i=1}p_i log(pi)$$

To see that this definition gives what we intuitively would expect from a "measure" of information, we state the following properties of $H$.

  1. $H = 0 \iff p_i = 1$ and $p_j= 0, \forall j \neq i$
  2. $\forall n \in N$, $H$ is maximum when $p_1=,\cdots,= p_n$
  3. Suppose $x$ and $y$ are two events with $x \in R^n$, $y \in R^m$ and $p(i,j)$ is the probability that $x$ and $y$ jointly occur (i.e. occur at the same time).

    • $H(x, y) = −\sum_{i, j} p(i, j) \log(p(i, j))$

    • $H(x, y) \leq H(x) + H(y)$.

      With equality only if the occurrences are independent.

    • $H_x(y) = −\sum_{i, j} p_i(j) \log(p_i(j))= H(x, y) − H(x).$

      The entropy of $y$ when $x$ is known.

    • $H(y) \geq H_x(y)$.

      The entropy of $y$ is never increased by knowing $x$.

  4. Any change towards equalization of the probabilities increases $H$. Greater uncertainty $\Rightarrow$ greater entropy.

Here is a post with some illustrative R code

1
On

The three postulates in this answer are the ones used in Shannon's original 1948 paper. If you skip over to Appendix II in that paper, you can find the remainder of the derivation.

  1. Derive the expression for $H \left(\tfrac{1}{n}, \tfrac{1}{n}, \ldots, \tfrac{1}{n} \right)$ as $-K \log n$.

  2. If all the $p_i$'s are rational, we can find an $m$ such that $m p_i \in \mathbb{N}, \forall i$. Now, use postulate $3$ to derive the entropy formula.

  3. Using the continuity postulate (first postulate), you can directly extend the formula to the case where the $p_i$'s are not necessarily rational.

1
On

In my view, the most intuitive way to look at entropy, which I also think is also the most correct one is simply as follows:

  1. If kilo grams is the measure of mass, seconds is the measure of time, meters is the measure of distance, then what is the measure of information? Say, how much information is inside a closed letter? In other words, if you open the letter and read it, how much information will you obtain?
  2. Shannon's entropy basically tries to answer that question. The answer is very simple and extremely intuitive: how about we measure the quantity of information based on the number of questions that must be asked in order to fully discover all unknowns that were hidden in that thing. So in the case of the letter, the amount of information in the letter is the quantity of questions that we need to get answered in order to fully know the content of the letter.

But obviously there is a problem with (2), which is: how can the number of questions be meaningful? Someone may ask a single broad question like "what is inside the letter?" and fully obtain the letter. Or someone may ask really stupid questions like "what is the 1st letter?" then "what is the 2nd letter?", …, ending up with thousands of silly questions for a tiny letter!

Shannon's entropy solves this problem by standardising the questions, such that their numbers are stable and meaningful, and it does so by defining the qualities the questions must have. Turns out this is very easy!

Here is how the questions are standardised so that their quantity is stable:

The average number of perfect n-nary questions that, if answered, we would end up fully resolving all unknowns about the subject.

An $n$-nary question is "perfect" if it splits the search space evenly into $n$ many equal sub-spaces. So if a question is binary ($n=2$), then every time you ask a perfect binary question, it must split the search space in half.

Wait, that looks like a balanced tree! Balanced $n$-nary trees split the search space in $n$-many equal parts every time we move down by a single node.

In balanced $n$-nary trees, we have a tree root on top, and branches, until we finally reach terminal leaf nodes. Also, in a balanced $n$-nary tree, the total number of nodes that you need to cross until you finally reach the terminal leaf nodes is $\log_n \texttt{m}$, where $m$ is total number items stored in the balanced $n$-nary tree that you are trying to lookup — this is where the $\log$ part in entropy comes from!

Now, let's have a closer look at entropy:

$$ H(\mathcal{X}) = -\sum_{x\in\mathcal{X}} p_x \log_n(p_x) $$

Now let's unwrap it piece by piece. 1st let's get rid of that negative:

$$ H(\mathcal{X}) = \sum_{x\in\mathcal{X}} p_x \log_n(\frac{1}{p_x}) $$

That's it for the negative! It was added there only because $\log_n(\frac{1}{p_x})$ doesn't look neat enough. Basically $\log_n(\frac{1}{p_x}) = -\log_n(p_x)$.

HOMEWORK 1: The $\sum_{x\in\mathcal{X}} p_x \ldots$ is just a weighted sum. I leave you figure out why we need it on your own.

But what about $\log_n(\frac{1}{p_x})$? First let's remind ourselves what is $p_x$:

$$ p_x = \frac{\text{number of times $x$ happened}}{\text{number of times everything happened}} $$

And what does $\frac{1}{p_x}$ mean?

$$\begin{split} \frac{1}{p_x} &= \frac{1}{\frac{\text{number of times $x$ happened}}{\text{number of times everything happened}}}\\ &= \frac{\text{number of times everything happened}}{\text{number of times $x$ happened}}\\ \end{split}$$

Wow! You see! It's getting very similar to that $m$ in the balanced $n$-nary tree example (i.e. $\log_n m$ part above).

HOMEWORK 2: Now, I leave it to you to figure out how $\frac{\text{number of times everything happened}}{\text{number of times $x$ happened}}$ corresponds to $m$ in my balanced $n$-nary tree example earlier.

Now, only if you solve HOMEWORK 1 and HOMEWORK 2, you'd fully understand Shannon's entropy! It's actually very simple and intuitive.