I have bumped many times into entropy, but it has never been clear for me why we use this formula:
If $X$ is random variable then its entropy is:
$$H(X) = -\displaystyle\sum_{x} p(x)\log p(x).$$
Why are we using this formula? Where did this formula come from? I'm looking for the intuition. Is it because this function just happens to have some good analytical and practical properties? Is it just because it works? Where did Shannon get this from? Did he sit under a tree and entropy fell to his head like the apple did for Newton? How do you interpret this quantity in the real physical world?
We want to define a measure of the amount of information a discrete random variable produces. Our basic setup consists of an information source and a recipient. We can think of our recipient as being in some state. When the information source sends a message, the arrival of the message causes the recipient to go to a different state. This "change" is exactly what we want to measure.
Suppose we have a set of $n$ events with respectively the following probabilities
$$p_1,p_2,...,p_n.$$
We want a measure of how much choice we are to make, how uncertain are we?
Intuitively, it should satisfy the following four conditions.
Let $H$ be our "measure".
$H$ is continous at every $p_i$
If $p_i = 1$, then $H$ is minimum with a value of $0$, no uncertainty.
If $p_1 = p_2= \dots = p_n$, i.e. $p_i=\frac{1}{n}$, then $H$ is maximum. In other words, when every outcome is equally likely, the uncertainty is greatest, and hence so is the entropy.
If a choice is broken down into two successive choices, the value of the original $H$ should be the weighted sum of the value of the two new ones.
An example of this condition $4$ is that $$H\left(\frac1{2}, \frac1{3}, \frac{1}{6} \right) = H\left(\frac{1}{2}, \frac{1}{2} \right) + \frac{1}{2} H\left(1 \right) + \frac{1}{2} H\left(\frac{2}{3}, \frac{1}{3} \right)$$
Here we decided to either take the a) first element or b) one of the other two elements. Then in a) we had no further decision, but for b) we had to decide which of those two to take.
The only $H$ satisfying the conditions above is:
$$H = −K\sum^n_{i=1}p_i log(pi)$$
To see that this definition gives what we intuitively would expect from a "measure" of information, we state the following properties of $H$.
Suppose $x$ and $y$ are two events with $x \in R^n$, $y \in R^m$ and $p(i,j)$ is the probability that $x$ and $y$ jointly occur (i.e. occur at the same time).
$H(x, y) = −\sum_{i, j} p(i, j) \log(p(i, j))$
$H(x, y) \leq H(x) + H(y)$.
With equality only if the occurrences are independent.
$H_x(y) = −\sum_{i, j} p_i(j) \log(p_i(j))= H(x, y) − H(x).$
The entropy of $y$ when $x$ is known.
$H(y) \geq H_x(y)$.
The entropy of $y$ is never increased by knowing $x$.
Any change towards equalization of the probabilities increases $H$. Greater uncertainty $\Rightarrow$ greater entropy.
Here is a post with some illustrative R code