Why is the entropy of a binary digit maximised if both outcomes are equiprobable?

94 Views Asked by At

One elementary result of Information Theory is that a binary digit communicates the most information when used to distinguish between two equally probable events, e.g. a toss of an unbiased coin.

Another way to think about this is that each binary digit must half the set of outcomes for a coding to be efficient, i.e.

  • 0 enables you to eliminate 50% of the search space
  • 1 enables you to eliminate 50% of the search space

where both 0 and 1 occur with equal probability.

What if you had a binary code where

  • 0 enables you to eliminate 30% of the search space
  • 1 enables you to eliminate 70% of the search space

where 0 occurs with 0.7 probability, and 1 occurs with 0.3 probability.

Isn't this, on average, equally efficient as the binary code where both 0 and 1 are equally probable and each eliminates 50% of the search space?

1

There are 1 best solutions below

1
On

It is not on average equally efficient. Averaging the percentages like that can't work; if you replace 30% with a much smaller probability, say 1%, then it might be clearer intuitively that eliminating 1% of the search space 99% of the time and eliminating 99% of the search space 1% of the time isn't that great, because the second case just doesn't occur that often.

The correct thing to do is to average the number of bits you've gained (which is equivalent to taking a geometric mean instead of an arithmetic mean), as follows: if you eliminate a proportion $1 - p$ of the search space with probability $p$, the size of the search space has been reduced by a factor of $p$, so you've gained $\log_{1/2} p$ bits of information with probability $p$. If the other case is that you eliminate $p$ of the search space with probability $1 - p$, then you've gained $\log_{1/2} (1 - p)$ bits of information with probability $1 - p$. So on average the number of bits you've gained is

$$p \log_{1/2} p + (1 - p) \log_{1/2} (1 - p) = H(p)$$

which is exactly the (binary) entropy, which is maximized when $p = \frac{1}{2}$ as usual (e.g. using calculus, although other approaches are possible, for example Jensen's inequality).

One way you can check that taking the logarithm is a sensible thing to do is that it behaves correctly with respect to gaining more information: if you reduce the search space by a factor of $p$ and then by a factor of $q$, you've overall reduced the search space by a factor of $pq$, which corresponds to a gain of $\log_{1/2} (pq)$ bits being equal to a gain of $\log_{1/2} p$ bits plus a gain of $\log_{1/2} q$ bits.

The best piece of writing I've ever read for gaining intuition about this idea of taking logarithms as a measure of gaining bits of information is actually this essay by Gwern analyzing how the protagonist of Death Note leaks different numbers of bits of information about himself over the course of the anime. Fun stuff.