Why does a less probable event yield more information?

339 Views Asked by At

I'm trying to wrap my around entropy as defined in information theory, and it states for Shannon information an axiom:

The less probable an event is, the more surprising it is and the more information it yields.

I know this is an axiom, so it's kind of inappropriate to ask for proof, but do we just have to take this as given? Or is it intuitive that this is true? What "information" am I gaining from witnessing a less probable outcome? If I had a magic coin that took different probabilities to land on different faces for each toss, why would the one where $p_{heads} = p_{tails} = 0.5$ provide me the "most information"?

2

There are 2 best solutions below

4
On

@DonThousand 's comments offer a good point of view. I would like to offer another, which has the advantage that both the information granted by a single outcome and the Shannon entropy make intuitive sense.

Consider a population with some characteristics (age, gender, hair color, etc.). You want to find one individual based on his characteristics (assume they single him out uniquely).

When you ask a yes/no question, getting the least common trait is more useful, because it shrinks down more the pool of possible candidates : you just gained more information.

Now, when you are choosing the next question, to gain the most out of it, it is pretty clear you need to choose a question where both outcomes are as likely.

Edit : Some more explanations on that last point.

When you ask a question and get an answer that had a probability (here frequency) $p$, you are multiplying the number of possible individual by $p$. To compute an expectation, it makes sense to take the $\log$ : the information associated with this outcome is $-\log p$ and the expected information is the Shannon entropy : $$S = -p\log p - (1-p)\log(1-p)$$

This is maximal for $p = 1/2$. If you want to find that person as quick as possible, the best strategy is dichotomy : try and find a question which splits the remaining population in $2$.

1
On

The definition of information is a little tricky here. A 10 bit number is said to have more information than a 5 bit number. If that number is say the number of watermelons in a basket, both numbers represent the same quantity, but different values have different amounts of "information".

So when you read the term "information" think of it as the number of bits used to represent that piece of information.

If some event keeps occuring with high probability (say people breathing in and out), you will want to assign lower number of bits to that event. All likely events will be assigned smaller amount of bits because it is easy to communicate that way. A more unlikely event (volcano eruption for instance) will be assigned a larger chunk of bits because all the lower bit combinations have already been taken up by the more likely events.

Now check out Shannon Coding and the example there. The best way to code data is essentially to assign lower number of bits to events that have higher probability and higher number of bits to events that have lower probability.