I'm trying to wrap my around entropy as defined in information theory, and it states for Shannon information an axiom:
The less probable an event is, the more surprising it is and the more information it yields.
I know this is an axiom, so it's kind of inappropriate to ask for proof, but do we just have to take this as given? Or is it intuitive that this is true? What "information" am I gaining from witnessing a less probable outcome? If I had a magic coin that took different probabilities to land on different faces for each toss, why would the one where $p_{heads} = p_{tails} = 0.5$ provide me the "most information"?
@DonThousand 's comments offer a good point of view. I would like to offer another, which has the advantage that both the information granted by a single outcome and the Shannon entropy make intuitive sense.
Consider a population with some characteristics (age, gender, hair color, etc.). You want to find one individual based on his characteristics (assume they single him out uniquely).
When you ask a yes/no question, getting the least common trait is more useful, because it shrinks down more the pool of possible candidates : you just gained more information.
Now, when you are choosing the next question, to gain the most out of it, it is pretty clear you need to choose a question where both outcomes are as likely.
Edit : Some more explanations on that last point.
When you ask a question and get an answer that had a probability (here frequency) $p$, you are multiplying the number of possible individual by $p$. To compute an expectation, it makes sense to take the $\log$ : the information associated with this outcome is $-\log p$ and the expected information is the Shannon entropy : $$S = -p\log p - (1-p)\log(1-p)$$
This is maximal for $p = 1/2$. If you want to find that person as quick as possible, the best strategy is dichotomy : try and find a question which splits the remaining population in $2$.