This question comes from here. Suppose messages $m_1, m_2, \ldots$ can be sent (through a channel) to a receiver with probabilities $p_1, p_2, \ldots$. The amount of information transferred when a message $m_k$ is successfully received is defined as $$I_k = \log \frac{1}{p_k}.$$
The authors of the document I linked make the following claims about $I_k$:
- It is intuitive: the occurrence of a highly probable event carries little information ($I_k = 0$ for $p_k = 1$).
- We gain more information when a less probable message is received ($I_k > I_l$ for $p_k < p_l$).
So my question is:
Why should a more probable message carry less information? Is it simply because successfully transmission of a longer message should have lower probability and a longer message can contain more meaning? E.g. if $m_1$ is the message "You're gonna die", and $m_2$ is the message "You're gonna die tomorrow", then successfully transmitting $m_2$ should be at least as hard (i.e. at least as improbable) as transmitting $m_1$ and $m_2$ clearly has more information.
You are getting there with the idea that if all cases of $m_2$ imply $m_1$ (but there are other ways for $m_1$ to be true), then $m_2$ transmits more information than $m_1$, but you don't need inclusion for that. Information received is how much better you know the state of the world. We define it as the log of the number of states to get the additivity we want. Imagine that four coins are tossed. Before you are told anything about the result, there are 16 possible states of the world. If I tell you they all came up heads (a surprising message), the number of states has been reduced by a factor 16. If I tell you they didn't all come up heads (not a surprising message), the number of states has been reduced by a factor 16/15. So in the first case, there are many fewer possible states of the world. In the second, there are almost as many possible states after the message as before.