Why do lower probability messages contain more information?

Question

Why do lower probability messages contain more information?

1.5k Views Asked by Bumbble Comm At 07 Apr 2026 - 7:29

This question comes from here. Suppose messages $m_1, m_2, \ldots$ can be sent (through a channel) to a receiver with probabilities $p_1, p_2, \ldots$. The amount of information transferred when a message $m_k$ is successfully received is defined as $$I_k = \log \frac{1}{p_k}.$$

The authors of the document I linked make the following claims about $I_k$:

It is intuitive: the occurrence of a highly probable event carries little information ($I_k = 0$ for $p_k = 1$).
We gain more information when a less probable message is received ($I_k > I_l$ for $p_k < p_l$).

So my question is:

Why should a more probable message carry less information? Is it simply because successfully transmission of a longer message should have lower probability and a longer message can contain more meaning? E.g. if $m_1$ is the message "You're gonna die", and $m_2$ is the message "You're gonna die tomorrow", then successfully transmitting $m_2$ should be at least as hard (i.e. at least as improbable) as transmitting $m_1$ and $m_2$ clearly has more information.

Original Q&A

There are 4 best solutions below

Bumbble Comm On 09 Oct 2011 - 2:37

Suppose you are to build a communication system. Every second, a randomly chosen message is delivered to you at point A and you are to make sure that with high probability it will be delivered at point B within a finite time. You know in advance which messages are possible and what their distribution are. (Successive messages are supposed to be independent).

However, bandwidth (the number of bits you can send per second) is expensive, and you want to be able to lease a channel with then minimal capacity you need to be able to deliver messages without building an ever-increasing backlog (again, with high probability).

If there are $n$ possible messages, you could meet your goal by buying enough bandwidth to send $\log_2(n)$ bits per second. But that could be wasteful -- say that 99% of the time the message is "no comments". Then you can encode that message as a single 0 bit, and send everything else as a 1-bit followed by a message number. That way you only have to buy bandwidth for about $1+\log_2(n-1)/100$ bits per second. This leaves you enough room to send a 0 each time nothing interesting happens. Once in a wile, when something interesting does happen, you send a 1 plus additional bits, which will take about $\log_2(n)$ extra seconds and build up a small backlog of messages that are all likely to be single 0's. But since you can send those 0's slightly faster than one per second, on average you can expect to have your backlog cleared by the time the next interesting thing happens.

(There are safety-margin refinements from queueing theory here that I won't go into).

The moral of this example is that if you designing a coding system and are interested in minimizing the expected number of bits necessary to send a message, you can "afford" to spend more bits on a rare message because you don't have to do it so often.

And it turns out that in the limit as $N\to\infty$, the lowest possible expected number of bits to send $N$ independent identically distributed messages (where the minimum is taken over all possible coding strategies), is exactly $N$ times the Shannon information content of the probability distribution.

Bumbble Comm On 09 Oct 2011 - 4:33

You are at "Sahara desert". Then you meet someone and the guy says:
- Today it did not rain!

How much information did you get? A few? A lot? Why?

Bumbble Comm On 09 Oct 2011 - 4:50

Imagine that you are trying to send the information using as few "bits" as you can. You send one message after the other. Or you just store it on your "pen drive"... :-)

Now, suppose the possible messages are $m_1, \cdots, m_n$. And suppose they are sent with probability $p_1, \cdots, p_n$. Now, you are about to choose the best representation (in bits, maybe) for each message. Suppose the size (in bits) of each message is $s_1, \cdots, s_n$.

What is the expected size of a certain message $m$? One might say it is: $$s = \sum_{j=1}^n p_j s_j.$$ Can you make all your messages have size $s_j = 1$? Only if you have no more then two messages.

How do you make $s$ as small as possible? You have to choose smaller $s_j$ for bigger $p_j$. The "ideal" size for $s_j$ is the amount of information $m_j$ carries. The bigger $p_j$, the smaller should be $s_j$. In some circumstances, one might say that the ideal size for $s_j$ is $\log_2\left(\frac{1}{p_j}\right)$.

**Bumbble Comm** · Accepted Answer

You are getting there with the idea that if all cases of $m_2$ imply $m_1$ (but there are other ways for $m_1$ to be true), then $m_2$ transmits more information than $m_1$, but you don't need inclusion for that. Information received is how much better you know the state of the world. We define it as the log of the number of states to get the additivity we want. Imagine that four coins are tossed. Before you are told anything about the result, there are 16 possible states of the world. If I tell you they all came up heads (a surprising message), the number of states has been reduced by a factor 16. If I tell you they didn't all come up heads (not a surprising message), the number of states has been reduced by a factor 16/15. So in the first case, there are many fewer possible states of the world. In the second, there are almost as many possible states after the message as before.

Why do lower probability messages contain more information?

There are 4 best solutions below

Related Questions in INFORMATION-THEORY

Trending Questions

Popular # Hahtags

Popular Questions