Problem
In a video I just watched (as could be seen here, between 7:10 - 7:30), the author claims that the number of digits one uses to encode a particular weather condition reflects his/her prediction about the weather distribution. For example, if two digits are used, then he/she believes the probability of that weather condition is $\frac{1}{2^2}=0.25$. Similarly, if five digits are used, then the probability should be $\frac{1}{2^5}=0.03125$.
As much as I could understand this from high level, i.e. if something happens quite often, then using fewer digits will surely save your transmission cost. However, I do not understand the mathematical reasoning of this correspondence between number of digits in a code and probability of a particular event.
I do not have previous exposure to information theory, so could someone provide some pointer for this question? Thank you in advance.
I don't find the example in the video very illuminating. Perhaps a better (and classical example) is the old Morse code. You need to encode the 26 letters using sequences of dots and dashes (which could be thought as 0/1 bits), not necessarily of the same length. If you think a little about this, you'll probably realize the same fact that the designer (around 1837!) realized: the more frequent (more probable) letters should get the shorter codes.
Actually, this happens also in most natural languages: the most frequent words tend to be shorter. We quickly guess that this is more efficient.
Matematically: if we have symbols $s_1, s_2 \cdots s_n$ and we want to encode them with codes of lengths $\ell_1 , \ell_2 \cdots \ell_n$, then the average code lenght (per symbol) is
$$L= \sum p(s_i) \, \ell_i$$
where $p(s_i)$ is the probability of symbol $i$. We want $L$ to be as small as possible. For this, we'd want $\ell_i$ to be small, but we cannot make all of them too small because the code, to be usable (decodable), obsviously impose some restrictions (if you are curious about this, see Kraft's inequality). Then, it makes (mathematical and intuitive) sense that the higher probabilities should be associated with smaller lengthts.