The inherent unpredictability, or randomness, of a probability distribution can be measured by the extent to which it is possible to compress data drawn from that distribution.
$$ \text{more compressible} ≡ \text{less random} ≡ \text{more predictable}$$
Suppose there are $n$ possible outcomes, with probabilities $p_1, p_2, . . . , p_n$. If a sequence of $m$ values is drawn from the distribution, then the $i^{th}$ outcome will pop up roughly $mp_i$ times (if $m$ is large). For simplicity, assume these are exactly the observed frequencies, and moreover that the $p_i$’s are all powers of $2$ (that is, of the form $\frac{1}{2^k}$). It can be seen by induction that the number of bits needed to encode the sequence is: $$\sum_{i=1}^{n} mp_i \log\left(\frac{1}{p_i}\right)\tag{1}$$ Thus the average number of bits needed to encode a single draw from the distribution is: $$\sum_{i=1}^{n} p_i \log\left(\frac{1}{p_i}\right)\tag{2}$$ This is the entropy of the distribution, a measure of how much randomness it contains. $(3)$
How are the equations $(1)$ and $(2)$ derived (the excerpt mentions induction but does not provide further proof) and how does the transition from compressibility to entropy at $(3)$ follow? Note when encoding is mentioned the excerpt is refering to Huffman encoding.
It is very easy. First we define entropy: $$H=-\sum_i p_i log p_i$$ For any discrete distribution $log_2(N)>H>0$ where $N$ is the alphabet size or the cardinality of the set from where you draw the random numbers. For example for a matrix M with elements $m_i \in \{0,1,...,255\}$, you have the cardinality of the matrix as $256$ and $8=log_2(256)>H>0$. If the martix is for example composed of all ones in other words if $m_i=C<256\forall i$ then you get $H=0$ because your matrix do not contain any information this means your average information is $0$. This also means that you can compress this matrix with only one element! that is $C$. If this matrix was randomly generated, you would get $H\approx 8$ which means that you have almost full average information. You can conclude now how many bits on average you would need to represent one element of this matrix? that is $$H=8$$ which is almost 8 bits!! this means in total you would need $8*N\times K$ bits to represent the whole matrix which is the worst case implying uncompressibility (where $N$ and $K$ are the dimensions of your matrix).
$Add$: Huffman encoding gets advantage of the co-occurances of the events and assigns bit taking into account this advantage. As an example to what I described above, huffman coding can just say $00000000$ to represent the whole matrix if $H=0$ and when $H=8$ Huffman coding needs at least $K*N*8$ bits therefore cannot provide any compression. $H$ defines the lowerbound of compressibility and any kind of coding cannot get below this limit, i.e., can not compress using less number of bits than it is stated by $H$.
I hope it helps;)