How are Huffman encoding and entropy related?

Question

How are Huffman encoding and entropy related?

15k Views Asked by user26649 At 31 Mar 2026 - 9:54

The inherent unpredictability, or randomness, of a probability distribution can be measured by the extent to which it is possible to compress data drawn from that distribution.

$$ \text{more compressible} ≡ \text{less random} ≡ \text{more predictable}$$

Suppose there are $n$ possible outcomes, with probabilities $p_1, p_2, . . . , p_n$. If a sequence of $m$ values is drawn from the distribution, then the $i^{th}$ outcome will pop up roughly $mp_i$ times (if $m$ is large). For simplicity, assume these are exactly the observed frequencies, and moreover that the $p_i$’s are all powers of $2$ (that is, of the form $\frac{1}{2^k}$). It can be seen by induction that the number of bits needed to encode the sequence is: $$\sum_{i=1}^{n} mp_i \log\left(\frac{1}{p_i}\right)\tag{1}$$ Thus the average number of bits needed to encode a single draw from the distribution is: $$\sum_{i=1}^{n} p_i \log\left(\frac{1}{p_i}\right)\tag{2}$$ This is the entropy of the distribution, a measure of how much randomness it contains. $(3)$

How are the equations $(1)$ and $(2)$ derived (the excerpt mentions induction but does not provide further proof) and how does the transition from compressibility to entropy at $(3)$ follow? Note when encoding is mentioned the excerpt is refering to Huffman encoding.

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2012-08-11 22:06:40

It is very easy. First we define entropy: $$H=-\sum_i p_i log p_i$$ For any discrete distribution $log_2(N)>H>0$ where $N$ is the alphabet size or the cardinality of the set from where you draw the random numbers. For example for a matrix M with elements $m_i \in \{0,1,...,255\}$, you have the cardinality of the matrix as $256$ and $8=log_2(256)>H>0$. If the martix is for example composed of all ones in other words if $m_i=C<256\forall i$ then you get $H=0$ because your matrix do not contain any information this means your average information is $0$. This also means that you can compress this matrix with only one element! that is $C$. If this matrix was randomly generated, you would get $H\approx 8$ which means that you have almost full average information. You can conclude now how many bits on average you would need to represent one element of this matrix? that is $$H=8$$ which is almost 8 bits!! this means in total you would need $8*N\times K$ bits to represent the whole matrix which is the worst case implying uncompressibility (where $N$ and $K$ are the dimensions of your matrix).

$Add$: Huffman encoding gets advantage of the co-occurances of the events and assigns bit taking into account this advantage. As an example to what I described above, huffman coding can just say $00000000$ to represent the whole matrix if $H=0$ and when $H=8$ Huffman coding needs at least $K*N*8$ bits therefore cannot provide any compression. $H$ defines the lowerbound of compressibility and any kind of coding cannot get below this limit, i.e., can not compress using less number of bits than it is stated by $H$.

I hope it helps;)

**Bumbble Comm** · Answer 2 · 2012-08-11 22:40:42

Let's start the induction with $n=2$. In that case, we have just two symbols, and $p_1+p_2=1$. Now the text assumes that all $p$ are powers of $2$, which is only possible if $p_1=p_2=\frac12$. Then the Huffman coding assigns to each symbol one bit, therefore each symbols is encoded exactly with one bits. On the other hand, the Shannon entropy is (assuming that $\log\equiv\log_2$) $\frac12\log 2+\frac12\log 2=1$. So for $n=2$ we have proven the claim.

Now assume we have proven it for up to $n-1$. Let's without loss of generality assume that the probabilities are labelled in descending order. The Huffman encoding in the first step combines the two least probable symbols, which therefore get the same length symbol, which is by 1 bit longer than a combined probability symbol would get in the $n-1$-symbol Huffman coding where all other symbols remain the same, with the same probabilities. However, the probabilities are, from the assumption, powers of two, that is, in binary they have the form $0.0\dots01$, and they have to add up to $1$. Now if the lowest two probabilities would not be equal, and the lowest probability is $p_n=2^{-k}$, the resulting sum would have an $1$ in the $k^{\text{th}}$ digit, but the binary representation has a $0$ here. Thus we conclude that for the lowest two probabilities $p_{n-1}=p_n$. But then, if we denote the number of bits for symbol $k$ by $b_k$, and define $p'_k$ to be the probability distribution when combining the two lowest probability symbols into one, $$p'_k=\cases{p_k & for $k<n-1$\\p_{n-1}+p_n = 2p_{n-1} & for $k=n-1$}\quad,$$ abd with $b'_k$ the corresponding Huffman bit lengths, we have as average number of bits per symbol $$\begin{aligned} \langle b\rangle_n &= \sum_{k=1}^n p_k b_k\\ &= \sum_{k=1}^{n-2}p_k b_k + p_{n-1}b_{n-1}+p_n b_n\\ &= \sum_{k=1}^{n-2}p'_k + b'_k p'_{n-1}(b'_{n-1}+1)\\ &= \langle b\rangle_{n-1} + p'_{n-1}\\ \text{(induction assumption)}\quad &=\sum_{k=1}^{n-1} p'_k\log\frac{1}{p'_k} + p'_{n-1}\\ &= \sum_{k=1}^{n-2} p'_k\log\frac{1}{p'_k} + p'_{n-1}\left(1 + \log\frac{1}{p'_{n-1}}\right)\\ &= \sum_{k=1}^{n-2} p_k\log\frac{1}{p_k} + (p_{n-1}+p_n)\left(\log 2 + \log\frac{1}{p'_{n-1}}\right)\\ &= \sum_{k=1}^{n-2} p_k\log\frac{1}{p_k} + p_{n-1}\log\frac{2}{p'_{n-1}} + p_n\log\frac{2}{p'_{n-1}}\\ &= \sum_{k=1}^{n-2} p_k\log\frac{1}{p_k} + p_{n-1}\log\frac{1}{p_{n-1}} + p_n\log\frac{1}{p_n}\\ &= \sum_{k=1}^{n} p_k\log\frac{1}{p_k} \end{aligned}$$ Note that by using the induction assumption, I have implicitly used the fact that $p'_{n-1}=2p_n=2^{-(k-1)}$ is also a power of two, because otherwise it couldn't have been applied.

How are Huffman encoding and entropy related?

There are 2 best solutions below

Related Questions in COMPUTER-SCIENCE

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions