How to make the encoding of symbols needs only $\sqrt{2}$ bits/symbol in theory?

281 Views Asked by At

I have the following encodings: A=0 , B=10, C=11 Their probabilities are: $P(A)= 1/2 , P(B)= 1/3 , P(C)=1/6 $ I calculated the average length (in bits) per symbol of this encoding by doing the following: $$1/2 * 1 + 1/3 * 2 + 1/6 * 2 = 1.5 $$ The book is asking if it's possible to achieve $\sqrt{2} \approx 1.4$ I thought about it but changing the encoding by making it longer would make the average higher or changing the encoding by making it shorter would mix stuff together. For instance, if I want to decode the following I would get lost when changing it: $011100100$ which is now $acbaba$.

Is it possible or am I right by saying it's not? If it's not, how can I argue about it?

2

There are 2 best solutions below

1
On BEST ANSWER

Given a finite probability distribution $p:=(p_i)_{1\leq i\leq n}$ its entropy is defined by $$H(p):=-\sum_{i=1}^n p_i \log_2(p_i)\ .$$ If $p$ models the frequencies of the letters of an alphabet then $H(p)$ turns out to be the average number of bits per letter. This is the essential content of Shannon theory, and cannot be explained in a few lines. In the case $p=\bigl({1\over2},{1\over3},{1\over6}\bigr)$ one obtains $H(p)=1.45915$. This is what you can reach "in the limit" through a clever encoding. But $1.41421$ is definitely not attainable under the given circumstances.

5
On

The Huffman code is the best you can achieve for encoding single symbols from a given set. To achieve a better encoding, you must encode combinations of several symbols at once.

For example, for two-symbol combinations, you get the probabilities: $$\begin{aligned} p(AA) &= \frac14 & p(AB) &= \frac16 & p(AC) &= \frac1{12}\\ p(BA) &= \frac16 & p(BB) &= \frac19 & p(BC) &= \frac1{18}\\ p(CA) &= \frac1{12} & p(CB) &= \frac1{18} & p(CC) &= \frac1{36} \end{aligned}$$

Applying the Huffman code to this, you can get (e.g. using this tool): $$\begin{aligned} AA &\to 10 & AB &\to 111 & AB &\to 1100\\ BA &\to 00 & BB &\to 010 & BC &\to 01111\\ CA &\to 1101 & CB &\to 0110 & CC &\to 01110 \end{aligned}$$ The average length with this encoding is $$\frac12\left(\frac{2}{4} + \frac{3}{6} + \frac{4}{12} + \frac{2}{6} + \frac{3}{9} + \frac{5}{18} + \frac{4}{12} + \frac{4}{18} + \frac{5}{36}\right) \approx 1.486$$ which is already less than $1.5$.

Encoding even more characters at one time, you get even more close to the theoretical optimum, $-\sum_k p_k \log_2 p_k \approx 1.46$.