In Huffman coding, how do I choose the frequency to get the maximum average bit length?

375 Views Asked by At

First I want to give you a little summary about the Huffmann code to avoid misunderstandings.

Summary begins

So we have an alphabet $A$ with $|A| > 1 $. For example $A$= {$a,b,c,d,e$}. Now we number the letters: $a=1,b=2,c=3,d=4,e=5$.

Every letter has a frequency. So we have a map $f : A \rightarrow [0,1]$ with $\sum_{a \in A} f(a) = 1$. For example $(1|0,2), (2|0,3), (3,|0,1), (4|0,35), (5|0,05)$. Let me explain my notation: $(1|0,2)$ means that the letter $a$ has frequency $0,2$.

So the Huffman code tells us that we take the two letters with the lowest frequency and combine them. So we get $(1|0,2), (2|0,3), (3,|0,15), (4|0,35)$. We get :enter image description here

If we repeat this process we will get:

enter image description here

So we can compute ABL (Average Bit Length ):

$$ ABL(\gamma) = \sum_{a \in A} f(a) \cdot |\gamma(a)| $$

where $\gamma$ is the length of the codeword. For example $\gamma(e) = 4$ like you can see in my second picture.

Any questions?

Summary ends

So I want to get the maximum average bit length. How do I have to choose $f$ for an alphabet $A$ with $|A| > 1$ ? I tried many things like taking the uniform distribution $f(a) = \frac{1}{n}$ for all $a \in A$ with $|A| = n$. But this and other of my ideas didn't work. Can you help me out? Update: : thanks to the comments I've got I find out that I made a mistake about the uniform distribution. So it could be that the uniform distriution could be the solution.

1

There are 1 best solutions below

1
On BEST ANSWER

The maximum is attained on the uniform distribution.

We need the following lemma:

Lemma. Let $p_1, \ldots, p_n \in [0, 1]$ and $l_1, \ldots, l_n\in\mathbb{N}$ be such that $p_1 + \ldots + p_n = 1, p_1 \ge p_2 \ge \ldots \ge p_n$ and $l_1 \le l_2 \le \ldots \le l_n$. Then $$ \sum\limits_{i = 1}^n p_i l_i \le \sum\limits_{i = 1}^n \frac{1}{n} \cdot l_i.$$

Proof. Denote $S_{n + 1} = 0$ and $S_i = p_{i} + \ldots + p_n$ for $i = 1, \ldots, n$. Let us show that $S_i \le \frac{n - i + 1}{n}$. Indeed, otherwise there exists $j\in\{i, i + 1, \ldots, n\}$ such that $p_j > \frac{1}{n}$. This implies that $p_1 \ge p_2 \ge \ldots \ge p_{i - 1} \ge p_j > \frac{1}{n}$, hence $p_1 + \ldots + p_n = 1 \ge (i - 1) \cdot \frac{1}{n} + S_i > (i - 1) \frac{1}{n} + \frac{n - i + 1}{n} = 1$, contradiction.

Using this bound on $S_i$ we obtain: \begin{align*} \sum\limits_{i = 1}^n p_i l_i &= \sum\limits_{i = 1}^n (S_i - S_{i + 1}) l_i = \sum\limits_{i = 1}^n S_i l_i - \sum\limits_{i = 2}^{n + 1} S_i \cdot l_{i - 1} = \sum\limits_{i = 1}^n S_i l_i - \sum\limits_{i = 2}^{n} S_i \cdot l_{i - 1}\\ &= l_1 S_1 + (l_2 - l_1) S_2 + \ldots + (l_n - l_{n - 1}) S_n \\ &\le l_1 \frac{n}{n} + (l_2 - l_1) \frac{n - 1}{n} + \ldots + (l_n - l_{n - 1}) \cdot \frac{1}{n} \\ &= l_1 \left(\frac{n}{n} - \frac{n - 1}{n}\right) + l_2\left(\frac{n - 1}{n} - \frac{n - 2}{n}\right) + \ldots + l_n \cdot \frac{1}{n} \\ &= \sum\limits_{i = 1}^n \frac{1}{n} \cdot l_i. \end{align*} Lemma is proved.

Now, suppose that the size of an alphabet is $n$.

Let $c_1, \ldots, c_n$ be codewords of the Huffman code for the uniform distribution. Further, let $l_i$ be the length of $c_i$ and assume that $l_1 \le l_2 \le \ldots \le l_n$. Take any probability distribution $p_1, \ldots, p_n$ and assume that $p_1 \ge p_2 \ge \ldots \ge p_n$.

We can use $c_1, \ldots, c_n$ as a prefix code for frequencies $p_1, \ldots, p_n$. Hence the average length of the Huffman code for $p_1, \ldots, p_n$ is at most $\sum\limits_{i = 1}^n p_i l_i$. By Lemma the last quantity is at most $\sum\limits_{i = 1}^n \frac{1}{n} \cdot l_i$ and the latter is the average length of the Huffman code for the uniform distribution.