Expected codelength for Huffman-algorithm with probabilities

135 Views Asked by At

I am unable to solve the following problem:

Let $$n \geq 2$$ $$p_{1} \geq p_{2} \geq ... \geq p_{n}$$ $$p_{i} = 2^{-k_{i}}$$ with $k_{i} \in \mathbb{N}$ and $\sum_{i=1}^{n}p_{i} = 1$.

$p_{i}$ is the probability that the character $c_{i}$ appears. We then execute the Huffman-algorithm. Show: The expected codelength of the resulting Huffman-Code is $$-\sum_{i=1}^{n}p_{i}\log{p_{i}}$$

How does one show that? In the first part of the task I showed that $p_{n-1} = p_{n}$.

This is what I have so far: I think the expected value of the codelength is defined as $E(C) = \sum_{i=1}^{n}p_{i}h(c_{i})$ where $h(c_{i})$ is the height of character $c_{i}$ in the Huffman-Tree.

From that I followed that I have to show that $E(C) = \sum_{i=1}^{n}p_{i}h(c_{i}) = \sum_{i=1}^{n}p_{i}k_{i}$, so I thought that I'd show that $h(c_{i}) = k_{i}$. I was told that this could be done via induction. But how does one show this?

I'd greatly appreciate any help, tips, ideas or solutions/proofs. I hope that I made clear what the problem is, as I translated it from my native language into english. If there's something that is not understandable, please let me know. Thanks in advance and have a great day!

2

There are 2 best solutions below

0
On BEST ANSWER

HINT: The induction is on $n$. If $n=2$, the only possibility is that $k_1=k_2=1$: no other combination gives you $p_1+p_2=1$. You can easily verify that in that case $h(c_1)=h(c_2)=1$. For the induction step, assume the result for some $n\ge 2$, and prove it for $n+1$. Note that after you combine the the two least probable characters in the first step of the Huffman algorithm, you have in effect an alphabet of $n$ characters, one corresponding to the union of the two least probable characters, so you can apply the induction hypothesis.

0
On

You have the pieces you need. You will show by induction on the number of nodes that $h(c_i)=k_i$. The first step in constructing the Huffman tree $T$ will be to combine $c_{n-1}$ and $c_n$; since $p_{n-1}=p_n=2^{-k}$ the combined node has weight $2^{-(k-1)}$ -- in particular, $1$ over a power of two. So the induction hypothesis applies to the tree $T'$ with the combined node; in particular $h(c_i')=k_i'$ for the $T'$ code. All that remains is to show that when you now split that final node of $T'$ into $c_{n-1}$ and $c_{n}$ the probabilities and depths work out correctly.