A special type of one symbol escape huffman code. Which source(s) will it be optimal for?

47 Views Asked by At

Background / Context:

The other day I implemented a special case of Huffman coding constructed with $2^{N}-1$ valid code words and 1 "escape" into a set of longer length codes.

As there is only one escape we don't need a tree or other data structure to represent the codes but they can be explained or classified as an ordered tuple selecting from the set of integers $\{1,2,3,4,...\}$ deciding number of bits with $\{1,3,7,15,...\}$ allocated code words and one specific code word reserved for "escape" to the next block (next in the tuple).

Actual question

This question relates to the specific example we get if we can select 1 bit width every time. It will give us a sequence of code words like ${0,10,110,1110,1111}$. If I recall correctly from my schooling this particular set of codes can be proven to be optimal for a source of run-lengths of some probability distribution.

Can you help me find it?

As I think there is a fair chance this particular example may exist in a typical university school book on introductory data compression, answers with references to literature are also welcome.

1

There are 1 best solutions below

0
On BEST ANSWER

Let $p_1 ≥ p_2 ≥ \dots ≥ p_n$ be the probabilities for code words, in descending order. Then the code is optimal if for every $i=1,\dots, n$ $$ p_i ≥ \sum_{j=i+2}^n p_j. $$ The idea is simple: We know that Huffman is optimal, and we look at when our code does things differently. Remember how the Huffman code is built. It groups the two groups with the lowest probabilities together. In our code, everything goes into the same group (of 111...). In order to have the same code, Huffman must always group the rest (which has probability $\sum_{j=i+2}^n p_j$) with the next element $p_{i+1}$. As it always groups the two smallest, that means that $p_i ≥ \sum_{j=i+2}^n p_j$, else it would group $p_i$ with $p_{i+1}$. By induction in the steps, we see that our code is the same as Huffman and thus optimal.

That we are else not optimal can be seen as follows: If $p_i < \sum_{j=i+2}^n p_j$ for some $i$, we group $p_i$ and $p_{i+1}$ together. The word for $p_i$ is now $1^{i-1}00$ instead of $1^{i-1}0$ (one longer), the word for $p_{i+1}$ is now $1^{i-1}01$ (same length) and the word for $p_{j}$ for $j>1$ is now $1^{j-2}0$, 1 shorter than $1^{j-1}0$. The $i$-th word gets 1 longer, the $i+2, i+3, \dots$-th words get 1 shorter. The difference in expected length is thus $$ p_i - \sum_{j=i+2}^n p_j < 0, $$ thus shorter than in our code. Therefore, our code would not be optimal.

As an example, our code would be optimal for some geometrical distributions, where $p_i = p(1-p)^{i-1}$ for some $p\in(0,1)$. As $\sum_{j=i+2}^n p_j = (1-p)^{i+1}$, the code would be optimal for $p>(1-p)^2$, which is the case for $p>\frac{3-\sqrt 5}2 \approx 0.3820$. Heuristically, one might say that the code is optimal if the probabilities fall geometrically with rate at least $1-\frac{3-\sqrt 5}2 = \frac{\sqrt 5-1}2$, which is the golden ratio.