How can I design a non-integral-power efficient range coding that is statistically "fair" over the whole range?

36 Views Asked by At

Numbers are represented using binary digits "bits" which are integers in the set $\{0,1\}$.

Positive integers for example, are usually encoded as a sequence of bits so that which power of 2 they encode correspond to the order which they are stored. For example the number $a$ represented by a binary vector $$[b_{N-1},\cdots,b_1,b_0]$$:

$$a = \sum_{i=0}^{N-1} b_k \cdot 2^k$$

Now if we know beforehand the range of numbers to be stored, $[0,a_{max}]$. In many cases $a_{max}$ may not be $2^k-1$ for some $k$. In other words... to be able to store any such number we will have to select a number of bits so that a certain set of combination of bits will be unused. From a data compression stand-point, this is unfortunate. Especially bothersome it can be if $a_{max}$ is a small number and we have very many numbers to store. Take for example $a_{max}$ = 2, this will require us to use $2$ bits, although 1/4 for each such pair of bits of what we can represent will go unused (the number 3 represented by the two bits 11 never occurs).

How can we in this case find a way to compress a set of numbers in the range $\{ 0,\cdots,a_{max} \}$ so that we "on average" (in some sense) don't need more bits than theoretically necessary?

1

There are 1 best solutions below

0
On

Use a variable length prefix-free code such as a Huffman code. Some numbers will be assigned to shorter codewords, see below. If the numbers arrive sequentially all you need to do is keep track of how many of them are in the interval $[2^{k-1},a_{max}]$ which gives you an indication of where to go for the $j^{th}$ stored number. The addressing scheme is of course an overhead, there is no free lunch. The example below encodes the interval $[0000,1010]=[0,10].$

enter image description here

You mentioned probability, if you know which numbers are more frequent then you can map them to shorter codewords, the full-blown version being something like the Huffman code based on a complete probability distribution.