Compressing binary numbers

2.2k Views Asked by At

If I have a arbitrarily long random binary number with the condition that the probability that a given digit is 0 and 1 is 1/4 and 3/4, respectively. What is the best way to compress this into a binary number, i.e. an algorithm with the smallest expected length of the compressed binary number.

Additionally, what is the information-theoretic bound, and is it achievable.

2

There are 2 best solutions below

0
On BEST ANSWER

So you have a Bernoulli$(1/4)$ source (which we presume to be i.i.d), which has entropy $-\frac{1}{4} \log_2 \frac{1}{4} - \frac{3}{4} \log_2 \frac{3}{4}$ which is a lower bound on bits per symbol for any compression scheme.

Now, if you use a universal source coder, such as arithmetic coding which maps infinite source sequences to incompressible binary sequences, you will get within at most 2 bits for any block length.

The Lempel-Ziv algorithms (also universal source codes; the basic variants are sliding window and tree structured) are known to have their asymptotic compression rate approach the entropy rate of the source (asymptotically) if the source is stationary and ergodic - that is, if the amount of symbols you're coding goes to infinity, you'll approach the entropy rate of the source.

For more details (and proofs), see Chapter 13 of Cover and Thomas' Elements of Information Theory 2e.

0
On

If the bits are independent, then, as Batman said above, the entropy is around 0.811 bits per input bit, which means that you can compress the bit string up to 81.1% of original size.

To attain that, you can do a Huffman code of some extension of the source.

For example, taking the 3 extension, we'd have eight symbols with these probabilities, and this Huffman tree and (possible) code:

         p     code       len   p x len
0 111  27/64    0          1     27/64  
1 011   9/64    100        3     27/64  
2 101   9/64    101        3     27/64  
3 110   9/64    110        3     27/64  
4 001   3/64    11100      5     15/64  
5 010   3/64    11101      5     15/64  
6 100   3/64    11110      5     15/64
7 000   1/64    11111      5      5/64

Average length= (158/64)/3 = 0.823 . To get closer to 0.811 we could use higher order extensions. Or use arithmetic coding.