Standard deviation of length of compressed file with arithmetic coding

112 Views Asked by At

I am working through the exercises in David Mackay's book "Information theory, inference and learning algorithms" (https://www.inference.org.uk/itprnn/book.pdf) and I am stuck with problem 6.9. In that exercise I am asked to estimate the mean and the standard deviation of the length of a compressed file consisting of 1000 samples from a Bernoulli distribution with $p=0.01$. I know that the mean of the length is 83 bits because the entropy of the random variable is $H(X)=0.081$ and so the length of a compressed file using arithmetic coding is the entropy times the number of samples plus 2. However I am stuck with defining the standard deviation. I know from page 128 that the answer is 21 bits but I don't understand why.

1

There are 1 best solutions below

2
On BEST ANSWER

Edited: improved following OP's comment

For an arithmetic binary code of length $N=1000$, since the arithmetic code corresponds esentially to an quasi optimal code of the extended (joint) sequence, we can expect the length of any input to be very close to the "ideal" length $-\log_2(p_i)$ plus an excess of $\epsilon \approx 2$ bits - here $p_i$ is the joint probability of the full sequence. That's why we get a mean value of $N H(X_i) +\epsilon$.

To compute the variance, we can disregard the $\epsilon$ term, and assume that the code length of each sequence is determined by its probability, and hence by the number of ones (which we call $k$).

$$ L =-\log p^{k}(1-p)^{N-k}=-N\log(1-p) + k \log(1/p-1) = a + b \, k \tag 1$$

where $a,b$ are given constants and $k$ and $L$ are random variables. In particular, $k$ is a Binomial with mean $Np$ and variance $N p (1-p)$.

Hence the variance of $L$ is $$ \sigma_L^2=b^2 \sigma_k^2=\log(1/p-1)^2 k p (1-p)=435.08\cdots$$

with a standard deviation around $21$.