Entropy determination regarding lossless data compression

196 Views Asked by At

Suppose I had many computer data files to compress losslessly and wanted to know what is the theoretical limit to each one as far as minimum filesize possible. How would a math person go about doing that without having the actual working compression and decompression programs? Is it possible to tell just from properties of the data (randomness, run lengths, patterns of bytes, repeated words...) what the limit of compression is or can it only be approximated? For example, if there was a contest to see what pair of programs could compress and decompress $10$ different computer data files the best, how would the programmers know when they are getting close to the limit of compression and should stop "tweaking" their algorithm? Let's assume a fair mix of files such as $1$ text only file, $1$ image file, $1$ spreadsheet file, $1$ sound file, $1$ video+sound file...

Can this lower limit in filesize be calculated or is it just determined by trial and "error"? ("error" being a resulting file with non minimal size because someone else's algorithm beat it).

This part added about 4.5 years after initially asked... The part I am having trouble with is encoding the image files as compactly as possible. What might seem like a tweak to one image file might actually "hurt" some other image file (as far as compression ratio). So it seems like there is no way to "prove" one method is generically better than another, other than maybe taking a bunch of unseen images (like during a contest), and then having all participants compress and decompress all of the files, and see who has the minimum total output size. However, even that is not conclusive because with another set of images, someone else might win. It is not an easy thing to get a "handle" on so to speak. I suppose if they prepared a huge "suite" of test images of all different types (large, small, line art, typical "pics", random data, pseudo random data (such as an already compressed file), text only...), that might be a more fair test. They would make it so the filename extension says what type of data it contains (.TXT, .BMP, .ZIP) so the compressor could "prepare" and select the proper algorithm(s). There would also be no time limit to compress the files but the "winner" would have to prove the compression and decompression works to "win".

1

There are 1 best solutions below

0
On

Suppose that Alice lives in Boston and Bob lives in New York. They would like to report the weather (which can either be fair or rainy) of a sequence of $n$ days of their cities to Carole who lives in San Francisco. Alice and Bob are currently not in good terms and hence only send messages directly to Carole. On average, how many bits do they jointly need to send to ensure that Carole has a good chance to reconstruct the weather in both cities?

Suppose that the weather in Boston is modelled by an ergodic process that has entropy $H(X)$ and the weather in New York is modelled by an ergodic process that has entropy $H(Y)$. Then, to send the weather for $n$ days, it will be sufficient to send $n(H(X) + H(Y))$ bits. In fact, if $X$ and $Y$ are uncorrelated, then this is the best you can do! (For a single source you would need $nH(n)$ bits, see the Source-Coding Thm, so the same is true if you have $2$ sources that have nothing to do with each other.)

However, since Boston and New York are geographically close, we can expect observations of $X$ and $Y$ to be highly correlated. The Slepian-Wolf Theorem establishes these limits for having 2 encoders (Alice and Bob), where each one observes a random source and both encoders directly send to a decoder (Carole). Let $R_X$ be the rate at which the encoder of $X$ sends to the decoder, and define $R_Y$ to be rate of the encoder of $Y$.

Define $H(X \mid Y)$ be the entropy of $X$ given $Y$: that means, the uncertainty that you're left with regarding $X$ if you already know $Y$, and let $H(X,Y)$ be the joint entropy of $X$ and $Y$. Note that $$H(X,Y) = H(X) + H(Y | X) \le H(X) + H(Y),$$ where equality holds if $X$ and are $Y$ are independent.

Then, on average (for long enough sequences), the following conditions on the encoding rates are necessary and sufficient to ensure that the decoder has a good chance to reconstruct the source observations:

  1. $R_X > H(X \mid Y)$
  2. $R_Y > H(Y \mid X)$
  3. $R_X + R_Y > H(X,Y)$

You can find an encoding/decoding scheme in [A] that shows that you can achieve these boundaries by time-sharing the $2$ schemes $(R_X,R_Y) = (H(X),H(Y\mid X))$ and $(R_X,R_Y) = (H(X\mid Y),H(Y))$. Surprisingly, in the first scheme, the encoder of $Y$ uses only $H(Y\mid X)$ bits (on average) to encode its observations of $Y$ without observing $X$!

In [A], this theorem was extended to $N$ sources. (Also, I've adapted the weather example from [A].)

[A] T M Cover. A Proof of the Data Compression Theorem of Slepian and Wolf for Ergodic Sources]6