How can we know it is better to compress some data with an algorithm based on our previous experience of a similar type of data?

35 Views Asked by At

In practice, people use the same type of compression algorithm for similar data from a similar source. Why can we compress different pictures with the same compression algorithm, and it works for all? Is it because of a statistical or information-theoretic property of the data, and if it is, what is that property?

1

There are 1 best solutions below

1
On BEST ANSWER

There's compresson and then there's compression. Say a compression algorithm is lossless if there exists a decompression routine that will return exactly te original file, for any file. Conversely, an algorithm is lossy if it is not lossless. Typical flavors of jpeg are lossy, for example: When you convert an image to jpeg in a graphics program you get to choose a compression ratio; more compression gives a smaller jpeg but it looks less like the original.

Never mind lossy compression for now; an explanation of why that works at all would involve talking about how visual perception works. Regarding lossless compression: How can we know it works on all files? We can't know that, in fact we do know just the opposite:

Obviousness. There does not exist a lossless compression algorithm that decreases the size of every non-empty file.

Proof: say $C(f)$ is the compressed version of the file $f$. Suppose $f$ and $g$ are one-byte files with $f\ne g$. Then $C(f)$ and $C(g)$ are both zero-byte files, so $C(f)=C(g)$; hence $C$ is not one-to-one.

So why do compression algorithms work in practice? They take advantage of redundancies common to the files being compressed.

For example, the sort of image files that are often converted to gif typically involve blocks of solid color. Which is to say there's a lot of repetition in thhe bit pattern, making "run-length encoding" useful:

Say the file contains the string "0000000000111110000000". One can easily concoct a format that says "ten zeroes, then five ones, then seven more zeroes" in less than 22 bytes.