Is compression possible on a uniform distribution of symbols?

1.6k Views Asked by At

I was reading about limits of data compression and I had this question that kept me thinking.

Alphabet set: Set of all the unique symbols in the input data

Q. If the distribution over the alphabet set is uniform then is it possible to compress data theoretically ? Standard compression methods (like Huffman codes) exploit the bias in the frequency distribution to obtain codes that have a lower average length of the input text. Is it possible to work without frequency?

Idea 1: One idea is that since the distribution is uniform over the symbols, then any assignment of codes would result in the same number of symbols in the range of the mapping, thus we would require the same number of bits to store that information. Hence the average length of compressed data should be greater than or equal to the original average length.

Idea 2: Another idea that pops up in my head is that if the distribution of symbols is uniform, why not create a new symbol set S' that is essentially a set of all the bi-grams of symbols. But, this distribution also needs to be uniform as the probability of a pair of symbols is same for all the possible cases.

These ideas force me to conclude that it is not possible to compress data if the distribution is uniform.

Please correct me if I am wrong at any point, also I would like to know your views on the problem.

Edit: If you can also provide a formal proof or a book for reference then it would be great.

3

There are 3 best solutions below

2
On

Your intuitive arguments are correct, however, under the additional assumption (which, I guess you also consider but don't explicitly state):

The source generates symbols that are not only uniformly distributed over the alphabet but also independent.

In that case, there is no (statistical) redundancy of the generated symbol sequence, hence, no room for compression.

Note that if the source symbols are correlated, then one could consider "super-symbols" (such as what you propose in idea 2) that would not be uniformly distributed, and, therefore, compressible.

0
On

Without loss of information, there is no way to compress any data source further than its Shannon Entropy, which is a measure of uncertainty, or information that you get with every new symbol.

The entropy is defined in bits as $H(X) = -\sum_i P[X=i] \log_2 P[X=i]$ for every symbol $i$.

For a symbol source that produces independent and uniformly distributed symbols, the entropy is just the same as the number of bits needed to encode the symbols, and no compression scheme can compress it any further. Your intuition is correct as why it is so.

Sometimes doing n-grams (grouping symbols into a "supersymbol") may help you to compress your data more efficiently. This can work on an uniformly distributed source, but only if your source produces symbols that are dependent on each other. If instead they are independent, no luck. Also, this coding scheme introduces algorithmic delay, since you need two symbols to produce the output.

In some types of coding, you can however get away with using less bits, but the information transmitted will not be the same as before, but likely similar enough for your purpose. This is lossy compression, wildly used in movies, TV and audio to exploit the fact that not every detail in the original audio is perceptible to humans. In that case, since you can't ever perceive the extra information, it's like if it was redundant from the beginning.

1
On

On a separate note, here are a few ideas that popped up in my head after reading the initial answers. Posting them here to keep the discussion along these lines separate from the original question.

Theoretically: A random symbol generator that generates a symbol from the alphabet set should be able to generate the data which has a uniform distribution and who's characters are independent by repeatedly running it. But since it is perfectly random, it has to have maximum entropy and thus maximum information.

Practically: We are working with data as a way to send some message/information. If we consider the above type of data, then it does not help us convey information in any way as it is random noise in the universe. We cannot even send 2 similar messages using the above data generator. So, practically speaking this data type is of least importance to us. Even with maximum entropy, it does not help us convey any information. (yes, it seems a bit strange at first look). So we are not dealing with such data and hence there is not really any need to compress such data types.

Practically we use pseudo-random generators, so given a seed, we should be able to generate the whole sequence. Interestingly, a tuple of (seed, message_length) should be enough to generate a message of given length using a pseudo-random generator. So it can be thought of as a way to compress data. But is it possible to generate all possible sequences from the generator?

Now, this boils down to the question: Can pseudo-random generators generate all possible sequences?