Why can't I compress an encrypted file?

4.3k Views Asked by At

Let's say I have a txt file, called harry_potter.txt. I can easily compress it with any compression algorithm.

So the entropy of the file is "smaller" than its size on the disk.

But if I encrypt the file with AES-256-CBC or AES-256-BCB (I used openssl, and enabled -nosalt option), the encrypted file can't be compressed anymore.

Why?

  1. Does the result encrypted file have more information/entropy? Or is it the weakness of the current compression algorithm?
  2. If the encrypted file does have more entropy, how does the additional entropy come from?
6

There are 6 best solutions below

0
On BEST ANSWER

In the sense of Kolmogorov complexity, the encrypted file doesn't contain much more information than the original, because it can be represented as encrypt(<your key>, decompress(<compressed file contents>)). But discovering this representation would require knowing your key or cracking the encryption, and that's not the business of compression algorithms.

7
On

You can think of compression as an invertible function $f:X\to X$ where $X$ is the set of all strings. A string is a finite sequence with elements taken from a finite set called an alphabet.

Now let $X_n$ denote the subset of all strings of length at most $n$. It is finite, and each $X_{n+1}$ contains $X_n$ as a proper subset. Therefore it is impossible for $f$ to map $X_{n+1}$ to $X_n$ and remain injective.

What this means is that for every compression algorithm there is an input such that the algorithm generates output that is at least as big. And in fact bigger if the the algorithm actually compresses at least one string.

So how does compression achieve smaller sizes? It cheats. It compresses only some strings, typically something with good enough patterns. For example text written by a human. It analyzes it, searches for patterns and takes advantage of information found to compress it.

And therefore it won't work on something opposite, like strings with not many patterns inside. Like for example the output of a good encryption algorithm. Or a random string. Or even an output of compression (i.e. compress twice).

But that totally depends on the algorithm. It is in fact possible to create a compression algorithm that does the opposite: works well for encrypted strings, badly for human readable strings. One such algorithm is: "decrypt and then apply classical compression". And so it isn't really related to any "entropy" or something.

3
On

Technically speaking, the entropy of the ciphertext is at most the entropy of the plaintext plus the length of the encryption key.

However, the compression algorithm doesn't know how its input was generated. As far as it's concerned, its input stream is just a bunch of random noise, which appears to lack any patterns, redundancy, or structure. So it doesn't compress well.

In theory, a more sophisticated compression tool could crack the encryption, and output a file containing the encryption key plus a well-compressed plaintext. But besides being computationally difficult, that would defeat the point of encrypting the file in the first place.


A similar effect can be achieved without encryption, by using algorithmically-generated data, e.g.,

data = bytes([pow(3, i, 256) for i in range(2**20)])
with open('foo.dat', 'wb') as f:
    f.write(data)

The foo.dat file can contain no more entropy than this 103-byte Python script. But on my machine, gzip -9 foo.dat still produces 3189 bytes of output, because gzip just isn't that good at detecting the pattern.

0
On

This is somewhat tangential to the precise question you asked, but nevertheless may be interesting. High level, while the compression approaches you applied may not be able to compress encrypted data, there are schemes that can in fact let you significantly compress encrypted data if the secret key is available at decompression time (but not during compression, of course).


To my knowledge, the first work propsing compression of encrypted data is by Johnson et al. [1]. The main insight here is that while the compression algorithm sees only the encrypted stream, if we assume that the decoder has access to the secret key, then it can combine decoding and decryption into one step to enable compression gains via a reduction to the Wyner-Ziv source coding problem. I'll give a brief account of the results, limiting to the lossless setting for simplicity.

Structure: Suppose that the secret key $K^n$ is drawn uniformly from a $2^{nR'}$-length secret key codebook $\mathcal{K}_n \subset \{0,1\}^n$, and the data $X^n \in \{0,1\}^n$ is drawn iid from some law $p$, and is encrypted to give $Y^n = X^n \oplus K^n$, where the XOR is applied bitwise. Then an encoder maps this $Y^n$ to a ${nR}$ length bitstring $B,$ without knowledge of $K^n$, but knowing the secret-key codebook $\mathcal{K}_n$, and a decoder uses $(B,K^n)$ to produce $\hat{X}^n$ (implicitly, we're assuming that the decoder gets the secret key through some secure channel). Say that $(R',R)$ rates are achievable with Wyner-sense perfect secrecy if we can find an encoder and decoder that ensure $P(X^n \neq \hat{X}^n) \to 0$ as $n$ diverges.

The insight is that as far as the encoder is concerned, the secret $K^n$ is just some side information available at the decoder. Using this the paper argues that if you construct $\mathcal{K}_n$ by drawing $2^{nR'}$ iid sequences according to the law of $X$, $p$, itself, and use Slepian-Wolf coding for $X^n$, then you can achieve $(H(X), H(X))$-rates with Wyner-sense perfect secrecy. In other words, you can compress to the optimal rate, and retain $H(X)$-bits of secrecy (which is the best possible, since the source itself has only $H(X)$ bits of entropy).


[1]. Johnson, M., Ishwar, P., Prabhakaran, V., Schonberg, D., & Ramchandran, K. (2004). On compressing encrypted data. IEEE Transactions on Signal Processing, 52(10), 2992-3006.

Note that the paper is 20 years old and well cited, so likely this theory has evolved since. I haven't really read anything on the topic beyond this, but I'm sure you can explore to the extent of your interest by chasing down papers that cite it.

3
On

Algorithmic Weakness

Other answers have given good theoretical explanations. The reality is that compression algorithms in practical use tend to be shallow, and by that I mean that they look for structure over relatively small windows of the input. The general strategy is to interpret the input as a sequence of words of bit length N, and then create a dictionary that allows you to do something like a Huffman encoding of those words so that you can match them to codes with an average length shorter than N.

Text files are easy to compress because the English ones are sequences of words of bit length 8 (ASCII/utf-8, at least), but the bytes that actually occur are not uniformly distributed over all 8-bit patterns. Thus, it is easy to construct a dictionary where the most frequent letters get codes of just a few bits, while the rarest might be 16 bits or more. If you increase the window to, say, 64 bits, you'll still see that the "words" which occur in the input are distributed in a very non-uniform manner.

Encryption algorithms, on the other hand, are explicitly designed to produce output that almost perfectly defeats this type of compression. That's because any information about the plaintext word frequencies must not leak into the ciphertext, or the code can be more easily broken. Thus, encryption algorithms explicitly produce outputs with the most uniform word distributions possible, at almost every window size. No matter what scale you look at it, a good ciphertext should be statistically indistinguishable from random noise. And random noise cannot be compressed with a clever dictionary.

The structure which exists in the plaintext has been obfuscated by permutations (and often some padding) into a string that has no manifest structure whatsoever. So, you could say that due to the domain-agnostic nature of general-purpose compression algorithms, ciphertexts are the most difficult inputs to actually compress.

However, it is likely that for any compression algorithm to effectively compress encrypted inputs, it would have to first decrypt it. Knowing the domain of the source input should not give any advantage in compressing the ciphertext, or that would imply that the encryption algorithm has leaked information that makes it weak.

0
On

Take an incompressible string $s$. Since the encryption function is injective, in general the encrypted string $E(s)$ will also be incompressible. Now take a compressible string $t$ and its encryption $E(t)$.

We expect a good encryption function to be semantically secure, i.e. if the attacker gives us two plaintexts $m_0$ and $m_1$, we return $E(m_x)$ randomly then the attacker can't determine if $x = 0$ or $x = 1$ with a non-negligible probability.

Now if $E(t)$ is compressible then this fact can be used to distinguish $E(s)$ and $E(t)$, therefore the encryption can't be semantically secure, thus all encrypted outputs must be incompressible.

Note that this in practice only applies to symmetric encryption algorithms. Asymmetric algorithms are usually non-injective and their outputs are compressible.