Using entropy to determine if a sentence is likely to be valid English

366 Views Asked by At

During my course of study for a cryptography class, I had to develop a simple Python script that would "bruteforce" the ciphertext for a Caesar Cipher. As an additional exercise, I wanted the script to also automatically determine which of the 26 outputs is most likely to be the correct answer.

To do this, I decided to calculate the bits of entropy for each calculated string to see if I can use this to highlight the correct answer. The specific equation being: $H(X) = -\sum p_c \times log_2(p_c)$, where $p_c$ is the probability of a letter appearing in an English word according to a letter frequency chart. (i.e. http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html)

This approach appears to have worked. In all of my test cases, the correct answer was the one with the highest entropy. For example:

Ciphertext: QNYYQJ UNLLD BJSY YT YMJ RFWPJY
Most Likely Plaintext (ROT 5): LITTLE PIGGY WENT TO THE MARKET (H = 6.365)
----------------------
ROT 0:  QNYYQJ UNLLD BJSY YT YMJ RFWPJY (H = 3.399)
ROT 1:  PMXXPI TMKKC AIRX XS XLI QEVOIX (H = 3.946)
ROT 2:  OLWWOH SLJJB ZHQW WR WKH PDUNHW (H = 3.993)
ROT 3:  NKVVNG RKIIA YGPV VQ VJG OCTMGV (H = 3.734)
ROT 4:  MJUUMF QJHHZ XFOU UP UIF NBSLFU (H = 3.655)
ROT 5:  LITTLE PIGGY WENT TO THE MARKET (H = 6.365)
ROT 6:  KHSSKD OHFFX VDMS SN SGD LZQJDS (H = 4.224)
...snip...
ROT 24: SPAASL WPNNF DLUA AV AOL THYRLA (H = 5.611)
ROT 25: ROZZRK VOMME CKTZ ZU ZNK SGXQKZ (H = 3.275)

So it appears to work, at least for these few test cases. What I'm confused about is, why does it work and what exactly am I measuring? I initially assumed that the correct answer would have the lowest entropy since it "would be the least random". But it turned out to be the one with the highest. Did I happen to use the entropy formula to calculate something that appears to be useful but isn't actually entropy at all?

Edit: The function in question.

# Calculate the bits of entropy for the given string based on English letter frequencies
def entropy(input_string):
    # Letter Frequency Chart for English
    freq = { 'E': 0.1202, 'T': 0.091, ...snip... }

    # Using the frequency of a letter as p(x), calculate entropy of the string using the formula:
    # H = -sum of (p(x) * log[p(x)]_2)
    total_entropy = 0
    for letter in input_string:
        total_entropy += freq[letter] * math.log(freq[letter], 2)

    return -(total_entropy)

Example Output:

>>> entropy("LITTLE PIGGY WENT TO THE MARKET")
6.365324803946881
>>> entropy("QNYYQJ UNLLD BJSY YT YMJ RFWPJY")
3.398563530183954

Final Edit:

Just adding the final function for any future students who may stumble across this question.

def entropy(s):
    # Letter Frequency Chart for English
    freq = { 
        'E': 0.1202, 'T': 0.091, 'A': 0.0812, 'O': 0.0768, 'I': 0.0731,
        'N': 0.0695, 'S': 0.0628, 'R': 0.0602, 'H': 0.0592, 'D': 0.0432, 
        'L': 0.0398, 'U': 0.0288, 'C': .0271, 'M': 0.0261, 'F': 0.023, 
        'Y': 0.0211, 'W': 0.0209, 'G': 0.0203, 'P': 0.0182, 'B': 0.0149, 
        'V': 0.0111, 'K': 0.0069, 'X': 0.0017, 'Q': 0.0011, 'J': 0.001, 
        'Z': 0.0007 
    }
    ascii_range = (65, 90)

    # Ensure the string's case matches the dictionary keys
    s = s.upper()

    # Using the frequency of a letter as p(x), calculate entropy of the string using the formula:
    # H = [sum of (-log[p(x)]_2)] / len(s)
    total_entropy = 0
    for c in s:
        if(ord(c) >= ascii_range[0] and ord(c) <= ascii_range[1]): # Only compute for values of A-Z
            total_entropy += -math.log(freq[c], 2)

    total_entropy = total_entropy / len(s)

    return total_entropy

Example Output:

>>> entropy("TEST")
3.491390529605717
>>> entropy("ZLYZ")
7.794603966207939
1

There are 1 best solutions below

1
On BEST ANSWER

You are computing (it seems)

$$ \sum_{i=1}^n - p(x_i) \log p(x_i) \tag 1$$ where $x_i$ is the letter in the $i-$th position, and $p(x)$ is the probability of letter $x$ in English.

That means nothing, as far as I can see.

What you should compute, I think is

$$ \tilde{H}= \frac{1}{n} \sum_{i=1}^n - \log p(x_i) \tag 2$$

If $n$ is large, by the law of large number, that average would tend to the expected value of the averaged thing. That is

$$ \tilde{H} \approx E [- \log p(x_i)] \tag 3$$

Now, if our letters really follow the statistics of English, then the above is precisely the entropy: $\tilde{H} \approx H$. [*] [**]

Otherwise, we are averaging over another probability. In that case, a well know result states that $$\tilde{H} \approx H + D(Q||P) \tag4$$

where $D(Q||P)$ is the Kullback–Leibler divergence (or distance) (or relative entropy) between the real distribution and the ideal distribution. Because $D(Q||P) >0$ if $Q\ne P$, you should get the lowest $\tilde{H}$ for the true distribution.


[*] That's actually the zero-order entropy, i.e., the one that does not take into account dependence on past letters. This, of course, is grossly inadequate for English.

[**] BTW, given that you have the true distribution, you could directly compute $H$ and compare.