During my course of study for a cryptography class, I had to develop a simple Python script that would "bruteforce" the ciphertext for a Caesar Cipher. As an additional exercise, I wanted the script to also automatically determine which of the 26 outputs is most likely to be the correct answer.
To do this, I decided to calculate the bits of entropy for each calculated string to see if I can use this to highlight the correct answer. The specific equation being: $H(X) = -\sum p_c \times log_2(p_c)$, where $p_c$ is the probability of a letter appearing in an English word according to a letter frequency chart. (i.e. http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html)
This approach appears to have worked. In all of my test cases, the correct answer was the one with the highest entropy. For example:
Ciphertext: QNYYQJ UNLLD BJSY YT YMJ RFWPJY
Most Likely Plaintext (ROT 5): LITTLE PIGGY WENT TO THE MARKET (H = 6.365)
----------------------
ROT 0: QNYYQJ UNLLD BJSY YT YMJ RFWPJY (H = 3.399)
ROT 1: PMXXPI TMKKC AIRX XS XLI QEVOIX (H = 3.946)
ROT 2: OLWWOH SLJJB ZHQW WR WKH PDUNHW (H = 3.993)
ROT 3: NKVVNG RKIIA YGPV VQ VJG OCTMGV (H = 3.734)
ROT 4: MJUUMF QJHHZ XFOU UP UIF NBSLFU (H = 3.655)
ROT 5: LITTLE PIGGY WENT TO THE MARKET (H = 6.365)
ROT 6: KHSSKD OHFFX VDMS SN SGD LZQJDS (H = 4.224)
...snip...
ROT 24: SPAASL WPNNF DLUA AV AOL THYRLA (H = 5.611)
ROT 25: ROZZRK VOMME CKTZ ZU ZNK SGXQKZ (H = 3.275)
So it appears to work, at least for these few test cases. What I'm confused about is, why does it work and what exactly am I measuring? I initially assumed that the correct answer would have the lowest entropy since it "would be the least random". But it turned out to be the one with the highest. Did I happen to use the entropy formula to calculate something that appears to be useful but isn't actually entropy at all?
Edit: The function in question.
# Calculate the bits of entropy for the given string based on English letter frequencies
def entropy(input_string):
# Letter Frequency Chart for English
freq = { 'E': 0.1202, 'T': 0.091, ...snip... }
# Using the frequency of a letter as p(x), calculate entropy of the string using the formula:
# H = -sum of (p(x) * log[p(x)]_2)
total_entropy = 0
for letter in input_string:
total_entropy += freq[letter] * math.log(freq[letter], 2)
return -(total_entropy)
Example Output:
>>> entropy("LITTLE PIGGY WENT TO THE MARKET")
6.365324803946881
>>> entropy("QNYYQJ UNLLD BJSY YT YMJ RFWPJY")
3.398563530183954
Final Edit:
Just adding the final function for any future students who may stumble across this question.
def entropy(s):
# Letter Frequency Chart for English
freq = {
'E': 0.1202, 'T': 0.091, 'A': 0.0812, 'O': 0.0768, 'I': 0.0731,
'N': 0.0695, 'S': 0.0628, 'R': 0.0602, 'H': 0.0592, 'D': 0.0432,
'L': 0.0398, 'U': 0.0288, 'C': .0271, 'M': 0.0261, 'F': 0.023,
'Y': 0.0211, 'W': 0.0209, 'G': 0.0203, 'P': 0.0182, 'B': 0.0149,
'V': 0.0111, 'K': 0.0069, 'X': 0.0017, 'Q': 0.0011, 'J': 0.001,
'Z': 0.0007
}
ascii_range = (65, 90)
# Ensure the string's case matches the dictionary keys
s = s.upper()
# Using the frequency of a letter as p(x), calculate entropy of the string using the formula:
# H = [sum of (-log[p(x)]_2)] / len(s)
total_entropy = 0
for c in s:
if(ord(c) >= ascii_range[0] and ord(c) <= ascii_range[1]): # Only compute for values of A-Z
total_entropy += -math.log(freq[c], 2)
total_entropy = total_entropy / len(s)
return total_entropy
Example Output:
>>> entropy("TEST")
3.491390529605717
>>> entropy("ZLYZ")
7.794603966207939
You are computing (it seems)
$$ \sum_{i=1}^n - p(x_i) \log p(x_i) \tag 1$$ where $x_i$ is the letter in the $i-$th position, and $p(x)$ is the probability of letter $x$ in English.
That means nothing, as far as I can see.
What you should compute, I think is
$$ \tilde{H}= \frac{1}{n} \sum_{i=1}^n - \log p(x_i) \tag 2$$
If $n$ is large, by the law of large number, that average would tend to the expected value of the averaged thing. That is
$$ \tilde{H} \approx E [- \log p(x_i)] \tag 3$$
Now, if our letters really follow the statistics of English, then the above is precisely the entropy: $\tilde{H} \approx H$. [*] [**]
Otherwise, we are averaging over another probability. In that case, a well know result states that $$\tilde{H} \approx H + D(Q||P) \tag4$$
where $D(Q||P)$ is the Kullback–Leibler divergence (or distance) (or relative entropy) between the real distribution and the ideal distribution. Because $D(Q||P) >0$ if $Q\ne P$, you should get the lowest $\tilde{H}$ for the true distribution.
[*] That's actually the zero-order entropy, i.e., the one that does not take into account dependence on past letters. This, of course, is grossly inadequate for English.
[**] BTW, given that you have the true distribution, you could directly compute $H$ and compare.