Chi-Squared Hypothesis Testing of Letter Frequency

640 Views Asked by At

The hypothesized letter-frequency values below are taken from Pavel Micka's website, which cites Robert Lewand's Cryptological Mathematics.

The Actual Appearances were manually obtained upon reading "A Study in Scarlet" by Arthur Conan Doyle.

Use the Chi-squared statistic to test whether the hypothesized frequencies are correct.

This will be a one-tailed test by design. Use a $5\%$ chance of a Type I error.

        Hypothesized    Actual
Letter  Frequency       Appearances

a   0.08167            19890
b   0.01492            1701
c   0.02782            5556
d   0.04253            10578
e   0.12703            29479
f   0.02228            4252
g   0.02015            5601
h   0.06094            8663
i   0.06966            9267
j   0.00153            276
k   0.00772            2244
l   0.04025            9458
m   0.02406            7184
n   0.06749            13765
o   0.07507            16986
p   0.01929            5887
q   0.00095            153
r   0.05987            7984
s   0.06327            11181
t   0.09056            27087
u   0.02758            5277
v   0.00978            3031
w   0.02360            7670
x   0.00150            200
y   0.01974            3396
z   0.00074            159

Attempted Solution:

I added each of the actual appearances to get $216{,}925$. Then I multiplied all the hypothesized frequencies by that number. I then used the formula Chi-squared Statistic $= \sum$$(O-E)^2\over{E}$ to get $14598.17$. The critical value I found from the table was $37.652$, thus rejecting the null hypothesis.

I was wondering if I did this correctly. I suspect I did not because of how much larger my Chi-square statistic was than my critical value.

Any help would be much appreciated.

EDIT:

I think I need to square root my chi-square statistic to get $120.8$. That is still a lot bigger than $37.652$, my critical value.

1

There are 1 best solutions below

4
On BEST ANSWER

If you used 'Actual appearances' for $O$ in the formula for the chi-squared statistic $Q$ and $n$ times 'Hypothesized frequency' for $E$, then your method of computing $Q$ is correct.

You have $k = 26$ 'categories' (letters of the alphabet) so $Q \stackrel{aprx}{\sim}\mathsf{Chisq}(\nu = k - 1).$ So for a test at the 5% level, the critical value is 37.65248, as you say.

With a large amount of data, it not unusual to get a very large value of $Q$, indicating a very bad fit of the observed to expected frequencies.

However, I tried putting your data into Minitab 17 software. When I cut/pasted from your data table, data in the rows for letters 'q' and 'u' were missing. (Maybe there are hidden, non-printing, characters in your data table that prevented transfer.) I entered these two rows into the Minitab worksheet by hand. Then as a check I added the hypothesized frequencies and got a total of 1 and the actual appearances to get 216,925, which agrees with your computation. Also, I got the same $Q$ you did. [You do not need to take the square root.]

So my computations agree with yours, and the data do not fit the hypothetical frequencies.

Particularly large 'contributions' to $Q$ come from the letters, h, i, t, and w. (Too many h's and i's; not enough t's and w's for a good match). There may be some quirk in the subject matter of the Doyle story or in his writing style (over or under usage of common words such as the, in, it, at, to, with, that, which and so on) that accounts for the discrepancy.