I'm trying to use Dunning's method of calculating LLR to compare word instances between two fulltext indexes. His method uses entropy as part of the calculation.
Dunning's blog post: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
But, although I've implemented it in both Excel and Java and they give the same answers, I believe the answers are wrong.
Two reasons I believe my results are wrong:
1: They don't agree with this online calculator (which uses a different formula): http://ucrel.lancs.ac.uk/llwizard.html
2: They are always negative; that is more disturbing.
Link to my faulty XLS sheet: (hope this is OK) https://www.dropbox.com/s/bnzmk7ttf4mv23k/entropy-and-LLR-suspect-gist.xlsx
Some theories I have:
1: Maybe my contingency table is setup wrong? Dunning talks about an abstract CT, but he doesn't specifically say how to fill it out with term frequencies counts. For example, in my table in cell 1,1 I put the number of times the word "spam" occurs in corpus A, whereas Dunning says "Event A and B together". So, when you have term counts, maybe there's some step to convert those counts into a proper CT?
2: Maybe some misunderstanding about the denominators that I'm using when calculating probability. In Steps 2, 3 and 4 I'm always dividing by the CT grandTotal, maybe that's wrong?
3: Maybe my entropy calculation is not the form that Dunning had in mind, perhaps there's some scaling I'm adding or not adding. I found this page http://mail-archives.apache.org/mod_mbox/mahout-dev/201009.mbox/%[email protected]%3E where Dunning replied to a question and mentions "un-normalized entropy". But I didn't follow the syntax and conversation well enough to related it back to what I was doing.
Here is a corrected version of your (very nice) spreadsheet.
https://dl.dropboxusercontent.com/u/36863361/entropy-and-LLR-suspect-gist.xlsx
There are two categories of changes.
a) You need to account for zero counts. Changing ln(Cxx) to ln(if(Cxx=0,1,Cxx)) solves this. The issue is that the limit of p log p is 0 when p=>0, but log p blows up. The special case avoids the problem.
b) My blog was confusing relative to sign. The definition I used for entropy lacked a sign change which I compensated for by inverting the order of the final computation.