I'm not sure if I've been using the Dunning log likelihood statistic to correctly identify characteristic keywords. I calculated the LogLikelihood values as $ G2 = 2*((a*ln \frac{a}{E1})) + (b*ln \frac{b}{E2}))) $ but now I'm confused on how to use this information to label word as more characteristic of TextA vs TextB.
I've seen:
$H_{0} =$ word is evenly distributed in both texts
$H_{a} = $ word is not evenly distributed between the texts (ie, it's overrepresented in one)
For example, suppose I've calculated the LL of word to be 23.54. This would be statistically significant at a significance level of 0.05.
What I've been doing is labeling word as characteristic of textA if it has a statistically significant LL and characteristic of textB otherwise. Correct?
Note: I already understand the TFIDF method
Relevant material, although please post relevant material too