Log likelihood ratio text analysis

358 Views Asked by Bumbble Comm At 25 Mar 2026 - 6:31

I'm not sure if I've been using the Dunning log likelihood statistic to correctly identify characteristic keywords. I calculated the LogLikelihood values as $ G2 = 2*((a*ln \frac{a}{E1})) + (b*ln \frac{b}{E2}))) $ but now I'm confused on how to use this information to label word as more characteristic of TextA vs TextB.

I've seen:

$H_{0} =$ word is evenly distributed in both texts
$H_{a} = $ word is not evenly distributed between the texts (ie, it's overrepresented in one)

For example, suppose I've calculated the LL of word to be 23.54. This would be statistically significant at a significance level of 0.05.

What I've been doing is labeling word as characteristic of textA if it has a statistically significant LL and characteristic of textB otherwise. Correct?

Note: I already understand the TFIDF method

Relevant material, although please post relevant material too

Original Q&A

Log likelihood ratio text analysis

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions