I have a dataset of individual text documents $D = {d_0, d_1, ..., d_n}$ and a corpus of keywords $K = {k_0, k_1, ..., k_m}$ in the documents. There are zero or more keywords in each text document. I want to calculate the mutual information between any two keyword variables.
If I'm given any two keywords $j, k$ from this set, what is $p(j,k)$? I know that for a given keyword being absent or present, I need to use the binary entropy function, but I'm not sure exactly what the joint binary entropy form is? Here was what I had considered:
- $p(j,k) =$ the number of documents that both j and k would co occur in. This makes sense intuitively, because I'm trying to determine how likely we are to see another keyword in a document given the presence of another
- $p(j,k) =$ the fraction of all of the co occurrences of keywords that are co occurrences of $j$ and $k$. This makes sense to me because some of the keywords occur much less commonly then others and I'm not sure that the first method captures this?
I would consider the binary variables $V_k$ that are 1 when the keyword $K_k$ is present in a text randomly chosen and 0 otherwise. The mutual information for 2 keywords $K_a$,$K_b$ is :
$\sum_{i,j}{P(V_a=i,V_b=j)}log{\frac{P(V_a=i,V_b=j)}{P(V_a=i)P(V_b=j)}}$
You can estimate it with :
$F(0,0).log(\frac{F(0,0)}{F(0,.)F(.,0)} + F(0,1).log\frac{F(0,1)}{F(0,.)F(.,1)}+F(1,0).log\frac{F(1,0)}{F(1,.)F(.,0)}+F(1,1)*log\frac{F(1,1)}{F(1,.)F(.,1)}$
where F(0,0) is the ratio of texts where neither keyword appears, F(1,0), the frequence of texts where only $K_i$ appears, ...
$F(0,.) =F(0,0)+F(0,1)$, etc... $F(.,0) =F(0,0)+F(1,0)$, etc...