Calculating mutual information for a dataset

511 Views Asked by At

I have a dataset of individual text documents $D = {d_0, d_1, ..., d_n}$ and a corpus of keywords $K = {k_0, k_1, ..., k_m}$ in the documents. There are zero or more keywords in each text document. I want to calculate the mutual information between any two keyword variables.

If I'm given any two keywords $j, k$ from this set, what is $p(j,k)$? I know that for a given keyword being absent or present, I need to use the binary entropy function, but I'm not sure exactly what the joint binary entropy form is? Here was what I had considered:

  1. $p(j,k) =$ the number of documents that both j and k would co occur in. This makes sense intuitively, because I'm trying to determine how likely we are to see another keyword in a document given the presence of another
  2. $p(j,k) =$ the fraction of all of the co occurrences of keywords that are co occurrences of $j$ and $k$. This makes sense to me because some of the keywords occur much less commonly then others and I'm not sure that the first method captures this?
1

There are 1 best solutions below

0
On

I would consider the binary variables $V_k$ that are 1 when the keyword $K_k$ is present in a text randomly chosen and 0 otherwise. The mutual information for 2 keywords $K_a$,$K_b$ is :

$\sum_{i,j}{P(V_a=i,V_b=j)}log{\frac{P(V_a=i,V_b=j)}{P(V_a=i)P(V_b=j)}}$

You can estimate it with :

$F(0,0).log(\frac{F(0,0)}{F(0,.)F(.,0)} + F(0,1).log\frac{F(0,1)}{F(0,.)F(.,1)}+F(1,0).log\frac{F(1,0)}{F(1,.)F(.,0)}+F(1,1)*log\frac{F(1,1)}{F(1,.)F(.,1)}$

where F(0,0) is the ratio of texts where neither keyword appears, F(1,0), the frequence of texts where only $K_i$ appears, ...

$F(0,.) =F(0,0)+F(0,1)$, etc... $F(.,0) =F(0,0)+F(1,0)$, etc...