For a task related to text analysis I'm looking for a measure of how much the presence of specific word $w \in W$ determines the context $c \in C$.
Typically, Word-context collocations are counted allowing to estimate the joint probability $p(w,c)$ and the Pointwise Mutual Information, measuring the association between a specific word-context pair : $$pmi(w,c)=\log_{2} \frac{p(w,c)}{p(w)p(c)}$$
The joint expectation of the PMI gives the Mutual Information, a measure of dependence between the random variables $W$ and $C$ : $$I(W;C)=\sum_{w,c}p(w,c)\log_{2}\left(\frac{p(w,c)}{p(w)p(c)}\right) $$
As I wanted a measure of dependence between a specific word $w\in W$ and contexts $C$, I rewrote the expression of the Mutual Information as: $$I(W;C)=\sum_{w}p(w)I(C;W=w)$$ which is the expectation of this measure of dependence between a word and contexts: $$ \begin{eqnarray} I(C;W=w)&=&\frac{1}{p(w)}\sum_{c}p(w,c)\log_{2}\left(\frac{p(w,c)}{p(w)p(c)}\right) \\&=&\sum_{c}p(c|w)\log_{2}p(c|w)-\sum_{c}p(c|w)\log_{2}p(c) \\&=&-H(C|W=w)-\sum_{c}p(c|w)\log_{2}p(c) \end{eqnarray}$$
- Unlike the entropy of the conditioned contexts $H(C|W=w)$, the other term is quite surprising. Any idea of what it could be ? Maybe it's not the most straightforward decomposition. I failed to transpose the MI/entropy identities to this conditioned universe.
- Do you think that this $I(C;W=w)$ make any sense ? If so, how would you name it ? (not invented here I bet)
- Does the information theory support it as a good indicator for picking the words best characterizing the context ? I'm trying to rank words for potential applications in locality sensitive hashing.
Ok, I found my question partially answered elsewhere. It can be expressed in terms of Kullback–Leibler divergence which is, in my case, the information gain on the contexts distribution by fixing a word, going from $p(C)$ to $p(C|W=w)$ : $$ \begin{eqnarray} I(C;W=w) & = & D_{KL}\left((C|W=w) \| C \right) \\ & = & H_{\times}(C|W=w, C) - H(C|W=w) \\ \end{eqnarray} $$ where $H_{\times}$ is the cross-entropy : $$ H_{\times}(C|W=w, C) = -\sum_{c}p(c|w)\log_{2}p(c) $$
The cross-entropy is a new concept for me. For some time I confused it with the joint entropy since they both use the same notation...
I'll leave the question open for some time: this answers 1. and 2., but 3. is still not clear to me. Maybe this more subjective question belongs to Stats.SE ?