I study about Mutual Information but I confuse about that. I study in this paper that mutual information is:$$I(x,y)=\iint p(x,y)\log\frac{p(x,y)}{p(x)p(y)}\,\mathrm dx\mathrm dy,$$ where $x, y$ are two vectors, $p(x,y)$ is the joint probabilistic density, $p(x)$ and $p(y)$ are the marginal probabilistic densities. MI is used to quantify both the relevance and the redundancy.
For understanding the MI, I have provided a small dataset like this: $$ \begin{matrix} &f_1&f_2 &f_3\\ c_2 & -1 & 0 & 1\\ c_1 & 0 & 1 & -1\\ c_1 & 1 &-1 & 0\\ c_2 & 0 & 1 & 1 \end{matrix} $$ where $f_1,f_2,f_3$ are 3 features for classification and $c_1, c_2$ are my classes.
- How can I calculate joint probabilistic density, $p(x,y)$ in this example?
- Can any one explain how can i calculate mutual information in this example using of above equation, $I(x,y)$?
Take the first feature $f_1$ and build the join histogram $(feature\ state,class\ state)$. Your features have $3$ possible states $\{-1,0,1\}$, classes have $2$ possible states $\{c=1,c=2\}$. To build the histogram simply count the join occurrences:
\begin{array}{|c|c|c|} \hline & c=1 & c=2 \\ \hline f_1=-1 & 0 & 1 \\ \hline f_1=0 & 1 & 1 \\ \hline f_1=+1 & 1 & 0 \\ \hline \end{array}
You see that $f_1=0$ is uninformative, because $c=1$ or $c=2$ are possible with equal probability. However if $f_1=-1$, with the data we have, it is a priori $c=2$ (because you have zero count for $c=1$). Mutual informative exactly quantify this. To compute it, you must first normalize your 2D histogram such that $\sum h_{ij}=1$ and you must compute marginals $p(feature)$ and $p(class)$
$$ p(feature,class)=\left(\begin{array}{cc} 0 & \frac{1}{4} \\ \frac{1}{4} & \frac{1}{4} \\ \frac{1}{4} & 0 \\ \end{array}\right),\ p(feature)=\left(\begin{array}{c} \frac{1}{4} \\ \frac{1}{2} \\ \frac{1}{4} \\ \end{array}\right),\ p(class)=\left(\frac{1}{2},\ \frac{1}{2}\right) $$ then compute: $I(x,y)=\int\int p(x,y) \frac{\log p(x,y)}{ p(x) \cdot p(y)}dxdy$ as follows: $$ I(feature, class)=\sum_{i=1,2,3}\sum_{j=1,2}p(feature\ i,class\ j)\log\frac{p(feature\ i,class\ j)}{p(feature\ i)p(class\ j)} $$ Then repeat the same computation for feature $f_2$ and $f_3$. The one with the highest mutual information is the most discriminative for guessing the class.