How to calculate mutual information

1.2k Views Asked by At

I study about Mutual Information but I confuse about that. I study in this paper that mutual information is:$$I(x,y)=\iint p(x,y)\log\frac{p(x,y)}{p(x)p(y)}\,\mathrm dx\mathrm dy,$$ where $x, y$ are two vectors, $p(x,y)$ is the joint probabilistic density, $p(x)$ and $p(y)$ are the marginal probabilistic densities. MI is used to quantify both the relevance and the redundancy.

For understanding the MI, I have provided a small dataset like this: $$ \begin{matrix} &f_1&f_2 &f_3\\ c_2 & -1 & 0 & 1\\ c_1 & 0 & 1 & -1\\ c_1 & 1 &-1 & 0\\ c_2 & 0 & 1 & 1 \end{matrix} $$ where $f_1,f_2,f_3$ are 3 features for classification and $c_1, c_2$ are my classes.

  • How can I calculate joint probabilistic density, $p(x,y)$ in this example?
  • Can any one explain how can i calculate mutual information in this example using of above equation, $I(x,y)$?
2

There are 2 best solutions below

0
On BEST ANSWER

Take the first feature $f_1$ and build the join histogram $(feature\ state,class\ state)$. Your features have $3$ possible states $\{-1,0,1\}$, classes have $2$ possible states $\{c=1,c=2\}$. To build the histogram simply count the join occurrences:

\begin{array}{|c|c|c|} \hline & c=1 & c=2 \\ \hline f_1=-1 & 0 & 1 \\ \hline f_1=0 & 1 & 1 \\ \hline f_1=+1 & 1 & 0 \\ \hline \end{array}

You see that $f_1=0$ is uninformative, because $c=1$ or $c=2$ are possible with equal probability. However if $f_1=-1$, with the data we have, it is a priori $c=2$ (because you have zero count for $c=1$). Mutual informative exactly quantify this. To compute it, you must first normalize your 2D histogram such that $\sum h_{ij}=1$ and you must compute marginals $p(feature)$ and $p(class)$

$$ p(feature,class)=\left(\begin{array}{cc} 0 & \frac{1}{4} \\ \frac{1}{4} & \frac{1}{4} \\ \frac{1}{4} & 0 \\ \end{array}\right),\ p(feature)=\left(\begin{array}{c} \frac{1}{4} \\ \frac{1}{2} \\ \frac{1}{4} \\ \end{array}\right),\ p(class)=\left(\frac{1}{2},\ \frac{1}{2}\right) $$ then compute: $I(x,y)=\int\int p(x,y) \frac{\log p(x,y)}{ p(x) \cdot p(y)}dxdy$ as follows: $$ I(feature, class)=\sum_{i=1,2,3}\sum_{j=1,2}p(feature\ i,class\ j)\log\frac{p(feature\ i,class\ j)}{p(feature\ i)p(class\ j)} $$ Then repeat the same computation for feature $f_2$ and $f_3$. The one with the highest mutual information is the most discriminative for guessing the class.

3
On

Sure. You have 3 natural bins, $\{\{-1\},\{0\},\{1\}\}$ (sometimes the bin-division is not so easy. It can even be the hardest part, say for example you have floating point numbers and no natural bounds of data.).

Discrete set makes our double integral a double sum. Let us estimate $p(x,y)$ we start $f_1,c_1$: we have two measurements, one 0 and one 1, density becomes $\{0,1/2,1/2\}$

Now do the same for all others.

For $p(x)$ we just count all $f_1$ regardless of $k$ in $c_k$: for 1: we have 1 "-1"s, 2 "0"s and 1 "1"s this gives us density $\{1/4,2/4,1/4\}$ for $f_1$. Now continue for the others.

For $p(y)$ you do the same counting but row-wise instead of column-wise.

Once you have calculated estimates for $p(x,y)$, $p(x)$, $p(y)$, you just plug in the values and calculate the double sum.