Why does mutual information use KL divergence?

173 Views Asked by At

Mutual information between a pair of random variables $X,Y$ having joint distribution $P_{(X,Y)}$ and marginal distributions $P_X,P_Y$ respectively is defined as

$$I(X,Y)\equiv D_{\text{KL}}(P_{(X,Y)}\|P_X\otimes P_Y ),$$

where $D_{\text{KL}}$ is the KL divergence. Intuitively, this measures how much "information" is revealed about one random variable through observing the other by quantifying how far the joint distribution is from the product of marginals (this distance being zero when $X,Y$ are independent).

Why not more flexibly allow for other notions of statistical distance? i.e. Why not define

$$\tilde I(X,Y,d)\equiv d(P_{(X,Y)},P_X\otimes P_Y )$$

for arbitrary distance $d$? There are distances that are at least as compelling as KL divergence, such as Jensen-Shannon divergence, which at least symmetrizes KL divergence, or the Wasserstein metric, which is actually a metric and enjoys other attractive properties (as observed in the ML literature).

I understand mutual information as defined has connections with entropy, so perhaps this makes the definition tractable? What merits are there in using the KL divergence vs. other distances?

1

There are 1 best solutions below

1
On

Very interesting question, in fact there are other notions of information out there, based on other statistical distances. One big family of divergences you might be interested to look at are the so-called $f$-Divergences, which are defined (informally) as follows. Let $f$ be a convex function with $f(1)=0$, and let $\mu$ and $\nu$ be two probability measures. Then the corresponding $f$-divergence is defined as $D_f(\mu\|\nu) =\mathbb{E}​_{\nu} [f(d\mu/d\nu)]​$.

Depending on the choice of $f$, you can notably recover

  • KL Divergence
  • Reverse KL Divergence
  • Jensen-Shannon Divergence
  • Chi-square Divergence
  • Hellinger Divergences
  • Hellinger Distance
  • Total Variation Distance

and many others, from which you can define a notion of information measure as $I_f(X, Y) = D_f(P_{X, Y} \|P_XP_Y)$, just as with the KL divergence when considering random variables $X$ and $Y$.

I invite you to read more about these to learn what their properties are and how they are formally defined.