Why does mutual information use KL divergence?

Question

Why does mutual information use KL divergence?

173 Views Asked by Bumbble Comm At 01 Apr 2026 - 6:54

Mutual information between a pair of random variables $X,Y$ having joint distribution $P_{(X,Y)}$ and marginal distributions $P_X,P_Y$ respectively is defined as

$$I(X,Y)\equiv D_{\text{KL}}(P_{(X,Y)}\|P_X\otimes P_Y ),$$

where $D_{\text{KL}}$ is the KL divergence. Intuitively, this measures how much "information" is revealed about one random variable through observing the other by quantifying how far the joint distribution is from the product of marginals (this distance being zero when $X,Y$ are independent).

Why not more flexibly allow for other notions of statistical distance? i.e. Why not define

$$\tilde I(X,Y,d)\equiv d(P_{(X,Y)},P_X\otimes P_Y )$$

for arbitrary distance $d$? There are distances that are at least as compelling as KL divergence, such as Jensen-Shannon divergence, which at least symmetrizes KL divergence, or the Wasserstein metric, which is actually a metric and enjoys other attractive properties (as observed in the ML literature).

I understand mutual information as defined has connections with entropy, so perhaps this makes the definition tractable? What merits are there in using the KL divergence vs. other distances?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2022-02-16 15:24:24

Very interesting question, in fact there are other notions of information out there, based on other statistical distances. One big family of divergences you might be interested to look at are the so-called $f$-Divergences, which are defined (informally) as follows. Let $f$ be a convex function with $f(1)=0$, and let $\mu$ and $\nu$ be two probability measures. Then the corresponding $f$-divergence is defined as $D_f(\mu\|\nu) =\mathbb{E}_{\nu} [f(d\mu/d\nu)]$.

Depending on the choice of $f$, you can notably recover

KL Divergence
Reverse KL Divergence
Jensen-Shannon Divergence
Chi-square Divergence
Hellinger Divergences
Hellinger Distance
Total Variation Distance

and many others, from which you can define a notion of information measure as $I_f(X, Y) = D_f(P_{X, Y} \|P_XP_Y)$, just as with the KL divergence when considering random variables $X$ and $Y$.

I invite you to read more about these to learn what their properties are and how they are formally defined.

Why does mutual information use KL divergence?

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions