What is KL-Divergence? Why Do I need it? How do I use it?

676 Views Asked by At

I am currently studying KL Divergence. But It seems very confusing that I don't maybe understand why do I ever need it and what is that for? As I have been reading stuff about Mutual Information, it looks like it is about the amount of Entropy between two probability distributions, for example, P(A) and P(B|A), especially for the conditional probability situation. Can somebody give me a clear explanation about KL-Divergence?

1

There are 1 best solutions below

0
On BEST ANSWER

What is the KL?

The KL divergence is a way to quantify how similar two probability distributions are. It is not a distance (not symmetric) but, intuitively, it is a very similar concept.

There are other ways of quantifying dissimilarity between probability distributions like the total variation norm (TV norm) [1] or more generally Wasserstein distances [2] but the KL has the advantage that it is relatively easy to work with (and particularly so if one of your probability distribution is in the exponential family), in fact, it can be shown to induce a geometry that is related to the Fischer information matrix [3,3b]).

Why do I need it? / How do I use it?

One place where it is widely used for example is approximate bayesian inference [4] where, essentially, one is interested in the following generic problem:

Let $\mathcal F$ be a restricted set of distributions such as the exponential family associated with a sufficient statistic. The problem is to find a distribution $q \in \mathcal F$ that is close in some sense to a target distribution of interest $p$.

Indeed, in Bayesian inference you may (often) be led to having a posterior distribution which is hard to use (e.g., hard to sample from, or hard to compute its moments) and you may therefore want to find a "good approximation" in a family of distributions where you know how to do all these things (typically, a Gaussian). The problem is therefore to find the "best fitting" distribution and the "best fit" is typically measured in the sense of the KL which leads to reasonably simple algorithms (depending on how you write the KL this will lead you to the Variational Bayes algorithm implemented in STAN for example, or the Expectation Propagation algorithm from Minka).

You could consider other distances/divergences but they might not lead you to an easily implementable algorithm (people have tried minimizing the Wasserstein for example, it works but it's hard [6]).

A wide range of other methods use or can be interpreted as using the KL, the Expectation Maximization algorithm for finding maximum likelihood estimators or maximum a posteriori estimators is one of those.

Another application is when testing for independence between two random variables where you could try computing (an approximation of) the KL between the joint $p_{XY}$ and the product of the marginals $p_X p_Y$ (the G-test is one of those [7]).

And there are many more applications..

Conclusion

The KL is probably as ubiquitous in the stats/ml world as the euclidean norm in linear algebra. The analogy extends to the fact that both are easy to work with, relatively easy to interpret and lead to algorithms for minimizing them that are easy to implement (gradient based).

Some references

[1] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

[2] https://en.wikipedia.org/wiki/Wasserstein_metric

[3,3b] https://personalrobotics.ri.cmu.edu/files/courses/papers/Amari1998a.pdf, http://arxiv.org/pdf/1412.1193.pdf

[4] https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

[5] http://research.microsoft.com/en-us/um/people/minka/papers/ep/

[6] https://arxiv.org/pdf/1310.4375.pdf

[7] https://en.wikipedia.org/wiki/G-test