"Distance" between distributions and convergence of Markov chains

480 Views Asked by At

For what follows, let us restrict ourselves to discussing discrete probability distributions.

When learning about things like relative entropy (aka Kullback-Leibler divergence) $D(\cdot || \cdot)$, one typically learns that it does not technically define a metric, since $D(P||Q) \ne D(Q||P)$, for some distributions $P,Q$.

In general, I don't think there is any well-defined notion of distance between two probability distributions.

However, when learning about Markov chains, we learn about nice conditions under which some distribution $\pi_0$ converges to the stationary distribution of the Markov chain $\pi$ (assuming it exists).

Since there is no well-defined metric on the set of probability distributions, what is going on here? How is this convergence defined? In precisely what sense is $\pi_0$ approaching $\pi$? Is there some (obvious) underlying topology that I am just not seeing? Can someone help reconcile what's going on here?

1

There are 1 best solutions below

2
On

Yes, there are many possible topologies on the space of all probability distributions. You can see https://en.wikipedia.org/wiki/Convergence_of_measures for an overview. The most commonly encountered is the weak topology; one of several equivalent definitions is that a sequence of probability measures $P_n$ on, say, $\mathbb{R}$, converges to $P$ iff $\int f\,dP_n \to \int f\,dP$ for every bounded continuous function $f$. It's also called "convergence in law" or "convergence in distribution". It is discussed in most graduate-level probability books (e.g. Durrett, Resnick). The canonical source to learn more is Billingsley, Convergence of Probability Measures.

It is true, for instance, that under appropriate conditions, the $n$-step distribution of a Markov chain converges, in the weak topology, to the stationary distribution.

In general, I don't think there is any well-defined notion of distance between two probability distributions.

Well, you are incorrect. The K-L divergence doesn't happen to be a metric, but there are plenty of other metrics. For instance, there's the Lévy metric, which induces the weak topology. The total variation distance is also a metric, but it induces a different (and less useful) topology.