some substitute of TV distance in compact metric space

113 Views Asked by At

Could anyone help me to understand the following excerpts from an article which is not available online though,

''In the Markov process theory, it is common to show convergence to invariant measure in total variance distance. However, we have no hope for such as we can always choose the initial distribution such that the process does not converge. We could choose the initial distribution to be singular with regard to Lebesgue measure if the invariant measure is absolutely continuous and vice-versa. We can avoid this problem by using a weaker metric which is known as Kantarovich metric.....''

Q: Why it is common? what're the benefits of TV norm over others? Why he has no hope of convergence? (asking because he didn't give counterexample at all, which might be easy to understand for a probabilist but I am very new to Markov chain on Metric space topic). I also do not understand his remedy of singular measure etc. Thanks for the explanation. How the weaker metric is helping?

Thanks!

1

There are 1 best solutions below

2
On BEST ANSWER

Without knowing the article in question, it is somewhat difficult to answer, but here are some facts that the paper in question seems to be using:

Total variation of probability measures is always 1 so that is not particularly important. (This is almost definitional; it is easy to show that it would be upper bounded by 1 and you can refine partitions in your sample space until you get arbitrarily close to 1 I think)

1) TV distance is a nice distance because it is bounded by 1. It is also relatively easy to compute if the measures have densities since it has a nice explicit form. It is also very interpretable: how can you couple two random variables with the given distribution such that they return the same value in some set of events? It is also easy to visualize if your measures have densities w.r.t. the same measure. (Look at the area between your densities; best done if they have densities w.r.t. Lebesgue measure)

2) TV distance between a continuous distribution and a discrete one is, however, always 1, so that is not useful. So, for instance, if you add i.i.d. discrete random variables (think of Bernoulli), and renormalize, by central limit theorem, you know that it will converge to a normal distribution in distribution, but in total variation, it will not converge, which is a problem.

Instead, one can look at the Wasserstein distance between the two measures, which is given by essentially a transportation map: Let's say that it costs me $c(x,y)$ to move one unit of mass from x to y; how much will my transportation cost if I want to make my measure $\mu $ look like $\nu $? This is well-defined between mixture distributions, and indeed, you can prove many convergence arguments this way. (Also, convergence in Wasserstein implies convergence in distribution)