How can I understand Wasserstein Metric?

310 Views Asked by At

I've met Wasserstein metric in different topic, most in sampling and mathematic model of machine learning.

For two density function $\mu,\nu$ on $R^d$, the wasserstein distance between $\mu,\nu$ can be defined as: $$ W_2(\mu,\nu) = \inf_{P \in \Gamma(\mu,\nu)}\{ \int_{R^d \times R^d} |x - y|^2dP\} $$ Where $\Gamma(\mu,\nu)$ is the set of all the joint distribution on $R^{2d}$ with marginal distribution $\mu,\nu$

My question is:

  1. How can I understand the definition by institution?

  2. My teacher says there is connection between Wasserstein distance, Partial Differential Equation and Optimization. How can I understand that?

  3. Are there any good reference or notes on this topic?

2

There are 2 best solutions below

0
On BEST ANSWER

One way I like to think about Wasserstein Distance, similar to Robert's answer and Milo's comment, frames the object as the amount of effort required to consolidate each distribution $\mu, \nu$ into a single distribution, or more literally the distance between to probability distributions. This can be visualized with a picture 1

This is the most intuitive interpretation I have seen. As for your second question, Wasserstein distance may be used when looking at sampling algorithms like Metropolis-Hastings, which is commonly used to answer similar questions as the ones often posed in optimization settings. Instead of learning a single optimal parameter (like with direct optimization), you are learning the distribution of the optimal parameter. Wasserstein distance is used in evaluating target distributions in MCMC, in addition to the more direct connection to optimization mentioned by Robert. In MCMC, a large Wasserstein distance between the target and proposal distribution may suggest a poor choice of parameters or acceptance probability, and could therefore be used to optimize the algorithm to result in low Wasserstein distance, corresponding to more accurate results.

A more in-depth source on W.D. can be found at: https://library.oapen.org/bitstream/id/b27ca94b-41a7-486c-863f-8de6b3a8f914/2020_Book_AnInvitationToStatisticsInWass.pdf

0
On

It may be easier to think of discrete masses rather than densities. Suppose $\mu$ and $\nu$ are discrete probability measures. You try to get $\nu$ from $\mu$ by moving around various packets of mass. The cost of moving a packet of mass $m$ a distance $d$ is $m d^2$. Then the Wasserstein distance from $\mu$ to $\nu$ is the minimum total cost of doing this. How to find the way to transform $\mu$ to $\nu$ at minimum total cost is the connection to Optimization.