Different versions of the entropy term in the entropy-regularized Wasserstein distance

208 Views Asked by At

\begin{equation} \mathcal{W}_\epsilon(\alpha, \beta) = \min_{\pi\in \Pi(\alpha\beta)} \int c(x,y) \mathrm{d}\pi(x,y) + \epsilon H(\pi \| \alpha \otimes \beta) \end{equation} Cuturi (2013) introduced the entropy-regularized Wasserstein distance, or Sinkhorn distance, shown above, where $\epsilon $ is the regularization parameter and $H(\pi \| \alpha \otimes \beta)$ is the relative entropy, or KL-divergence, between the transport plan and the marginal probabilities.

But I have seen the $H(\cdot)$ term shown in two different ways, one with entropy and the other with relative entropy:

\begin{align} H(\pi) &= \int \pi(x,y) \ln \pi(x,y) \\ H(\pi \| \alpha \otimes \beta) &= \int \ln \left(\frac{\mathrm{d}\pi (x,y)}{\mathrm{d}\alpha(x) \mathrm{d}\beta(y) } \right) \mathrm{d}\pi (x,y) \end{align}

How are the last two lines equal or connected to each other? Obviously they're not the same, so why are there two different versions running around?

2

There are 2 best solutions below

5
On BEST ANSWER

These two are actually equivalent up to a constant when $\pi$ is a coupling of $\alpha$ and $\beta$. I'll assume that $\pi,\alpha, \beta$ all have densities. We can then write:

$$ H(\pi||\alpha\otimes \beta) = \int\ln\left(\frac{d\pi}{d\alpha d\beta} \right)d\pi = \int \pi(x,y) \ln\left(\frac{\pi(x,y)}{\alpha(x)\beta(y)} \right) dx dy $$

Note that $\pi(x,y)$ is the density with respect to the Lebesgue measure, and the same can be said for $\alpha(x)$ and $\beta(y)$. Therefore:

$$ H(\pi||\alpha\otimes \beta) = \int\pi(x,y)\ln \pi(x,y) dx dy - \int\pi(x,y)\ln(\alpha(x))dxdy - \int\pi(x,y)\ln(\beta(y))dxdy =\\ = \int \pi(x,y) \ln\pi(x,y) dx dy - \int\alpha(x)\ln\alpha(x) dx -\int \beta(y) \ln \beta(y) dy = H(\pi) - H(\alpha) - H(\beta) $$

Since $\alpha$ and $\beta$ are fixed, we get $H(\pi) + C$, where $C$ is a constant.

2
On

I would like to add a couple of points here which I think shouldn't be overlooked.

Neither of the choices are "wrong". In the 2013 Cuturi paper you reference he chooses to regularise with "entropy" (notice this is actually the negative Boltzmann entropy) :

$$ H(\pi)= \begin{cases} \int \pi \log \pi~~&\text{when}~\pi~\text{has a density} \\ \infty & o.w \end{cases}. $$

  1. The reason this is a natural choice for regularisation is that it does the "smoothing" or "relaxing" job that regularisation is intended to do. Adding $H$ into the optimal transport problem gives the mass "freedom to spread out". This can be seen in this example let $\mu$ be concentrated on the two points $x_1,x_2 \in \mathbb{R}$ such that $\mu(x_1)=\mu(x_2)=\frac{1}{2}$, and $\nu$ be concentrated on the two points $y_1,y_2 \in \mathbb{R}$ such that $\nu(y_1)=\mu(y_2)=\frac{1}{2}$, then the optimal coupling $\pi$ which maximises $H$ is

$$ \pi(x_i,y_j)=\frac{1}{4},\forall~i,j.$$

  1. Since we have a minimisation problem its beneficial to be adding a uniformly convex term, again $H$ ticks that box!

  2. The choice of adding $H(\pi~||~\alpha\otimes\beta)$, the entropy conditioned on the product measure, has its advantages as outlined in https://audeg.github.io/publications/these_aude.pdf. As far as I understand it lets you rephrase the dual problem in a neat way.

  3. Now to comparing the two choices: I don't think it matters too much either way, both do the same job. As the other answer points out, the minimiser is the same, and they differ by a constant $C$. Lastly remember that, usually when "doing regularisation" you have a small parameter $\epsilon \ll 1$ multiplying the regularisation term, hence

$$ \epsilon \Big(H(\pi~||~\alpha\otimes\beta)-H(\pi)\Big)=\epsilon C \ll 1 .$$

  • edit I have changes to make about what choice to take.