In optimal transportation theory, the optimal re-allocation of probability distribution $\alpha$'s mass to another distribution $\beta$ is solved by minimizing the Wasserstein distance with respect to the transport plan.
$$W (\alpha, \beta) = \min_{\pi\in \Pi(\alpha\beta)} \int c(x,y) \mathrm{d}\pi(x,y) $$
Alternatively, the relative entropy-regularized Wasserstein distance, also called Sinkhorn distance, can be used:
$$W_\epsilon (\alpha, \beta) = \min_{\pi\in \Pi(\alpha\beta)} \int c(x,y) \mathrm{d}\pi(x,y) + \epsilon H(\pi \| \alpha \otimes \beta)$$ where $\epsilon$ is the regularization parameter, and relative entropy is$$H(\pi \| \alpha \otimes \beta) = \int \ln \left(\frac{\mathrm{d}\pi (x,y)}{\mathrm{d}\alpha(x) \mathrm{d}\beta(y) } \right) \mathrm{d}\pi (x,y) $$ Aude Genevay said that if you try the extreme case where both the source and target distributions are identical, $\alpha = \beta$, then we would expect the entropy-regularized Wasserstein distance (Sinkhorn distance) to equal $0$ since there is nothing to move, however it is incapable of doing so. Because of this she proposes the Sinkhorn divergence instead, a normalization which does equal $0$ if $\alpha = \beta$:
$$\bar{W}_\epsilon (\alpha, \beta) = W_\epsilon (\alpha, \beta) - \frac{1}{2} [W_\epsilon (\alpha, \alpha) + W_\epsilon (\beta, \beta) ]$$ In other words, $\bar{W}_\epsilon (\alpha, \alpha) = 0$.
Questions
- Why (or for what levels of regularization) can't the Sinkhorn distance, shown earlier, achieve $0$?
- Does standard optimal transport, which uses the unregularized Wasserstein distance, also suffer from this incapability (even though I know that the Wasserstein distance by itself, without OT, will achieve $0$)?
- and why, mathematically, does the Sinkhorn divergence?
$$W_\epsilon (\alpha, \beta) = \min_{\pi\in \Pi(\alpha,\beta)} \int c(x,y) \mathrm{d}\pi(x,y) + \epsilon H(\pi \| \alpha \otimes \beta)$$
and note it is only zero when $\pi(A,B)=\pi(A\cap B,A\cap B)$ (i.e when $\pi$ puts all the mass onto the diagonal $x=y$), otherwise it is positive. Then you are just left to check what happens to $\epsilon H(\pi \| \alpha \otimes \beta)$ when $\pi$ is of such a form.
When you say "even though I know that the Wasserstein distance by itself, without OT, will achieve 0 " this doesn't really make sense the Wasserstein distance is an OT problem itself. To see why $W_2(\alpha,\alpha)=0$ take as a $\gamma=f_\#\alpha$ where $f:\mathbb{R}^d\to \text{diag}(\mathbb{R}^d)$ is defined as $f(x)=(x,x)$, and define as a candidate $\pi\in\Pi(\alpha,\beta)$ (not necessarily optimal) $\pi(A)=\gamma(A \cap \text{diag}(\mathbb{R}^d))$. Note $\text{diag}(\mathbb{R}^d)=\{(x,x)\in\mathbb{R}^{2d}~:~x\in\mathbb{R}^d\}$
Are you asking how $\tilde{W}_\epsilon(\alpha,\alpha)=W_\epsilon(\alpha,\alpha)-\frac{1}{2}(W_\epsilon(\alpha,\alpha)+W_\epsilon(\alpha,\alpha))=0$?
Edit : for 1) and $c(x,y)=\|x-y\|^2$ proceed like this :
where in the last line we used Jensens inequality for the strictly convex function $x\mapsto x^2$.