Gradient of the Sinkhorn Distance for Regularized Optimal Transport

454 Views Asked by At

Given two probability measures $\mathbf{r}\in \Sigma_n$ and $\mathbf{c}\in \Sigma_n$, Cuturi 2013 [1] defines the Sinkhorn distances as: $$ d_{\mathbf{M},\alpha}(\mathbf{r}, \mathbf{c}) = \min_{\mathbf{P} \in U_{\alpha}(\mathbf{r},\mathbf{c})} \langle \mathbf{P}, \mathbf{M} \rangle $$ where the cost matrix $\mathbf{M}$ defines the underlying metric space, and $U$ depicts the set of transportation plans: $$ U_{\alpha}(\mathbf{r},\mathbf{c}) = \{\mathbf{P}\in U(\mathbf{r},\mathbf{c}) \,:\, D_{\text{KL}} (\mathbf{P} || \mathbf{r}\mathbf{c}^\top) \leq \alpha \} $$ $D_{\text{KL}}$ refers to the KL-divergence and $\alpha$ controls the regularization. In practice $d(\cdot)$ can be computed efficiently by considering the dual problem. The algorithm looks like:

enter image description here

Essentially, I would like to minimize energies that involves Sinkhorn distances, such as: $$ \arg\min_{\boldsymbol{r}\in \Sigma_n} \sum\limits_{i=1}^k d_{\mathbf{M},\alpha}(\boldsymbol{r}, \mathbf{c}_i) $$ This is only one example. Though, to this aim, I would need the gradient of the Sinkhorn distance w.r.t. the empirical measures, e.g. $\nabla_{\mathbf{r}} d_{\mathbf{M},\alpha}$ and $\nabla_{\mathbf{c}} d_{\mathbf{M},\alpha}$.

Using [2] (Proposition 1 and 2), I have hunch that the gradient w.r.t. $\mathbf{r}$ would read: $$ \nabla_\mathbf{r} d_{\mathbf{M}}^\lambda = -\frac{\log(\mathbf{u})}{\lambda}+\frac{\log(\mathbf{u})^\top\mathbf{1}_n}{\lambda n}\mathbf{1}_n $$

where $d_{\mathbf{M}}^\lambda$ refers to the dual problem that is equivalent to the primal $d_{\mathbf{M},\alpha}$, and the pair $(\mathbf{u}\in\mathbb{R}^n_+,\mathbf{v}\in\mathbb{R}^n_+)$ is computed through the Sinkhorn matrix scaling applied to $K=\exp(-\lambda \mathbf{M})$. The term on the right hand side is used to normalize the gradient to keep it in the tangent bundle of the simplex. Does that mean that the gradient w.r.t. $\mathbf{c}$ would be:

$$ \nabla_\mathbf{c} d_{\mathbf{M}}^\lambda = -\frac{\log(\mathbf{v})}{\lambda}+\frac{\log(\mathbf{v})^\top\mathbf{1}_n}{\lambda n}\mathbf{1}_n $$ or is there another expression ?

In general, any ideas on computing these quantities without resorting to auto-differentiation would be welcome.

1

There are 1 best solutions below

0
On

Here, the answer comes easily after a reparameterization. First, note that the inequality constraint with the KL divergence to the margins can be written as $H(r)+H(c)-H(P) \leq \alpha$ when $P$ satisfies the equality constraints, which can be rewritten as $H(P) \geq \epsilon$ after absorbing constants. This inequality constraint will be "saturated" at the optimum since it will be a saddlepoint, and so we can take it as an equality constraint.

Writing out the Lagrangian as

$$L(P; \alpha, \beta, \epsilon) = <P, C>_F - \epsilon H(P) - \alpha^T (P1 - r) - \beta^T (P^T 1 - c)$$

And taking the appropriate derivatives at roots, we can see that

$$P_{ij} = e^{\frac{\alpha_i + \beta_j - C_{ij} }{\epsilon }} = u_i K_{ij } v_j$$

With the last expression being in the notation in your answer. So $u$ and $v$ are really just transformations of Lagrange multipliers, $\alpha = \epsilon \log(u)$ and this gives a direct link to the derivative of the objective with respect to $r$ via the interpretation of the Lagrange multipliers as the change in the objective with respect to the value of the constraint. Specifically,

$$\frac {\partial L} {\partial r} = \alpha = \epsilon \log(u)$$

And likewise for $c$.

It's also possible to show this by noting that at convergence, the Lagrange multipliers are the fixed points of a system of functions, rewriting as the roots of two implicit functions, and making use of the implicit function theorem. The result is the same, but in my opinion less illuminating and more tedious.

As a last note, you're correct that the gradient should be adjusted for the normalization constraint on $r$. The simplest way to handle this is to treat the gradient as an element of the tangent space of the multinomial manifold at $r$ and use the projection under the Fisher metric. This requires first raising the index of the differential with the metric on the ambient space:

$$ \#(df)^i = p_i \partial_i f$$

And then projection onto the submanifold:

$$ (grad f)^i = p_i (\partial_i f - \sum_{k} p_k \partial_k f) $$

Then, checking that the Fisher inner product with a tangent vector $x$ gives the correct directional derivative: $$g(x, grad f) = <\flat (grad f), x> = <x, \nabla f> - <p, \nabla f> <x, 1> = <x, \nabla f> $$

Which is exactly the definition of the directional derivative when $x$ is constrained to sum to 0.