How is the N-Pair loss' gradient?

56 Views Asked by At

In the paper "Improved Deep Metric Learning with Multi-class N-pair Loss Objective" [1], the author proposes the N-Pair loss as an efficient and effective alternative to Triplet loss [2]. The author also states that this loss

pushes (N-1) negative examples all at once, based on their similarity to the input example.

Considering an $(N+1)$-tuplet of training samples $\{x,x^+,x_1,...,x_{N−1}\}:$ $x^+$ is a positive sample to $x$, and $\{x_i\}_{i=1}^{N-1}$ are negative. The N-Pair loss is defined as follows:

\begin{equation}\label{n_pairs_loss} \mathcal{L}(\{x,x^+,\{x_i\}_{i=1}^{N-1}\}; f) = -log \frac{exp(f^T f^+)}{exp(f^T f^+) + \sum_{i=1}^{N-1}exp(f^T f_i)} \end{equation}

where $f(·;θ)$ is an embedding kernel defined by a deep neural network, $f$ is the representation of $x$, $f_i$ is the representation of the negative sample, and $f^+$ is the representation of the positive sample. This loss is minimized when the representation of $x$ is closer to $x^+$ than to all $x^-$.

The N-Pair loss can also be defined as its equivalent equation:

\begin{equation}\label{n_pairs_loss_alt} \mathcal{L}(\{x,x^+,\{x_i\}_{i=1}^{N-1}\}; f) = log (1 + \sum_{i=1}^{N-1}exp(f^T f_i-f^Tf^+)) \end{equation}

To better understand the effect of optimizing the N-Pair loss we can investigate the gradient.

Therefore, how are the $\frac{\partial \mathcal{L}}{\partial x}$, $\frac{\partial \mathcal{L}}{\partial x^+}$, and $\frac{\partial \mathcal{L}}{\partial x_i}$?

References

[1] Sohn, Kihyuk. "Improved deep metric learning with multi-class n-pair loss objective." Advances in neural information processing systems 29 (2016): 1857-1865.

[2] Weinberger, Kilian Q., John Blitzer, and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." Advances in neural information processing systems. 2006.