I've been reviewing some basic optimal transport concepts (ref Peyré and Cuturi). I like the notation of the Kantorovich optimal transport problem.
$$ L_C(a,b) = \min_{ P \in U(a,b) } \langle P , C \rangle = \sum_{i,j} C_{i,j}P_{i,j} $$
The optimal transport cost between discrete probability measures $a$ and $b$ is the coupling ($P$) that minimizes the overall transportation cost of transferring mass from measure $a$ to $b$, with cost matrix $C$ which is defined by a distance metrics between supports using some distance metric, like the Euclidean distance.
I'm working on a problem which I believe is a very simple case of optimal transport in this setting, and I wanted to get some feedback if my logic is off somewhere.
If we assume $a$ and $b$ are discrete probability densities each over identical supports $[1,2,3]$. We wish to compute the optimal coupling between them. I believe there are 3 simple (trivial) cases where we can compute the optimal transport map for an arbitrary $a$. The trivial cases I'm referring to are when $b$ is either $[1,0,0]$, $[0,1,0]$, or $[0,0,1]$.
Example, let $b = [1,0,0]$, and $a = [a_1,a_2,a_3]$, then I believe there is only 1 coupling possible, and so it must be optimal. Specifically, if the couplings $P$ are such that $P_{i,j}$ corresponds to the coordinates $a_i$ (rows) and $b_i$ (columns), then the only possible coupling is this. It is the coupling that sends all the weight from $a_3$ and $a_2$ to the first column. The target density has all the weight on one support.
$$\begin{bmatrix} a_1 & 0 & 0\\ a_2 & 0 & 0\\ a_3 & 0 & 0 \end{bmatrix}$$
Similarly for $b=[0,1,0]$ and $[0,0,1]$ we have similar arguments.
The point I am trying to make is that if $b$ belongs to these trivial cases, then we know what the optimal transport map has to be in each case, because it is the only transport map available. The same idea should work in arbitrary dimension, not just 3. Is this correct?
It happens to be the case in machine learning that we often deal with "target" measures in classification problems with all the weight over one support, like in this example. I believe this makes it quick and easy to apply the solution to the Kantorovich problem as a loss function in classification settings, but that is for the next chapter.