The Wasserstein-1 distance can be viewed as the minimum amount of work needed to move one distribution to another distribution, as if the distributions were like piles of earth. The typical definition illustrates this nicely since it is the form $$W_1(\mu, \nu) = \inf_{\gamma \in \Gamma(\mu, \nu)} \mathbb{E}_{(x, y) \sim \gamma} |x - y|$$ which I view as the optimal transport plan times the transport distance integrated over $x$ to move $\mu$ to $\nu$ and obtain the work as a function of $y$, and then integrated over $y$ to get the total work done.
However, the dual representation which is often used in practice is given by
$$W_1(\mu, \nu) = \sup_{||f||_L \leq 1} \mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{y \sim \nu}[f(y)]$$
which I find less intuitive. I understand the Lipschitz constraint is stopping $f$ from arbitrarily separating the distributions, but it is unclear to me how this can be viewed as optimally moving one distribution to another.
I'm less interested in the formal proof involving linear programming as I am the intuitive reason these two formulas give the same result but seem so different.