On page 3 of Generative Adversarial Imitation Learning, Ho & Ermon derived the following Lagrangian.
$$\arg\max_{c \in \mathbb R^{\mathcal S\times\mathcal A}}\min_{\rho \in \mathcal D}\bar{L}(\rho,c) = -\bar {H}(\rho)-\psi(c)+\sum_{s,a}(\rho(s,a)-\rho_{E}(s,a))c(s,a)$$
where
$s$ denotes state
$a$ denotes action
$c$ is the cost function
$\rho$ and $\rho_E$ are the state-action visitation distribution associated to policy $\pi_\rho$ and $\pi_E$ respectively
$\psi$ is a regularization function (which we assume is a constant and omitted in the following discussion)
$\bar H$ is the entropy
Then, on page 4, they derive
I'm confused about the explanation in the second paragraph, which explain why $\rho(s,a) = \rho_E (s,a)$. But why can't we directly derive it from the constraint? After all, the dual has a solution when the constraint is satisfied and the convexity of $-\bar H$ ensures that $\rho(s,a) = \rho_E(s,a)$ is optimum for the primal.
