I'm working through a paper on Proximal Policy Optimization (PPO) and am trying to understand the derivation of the optimal policy probabilities for the off-policy case as expressed in Equation 16. "Off-Policy Proximal Policy Optimization" is the name of this paper, and it can be found easily online. The derivation uses the Karush-Kuhn-Tucker (KKT) conditions, and I'm struggling to follow the specific steps to arrive at the final formula.
Given the following constraint optimization problem for the case $ A_t \leq 0 $:
$$ \min_{\pi} \sum_a \pi(a|s_t) \log \frac{\pi(a|s_t)}{\pi_{\text{old}}(a|s_t)} \text{s.t.} \quad \pi(a|s_t) \leq \mu(a|s_t)\big|_{s_t,a_t}, \quad \sum_a \pi(a|s_t) = 1, \quad \pi(a|s_t) > 0 $$
The KKT conditions that I'm considering are:
Stationarity: $$ \frac{\partial \mathcal{L}}{\partial \pi(a|s_t)} = 0 $$
Primal feasibility: $$ \pi(a|s_t) \leq \mu(a|s_t)\big|_{s_t,a_t} $$
Dual feasibility: $$ \lambda_a \geq 0 $$
Complementary slackness: $$ \lambda_a \cdot (\pi(a|s_t) - \mu(a|s_t)\big|_{s_t,a_t}) = 0 $$
However, when applying these conditions, I'm not reaching the expected formula given in the paper:
$$ \pi^{\text{Off-Policy PPO}}_{\text{new}}(a|s_t) = \begin{cases} \frac{\pi_{\text{old}}(a|s_t)(1 - \mu(a_t|s_t))}{1 - \pi_{\text{old}}(a_t|s_t)}, & \text{if } a \neq a_t \\ \mu(a_t|s_t), & \text{if } a = a_t \end{cases} $$
Can someone guide me through the detailed steps of using the KKT conditions to arrive at this result?