Determining Optimal Policy Probabilities in Off-Policy PPO Using KKT Conditions

24 Views Asked by Bumbble Comm At 27 Mar 2026 - 12:57

I'm working through a paper on Proximal Policy Optimization (PPO) and am trying to understand the derivation of the optimal policy probabilities for the off-policy case as expressed in Equation 16. "Off-Policy Proximal Policy Optimization" is the name of this paper, and it can be found easily online. The derivation uses the Karush-Kuhn-Tucker (KKT) conditions, and I'm struggling to follow the specific steps to arrive at the final formula.

Given the following constraint optimization problem for the case $ A_t \leq 0 $:

$$ \min_{\pi} \sum_a \pi(a|s_t) \log \frac{\pi(a|s_t)}{\pi_{\text{old}}(a|s_t)} \text{s.t.} \quad \pi(a|s_t) \leq \mu(a|s_t)\big|_{s_t,a_t}, \quad \sum_a \pi(a|s_t) = 1, \quad \pi(a|s_t) > 0 $$

The KKT conditions that I'm considering are:

Stationarity: $$ \frac{\partial \mathcal{L}}{\partial \pi(a|s_t)} = 0 $$
Primal feasibility: $$ \pi(a|s_t) \leq \mu(a|s_t)\big|_{s_t,a_t} $$
Dual feasibility: $$ \lambda_a \geq 0 $$
Complementary slackness: $$ \lambda_a \cdot (\pi(a|s_t) - \mu(a|s_t)\big|_{s_t,a_t}) = 0 $$

However, when applying these conditions, I'm not reaching the expected formula given in the paper:

$$ \pi^{\text{Off-Policy PPO}}_{\text{new}}(a|s_t) = \begin{cases} \frac{\pi_{\text{old}}(a|s_t)(1 - \mu(a_t|s_t))}{1 - \pi_{\text{old}}(a_t|s_t)}, & \text{if } a \neq a_t \\ \mu(a_t|s_t), & \text{if } a = a_t \end{cases} $$

Can someone guide me through the detailed steps of using the KKT conditions to arrive at this result?

Original Q&A

Determining Optimal Policy Probabilities in Off-Policy PPO Using KKT Conditions

Related Questions in PROBABILITY

Related Questions in OPTIMIZATION

Related Questions in CONVEX-OPTIMIZATION

Related Questions in MACHINE-LEARNING

Related Questions in KARUSH-KUHN-TUCKER

Trending Questions

Popular # Hahtags

Popular Questions