Critic Loss in PPO

1k Views Asked by Bumbble Comm At 26 Mar 2026 - 9:38

TL,DR: How precisely is the critic loss in PPO defined?

I am trying to understand the PPO algorithm so that I can implement it. Now I'm somewhat confused when it comes to the critic loss. According to the paper, in the objective that we want to maximize, there is a term $$ -c_1 (V_\theta(s_t) - V_t^{targ})^2 $$ which is the loss for the critic ($"-"$ in the beginning, since the objective is maximized). I didn't really know what $V_t^{targ}$ was supposed to be at first, after some research online it looks like that's what we're getting from GAE for $\hat{A}_t$, so it's our advantage estimate. On the other hand, $V_\theta(s_t)$ seems to be the output of our critic network, which estimates the value of the state. This would mean that we are trying to get the value of the state close to the advantage estimate, which doesn't make any sense, as the advantage of an action is defined as the action's value minus the value of the state (hence we would end up with "(value of the action - value of the state) - value of the state" in the critic loss). This can also be seen in the definition of the (generalized) advantage estimate, where the value of the state is deducted already. So I suppose I'm misunderstanding something about the critic loss?

Original Q&A

There are 1 best solutions below

Bumbble Comm On 14 Jan 2022 - 5:38 BEST ANSWER

For anyone interested, I think I figured it out now. $V_\theta(s_t)$ is indeed the output of the critic network, i.e. our estimate of the state value according to our policy. $V_t^{targ}$, on the other hand, is an estimate for the action value based on the trajectory, i.e. the discounted sum of future rewards: $\sum_{i=t}^T \gamma^{i-t}R_\theta(s_i)$. (In particular, it's not $\hat{A}_t$ from GAE, but in case $\lambda=0$ we have $\hat{A}_t\approx V_t^{targ}-V(s_t)$.) But since we are choosing the action according to our policy, the state value (acc. to the policy) should equal the value of the action that's chosen according to that policy (in the trajectory). So it does make sense to use the MSE loss of $V_\theta(s_t)$ and $V_t^{targ}$.

Critic Loss in PPO

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Related Questions in ARTIFICIAL-INTELLIGENCE

Trending Questions

Popular # Hahtags

Popular Questions