How to compute the optimization process of PPO algorithm?

50 Views Asked by At

I'm building the Proximal Policy Optimization algorithm from scratch (well, using PyTorch). I've been studying it by my own, but I'm a little bit confused in the optimization phase, here is the thing.

From what I know...

  1. First, we initialize a Policy Network with random parameters.
  2. Second, we start with policy rollouts. At each time step (t) of the episode, we compute the Value Functions (functions approximators) in order to get the Advantage Function A(s,a), also, we compute the clipped surrogate objective (J) of each time step (t).
  3. At the end of the episode, we sum all the clipped surrogate objective values, this will give us the expectation of the entire episode of the expected cumulative rewards. In the paper of PPO the equation contains the expected value of all time steps (t).
  4. Once we have the value of the expected clipped surrogate objective (sum of all clipped surrogate objective values) at the end of the episode, we compute the Stochastic Gradient Descent (SGD). In order to compute SGD we need a loss function, we have our expected clipped surrogate objective function, so we just do -(expected clipped surrogate objective function) and that will represented the loss, that is the same as computing the Stochastic Gradient Ascent in order to maximize the expected cumulative reward or the objective function.

Now my confusion comes in here... I thought that the clipped surrogate objective function was computed at each individual time steps (t) and then at the end of the episode, we sum all of them in order to optimize it (compute SGD). The thing is that some authors say that the optimization process (SGD) is done at each time step (t) of the episode instead of at the end of the episode, but why? doesn't the clipped objective function equation in the paper contains an expectation symbol? if the computation is done at each time step (t) then the expected value symbol in the equation is redundant, isn't it?

Also, they say that in order to compute SGD we need a loss, I thought that by setting the clipped surrogate objective function to negative (-) we can get the gradient of it and then minimize it using SGD that it would be the same as maximizing it, but the paper shows another equation as being the loss function that the optimization phase uses, how is this?

So my questions are...

  1. When and how is the clipped surrogated objective function computed (at each time step (t) or at the end of the episode), is my implementation of the computation correct?.
  2. When and how is the optimization phase computed (at each time step (t) or at the end of the episode)?
  3. My thoughts about the loss function correct? or what does the paper means with this loss function they show?

Thank you in advance:)