Solving for policy gradient in reinforcement learning

418 Views Asked by Bumbble Comm At 09 Apr 2026 - 7:27

I have been trying to learn about policy gradients in regards to reinforcement learning. I have run into this equation.

$$ \nabla_\theta J(\theta) = E\left[\nabla_\theta \log\pi_\theta (s,a)Q^{\pi_\theta}(s,a)\right]$$

I understand that this is the gradient and we use this to change the weights of our policy, I'm just not sure how to go about calculating this function. The biggest part I'm getting tripped up on is $\nabla_\theta \log\pi_\theta (s,a)$.

I believe its the gradient vector but how would you go about calculating it so you can change the weights in your policy?

Original Q&A

There are 1 best solutions below

Bumbble Comm On 10 Jul 2018 - 8:05

In general, you have a model that outputs probabilities for a stochastic policy: $$ p(a|s) = \pi_\theta(s,a) $$ so that you can sample an action $a\sim\pi_\theta$. In other words, $\pi_\theta$ is often a standard deep neural network. Thus, to get $\nabla_\theta\log\pi_\theta$, you need to differentiate through the output probabilities of the network wrt the parameters $\theta$.

In practice, one uses an automatic differentiation library. In pseudocode, we do:

probs = policy.action_probabilities(state)
action = probs.sample_as_categorical_distribution()
log_probs = log(probs)
d_logPi/d_theta = log_probs.backward_pass()   // back-prop 
dJ/d_theta = average(Q * d_logPi/d_theta)   // Monte-Carlo expectation

In other words, using REINFORCE is essentially the same as training a network with a loss function given by: $$ \mathcal{L}(\theta) = \mathbb{E}[R_\theta(\mathcal{T})] $$ where $R(\mathcal{T})$ is the weighted cumulative reward over trajectory $\mathcal{T}$.

See links about likelihood ratio methods and the REINFORCE rule.

Hopefully that helps.

Solving for policy gradient in reinforcement learning

There are 1 best solutions below

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in GRADIENT-FLOWS

Trending Questions

Popular # Hahtags

Popular Questions