I have been trying to learn about policy gradients in regards to reinforcement learning. I have run into this equation.
$$ \nabla_\theta J(\theta) = E\left[\nabla_\theta \log\pi_\theta (s,a)Q^{\pi_\theta}(s,a)\right]$$
I understand that this is the gradient and we use this to change the weights of our policy, I'm just not sure how to go about calculating this function. The biggest part I'm getting tripped up on is $\nabla_\theta \log\pi_\theta (s,a)$.
I believe its the gradient vector but how would you go about calculating it so you can change the weights in your policy?
In general, you have a model that outputs probabilities for a stochastic policy: $$ p(a|s) = \pi_\theta(s,a) $$ so that you can sample an action $a\sim\pi_\theta$. In other words, $\pi_\theta$ is often a standard deep neural network. Thus, to get $\nabla_\theta\log\pi_\theta$, you need to differentiate through the output probabilities of the network wrt the parameters $\theta$.
In practice, one uses an automatic differentiation library. In pseudocode, we do:
In other words, using REINFORCE is essentially the same as training a network with a loss function given by: $$ \mathcal{L}(\theta) = \mathbb{E}[R_\theta(\mathcal{T})] $$ where $R(\mathcal{T})$ is the weighted cumulative reward over trajectory $\mathcal{T}$.
See links about likelihood ratio methods and the REINFORCE rule.
Hopefully that helps.