I have a question about the policy gradient update in the deep deterministic policy gradient algorithm. I am implementing the DDPG algorithm in Java using the DeepLearning4J library.
In the algorithm the following update is used:
$$\nabla_{\theta^{\mu}}J \approx \frac{1}{N}\sum_i\nabla_aQ(s, a | \theta^Q)|_{s=s_i,a=\mu(s_i)}\nabla_{\theta^{\mu}}\mu(s|\theta^{\mu})|_{s_i}$$
Can this be rewritten to:
$$\nabla_{\theta^{\mu}}J \approx \frac{1}{N}\nabla_{\theta^{\mu}}\left(\sum_i\nabla_aQ(s, a | \theta^Q)|_{s=s_i,a=\mu(s_i)}\mu(s|\theta^{\mu})|_{s_i}\right)$$
I then want to this in the following way. Use $\nabla_aQ(s, a | \theta^Q)|_{s=s_i,a=\mu(s_i)}$ as error term for the back propagation algorithm. $\mu$ does not have a loss function, so the calculation of $\delta$ for the last layer is just:
$$\delta^{(n)} = f'(z^{(n)}),$$ with $n$ the final layer.
The backpropGradient function in the DeepLearning4J library takes $\epsilon$ as input which is multiplied with $f'(z^{(n)})$. So if $\epsilon$ is replaced with $\nabla_aQ(s, a | \theta^Q)|_{s=s_i,a=\mu(s_i)}$ it should give me the correct gradient right?
$$\delta^{(n)} = \nabla_aQ(s, a | \theta^Q)|_{s=s_i,a=\mu(s_i)}f'(z^{(n)}),$$ with $n$ the final layer.
And then use this term to compute the rest of the back propagation algorithm.