Problem with gradient of actor in an Actor Critic algorithm using linear function approximation

42 Views Asked by At

I have a problem updating $\boldsymbol{\theta}$ (the weights vector for the actor in an actor critic algorithm). I know the gradient of $ln(\pi(a|s,\boldsymbol{ \theta }) = \mathbf{x}(s,a) - \sum_b{\pi(b|s,\boldsymbol{\theta})*\mathbf{x}(s,b)} $where the index b represent each of the possible actions. The result of this difference is always $0$. I think I know why: it is because $\mathbf{x}(s,b)$ is equal for all actions so I can take it out of the sum and because the summation of all the $\pi(b|s,\boldsymbol{\theta})$ is equal to 1 then the result is equal to $\mathbf{x}(s,a) - \mathbf{x}(s,a) = 0$. I'm using stack features technique. I have search through various sources like coursera, google etc. but I cannot find a solution. My question is: How to encode $\mathbf{x}(s,a)$ so that for each of the actions get a different representation and avoid this problem. Thanks beforehand.