Policy Gradient Methods - Proof

59 Views Asked by At

I'm trying to get a grasp of reinforcement learning and among the references I came across Policy Gradient Methods for Reinforcement Learning with Function Approximation

The paper is really short and the appendix is what I'm mostly interested in. In appendix the theorem proven is

Theorem 1 (Policy Gradient) For any MDP, in either the average reward or start-state formulations, $$ \frac{\partial \rho}{\partial \theta} = \sum_{s} d^{\pi}(s) \sum_a \frac{\partial \pi(s,a)}{\partial \theta} Q^{\pi}(s,a) $$

When I go in appendix 1 I understand all the math (it's mostly chain rule and substitutions), however it's the very beginning that confuses me. The proof indeed starts with

$$ \frac{\partial V^{\pi}(s)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_a \pi(s,a) Q^{\pi}(s,a) \;\;\; \forall s \in \mathcal{S} $$

The equality is "by definition" true, however the quantity $V^{\pi}(s)$ is not defined in the paper. Does anyone know or can clarify maybe?