Policy Gradient Methods - Proof

59 Views Asked by Bumbble Comm At 08 Apr 2026 - 5:54

I'm trying to get a grasp of reinforcement learning and among the references I came across Policy Gradient Methods for Reinforcement Learning with Function Approximation

The paper is really short and the appendix is what I'm mostly interested in. In appendix the theorem proven is

Theorem 1 (Policy Gradient) For any MDP, in either the average reward or start-state formulations, $$ \frac{\partial \rho}{\partial \theta} = \sum_{s} d^{\pi}(s) \sum_a \frac{\partial \pi(s,a)}{\partial \theta} Q^{\pi}(s,a) $$

When I go in appendix 1 I understand all the math (it's mostly chain rule and substitutions), however it's the very beginning that confuses me. The proof indeed starts with

$$ \frac{\partial V^{\pi}(s)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_a \pi(s,a) Q^{\pi}(s,a) \;\;\; \forall s \in \mathcal{S} $$

The equality is "by definition" true, however the quantity $V^{\pi}(s)$ is not defined in the paper. Does anyone know or can clarify maybe?

Original Q&A

Policy Gradient Methods - Proof

Related Questions in CALCULUS

Related Questions in PROOF-EXPLANATION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions