Using Stationary Distribution in the Policy Gradient Theorem Proof

63 Views Asked by Bumbble Comm At 11 Apr 2026 - 5:56

As per Sutton et al. 1998, consider a Markov decision process (MDP), where the state, action and reward at each time $t \in \{0, 1, 2, \dots\}$ are denoted by $s_t \in \mathcal{S}, a_t \in \mathcal{A}$ and $r_t \in \mathbb{R}$, respectively. Let $\mathcal{P}^a_{ss'} = P(s_{t + 1} = s' | s_t = s, a_t = a)$ dictate the state transition probabilities, and $\mathcal{R}^a_s = \mathbb{E} \left[r_{t + 1} | s_t = s, a_t=a \right]$ define the expected reward. Also, let the policy, parameterized by $\theta$ be $\pi_\theta (s, a) = P( a_t = a | s_t = s, \theta)$.

The policy gradient theorem states that for any MDP:

$\frac{\partial \rho}{\partial \theta} = \sum_s d^\pi(s) \sum_a \frac{\partial \pi (s, a)}{\partial \theta} Q^\pi (s, a)$,

where $d^\pi(s)$ is the stationary distribution for the MDP, defined as $d^\pi (s) = \lim_{t \rightarrow \infty} P(s_t = s | s_0, \pi)$, $\rho(\pi)$ is the long-term expected reward per step, given by $\rho(\pi) = \sum_s d^\pi(s) \sum_a \pi (s, a) \mathcal{R}^a_s$ and $Q^\pi (s, a)$ is the value of some state-action pair under the policy $\pi$, which is given by $Q^\pi (s, a) = \sum \limits_{t = 1}^\infty \mathbb{E} \left[ r_t - \rho(\pi) | s_0 = s a_0 = a, \pi \right]$.

My question is regarding is regarding a part of the proof where the idea of $d^\pi (s)$ as the stationary distribution is used. In particular, the authors use:

$\sum_s d^\pi (s) \sum_a \pi (s, a) \sum_{s'} \mathcal{P}^a_{ss'} \frac{\partial V^\pi (s')}{\partial \theta} = \sum_{s'} d^\pi(s') \frac{\partial V^\pi (s')}{\partial \theta}$.

I intuitively understand this as a property of the stationary distribution, but it would be nice to get a more rigorous proof of why the above identity is true.

Original Q&A

Using Stationary Distribution in the Policy Gradient Theorem Proof

Related Questions in LINEAR-ALGEBRA

Related Questions in MARKOV-CHAINS

Related Questions in MARKOV-PROCESS

Related Questions in STATIONARY-PROCESSES

Related Questions in TRANSITION-MATRIX

Trending Questions

Popular # Hahtags

Popular Questions