How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

20 Views Asked by Bumbble Comm At 03 Apr 2026 - 4:19

In reinforcement learning, how do you prove the following relation between the difference in value functions of two policies?

The value function $V^\pi(s)$ represents the expected cumulative reward for all states $s\in\mathcal{S}$ when following policy $\pi$ from state $s$, while the action-value function $Q^\pi(s, a)$ (aka the $Q$-function) represents the expected cumulative reward for all states $s\in\mathcal{S}$ and $a\in\mathcal{A}$ when taking action $a$ in state $s$ and then following policy $\pi$. The advantage function, defined as $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$, indicates the relative improvement of taking a specific action over the average action in state $s$ according to policy $\pi$.

The following relationship links the difference in value functions between two policies $\pi$ and $\pi'$ to their action distributions, the advantage function under $\pi'$, and the discounted state visitation distribution under $\pi$:

$$ \def\expect{\mathbb{E}} V^{\pi}(s_0) - V^{\pi'}(s_0) = \sum_{t=0}^\infty \gamma^t \expect_{a\sim\pi(\cdot|s)}\expect_{s\sim P_t(\cdot|s_0,\pi)}[A^{\pi'}(s,a)]$$

where

$\pi, \pi'$ are two policies;
$\gamma$ is the discount factor;
$P_t(\cdot|s_0,\pi)$ is the probability distribution over states reached at time $t$ starting from state $s_0$ and following policy $\pi$.

Original Q&A

How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

Related Questions in STATISTICS

Related Questions in PROBABILITY-DISTRIBUTIONS

Related Questions in COMPUTER-SCIENCE

Related Questions in CONDITIONAL-EXPECTATION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions