How to prove the relation involving difference between value functions of two different policies and the sum of advantage function over time?

20 Views Asked by At

In reinforcement learning, how do you prove the following relation between the difference in value functions of two policies?

The value function $V^\pi(s)$ represents the expected cumulative reward for all states $s\in\mathcal{S}$ when following policy $\pi$ from state $s$, while the action-value function $Q^\pi(s, a)$ (aka the $Q$-function) represents the expected cumulative reward for all states $s\in\mathcal{S}$ and $a\in\mathcal{A}$ when taking action $a$ in state $s$ and then following policy $\pi$. The advantage function, defined as $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$, indicates the relative improvement of taking a specific action over the average action in state $s$ according to policy $\pi$.

The following relationship links the difference in value functions between two policies $\pi$ and $\pi'$ to their action distributions, the advantage function under $\pi'$, and the discounted state visitation distribution under $\pi$:

$$ \def\expect{\mathbb{E}} V^{\pi}(s_0) - V^{\pi'}(s_0) = \sum_{t=0}^\infty \gamma^t \expect_{a\sim\pi(\cdot|s)}\expect_{s\sim P_t(\cdot|s_0,\pi)}[A^{\pi'}(s,a)]$$

where

  • $\pi, \pi'$ are two policies;
  • $\gamma$ is the discount factor;
  • $P_t(\cdot|s_0,\pi)$ is the probability distribution over states reached at time $t$ starting from state $s_0$ and following policy $\pi$.