Understanding the derivation of the Bellman equation for state value function

1.2k Views Asked by At

In reinforcement learning theory, from Sutton and Barto, page 46-47 the Bellman equation for a state-value function is:

\begin{equation} \begin{aligned} v_\pi (s) :&= \mathbb{E}\left[G_t | S_t=s\right]\\ &= \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} | S_t=s\right]\\ &= \sum_{a}\pi(s|a)\sum_{s', r}p(s', r|s, a)\left[r+\gamma\mathbb{E}_\pi[G_{t+1}|S_t=s\right]]\\ &= \sum_{a}\pi(s|a)\sum_{s', r}p(s', r|s, a)\left[r + \gamma v_\pi(s')\right] \end{aligned} \end{equation}

Where

\begin{equation} \begin{aligned} v_\pi(s) &= \text{A state value function}\\ r &= \text{A reward}\\ G_t &= \text{Discounted reward or return from time t}\\ S_t, s, s' &= \text{State at time t, a state and the next state respectively}\\ \gamma &= \text{A discount factor, constant}\\ p(s', r|s, a) &= \text{probability of transition to state } s' \text{with reward } r \text{, from state } s \text{and action } a\\ \pi(s, a) &= \text{probability of taking action } a \text{ in state } s \text{ under stochastic policy } \pi\\ \end{aligned} \end{equation}

I understand the relationship between lines 1 and 2 (from a previous equation in Sutton and Barto, eqn. 3.9 if your interested) and I also understand the final substitution, i.e. the recursive bit. However I don't fully understand how to you get from line 2 to 3. I think you probably need to refer to the definition of conditional expectation, i.e.

\begin{equation} \mathbb{E}(X|Y) = \sum_{x}x P(X=x|Y=y) \end{equation}

but I'm having a hard time understanding the precise logic. Does anybody have any further insight that could help me understand both mathematically and intuitively why this is so?

2

There are 2 best solutions below

0
On

Perhaps you need a bit more of details and notational enrichment of the equations in order to make the derivation step. These are spelled out in fair amount of detail at the link below.

https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/

0
On

Regarding the conditional expectation, be careful. $\mathbb{E}[X|Y=y]$ is a constant but $\mathbb{E}[X|Y]$ is a random variable. In $\mathbb{E}[X|Y=y]$ we know that we must consider the conditional probability $\Pr\{x|y\}$ when averaging the values of $x$. In $\mathbb{E}[X|Y]$, since $Y$ is now random, $\Pr\{x|Y\}$ is a random variable that depends on the value $Y$ will take, because $\Pr\{x|y_1\}\neq\Pr\{x|y_2\}$ in general. The averaging of $x$ depends on the probabilities of occurring each $\{y_1,y_2,...\}$.

The derivation of Bellman equation is as follows. First, use the linearity of the expected value operator:

\begin{equation} \begin{split} V_\pi(s)&=\mathbb{E}[G_t|S_t=s]\\ &=\mathbb{E}[R_{t+1}|s]+\lambda\mathbb{E}[G_{t+1}|s]\\ \end{split} \end{equation}

The first part is easier:

\begin{equation} \begin{split} \mathbb{E}[R_{t+1}|s] &=\sum_{r\in\mathcal{R}}r\Pr\{r|s\}\\ &=\sum_{r\in\mathcal{R}}\sum_{a\in\mathcal{A}(s)}r\Pr\{r|s,A_t=a\}\Pr\{A_t=a|s\}\\ &=\sum_{r\in\mathcal{R}}\sum_{a\in\mathcal{A}(s)}r\Pr\{r|s,a\}\pi(a|s)\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}r\Pr\{r|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}r\Pr\{r|s,a,S_{t+1}=s'\}\Pr\{S_{t+1}=s'|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}r\Pr\{r,s'|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(r,s'|s,a) r\\ \end{split} \end{equation}

Then for the return $G_{t+1}$:

\begin{equation} \begin{split} \mathbb{E}[G_{t+1}|s] &=\sum_{g\in\mathcal{G}}g\Pr\{G_{t+1}=g|s\}\\ &=\sum_{g\in\mathcal{G}}\sum_{a\in\mathcal{A}(s)}g\Pr\{g|s,A_t=a\}\Pr\{A_t=a|s\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{g\in\mathcal{G}} g\Pr\{g|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{g\in\mathcal{G}} \sum_{s'\in\mathcal{S}}g\Pr\{g|s,a,S_{t+1}=s'\}\Pr\{S_{t+1}=s'|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{s'\in\mathcal{S}} \Pr\{s'|s,a\}\sum_{g\in\mathcal{G}}g\Pr\{g|s,a,s'\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{s'\in\mathcal{S}} \Pr\{s'|s,a\}\sum_{g\in\mathcal{G}}\sum_{r\in\mathcal{R}} g\Pr\{g|s,a,s',r\}\Pr\{r|s,a\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{r\in\mathcal{R}} \sum_{s'\in\mathcal{S}}\Pr\{r|s,a\}\Pr\{s'|s,a\} \sum_{g\in\mathcal{G}}g\Pr\{g|s,a,s',r\}\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{r\in\mathcal{R}} \sum_{s'\in\mathcal{S}}p(r,s'|s,a) \underbrace{\sum_{g\in\mathcal{G}}g\Pr\{g|s,a,s',r\}}_ {\mathbb{E}[G_{t+1}|s,a,s',r]}\\ \end{split} \end{equation}

Then we have:

\begin{equation} \begin{split} \mathbb{E}[G_{t+1}|s,a,s',r] &=\mathbb{E}[G_{t+1}|S_t=s,A_t=a,S_{t+1}=s',R_{t+1}=r]\\ &=\mathbb{E}[R_{t+2}+\lambda G_{t+2}|S_t=s,A_t=a,S_{t+1}=s',R_{t+1}=r]\\ \end{split} \end{equation}

By the Markov property we know that the expected reward at timestep $t+2$ and onward, that is, the expression $R_{t+2}+\lambda G_{t+2}$, depends only on $S_{t+1}$ and $A_{t+1}$. So we can simplify the above expression:

\begin{equation} \begin{split} \mathbb{E}[G_{t+1}|s,a,s',r] &=\mathbb{E}[G_{t+1}|s']\\ &=\mathbb{E}[G_{t+1}|S_{t+1}=s']\\ &=V_\pi(s') \end{split} \end{equation}

Then:

\begin{equation} \begin{split} V_\pi(s) &=\mathbb{E}[R_{t+1}|s]+\lambda\mathbb{E}[G_{t+1}|s]\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(r,s'|s,a) r + \lambda \sum_{a\in\mathcal{A}(s)}\pi(a|s)\sum_{r\in\mathcal{R}} \sum_{s'\in\mathcal{S}}p(r,s'|s,a) V_\pi(s')\\ &=\sum_{a\in\mathcal{A}(s)}\pi(a|s) \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(r,s'|s,a) \left[r+\lambda V_\pi(s')\right] \end{split} \end{equation}