How does Sutton unroll the consistency condition between a state and its successor (3.14)

47 Views Asked by At

Sutton describes the value of a state using the following equation -

$$\mathcal{v}_{\pi}(\mathcal{s}) = \mathop{\mathbb{E}_{\pi}}[G_t|S_t = S]$$ $$= \mathop{\mathbb{E}_{\pi}}[R_{t+1}+\gamma*G_{t+1}|S_t = S]$$ $$= \sum_a\pi(a|s)\sum_{s', r}p(s',r|s,a)[r+\gamma*\mathcal{v}_{\pi}(\mathcal{s'})]$$

My question is how did he come up with the terms $\sum\limits_a\pi(a|s)$ and $\sum\limits_{s', r}p(s',r|s,a)$, from the preceding equation. Shouldn't these expressions evaluate to 1?

1

There are 1 best solutions below

0
On BEST ANSWER

First, there is a typo in your description. $S_{t}=S$ should be $S_{t}=s$.

To answer your first question, note that $$ \mathbb{E}\left[R_{t+1}+\gamma G_{t+1}\mid S_{t}=s\right]=\sum_{s^{\prime},r,a}p(s^{\prime},r,a\mid s)\left(r+\gamma v_{\pi}(s^{\prime})\right) $$ and, fixing a strategy $\pi$, $$ p(s^{\prime},r\mid s,a)p(a\mid s)=\frac{p(s^{\prime},r,s,a)}{p(s,a)}\frac{p(s,a)}{p(s)}=\frac{p(s^{\prime},r,s,a)}{p(s)}=p(s^{\prime},r,a\mid s). $$ Note: I have used $p(a\mid s)$ and $\pi(a\mid s)$ interchangeably.

As for your second question, let $p\equiv p(x)$ be a probability mass function (PMF) associated with a discrete random variable $X$. Of course, $p$ sums to one. However, $\mathbb{E}[f(X)]=\sum p(x)f(x)$ does not, necessarily. This is analogous to your question: $\sum_{s^{\prime},r}p(s^{\prime},r\mid s,a)$ sums to one, but $$ \sum_{s^{\prime},r}\left[p(s^{\prime},r\mid s,a)\left(r+\gamma v_{\pi}(s^{\prime})\right)\right] $$ does not, necessarily. A similar statement can be made for $a\mapsto\pi(a\mid s)$.