Exercise $3.19$ of Sutton's Reinforcement Learning

50 Views Asked by At

I'm trying to solve the first part of exercise $3.19$ of Sutton's Reinforcement Learning (Chapter $3.5$, page $62$, second edition). The question reads:

The value of an action, $q_{\pi}(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

enter image description here

Give the equation corresponding to this intuition and diagram for the action value, $q_{\pi}(s, a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_{\pi}(S_{t+1})$, given that $S_t=s$ and $A_t=a$. This equation should include an expectation but not one conditioned on following the policy.

My attempt is something like

$$ q_{\pi}(s, a) = \mathbb{E}_{(S_{t+1}, R_{t+1}|S_{t}=s, A_{t}=a)} [R_{t+1} + \gamma v_{\pi}(S_{t+1})]$$

Does that make sense?