Understanding the logic of the Bellman Equation

Question

Understanding the logic of the Bellman Equation

149 Views Asked by Bumbble Comm At 06 Apr 2026 - 4:18

From reinforcement learning, I am trying to get the Bellman Equation from the standard definition of the state-action value function.

I know that the sum of future rewards $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ can help, because I have the following identity : $$G_t = R_{t+1} + \gamma G_{t+1}$$

So far I have

$$Q_{*}(s, a) = max_{\pi} Q_{\pi} (s,a)$$ $$= E[G_t | S_t = s, A_t = a]$$ $$= E[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a]$$

From there, how can I obtain the Bellman Equation, i.e. $Q_{*}(s, a) = E[R_{t+1} + \gamma max_{\pi} Q_{\pi} (s',a')]$? I am very poor in maths, so please, explain your answer in details.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Since I have some doubt too on the exact derivation, I'll try to do this. If you do the hypothesis that there exists $\pi^*$ such that $V^*(s)=V_{\pi^*}(s) \ge V_{\pi}(s) \ \ \ \forall \pi,s \ \ \ (*), $ and $Q^*(s,a)= Q_{\pi^*}(s,a) \ge Q_{\pi}(s,a) \ \ \forall \pi,s,a \ \ \ (**)$ , which is true for finite MDPs, then you have: \begin{align} Q^*(s,a) &= \max_{\pi}{E_{\pi}\left[ G_t | s_t=s, a_t =a \right]} = \max_{\pi}{E_\pi\left[ r_t+\gamma G_{t+1}|s_t=s, a_t=a \right]} = \\ &=E_{r \sim p(r|s,a)}[r_t(s,a)] + \max_{\pi}{E\left[ \gamma G_{t+1} | s_t=s, a_t=a \right]} = \\ &= E_{r \sim p(r|s,a)}[r_t(s,a)] + \max_{\pi}\left\{\sum_{s'}p(s'|a,s)\sum_{a'}\pi(a'|s')E_\pi\left[ \gamma G_{t+1} |s_{t+1}=s',a_{t+1} = a' \right] \right\} =^{(*)} \\ &= E_{r \sim p(r|s,a)}[r_t(s,a)] + \sum_{s'}p(s'|a,s)\max_{\pi}\left\{\sum_{a'}\pi(a'|s')E_\pi\left[ \gamma G_{t+1} |s_{t+1}=s',a_{t+1} = a' \right] \right\} \\ &= E_{r \sim p(r|s,a)}[r_t(s,a)] + \sum_{s'}p(s'|a,s)\gamma V^*(s') = E[r_t(s,a) + \gamma \max_{a'}Q^*(s', a')] = \\ &= E_{r \sim p(r|s,a), s'\sim p(s'|s,a) }[r_t(s,a) + \gamma \max_{a'}\max_{\pi}Q_{\pi}(s', a')] \end{align} I think in your definition there is a bit an abuse of notation, and my last equation is more precise, even if, since there exist deterministic optimal policies, one may write: $$ \max_{a'}\max_{\pi}Q_{\pi}(s', a') = \max_{\pi :\pi(a'|s')=1}Q_{\pi}(s', a')$$ which is what I think is meant in your definition.

Understanding the logic of the Bellman Equation

There are 1 best solutions below

Related Questions in SUMMATION

Related Questions in EXPECTED-VALUE

Trending Questions

Popular # Hahtags

Popular Questions