I'm learning Markov dynamic programming problem and it is said that we must use backward recursion to solve MDP problems.
My thought is that since in a Markov process, the only existing dependence is that the next stage (n-1 stages to go) depends on the current stage (n stages to go) and not the other way around. Is this the reason? But isn't forward recursion also implemented based on this kind of dependence?
Could anyone help me to understand why we must use backward? :-)
2026-04-23 01:07:24.1776906444
Markov dynamic programming recursion
658 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
A Markov decision process involves computing some quantity of the form $$ v^N(i)\equiv\sup_{\pi}\mathbb{E}^{(i)}\left[\sum_{n=0}^{N}\underbrace{\frac{1}{\left(1+\rho\right)^{n}}}_{\text{discount}}u(X_{n}^{\pi})\right] $$ where $(X_{n}^{\pi})_{n\geq0}$ is a controlled, time-homogeneous, Markov chain taking values in a finite state space $\mathcal{S}=\{1,\ldots,M\}$. $\mathbb{E}^{(i)}$ is the expectation conditional on $X_{0}=i$, where $i\in\mathcal{S}$. Note that $N$ is the number of timesteps until the "game" is over.
Without worrying too much about regularity, note that \begin{align*} v^N(i) & \equiv\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[\sum_{n=1}^{N}\frac{1}{\left(1+\rho\right)^{n-1}}u(X_{n}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[v^{N-1}(X_{1}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(i)+\frac{1}{1+\rho}\sum_{j}p_{ij}(\pi)v^{N-1}(j)\right\} \end{align*} where $p_{ij}(\pi)$ is the probability that a transition from state $i$ to $j$ occurs.
In fact, going a bit further, we can write the above equation in the vector form $$ \mathbf{v}^{N}=\sup_{\pi}\left\{ \frac{P(\pi)}{1+\rho}\mathbf{v}^{N-1}+\mathbf{u}\right\} $$ where $P(\pi)$ is the matrix with coefficients $p_{ij}(\pi)$, and the supremum is to be interpreted row-wise.
The steady-state case ($N=\infty$) is slightly more complicated. In the steady-state case, $\mathbf{v}^N=\mathbf{v}^{N-1}$, and the above equation becomes (letting $\mathbf{v}=\mathbf{v}^N$) $$ 0=\sup_{\pi}\left\{ \left(\frac{P(\pi)}{1+\rho}-I\right)\mathbf{v}+\mathbf{u}\right\}. $$ This equation can be solved by Howard's policy iteration. For details, see Section 2 (namely, Proposition 2.2) in [1].
[1] P. Azimzadeh, P. A. Forsyth, "Weakly chained matrices, policy iteration, and impulse control", In arXiv preprint arXiv:1510.03928, 2015.