Markov dynamic programming recursion

Question

Markov dynamic programming recursion

658 Views Asked by Bumbble Comm At 23 Apr 2026 - 1:07

I'm learning Markov dynamic programming problem and it is said that we must use backward recursion to solve MDP problems.

My thought is that since in a Markov process, the only existing dependence is that the next stage (n-1 stages to go) depends on the current stage (n stages to go) and not the other way around. Is this the reason? But isn't forward recursion also implemented based on this kind of dependence?

Could anyone help me to understand why we must use backward? :-)

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

A Markov decision process involves computing some quantity of the form $$ v^N(i)\equiv\sup_{\pi}\mathbb{E}^{(i)}\left[\sum_{n=0}^{N}\underbrace{\frac{1}{\left(1+\rho\right)^{n}}}_{\text{discount}}u(X_{n}^{\pi})\right] $$ where $(X_{n}^{\pi})_{n\geq0}$ is a controlled, time-homogeneous, Markov chain taking values in a finite state space $\mathcal{S}=\{1,\ldots,M\}$. $\mathbb{E}^{(i)}$ is the expectation conditional on $X_{0}=i$, where $i\in\mathcal{S}$. Note that $N$ is the number of timesteps until the "game" is over.

Without worrying too much about regularity, note that \begin{align*} v^N(i) & \equiv\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[\sum_{n=1}^{N}\frac{1}{\left(1+\rho\right)^{n-1}}u(X_{n}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[v^{N-1}(X_{1}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(i)+\frac{1}{1+\rho}\sum_{j}p_{ij}(\pi)v^{N-1}(j)\right\} \end{align*} where $p_{ij}(\pi)$ is the probability that a transition from state $i$ to $j$ occurs.

In the second equality above, we have related the value of $v^N$ to that of $v^{N-1}$. The dynamic programming algorithm thus involves computing $v^0,v^1,v^2,\ldots$ in that order. This is why it is referred to as backwards (recall that $v^N$ is the value assuming there are $N$ steps remaining).

In fact, going a bit further, we can write the above equation in the vector form $$ \mathbf{v}^{N}=\sup_{\pi}\left\{ \frac{P(\pi)}{1+\rho}\mathbf{v}^{N-1}+\mathbf{u}\right\} $$ where $P(\pi)$ is the matrix with coefficients $p_{ij}(\pi)$, and the supremum is to be interpreted row-wise.

The steady-state case ($N=\infty$) is slightly more complicated. In the steady-state case, $\mathbf{v}^N=\mathbf{v}^{N-1}$, and the above equation becomes (letting $\mathbf{v}=\mathbf{v}^N$) $$ 0=\sup_{\pi}\left\{ \left(\frac{P(\pi)}{1+\rho}-I\right)\mathbf{v}+\mathbf{u}\right\}. $$ This equation can be solved by Howard's policy iteration. For details, see Section 2 (namely, Proposition 2.2) in [1].

[1] P. Azimzadeh, P. A. Forsyth, "Weakly chained matrices, policy iteration, and impulse control", In arXiv preprint arXiv:1510.03928, 2015.

Markov dynamic programming recursion

There are 1 best solutions below

Related Questions in MARKOV-CHAINS

Related Questions in LINEAR-PROGRAMMING

Trending Questions

Popular # Hahtags

Popular Questions