Markov dynamic programming recursion

658 Views Asked by At

I'm learning Markov dynamic programming problem and it is said that we must use backward recursion to solve MDP problems.

My thought is that since in a Markov process, the only existing dependence is that the next stage (n-1 stages to go) depends on the current stage (n stages to go) and not the other way around. Is this the reason? But isn't forward recursion also implemented based on this kind of dependence?

Could anyone help me to understand why we must use backward? :-)

1

There are 1 best solutions below

5
On BEST ANSWER

A Markov decision process involves computing some quantity of the form $$ v^N(i)\equiv\sup_{\pi}\mathbb{E}^{(i)}\left[\sum_{n=0}^{N}\underbrace{\frac{1}{\left(1+\rho\right)^{n}}}_{\text{discount}}u(X_{n}^{\pi})\right] $$ where $(X_{n}^{\pi})_{n\geq0}$ is a controlled, time-homogeneous, Markov chain taking values in a finite state space $\mathcal{S}=\{1,\ldots,M\}$. $\mathbb{E}^{(i)}$ is the expectation conditional on $X_{0}=i$, where $i\in\mathcal{S}$. Note that $N$ is the number of timesteps until the "game" is over.

Without worrying too much about regularity, note that \begin{align*} v^N(i) & \equiv\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[\sum_{n=1}^{N}\frac{1}{\left(1+\rho\right)^{n-1}}u(X_{n}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(X_{0})+\frac{1}{1+\rho}\mathbb{E}^{(i)}\left[v^{N-1}(X_{1}^{\pi})\right]\right\} \\ & =\sup_{\pi}\left\{ u(i)+\frac{1}{1+\rho}\sum_{j}p_{ij}(\pi)v^{N-1}(j)\right\} \end{align*} where $p_{ij}(\pi)$ is the probability that a transition from state $i$ to $j$ occurs.

In the second equality above, we have related the value of $v^N$ to that of $v^{N-1}$. The dynamic programming algorithm thus involves computing $v^0,v^1,v^2,\ldots$ in that order. This is why it is referred to as backwards (recall that $v^N$ is the value assuming there are $N$ steps remaining).

In fact, going a bit further, we can write the above equation in the vector form $$ \mathbf{v}^{N}=\sup_{\pi}\left\{ \frac{P(\pi)}{1+\rho}\mathbf{v}^{N-1}+\mathbf{u}\right\} $$ where $P(\pi)$ is the matrix with coefficients $p_{ij}(\pi)$, and the supremum is to be interpreted row-wise.


The steady-state case ($N=\infty$) is slightly more complicated. In the steady-state case, $\mathbf{v}^N=\mathbf{v}^{N-1}$, and the above equation becomes (letting $\mathbf{v}=\mathbf{v}^N$) $$ 0=\sup_{\pi}\left\{ \left(\frac{P(\pi)}{1+\rho}-I\right)\mathbf{v}+\mathbf{u}\right\}. $$ This equation can be solved by Howard's policy iteration. For details, see Section 2 (namely, Proposition 2.2) in [1].


[1] P. Azimzadeh, P. A. Forsyth, "Weakly chained matrices, policy iteration, and impulse control", In arXiv preprint arXiv:1510.03928, 2015.