Performance Loss for High-order Markov Decision Process Using Lower-order Policy

26 Views Asked by At

We know that the optimal policy for MDP only depends on the current state $s_t$, i.e., $a_{t}\sim\pi^{(1)}(\cdot\mid s_t)$. For $N$-order MDP($P(s_{t}|s_{0},a_{0},\cdots,s_{t-1},a_{t-1})=P(s_{t}|s_{t-N},a_{t-N},\cdots,s_{t-1},a_{t-1})$), the optimal policy depends on $N$ historical states and actions, i.e., $a_{t}\sim\pi^{(N)}(\cdot\mid s_{t-N+1}, a_{t-N+1}, \cdots, s_{t-1}, a_{t-1}, s_t)$. This can be explained by state augmentation.

My question is, if we use a lower-order policy $\pi^{(M)}$ to make decisions on a higher-order system ($N$-order MDP, and $M<N$), how can we characterize its performance loss?

I hope to

  1. proof that $\sup_{\pi\in\Pi^{(M)}} V(\pi;s)<\sup_{\pi\in\Pi^{(N)}} V(\pi;s)$.
  2. characterize some bound of performance loss $\sup_{\pi\in\Pi^{(N)}} V(\pi;s)-\sup_{\pi\in\Pi^{(M)}} V(\pi;s)$;
  3. proof that $\sup_{\pi\in\Pi^{(M)}} V(\pi;s)$ increases with respect to $M$ if $M<N$.

I don't know how to rigorously prove this mathematically. I think the problem is similar to $\min_{y}{\mathbb{E}(y-X)^2}$. If $y$ depends on $x$, you can make $y(x) = x$, then the minimum value is $0$. If $y$ is a constant, then the optimal estimate of $y$ is the mean of $X$ and the minimum is the variance of $X$.

Thanks for providing ideas, some special cases will help me too!