Difference of Value functions in Markov decision Processes

43 Views Asked by At

I have recently started working on MDPs and there is a particular issue I haven't understood.

For sake of clarity, consider only finite spaces such that : $S$ finite state space, $A$ finite action space, $T^a$ the transition matrix associated to action $a$ and the stage cost at stage $n$ as $g_n:S\times A\to[0,\bar{g}]$ with $\bar{g}<\infty$. We consider the deterministic strategy $x$.

Consider the following discounted (or undiscounted) total cost as follows $V(x)=\sum_{n\geq 0}\beta^nT_0^n(x)g_n(x)$ with $\beta\in(0,1]$.

Consider $x^*$ the optimal strategy and define $x^\prime$ as follows: up to stage N you play $x_n\neq x^\ast_n$ for $n\in\{1,...,N\}$ and afterwards you play as strategy $x^\ast$.

For any initial state, you play either $x^\ast$ or $x^\prime$. Define for the same initial state that $V(x^\ast)-V(x^\prime):=\sum_n\beta^nT_0^N(x^\ast)\times T_N^n(x^\ast)g_n-\sum_n\beta^nT_0^N(x)\times T_N^n(x^\ast)g_n$

How is it possible to write $V(x^\ast)-V(x^\prime)=\sum_n\beta^n(T_0^N(x^\ast)-T_0^N(x^\prime))\times T_N^n(x^\ast)g_n$

My issue is as follows. If you start from the same initial state but play a different strategy up to stage N, you might arrive to two different states. These two different states might have different optimal action $x^\ast$ afterwards and therefore it shouldn't be possible to factorize. Am I missing something ?