I need to find a method that allows me to determine the optimal policy of a finite-horizon Markov Decision Process with a discount factor without solving the dual problem of the linear programming problem associated. I think I have found a method that works, but I am not sure about the proof I have done.
Let $S$ be the set of states in which the system can be for each stage, D(i) the set of possible decisions for each state i $\in S$, $R_i^k$ the reward obtained by choosing k $\in$ D(i), i$\in S$, and $\beta \in (0, 1)$ the discount factor. For each stationary policy $\pi$, we consider the expected rewards vector: $$ v_i^{\pi} = R_i^\pi + \displaystyle\sum_{n=1} ^{\infty} \displaystyle\sum_{j \in S} \beta^np_{ij}^\pi(n)R_j^\pi $$ where $p_{ij}$(n) is the probability of going from the state i $\in S$ to j $\in S$ in n stages.
We know that $w$ = $\displaystyle\max_{\pi} v^\pi$ is the solution of: $$ \begin{equation*} \begin{array}{ll@{}ll} \text{minimize} & \displaystyle\sum\limits_{i=1}^{N} w_{i}\\ \text{subject to}& w_i \geq R_{i}^k +\displaystyle\sum\limits_{ j \in S} \beta p_{ij}^kw_j, &i \in S, k\in D(i)\\ & w_{i} \in \mathbb{R} \end{array} \end{equation*} $$
Let $w^{*}$ = $g$ be the solution to that problem. The method I propose consists of the following:
- For each state i $\in S$ we choose a decision k $\in$ D(i) for whom the restriction satisfies with equality.
- If there is more than one of such restrictions for one staite i, we choose one and only one of them.
- One optimal policy will be that in which we choose the decisions mentioned above for their respective states.
I have proven the method is correct as follows: let $\pi$ be the optimal stationary policy, we know that for the policy evaluation equation, $v^\pi = \beta P ^\pi v^\pi + R^\pi$, being $P$ the matrix $P^\pi = (p_{i,j}^\pi)$, so choosing the restrictions of the pairs (i, $\pi$(i)), is clear that those restrictions are satisfied with equality by $w^* = v^\pi$, so, for each state i $\in S$ there is at least one of such restrictions so that we can choose at least one k $\in$ D(i). Now lets suppose that it exists $l_0 \in S$ with $\pi (l_0) = k_1$ and $k_2 \in D(l_0)$ so that $k_1 \ne k_2$ and $$ w_{l_0}^{*} = R_{l_0}^{k_2} +\displaystyle\sum\limits_{ j \in S} \beta p_{ij}^{k_2}kw_j $$ then we construct the stationary policy $\delta$ verifying $\delta$(i) = $\pi(i)$ if i $\ne$ $l_0$, $\delta(l_0)$ = $k_2$. We can choose the restriction pairs (i, $\delta$(i)), $i \in S$, and we have: $$ w^{*} = \beta P ^\delta w^{*} + R^\delta $$ so $w^{*} = v^\delta$, which means that $\delta$ is an optimal policy too. In this case, there is more than one optimal policy. $$ $$ I think the proof is correct, but I have been searching it for some time and I haven't seen it in any book or article, so my question is, did I miss something? If yes, what? How can I find such a method then? $$ $$ Thanks in advance.