Finding the optimal policy of an MDP just solving the primal problem associated

36 Views Asked by Bumbble Comm At 29 Mar 2026 - 2:19

I need to find a method that allows me to determine the optimal policy of a finite-horizon Markov Decision Process with a discount factor without solving the dual problem of the linear programming problem associated. I think I have found a method that works, but I am not sure about the proof I have done.

Let $S$ be the set of states in which the system can be for each stage, D(i) the set of possible decisions for each state i $\in S$, $R_i^k$ the reward obtained by choosing k $\in$ D(i), i$\in S$, and $\beta \in (0, 1)$ the discount factor. For each stationary policy $\pi$, we consider the expected rewards vector: $$ v_i^{\pi} = R_i^\pi + \displaystyle\sum_{n=1} ^{\infty} \displaystyle\sum_{j \in S} \beta^np_{ij}^\pi(n)R_j^\pi $$ where $p_{ij}$(n) is the probability of going from the state i $\in S$ to j $\in S$ in n stages.

We know that $w$ = $\displaystyle\max_{\pi} v^\pi$ is the solution of: $$ \begin{equation*} \begin{array}{ll@{}ll} \text{minimize} & \displaystyle\sum\limits_{i=1}^{N} w_{i}\\ \text{subject to}& w_i \geq R_{i}^k +\displaystyle\sum\limits_{ j \in S} \beta p_{ij}^kw_j, &i \in S, k\in D(i)\\ & w_{i} \in \mathbb{R} \end{array} \end{equation*} $$

Let $w^{*}$ = $g$ be the solution to that problem. The method I propose consists of the following:

For each state i $\in S$ we choose a decision k $\in$ D(i) for whom the restriction satisfies with equality.
If there is more than one of such restrictions for one staite i, we choose one and only one of them.
One optimal policy will be that in which we choose the decisions mentioned above for their respective states.

I have proven the method is correct as follows: let $\pi$ be the optimal stationary policy, we know that for the policy evaluation equation, $v^\pi = \beta P ^\pi v^\pi + R^\pi$, being $P$ the matrix $P^\pi = (p_{i,j}^\pi)$, so choosing the restrictions of the pairs (i, $\pi$(i)), is clear that those restrictions are satisfied with equality by $w^* = v^\pi$, so, for each state i $\in S$ there is at least one of such restrictions so that we can choose at least one k $\in$ D(i). Now lets suppose that it exists $l_0 \in S$ with $\pi (l_0) = k_1$ and $k_2 \in D(l_0)$ so that $k_1 \ne k_2$ and $$ w_{l_0}^{*} = R_{l_0}^{k_2} +\displaystyle\sum\limits_{ j \in S} \beta p_{ij}^{k_2}kw_j $$ then we construct the stationary policy $\delta$ verifying $\delta$(i) = $\pi(i)$ if i $\ne$ $l_0$, $\delta(l_0)$ = $k_2$. We can choose the restriction pairs (i, $\delta$(i)), $i \in S$, and we have: $$ w^{*} = \beta P ^\delta w^{*} + R^\delta $$ so $w^{*} = v^\delta$, which means that $\delta$ is an optimal policy too. In this case, there is more than one optimal policy. $$ $$ I think the proof is correct, but I have been searching it for some time and I haven't seen it in any book or article, so my question is, did I miss something? If yes, what? How can I find such a method then? $$ $$ Thanks in advance.

Original Q&A

Finding the optimal policy of an MDP just solving the primal problem associated

Related Questions in OPTIMIZATION

Related Questions in LINEAR-PROGRAMMING

Related Questions in MARKOV-PROCESS

Related Questions in MATHEMATICAL-MODELING

Related Questions in OPERATIONS-RESEARCH

Trending Questions

Popular # Hahtags

Popular Questions