In the following, I will use the vocabulary of Markov Decision Processes (MDP) because I first thought this problem was one of them. However now I'm not so sure anymore.
Description of the problem:
- There is a set of states, some of which are transient and others are final. Let {A, B, C} be the subset of final states.
- There is an initial state.
- There is a subset of target states, say {A, B}, among the final states.
- There is a known set of actions associated to each state.
- There is a known set of transitions associated to each action (known probabilities, known new states).
- There can be cycles over some transient states.
Goals:
- Find the policy corresponding to the highest probability of reaching any of the target states, starting from the initial state and without limitation of time.
- Provide the corresponding probability.
- If several policies correspond to the same highest probability, find the one(s) corresponding to the shortest time.
My first thoughts and approach were:
This problem can be treated as an MDP.
Thus I associated a set of rewards to the set of transitions (R>0 for transitions to target states, R<0 for transitions to other final states, R=0 for transitions to transient states).
I chose the discount factor $\gamma\neq0$.
I used the policy iteration algorithm.
However I got various optimal policies depending on the values of the rewards and on the discount factor.
Then my second thoughts were:
This might not be the better approach.
$\gamma$ should be equal to 1, due to no limitation of time.
I could set R=0 for non-target final states, without changing the result.
The problem does not depend on the rewards (or at least, only the sign of the rewards is used to designate the target states), so the solution should not depend on the rewards.
Thus my question is:
To which class of problems does this problem belong to, and what are the methods/algorithms for solving it?
Thanks in advance for your answers.
Your approach looks reasonable if you set $R=1$ for transitions from transient to target states, $R=0$ for all other transitions, and $\gamma>0$. Because your state space and action sets are finite, an alternative to policy iteration is to use linear programming to solve Bellman's equations, with a variable for each state and a constraint for each state-action pair.