It is known from chapter 5 of Dynamic Programming and Optimal Control Vol II that a uni-chain Markov Decision Process (MDP) has a unique gain-bias solution $(J,\vec{h})$ to the following infinite-horizon average-cost Bellman policy evaluation equations $$ \vec{h} + J\mathbf{1} = \vec{C} + \mathbf{P} \vec{h} $$ where $\vec{C}$ is a cost/reward vector and $\mathbf{P}$ is a transition matrix/Markov chain. Moreover, $h(x) = 0$ for one $x \in \mathcal{X}$ named the distinguished state. This allows the underdetermined system of $|\mathcal{X}|+1$ equations with $|\mathcal{X}|$ variables to be solved as $|\mathcal{X}|$ equations. Uniqueness stems from the fact $\vec{h}$ is unique when solved for $\pi \in \Pi$ and given $J$. That is to say, $J$ is not unique.
Policy iteration typically runs in two phases. First it finds the optimal gain $J^*$ such that $J_{k+1} = J_k = J^*$ for the $k^{th}$ iteration but $\vec{h}_{k+1} \preceq \vec{h}_k$ where $\preceq$ denotes element-wise comparison. Part two, then explores $\pi \in \Pi_2$ where $\forall \pi,\pi' \in \Pi_2: J_\pi' = J_\pi$ and improves the bias until termination $\vec{h}_{k+1} = \vec{h}_k$. Hence, $\Pi_2$ is the set of gain-optimal policies where termination occurs at a gain-bias optimal policy $(J^*,\vec{h}^*)$.
Intuitively, $|\Pi_2| \geq 1$ for a uni-chain MDP because it has a single recurrent class and some transient states. The transient states tailor the bias whereas the recurrent states produce the gain (in the limit). Hence, part two of Policy Iteration changes/optimises the policy with respect the transient states.
It seems reasonable to assume the following for an ergodic MDP which is a uni-chain MDP that has no transient states: $$ |\Pi_2| = 1 $$ because there are no transient states to optimise bias over. Hence, an ergodic MDP has a unique gain in contrast to the generalised uni-chain MDP.
I could not find any literature that shows, proves, supports or states this. This makes me wonder if such a result is correct or perhaps assumed to be very obvious. Can you please help me verify whether this assumption is correct or not? A citation/resource would be greatly appreciated as well.
I am providing a partial answer to my own question. It is partial in that I provide evidence to support my assumption that for recurrent chain MDPs $$ |\Pi_2| = 1 $$ such that only one gain-optimal policy exists hence making it unique. It is evidence because it is a simulation experiment. I provide the Python code below. It is not optimise for. More specifically, the for loops over actions make it slow in both the MDP construction and policy improvement. Hence making
Alarge dramatically slows down code.Briefly, the code randomly generates either a uni-chain or recurrent MDP and performs policy iteration on it. It tracks the gain and corresponding policies. $\Pi_2$ is identified once the gains stops changing. We search for the largest sampled $\Pi_2$ for both types of MDPs. Note that for the uni-chain MDP, all Markov chains for every decision have the same structure. Hence, they have the same transient states. This ensures that no policy can make $\mathbf{P}_{\pi}$ recurrent. As for the recurrent MDPs, no Markov chain corresponding to a decision has a zero entry anywhere. This ensures that no policy can induce $\mathbf{P}_{\pi}$ to be a non-recurrent uni-chain system.
Here is the code:
Here is the experiment and its helper functions
Note that for the uni-chain MDP, the recurrent class consists of only two states. Why? They are responsible for the gain. A small size means that policy iteration very quickly optimises their actions as to find the optimal gain. Thereafter, it can start on extending $\Pi_2$ as it optimises actions over the transient states. A large number of transient states guarantees more possible combinations of actions over them and hence a larger $\Pi_2$.
Lastly, note that
G1andG2are arrays. So you can plot histograms of them if you are interested in a probability distribution over $\Pi_2$. I don't know what application that might have but it seems interesting, nonetheless.My results with
np.random.seed(123)follows: