Does an ergodic Markov Decision Process have a unique optimal gain?

119 Views Asked by At

It is known from chapter 5 of Dynamic Programming and Optimal Control Vol II that a uni-chain Markov Decision Process (MDP) has a unique gain-bias solution $(J,\vec{h})$ to the following infinite-horizon average-cost Bellman policy evaluation equations $$ \vec{h} + J\mathbf{1} = \vec{C} + \mathbf{P} \vec{h} $$ where $\vec{C}$ is a cost/reward vector and $\mathbf{P}$ is a transition matrix/Markov chain. Moreover, $h(x) = 0$ for one $x \in \mathcal{X}$ named the distinguished state. This allows the underdetermined system of $|\mathcal{X}|+1$ equations with $|\mathcal{X}|$ variables to be solved as $|\mathcal{X}|$ equations. Uniqueness stems from the fact $\vec{h}$ is unique when solved for $\pi \in \Pi$ and given $J$. That is to say, $J$ is not unique.

Policy iteration typically runs in two phases. First it finds the optimal gain $J^*$ such that $J_{k+1} = J_k = J^*$ for the $k^{th}$ iteration but $\vec{h}_{k+1} \preceq \vec{h}_k$ where $\preceq$ denotes element-wise comparison. Part two, then explores $\pi \in \Pi_2$ where $\forall \pi,\pi' \in \Pi_2: J_\pi' = J_\pi$ and improves the bias until termination $\vec{h}_{k+1} = \vec{h}_k$. Hence, $\Pi_2$ is the set of gain-optimal policies where termination occurs at a gain-bias optimal policy $(J^*,\vec{h}^*)$.

Intuitively, $|\Pi_2| \geq 1$ for a uni-chain MDP because it has a single recurrent class and some transient states. The transient states tailor the bias whereas the recurrent states produce the gain (in the limit). Hence, part two of Policy Iteration changes/optimises the policy with respect the transient states.

It seems reasonable to assume the following for an ergodic MDP which is a uni-chain MDP that has no transient states: $$ |\Pi_2| = 1 $$ because there are no transient states to optimise bias over. Hence, an ergodic MDP has a unique gain in contrast to the generalised uni-chain MDP.

I could not find any literature that shows, proves, supports or states this. This makes me wonder if such a result is correct or perhaps assumed to be very obvious. Can you please help me verify whether this assumption is correct or not? A citation/resource would be greatly appreciated as well.

1

There are 1 best solutions below

0
On

I am providing a partial answer to my own question. It is partial in that I provide evidence to support my assumption that for recurrent chain MDPs $$ |\Pi_2| = 1 $$ such that only one gain-optimal policy exists hence making it unique. It is evidence because it is a simulation experiment. I provide the Python code below. It is not optimise for. More specifically, the for loops over actions make it slow in both the MDP construction and policy improvement. Hence making A large dramatically slows down code.

Briefly, the code randomly generates either a uni-chain or recurrent MDP and performs policy iteration on it. It tracks the gain and corresponding policies. $\Pi_2$ is identified once the gains stops changing. We search for the largest sampled $\Pi_2$ for both types of MDPs. Note that for the uni-chain MDP, all Markov chains for every decision have the same structure. Hence, they have the same transient states. This ensures that no policy can make $\mathbf{P}_{\pi}$ recurrent. As for the recurrent MDPs, no Markov chain corresponding to a decision has a zero entry anywhere. This ensures that no policy can induce $\mathbf{P}_{\pi}$ to be a non-recurrent uni-chain system.

Here is the code:

import numpy as np
from tqdm import tqdm

def get_random_MDP(N,actions=3,transient=0,progress=True):
    assert transient < N
    if progress:
        track = tqdm
    else:
        track = lambda x: x # identity
    N1 = N - transient
    N2 = transient
    A = actions
    Pset = []
    Cset = []
    for a in track(range(A)):
        idxs = np.random.randint(1,100,size=N)
        Pa = np.random.dirichlet(idxs,size=N)
        Pa[:N1,N1:] = 0
        Pa = Pa/(Pa.sum(axis=1))[:,None]
        Pset.append(Pa)
        Ca = np.random.uniform(0,10,size=N)
        Cset.append(Ca)
    return Pset,Cset

def load_policy(pi,Pset,Cset):
    N = len(pi)
    P = np.zeros((N,N))
    C = np.zeros(N)
    for i,a in enumerate(pi):
        P[i] = Pset[a][i]
        C[i] = Cset[a][i]
    return P,C

def average_eval(pi,Pset,Cset):
    P,C = load_policy(pi,Pset,Cset)
    G = np.concatenate((C,[0])) # last entry is zero
    N = len(pi)
    I = np.eye(N)
    distinguished_state = np.random.randint(0,N)
    A = np.zeros((N+1,N+1))
    A[-1,distinguished_state] = 1
    A[:-1,-1] = 1
    A[:-1,:-1] = I - P
    sol = np.linalg.solve(A,G)
    bias = sol[:-1]
    gain = sol[-1]
    return gain,bias,P,C

def policy_improvement(J,Pset,Cset):
    N = len(J)
    A = len(Pset)
    Q = np.zeros((A,N))
    for a in range(A):
        Q[a] = Cset[a] + Pset[a].dot(J) 
    pi = np.argmin(Q,axis=0)
    return pi

def average_PI(Pset,Cset,pi0=None,maxiter=100,progress=True):
    if progress:
        track = tqdm
    else:
        track = lambda x: x # identity
    N = len(Cset[0])
    A = len(Cset)
    if pi0 is None:
        pi0 = np.random.randint(0,A,size=N)
    history = []
    for niter in track(range(maxiter)):
        gain,bias,_,_ = average_eval(pi0,Pset,Cset)
        history.append([pi0,gain,bias])
        pi1 = policy_improvement(bias,Pset,Cset)
        if np.all(pi0==pi1):
            break
        pi0 = pi1
    return pi0,history,niter

Here is the experiment and its helper functions

def get_gain_optimal_policies(history):
    gains = []
    for h in history:
        gains.append(h[1])
    optimal_gain = gains[-1] # most bias optimal as well
    gains = np.array(gains)
    idxs = np.where(gains==optimal_gain)[0]
    policies = []
    for idx in idxs:
        policies.append(history[idx][0])
    return policies,gains


def search_largest_gain_set(N,A,transient,n_samples=1000):
    gain_set_sizes = np.zeros(n_samples,dtype=int)
    for i in tqdm(range(n_samples)):
        Pset,Cset = get_random_MDP(N,A,transient,progress=False)
        _,history,_ = average_PI(Pset,Cset,progress=False)
        gain_optimal_policies,_ = get_gain_optimal_policies(history)
        gain_set_sizes[i] = len(gain_optimal_policies)
    return gain_set_sizes

if __name__ == '__main__':
    N = 500
    A = 5
    # Unichain search
    G1 = search_largest_gain_set(N,A,transient=N-2,n_samples=1000)
    # Recurrent chain search
    G2 = search_largest_gain_set(N,A,transient=0,n_samples=1000)
    print('\nLargest gain-optimal policy set:')
    print('Unichain: ',G1.max())
    print('Recurrent: ',G2.max())

Note that for the uni-chain MDP, the recurrent class consists of only two states. Why? They are responsible for the gain. A small size means that policy iteration very quickly optimises their actions as to find the optimal gain. Thereafter, it can start on extending $\Pi_2$ as it optimises actions over the transient states. A large number of transient states guarantees more possible combinations of actions over them and hence a larger $\Pi_2$.

Lastly, note that G1 and G2 are arrays. So you can plot histograms of them if you are interested in a probability distribution over $\Pi_2$. I don't know what application that might have but it seems interesting, nonetheless.

My results with np.random.seed(123) follows:

100%|██████████| 1000/1000 [04:08<00:00,  4.03it/s]
100%|██████████| 1000/1000 [04:15<00:00,  3.91it/s]
Largest gain-optimal policy set:
Unichain:  5
Recurrent:  1