Solving for an optimal policy in an infinite-horizon and finite-state-space MDP with expected average reward

156 Views Asked by At

I've just jumped into this area, and I was told that average reward criterion could make things slightly more difficult comparing to the usual discounted reward criterion. I will really appreciate if someone could answer my following questions and hopefully point me to some extra materials in addition to Puterman's book "Markov Decision Processes: Discrete Stochastic Dynamic Programming" (which I'm struggling with.) I'll also be glad if you could correct me on any of my misunderstanding I might have.

In particular, I have an MDP with finite states space, infinite horizon, stationary transition probabilities, and stationary & bounded rewards. Given $\pi$ a policy and $r_{t,\pi}$ a reward at time $t$ upon visiting a state induced by the policy, a policy evaluation function $f$ is in the following form: $$f(\pi) = \mathbb{E} \left[ {\lim\inf}_{T\rightarrow\infty} \frac{1}{T} \sum_{t=1}^T r_{t,\pi} \right]$$ and I seek to find an optimal policy $\pi^* = {\arg\max}_{\pi\in\Pi} \ f(\pi)$ out of all possible policies in set $\Pi$.

My questions are as follows.

  1. Is my MDP recurrent if it structurally looks like this image: MDP structure? That is, every state has a non-zero probability of revisiting unless the policy instructed not to take some actions to avoid some particular states.

  2. Does the usual value iteration and policy iteration algorithm on MDPs with discounted reward criterion still works on this problem? If not, could you tell me or point me to materials for the algorithm that I needs for my problem?

Many thanks in advance!