Direct proof of the existence of optimal memoryless deterministic policies in MDP

408 Views Asked by At

It is well known that (finite-state, finite-action, discrete time) MDPs admit an optimal policy that is memoryless and deterministic (sometimes called pure).

The proof of this fact for discounting-reward is not simple, but not too complicated. However, for limit-average (a.k.a mean-payoff) reward, proofs (at least the proofs I've seen) are quite involved, and go either through analyzing the Bellman equations and the solutions of an LP, or through a Cesaro-limit using the discounting case.

Additionally, the textbook-proof (referring to Puterman's book) is divided into cases based on the structure of the MDP.

My question:

Is there a direct proof that a finite MDP has a pure optimal policy?

If not, is there any reason that the proof would be inherently messy?