It is well known that (finite-state, finite-action, discrete time) MDPs admit an optimal policy that is memoryless and deterministic (sometimes called pure).
The proof of this fact for discounting-reward is not simple, but not too complicated. However, for limit-average (a.k.a mean-payoff) reward, proofs (at least the proofs I've seen) are quite involved, and go either through analyzing the Bellman equations and the solutions of an LP, or through a Cesaro-limit using the discounting case.
Additionally, the textbook-proof (referring to Puterman's book) is divided into cases based on the structure of the MDP.
My question:
Is there a direct proof that a finite MDP has a pure optimal policy?
If not, is there any reason that the proof would be inherently messy?