What is a mixing condition of an MDP? I'm reading a paper called Experts in a Markov Decision Process, and it says
Before we provide our algorithm a few definitions are in order. For every stationary policy $\pi(a|s)$, we define $P^{\pi}$ to be the transition matrix induced by $\pi$, where the component $[P^{\pi}]_{s, > s'}$ is the transition probability from $s$ to $s'$ under $\pi$. Also, define $d_{\pi, t}$ to be the state distribution at time $t$ when following $\pi$, ie $$d_{\pi, t} = d_1(P^{\pi})^t $$ where we are treating $d_1$ as a row vector here.
Assumption 1 (Mixing) We assume the transition model over states, as determined by $\pi$, has a well defined stationary distribution, which we call $d_{\pi}$. More formally, for every initial state $s$, $d_{\pi,t}$ converges to $d_{\pi, t}$ converges to $d_{\pi}$ as $t$ tends to infinity and $d_{\pi}P^{\pi} = d_{\pi}$. Furthermore, this implies there exists some $\tau$ such that for all policies $\pi$, and distributions $d$ and $d'$, $$|| dP^{\pi} - d'P^{\pi}||_1 \le e^{-1/\tau}||d - d' || _1$$ where $||x||_1$ denotes the $l_1$ norm of a vector $x$. We refer to $\tau$ as the mixing time and assume that $\tau > 1$.
What exactly is the inequality saying? My raw interpretation says that the left hand side is the distance between next state distributions and the right hand side is the exponentially reduced distance between any state distributions. How is $\tau$ a mixing constant that guarantees such a bound?
If one assumes $\tau > 0$, then this contraction inequality is not true. A simple counter-example is an MDP where there is only one possible action at each state (meaning, there are no decisions) and the corresponding Markov chain has three states $\{1, 2, 3\}$ and transition matrix $$P = \begin{bmatrix}0&1&0\\0&0&1\\1/2&0&1/2\end{bmatrix}$$
This is irreducible and aperiodic and has a unique stationary distribution $(1/4 , 1/4 , 1/2)$. Define:
\begin{align} d &= (1 , 0 , 0) \implies dP = (0 , 1 , 0)\\ d' &= (0 , 1 , 0) \implies d' P = (0 , 0 , 1) \end{align} So $$ || d P - d'P||_1 = ||(0 , 1 , -1)||_1 = 2 = ||(1 , -1 , 0) ||_1 = ||d - d'||_1$$ which contradicts the desired contraction inequality $$ ||dP - d'P||_1 \leq \underbrace{e^{-1/\tau}}_{<1} || d - d'||_1 $$
We can get the desired contraction inequality if we introduce additional assumptions:
Define a stochastic matrix to be a square matrix with nonnegative entries whose components in each row sum to 1. Let $S = \{1, 2, ..., n\}$ be the state space of our MDP (with $n$ states). Let $\mathcal{P}$ be the class of policies. Suppose there is an $\epsilon \in (0,1]$ and an $n \times n$ stochastic matrix $R=(r_{ij})$ with identical rows (so that $r_{aj} = r_{bj}$ for all $a,b,j \in \{1, ..., n\}$) such that: $$ P_{ij}^{\pi} \geq \epsilon r_{ij} \quad, \forall i, j \in S, \forall \pi \in \mathcal{P} \quad \mbox{(Assumption 1)}$$
We say that a vector $x \in \mathbb{R}^n$ is a probability mass function (PMF) if its components are nonnegative and sum to 1. We treat probability mass functions as $1 \times n$ row vectors.
Claim:
If Assumption 1 holds, then for all $\pi \in \mathcal{P}$ and all PMFs $x$ and $y$, we have: $$ \boxed{||xP^{\pi} - yP^{\pi}||_1 \leq (1-\epsilon)||x-y||_1} $$
Proof:
Fix $\pi \in \mathcal{P}$. Fix $\delta \in (0, \epsilon)$. Define $Q = \frac{1}{1-\delta}(P^{\pi} - \delta R)$ and notice that $Q$ is a stochastic matrix. Then $$ P^{\pi} = \delta R + (1-\delta) Q $$ Fix $x, y$ as two PMFs. Then: \begin{align} ||xP^{\pi} - yP^{\pi}||_1 &= ||\delta(x-y)R + (1-\delta)(x-y)Q||_1\\ &\overset{(a)}{=}(1-\delta)||(x-y)Q||_1\\ &\overset{(b)}{\leq} (1-\delta)||x-y||_1 \end{align} where (a) holds because $xR = yR$; (b) holds because $||zQ||_1\leq ||z||_1$ for any $z \in \mathbb{R}^n$ and any $n\times n$ stochastic matrix $Q$. Taking $\delta\rightarrow\epsilon$ gives the result.
$\Box$
An example of when Assumption 1 holds is when there is an $\epsilon>0$ such that $P_{ij}^{\pi} \geq \epsilon/n$ for all $i,j \in \{1, ..., n\}$ (in which case the $R$ matrix is the matrix composed of all $1/n$ entries). Another example is if $P_{i1}^{\pi} \geq \epsilon$ for all $i \in \{1, ..., n\}$, meaning we go to state 1 next with probability at least $\epsilon$, regardless of the current state (in which case the $R$ matrix has 1s in column 1 and zeros in all other columns).
A corollary is that if Assumption 1 holds and if we use the same policy $\pi$ on each slot, then there is a unique stationary PMF $d^{\pi}$ (satisfying $d^{\pi} = d^{\pi} P^{\pi}$) and we converge to $d^{\pi}$ exponentially fast from any initial PMF $x$: $$ ||x(P^{\pi})^k - d^{\pi}||_1 \leq (1-\epsilon)^k||x-d^{\pi}||_1 \quad, \forall k \in \{0, 1, 2, 3, ...\}$$
Another corollary is that if Assumption 1 holds, then for any two initial PMFs $x$ and $y$ and any sequence of policies $\{\pi_k\}_{k=1}^{\infty}$, we get $$|| xP^{\pi_1} P^{\pi_2} \cdots P^{\pi_k} - y P^{\pi_1}P^{\pi_2}\cdots P^{\pi_k}||_1 \leq (1-\epsilon)^k ||x-y||_1 \leq 2 (1-\epsilon)^k \quad, \forall k \in \{1, 2, 3, …\} $$ so the $k$-step distributions “merge” as $k \rightarrow\infty$ and the effect of the initial conditions decays exponentially fast, regardless of the sequence of policies we use.
In Proposition 2.1 in the following paper, my students give a more general condition than Assumption 1 under which desirable mixing results hold:
https://arxiv.org/abs/1709.03465