Let $X=(X_t)_{t=0}^\infty$ be an irreducible and aperiodic Markov chain on a finite state space $\Omega$. We know that $X$ admits a unique stationary state $\pi$. For each $x\in \Omega$, let $p_x^{(t)}$ denote the distribution of $X_t$ given that the starting state is $x$. In other words, $p_x^{(t)}(y)$ is the probability of landing on $y$ at the $t$-th step if we had started at $x$.
We define $\Delta_x(t)=\|p^{(t)}_x-\pi\| = \frac{1}{2}\sum_{y\in \Omega} |p^{(t)}_x(y)-\pi(y)|$. That is, $\Delta_x(t)$ is the variation distance between $p^{(t)}_x$ and $\pi$.
Fact. $\Delta_x(t)$ is a non-increasing funciton of $t$.
Writing $P$ to denote the transition matrix of the Markov chain, the above fact is easy to see by a simple application of triangle inequality to $\Delta_x^{(t+1)} = \|p^{(t)}_xP-\pi P\|$.
But my question is the following.
In this PDF, the above fact is proved in Claim 3.8 using a coupling argument. I was able to follow the argument formally. But can somebody please provide some intuition behind this.
Claim 3.8 $\Delta_x(t)$ is non-increasing in $t$.
Proof: Let $X_0 = x$ and $Y_0$ have the stationary distribution $\pi$. We fix $t$ and couple the distributions of the random variables $X_t$ and $Y_t$ such that $\Pr[X_t \ne Y_t] = \| p_x^{(t)} - \pi\| = \Delta_x(t)$, which is possible because of the Coupling Lemma. We now use this coupling to define a coupling of the distributions of $X_{t+1}$ and $Y_{t+1}$ as follows:
- If $X_t = Y_t$, then set $X_{t+1} = Y_{t+1}$.
- Otherwise, let $X_t \to X_{t+1}$ and $Y_t \to Y_{t+1}$ independently.
Then we have $$\Delta_x(t+1) \equiv \|p_x^{(t+1)} - \pi\| \le \Pr[X_{t+1} \ne Y_{t+1}] \le \Pr[X_t \ne Y_t] = \Delta_x(t).$$ The first inequality holds because of the Coupling Lemma, and the second inequality is true by the construction of the coupling.
The intuition behind the Coupling Lemma is that $\Delta_x(t) = \|p_x^{(t)} - \pi\|$ is the smallest that the probability $\Pr[X_t \ne Y_t]$ can get for any two random variables such that $X_t$ has the $p_x^{(t)}$ distribution and $Y_t$ has the $\pi$ distribution.
When we generate such a coupling in this case, you shouldn't think of $X_t$ and $Y_t$ as being the states of the $t^{\text{th}}$ step of some Markov chain. $X_t$ is just a random state sampled from the $p_x^{(t)}$ distribution, and $Y_t$ from the $\pi$ distribution. They're coupled in a way to be equal with the highest probability possible. (Just pick $X_t = Y_t = s$ with probability $\min\{p_x^{(t)}(s), \pi(s)\}$ for all states, and then fill in the remaining probability mass arbitrarily to make $X_t$ and $Y_t$ have the right distributions.)
We use the probabilities from $p_x^{(t)}$ and $\pi$ to put two tokens $X$ and $Y$ down on the Markov chain. Then they take a step, following the Markov chain rules, and also following the rule that if they start in the same state, they get stuck together.
This guarantees that $\Pr[X_{t+1} \ne Y_{t+1}]$ can only decrease. So we know that:
We conclude that $\Delta_x(t+1) \le \Delta_x(t)$.