Basic setting
I am working with a sequence of random variables $\mathbf{X} := X_1, X_2, \dots$, for which I know the Markov property does not hold exactly, but approximately: $$ \Pr[X_{n+1}=x \mid X_{n}=x_{n}] \approx \Pr[X_{n+1}=x \mid X_1=x_1, \dots X_{n}=x_{n}] $$
I am approximating $\mathbf{X}$ by a Markov chain $\mathbf{Y}$, given by $$ \Pr[Y_{n+1}=x \mid Y_{n}=x_{n}] := \Pr[X_{n+1}=x \mid X_{n}=x_{n}] $$
Goal
I want to bound the error introduced by approximating $\mathbf{X}$ by $\mathbf{Y}$. One reasonable approach is measuring the KL divergence (I am open to other approaches if needed): $$ D_\text{KL}(\mathbf{X} || \mathbf{Y}) := \mathbb{E}_{x \sim \mathbf{X}}\left[\log \frac{\Pr[\mathbf{X}=x]}{\Pr[\mathbf{Y} = y]}\right] $$
Question
Is there a reasonable/interpretable assumption I can place on $\mathbf{X}$ that ensures that $\mathbf{X}$ is close to its Markov chain approximation $\mathbf{Y}$?
Details
- Technically, I am asking for an assumption on $\mathbf{X}$ that implies that $D_\text{KL}(\mathbf{X} || \mathbf{Y})$ is small.
- Of course, I can directly bound $D_\text{KL}(\mathbf{X} || \mathbf{Y})$, but I would prefer to place a more fundamental, simpler assumption on $\mathbf{X}$ that implies small $D_\text{KL}(\mathbf{X} || \mathbf{Y})$.
- Ideally, the assumption should be standard/well-established.
- In my case, I am actually working with a Markov chain of order $m$, but I am assuming any answer for order $1$ can be generalized to order $m$.
- In my case, my Markov chain is finite, but again, I am assuming this does not make much of a difference.
I can't see a simple solution to that as the problems (or opportunities perhaps) arise even in the case when both $X$ and $Y$ are Markov.
For arbitrarily small $\varepsilon$ we can perturb the equation of the $X$ chain and get a $Y$ with governing equation different at any point by at most $\varepsilon$ (the difference in transition probabilities for example) and having equilibrium distribution with $D_{KL}( X || Y )$ as small as desired.
This can be done for example by taking Metropolis-Hastings algorithm realization chains with the same proposition distribution and two completely different target distributions provided they vary slow enough (e.g. they are Lipshitz with sufficiently small constant and proposition distribution decays quickly enough) to ensure that disturbance in the transition matrix is small enough (cf. Wiki: Metropolis-Hastings algorithm).
In summary if equilibrium behavior is the thing you're trying to model even very small change in the equations can change it completely, if on the other hand you're only interested in local behaviour there might be some hope.