Let $H_1$ and $H_2$ be two instances of a finite Hidden Markov Model (HMM) $H$. That is, $H_1$ and $H_2$ have identical state spaces $Q$ as well as identical transition $A$ and emission probabilities $B$. However, they have different initial distributions $\pi_1$ and $\pi_2$, respectively. Let $(O_0, O_1, O_2\ldots)$ be a sequence of observations applied to both $H_1$ and $H_2$.
Has anyone considered the question how long does it take both $H_1$ and $H_2$ to forget their initial distributions? That means their belief states $b_1: Q \rightarrow [0, 1]$ and $b_2: Q \rightarrow [0, 1]$ become arbitrarily close to each other?
By "arbitrarily close" here I mean "Given any $\epsilon > 0$, what is $N(\epsilon) > 0$ such that, for all observations $O_n, n > N(\epsilon):\ dist(b_1, b_2) < \epsilon$ for some distance function $dist$?"
I stumbled on two papers that might hold clues:
J. Wolfowitz "Products of Indecomposable, Aperiodic, Stochastic Matrices," AMS, 1962.
J. Hajnal, "The ergodic properties of non-homogeneous finite Markov chains," 1956.
(I don't have access to Hajnal's paper)
The main theorem is that:
We may apply this to HMMs as follows: Let $\mathbf{A}$ be the HMM transition matrix (assumed indecomposable and aperiodic). Let $\mathbf{A}^{(k)}$ be the product of $\mathbf{A}$ with the $k^{th}$ column of $\mathbf{B}$, the observation matrix. That is: $\mathbf{A}^{(k)}_{ij} = \mathbf{A}_{ij} \times \mathbf{B}_{jk}$
Then all $\mathbf{A}^{(k)}$'s are normalized to obtain row stochastic matrices (What conditions must $\mathbf{B}$ satisfy so that the normalized $\mathbf{A}^{(k)}$'s are SIA?). This way, given any observation sequence $O_0, O_1, O_2, \ldots$, we obtain that the final belief state is given by:
$\mathbf{b}_f = \mathbf{b}_0 \times \mathbf{A}^{(O_1)} \times \mathbf{A}^{(O_2)} \times \ldots$
which must converge to the unique row of the rank-one matrix $\prod_{i} \mathbf{A}^{(O_i)}$. Note that $\mathbf{b_0}$ is the initial belief state multiplied by probabilities of the initial observation $O_0$ and normalized.
So it seems inhomogeneous Markov chains hold a clue. Maybe someone with a deeper knowledge on these can elaborate the missing conditions on $\mathbf{B}$ or point out flaws of this reasoning, if there is any.