Deriving Properties of Markov Chains

276 Views Asked by At

I had the following question about Discrete Time Markov Chains.

Suppose we define a Discrete Time Markov Chain as a Stochastic Process ${X_n : n = 0, 1, 2, ...}$ such that for any set of states $i_0, i_1, ..., i_{n+1}$ and any time step $t > s$, the probability of being in state $i_{n+1}$ at time $t$ given the history of the process up to time $s$ only depends on the state $i_n$ at time $s$ is:

$$ P(X_t = i_{n+1} | X_s = i_s, X_{s-1} = i_{s-1}, ..., X_0 = i_0) = P(X_t = i_{n+1} | X_s = i_n) $$

where $P(X_t = i | X_s = j)$ is the probability of the process being in state $i$ at time $t$ given that it was in state $j$ at time $s$.

For simplicity sake, lets say that this Markov Chain is time-homogeneous - for any states $i$ and $j$, the probability of transitioning from state $i$ to state $j$ at time $t$ only depends on the states $i$ and $j$:

$$ P(X_{t+1} = j | X_t = i) = P_{ij} $$

where $P_{ij}$ denotes the transition probability from state $i$ to state $j$. For this Markov Chain, we can define the corresponding transition matrix like this :

$$ P = \begin{bmatrix} P_{11} & P_{12} & \cdots & P_{1n} \\ P_{21} & P_{22} & \cdots & P_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ P_{n1} & P_{n2} & \cdots & P_{nn} \end{bmatrix} $$

Where :

  • $P_{ij} = \frac{n_{ij}}{\sum_{k=1}^n n_{ik}}$ and $n_{ij}$ is the frequency of transitioning from state $i$ to state $j$ in the observed data.
  • All entries of $P$ are nonnegative: $P_{ij} \geq 0$ for all $i,j$.
  • The row sums of $P$ are equal to 1: $\sum_{j=1}^m P_{ij} = 1$ for all $i$.

My Question: For a long time, I have been interested in learning about how to calculate properties about these transition probabilities $P_{ij}$ - for example, suppose we have observe some data and fit a Discrete Time Markov Chain to this data (i.e. estimate the transition probabilities), how do we then find out the Variance and Confidence Intervals for $\hat{P}_{ij}$ (e.g. the estimate of $P_{ij}$)?

Using logic, I think that $\hat{P}_{ij}$ are like "dice rolls" - if you have rolled a number $i$, what is the probability of rolling some other number $j$ out of a total of $n$ possible numbers? If this is the case, this reminds me of the Multinomial Probability Distribution:

$$ P(X_1 = x_1, X_2 = x_2, ..., X_k = x_k) = \frac{n!}{x_1!x_2!...x_k!} p_1^{x_1} p_2^{x_2} ... p_k^{x_k} $$

where:

  • There are $k$ possible outcomes with probabilities $p_1, p_2, ..., p_k$
  • And $X_1, X_2, ..., X_k$ are the number of times each outcome occurs in $n$ trials
  • And $\sum_{i=1}^k x_i = n$ and $x_1, x_2, ..., x_k$.

As the Multinomial Distribution is a well-studied distribution, we know that:

  • The Mean is $\operatorname{E}[X_i] = n p_i$
  • The Variance is $\operatorname{Var}[X_i] = n p_i (1 - p_i)$.

Thus, if I were to extend this logic to Markov Chains, I would conclude that the 95% Confidence Interval for any given $\hat{P}_{ij}$ (using the Wald Confidence Interval) can be given by:

$$ \text{95% CI for } \hat{p}_{ij} = \hat{p}_{ij} \pm 1.96\sqrt{\frac{\hat{p}_{ij}\cdot (1-\hat{p}_{ij})}{n_{i}}} $$

Where :

  • $\hat{p}_{ij}$ is the estimate (e.g. Maximum Likelihood Estimate) of $P_{ij}$
  • $n_{i}$ is the number of times the Markov Chain left "state i"

I am not sure if my logic is correct - can someone please tell me if I am correctly applying the mathematical logic here?

Thanks!

Additional References:

2

There are 2 best solutions below

16
On BEST ANSWER

Your intuition is completely correct, with some minor mathematical incompleteness, part of which involves assumptions on your chain, namely that it is irreducible and finite.

You propose a common-sense estimator for $p_{i,j}$, namely $\hat{p}_{i,j} = N_{i,j} / N_i$. This is in fact the maximum likelihood estimator (see links below). The key to concluding your calculation is to observe that you want to compute the variance of your estimator conditioned on $N_i$, that is: \begin{align*}\mathrm{Var}(\hat{p}_{i,j} | N_i = n_i) &= \mathrm{Var}\left( \frac{N_{i,j}}{N_i} \bigg | N_i = n_i \right)\\ &= \frac{1}{n_i^2} \mathrm{Var}(N_{i,j} | N_i = n_i) \\ &= \frac{1}{n_i^2} n_i p_{i,j} ( 1- p_{i,j}). \end{align*} I did this explicitly, even though you already did, because I wanted to be clear about where the factor of $n_i$ in the denominator comes from.

Conditional probabilities can be tricky and some mathematical rigor is required, but one cannot abandon common sense entirely. As you say, in real life, you know how many times you visited $n_i$ (at least for the simple, finite state space models in which you know all the states that you consider here). A key observation is that each time you are at $i$, you have a new chance to move to any $j$, completely independently of everything that happened in the past. In other words, for each state $i$, you have independent and identically distributed samples from a multinomial distribution. This is a consequence of the (strong) Markov property.

So, you can pretend you have no idea about the Markov chain, and that you are observing exits from a specific state $i$, which you have $n_i$ of. It follows that indeed, $$\frac{\hat{p}_{i,j} - p_{i,j}}{\sqrt{p_{i,j}(1-p_{i,j})/n_i}} \Rightarrow N(0, 1)$$ as $n_i \rightarrow \infty$, where $\Rightarrow$ denotes convergence in distribution (see for instance Proposition 6.3 part 2. in these notes). It can also be shown that $N\pi_i / N_i \rightarrow 1$ almost surely, where $N$ is the total number of moves of the chain and $\pi_i$ is the stationary distribution of the chain. You can put confidence intervals as you said.

There are also plenty of references on this on the web:

Since you commented on stats stackexchange question back in December 2022, and have asked a similar question again, is there something else that needs to be clarified?

Edit:

To clarify some of the discussion in the comments, I attach a little example. Consider the following Markov chain.

Suppose that $p_{0,2} = 0$, $p_{1,2} = \varepsilon \ll 1$, so that it takes many steps to get to state 2 starting from 0 or 1. @Adam points out that the notes I linked above take care to establish a Central Limit Theorem with $N$ denoting the number of steps of the whole Markov Chain (see the use of Anscombe's lemma). To use the form of the CLT described in the notes, one must justify the approximation $N_i \approx N\pi_i$, where $\pi$ denotes the stationary distribution.

But this isn't necessary if you just want to put confidence intervals on, say, $\hat{p}_{0,0}$. In fact, you can put confidence intervals on $\hat{p}_{0,0}$ without ever having visited state 2! Given $N_0$ visits to $0$, you can apply the classical CLT to just the i.i.d. sequence of transitions $T_k = 1_{\{(0,0)\}}(X_{\tau_k}, X_{\tau_k+1})$, $k=0, \dots, N_0-1$. Notice that I make no claims about $\hat{p}_0$, the probability of being at 0.

Here's a quick and dirty implementation of this in Python. Caveat emptor: I scribbled this code very quickly, so there may be mistakes. I'm also applying an "engineer's" version of error bars, i.e. $\pm 2\sigma/\sqrt{n}$.

enter image description here

eps = 1e-30
# define the transition matrix with epsilon
# so small that we never visit 2
P = np.array(
    [[0.3, 0.7, 0],
     [0.6, 0.4-eps, eps],
     [0, eps, 1-eps]])

# simulate the chain
n = 10000
X = [0]
for i in range(n):
    new_x = np.random.choice([0,1,2], p=P[X[-1]])
    X.append(new_x)
X = np.array(X)

# get the number of transitions to 0, starting from 0
at_0 = X[:-1] == 0
transition_00 = (X[:-1] == 0) & (X[1:] == 0)
N_00 = transition_00[at_0]
N_0 = np.arange(len(N_00))+1

# estimate \hat{p}, and get the variance, p_00 * (1 - p_00) ~= 0.3 * 0.7. 
p_00 = np.cumsum(N_00) / N_0
sigma = sum(N_00) / (N_0[-1]-1) - p_00[-1]**2

plt.plot(N_0, p_00)
plt.fill_between(N_0, p_00[-1] + 2*sigma/np.sqrt(N_0), p_00[-1] - 2*sigma/np.sqrt(N_0), alpha=0.3)
plt.xlabel('# times at 0', fontsize=20)
plt.ylabel(r"$\hat{p}_{0,0}$", fontsize=20)
plt.axis([100, len(N_0), 0.2, 0.4])

Is this method optimal? Possibly not. Are the error bars meaningful? Yes.

I suppose the final thing to address in this story is: if I'm making error bars for transition probabilities look so simple, why is this so hard in practical problems involving, say, MCMC? One reason is that in hard sampling problems, you don't even know all the states, you don't quite know which states you transition to, and you certainly don't know if the part of the Markov chain you're exploring is where the action is, or if there is some distant part of the chain (such as state 2) that holds more overall mass (as measured by $\pi$) that is as of yet unexplored. The inability to quantify the unexplored $\pi$-volume is in a sense the insurmountable problem of sampling.

1
On

See [1], where this and related problems are discussed for reversible Markov chains, and [2] for a follow-up in the non-reversible case.

[1] Hsu, Daniel, Aryeh Kontorovich, David A. Levin, Yuval Peres, Csaba Szepesvári, and Geoffrey Wolfer. "Mixing time estimation in reversible markov chains from a single sample path." Annals of Applied Probability 29, no. 4 (2019): 2439-2480. https://projecteuclid.org/journals/annals-of-applied-probability/volume-29/issue-4/Mixing-time-estimation-in-reversible-Markov-chains-from-a-single/10.1214/18-AAP1457.pdf

[2] Wolfer, Geoffrey, and Aryeh Kontorovich. "Estimating the mixing time of ergodic markov chains." In Conference on Learning Theory, pp. 3120-3159. PMLR, 2019.