How to test whether a 0-1 series is "Bernoulli or Markov distributed"?

121 Views Asked by At

I'm currently studying for a stat exam and in a list of sample questions that came up in previous exams I stumbled upon this problem which I can't solve:

  • Given a series of 0s and 1s, how to test whether the series is Bernoulli or Marko distributed?

I've never actually heard the name "Markov distribution" before, but I guess the question refers to whether the series consists of independent Bernoulli trials or if there is some dependency - Markov property being the weakest kind.

We've covered nonparametric (e.g. Chi-squared) tests for specific distributions, but if I'm not mistaken, they are based on an i.i.d. assumption. But the question seems to be exactly whether such an assumption holds, so these are no good.

Is there a common statistical test that does the trick or how should I approach this problem?

1

There are 1 best solutions below

0
On

I think your supposition is correct, but the problem is not crisply stated. In a two-state Markov Chain (MC) let the conditional change probabilities be $\alpha = P(X_i = 1|X_{i-1} = 0)$ and $\beta = P(X_i = 0|X_{i-1} = 1)$. With this notation, if $\alpha = \beta = 1/2,$ then the MC is not distinguishable from an independent Bernoulli sequence.

But if the MC is not trivial in this sense, and does have a dependent structure, then you might be able to detect it using a 'runs test' or by looking at autocorrelations.

Runs test. In an independent Bernoulli sequence with $P(X_i = 0) = P(X_i = 1) = 1/2,$ runs of $0$'s should average 2 in length, and similarly for runs of $1$'s. So by looking at the average lengths of runs, you can detect dependent structure. See this page from the NIST Engineering Statistics Handbook. (Before revision, I suggested the Wikipedia page on this topic, but then I read it carefully, found it cryptic and possibly misleading, and regretted that suggestion.)

Auto-correlation fundtion (ACF). The (estimated) autocorrelation of lag $\ell$ of a sequence is (roughly) the correlation of the sequence $x_1, x_2, \dots, x_{n-\ell}$ with the sequence $x_{\ell}, x_{\ell+1}, \dots x_n.$ (The quibble is that the mean and SD of the whole sequence are used in computing the correlation.) The ACF is a sequence of autocorrelations for various lags $\ell = 0, 1, 2, \dots.$

Here is an example for a sequence from a MC simulated in R statistical software.

n = 1000;  x = numeric(n)
x[1] = 0  # start chain in state 0 at step 1
for (i in 2:n) {
  if (x[i-1]==0) x[i] = rbinom(1, 1, 1/3)  # obs from Bernoulli(1/3)
  else x[i] = rbinom(1, 1, 2/3)
}
acf(x)

Of course, the autocorrelation for lag 0 is unity. The ACF plot shows significant autocorrelations for lags 1 and 2 (bars outside the dotted blue boundaries). Autocorrelations for larger lags are not significantly different from 0, because Markov dependence on a state several steps back "wears away" as the chain reaches steady state.

enter image description here

There are several relevant Wikipedia pages; you could google 'autocorrelation test', 'autocorrelation function', and so on. Autocorrelation methods are widely used in a number of fields, including economics.

Note: These 1000 values from a simulated MC (about half 0's and half 1's) happened to have 350 runs, whereas we would expect to see about 500 runs in a Bernoulli sequence from tosses of a fair coin; 350 is very much fewer than expected, and so we would reject the null hypothesis of random trials. See the NIST page linked above for details. (In R, one can count runs with the code rle(x), where 'rle' stands for 'run length encoding'; type help rle in the Session window of R for details.)