Probability of some die face being missed N or more times in a row in M rolls? [Clarified]

694 Views Asked by At

2015/01/28 Clarification: Question rewritten to remove ambiguity that elicited (interesting) responses to a different problem (use edit should you wish to view).

From Markus' feedback on the earlier question:

Let's consider a die with $F$ faces. This die is rolled $M$ times. What is the probability, that each window of size $N \leq M$ within this series of $M$ rolls contains always all $F$ faces?

Or alternatively, consider an urn with $F$ differently colored balls, where $M$ balls are drawn (with replacement after each draw) and the colors noted. What is the probability given $F$, $M$, and some $N \leq M$ that every consecutive $N$ draws of the $M$ total draws exhibited at least one example each of all $F$ colors?

Obviously, if $M<N$ or $N<F$ the probability is 0. I'm stuck for the $F\leq N \leq M$ cases.

I can do this using recursive calculation or a Markov chain, but only for the most trivial (small) cases, since the state space obviously explodes quite quickly.

I thought of treating it like a runs/streak problem, that is, I thought that if some face/color is seen, followed by $N$ all of differing face/color resulting in a "failure", I could just use existing means of calculating a streak of $N$ events with probability $(faces - 1)/(faces)$ for the event of a differing face/color. I was unable to arrive at a solution this way.

I then thought of treating it as a coupon collector's problem. I can easily calculate the probability of all faces/colors being seen in the first $N$, so I thought using that, the next "window" from 2 to N+1 is a "failure" if those $N$ are all different from the 1st element (probability $((faces - 1)/(faces))^N$), continuing this for each step of the "window" by 1 to get a net probability. I get stuck here with the dependency between successive "windows", and I'm not even sure if it's a valid tack.

I attempted to find a generating function by looking at actual example counts of differing cases, and searching OEIS for sequences along with checking for generating/sequence functions in Mathematica. I found no matches, and I'm a neophyte with generating functions, so I hit a dead end there.

Bottom line, given $F, M, N$, how might I calculate this?

A couple of examples follow to clarify the above problem description.

Given an alphabet of two elements ${0,1}$, ($F=2$) the possible words of length 4 ($M=4$) are

enter image description here

With a "window" of 3 ($N=3$), the only length 4 words that have no missing alphabet elements in all possible length 3 "windows" are

enter image description here

So, my probability of missing some alphabet element in a 3-windowed 4-length string on a 2-element alphabet is 6/16.

For a more involved example, take the length 6 strings on $0,1,2,3$. There are 4096 of them (I'll not list them for obvious reasons). For a "window" of 5, only 528 of them have the complete alphabet in all possible length 5 consecutive positions. A few examples:

enter image description here

Note that in all of these, positions 1 through 5 and 2 through 6 for each have at least one example of every element in the alphabet.

That number (528 in this example, or equivalently 3568 for the complement) is what I'm after.

I hope that clarifies the question, apologies to readers for any confusion I caused.

2015/01/30 - I've awarded the current bounty to Markus, even though the answer was for a different problem (my fault, my ambiguously written OP), rather than have it vanish into the ether. His response prompted me to look at the problem in other ways and was most interesting in itself. I shall add another bounty if/when it is allowed, and/or award one manually should a solution present itself.

3

There are 3 best solutions below

3
On

If the numbers involved are small, you just have to compute them. If $F$ is large, the chance of two in a row on a given pair is $1/F^2$ and the chance of a pair (because the first can be anything) is $1/F$ You can use a Poisson distribution here. You haven't specified the calculation well enough to do it.

11
On

The "easy" way to enumerate strings of a particular length that avoid any collection of words is the so-called Goulden-Jackson cluster method. This method is elegantly described in a wonderful paper by Zeilberger and Noonan. (This paper is intended as a gentle introduction to the Goulden-Jackson cluster method and has almost no prerequisites.) In it they concretely describe how to easily obtain the generating function for words of various lengths that avoid a collection of "bad words". In your case, say $a_M$ is the number of rolls of length $M$ that avoid all the bad words of the form $i^N$ for $1\le i\le F$. Their method shows that the corresponding generating function $\sum_{M\ge0}a_M z^M$ is $$\frac{1-z^N}{1-Fz+(F-1)z^{N}}.$$ Of course, there's still the problem of extracting the coefficient of $z^M$ from this rational function, but perhaps this is enough to get you started.

11
On

Attention: [2015-01-28] According to a reformulation of OPs question clarifying ambiguous aspects, this answer is no longer valid. It instead adresses the problem stated a few paras below. Nevertheless, since the elaboration might be useful for the interested reader, I will not remove it.


Note: The following is a two-step approach. I will start with a simpler problem in order to be well prepared for the more challenging problem stated by OP. We also take a look at small examples as simple plausability check. We will see that the resulting generating function is the one stated by @RusMay.


The following is based upon section I.4.1 and example III.24 from Analytic Combinatorics by P. Flajolet and R.Sedgewick.

Let's denote the $F$ faces with $a_1,\ldots,a_F$. Each result from $M$ rolls can be seen as word of length $M$ consisting of characters from the alphabet $\mathcal{A}=\{a_1,\ldots,a_F\}$.We can reformulate OPs question and get following

Problem: Let's use an alphabet $\mathcal{A}=\{a_1,\ldots,a_F\}$ of $F$ characters. What is the number $C_F(M,N)$ of different words of length $M$ with longest runs of a character having length less than $N(\geq 2)$.

The corresponding probability OP is asking for is the number $C_F(M,N)$ divided by the number $F^M$ of all words of length $M$ over the alphabet $\mathcal{A}$.

First Step: Analysing the case $F=2,N\geq 2$

We consider binary strings over the alphabet

\begin{align*} \mathcal{A}=\{0,1\} \end{align*}

and are looking for the number of strings of length $M$ having less than $N$ consecutive $0$s and $1$s.

We build the binary strings starting with simpler patterns. Let $$\text{SEQ}(1)=\{\varepsilon,1,11,111,\ldots\}$$ denote all strings containing only $1$s with length $\geq0$. The empty string is denoted with $\varepsilon$. The corresponding generating function is $$z^0+z^1+z^2+\ldots=\frac{1}{1-z},$$ with the exponent of $z^n$ marking the length $n$ of a string of $1$s and the coefficient of $z^n$ marking the number of strings of length $n$. Similarly, we define $\text{SEQ}(0)=\{\varepsilon,0,00,000,\ldots\}$.

Let $\text{SEQ}^{\leq k}(1)$ be the set of all strings of $1$s with length $\leq k$. The generating function for $\text{SEQ}^{\leq k}(1)$ is $$1+z+z^2+\ldots+z^{k}=\frac{1-z^{k+1}}{1-z}$$

Even more simplified: We consider $$F=2, N=2$$ namely strings having runs of $0$s and $1$s with maximum length $1$.

How could we describe all these strings? We may have a leading $1$. So, we start with zero or one $1$s. Formally: \begin{align*} \text{SEQ}^{\leq 1}(1)=\{\varepsilon,1\} \end{align*} with generating function \begin{align*} 1+z\tag{1} \end{align*}

Then we may have zero or more groups, each group containing a $0$ followed by a $1$. Formally:

$$\text{SEQ}(01)=\{\varepsilon,01,0101,010101,\ldots\}$$

with generating function

\begin{align*} 1+z^2+z^4\ldots=\frac{1}{1-z^2}\tag{2} \end{align*} The strings may finish with zero or one $0$, formally $\text{SEQ}^{\leq 1}(0)$ with generating function $1+z$.

We conclude: The binary strings having runs of $0$s and $1$s of maximum length $1$ are modelled by

\begin{align*} \text{SEQ}^{\leq 1}(1)\text{SEQ}(01)\text{SEQ}^{\leq 1}(0)\tag{3} \end{align*}

with generating function \begin{align*} \left(1+z\right)\left(\frac{1}{1-z^2}\right)\left(1+z\right)=\frac{1+z}{1-z} \end{align*}

according to (1) and (2)


Generalisation: $F=2,N\geq 2$

We consider now the more general problem $N\geq 2$. Since we have to consider runs of less than $N$ $0$s and $1s$ we have to substitute the outer occurrences $0$ and $1$ by up to $N-1$ $0$s resp. $1$s and the inner $0$s and $1$s enlarge by up to $N-2$ $0$s resp. $1$s. We obtain

\begin{align*} \text{SEQ}^{\leq N-1}(1)\text{SEQ}\left(0\text{SEQ}^{\leq N-2}(0)1\text{SEQ}^{\leq N-2}(1)\right)\text{SEQ}^{\leq N-1}(0) \end{align*}

with corresponding generating function $A(z)$

\begin{align*} A(z)&=\left(\frac{1-z^N}{1-z}\right)\left(\frac{1}{1-z\frac{1-z^{N-1}}{1-z}z\frac{1-z^{N-1}}{1-z}}\right)\left(\frac{1-z^N}{1-z}\right)\\ &=\frac{(1-z^N)^2}{(1-z)^2-z^2(1-z^{N-1})^2}\\ &=\frac{1-z^N}{1-2z+z^N}\tag{4} \end{align*}

Note: Observe, that (4) corresponds to the generating function given by @RusMay with $F=2$


Example: $F=2, N=3$

The generating function counting the number of binary strings with consecutive runs of length less than 3 is

\begin{align*} \frac{1-z^3}{1-2z+z^3}=1+2z+4z^2+6z^3+10z^4+\ldots \end{align*}

the corresponding strings are

\begin{align*} \{&\varepsilon,\\ &0,1,\\ &00,01,10,11,\\ &001,010,011,100,101,110,\\ &0010,0011,0100,0101,0110,1001,1010,1011,1100,1101,\\ &\ldots\} \end{align*}

This was the warm-up exercise and we are now well-prepared for the general solution.


Second Step: General solution $F\geq 2,N\geq 2$

Words with consecutive runs having length exactly one are called Smirnov words. Observe, in (3) we described Smirnov words over a binary alphabet with the generating function $\frac{1+z}{1-z}$.

Let $\mathcal{W}=SEQ(\mathcal{A})$ be the set of words over the alphabet $\mathcal{A}=\{a_1,\ldots,a_F\}$, and $\mathcal{S}$ be the set of Smirnov words. We need two observations:

  • Aspect 1: The generating function $W(v_1,\ldots,v_F)$ counting the number of words of $\mathcal{W}$ with $v_j$ denoting the number of occurrences of the letter $a_j$, $1 \leq j \leq F$ is

\begin{align*} W(v_1,\ldots,v_F)=\frac{1}{1-(v_1+\ldots+v_F)}\tag{5} \end{align*}

Note: Here I follow Flajolet's advice and omit the redundant variable $z$ in $W$ which is otherwise used to mark the length of the words. This job is already done by $v_1,\ldots,v_F$ and so I don't need to use $W(z;v_1,\ldots,v_F)=\frac{1}{1-(v_1+\ldots+v_F)z}$ instead.

  • Aspect 2: We can take an arbitrary word from $\mathcal{W}$ and substitute each run of characters by only one occurrence of this character in order to get a Smirnov word. On the other hand, we can find to each word $w\in \mathcal{W}$ a Smirnov word $s_w\in\mathcal{S}$ and expand each character of $s_w$ by a proper run length of this character in order to get the corresponding word $w\in \mathcal{W}$.

So, arbitrary words in $\mathcal{W}$ are derived from Smirnov words by a simultaneous substitution:

\begin{align*} \mathcal{W}=\mathcal{S}[a_1\mapsto SEQ^{\geq 1}(a_1),\ldots,a_F\mapsto SEQ^{\geq 1}(a_F)]\tag{6} \end{align*}

Let $S(v_1,\ldots,v_F)$ denote the generating function counting the number of Smirnow words in $\mathcal{S}$.

The substitution in (6) implies the relation:

\begin{align*} W(v_1,\ldots,v_F)=S\left(\frac{v_1}{1-v_1},\ldots,\frac{v_F}{1-v_F}\right)\tag{7} \end{align*} since similarly to (2) the generating function for \begin{align*} SEQ^{\geq 1}(a_j)=\{a_j,a_ja_j,a_ja_ja_j,\ldots\} \end{align*} is $$\frac{v_j}{1-v_j}$$

We observe $S(v_1,\ldots,v_F)$ is specified in (7) implicitely by $W(v_1,\ldots,v_F)$. Since the inverse function of $\frac{v}{1-v}$ is $\frac{v}{1+v}$ we obtain with the help of (5):

\begin{align*} S(v_1,\ldots,v_F)&=W\left(\frac{v_1}{1+v_1},\ldots,\frac{v_F}{1+v_F}\right)\\ &=\left(1-\sum_{j=1}^{F}\frac{v_j}{1+v_j}\right)^{-1} \end{align*}

We can proceed even simpler, since we don't need to differentiate between the characters of the alphabet $\mathcal{A}$. So, if we set $v_j=z$ and so forget the composition of the words into letters, we obtain the generating function for Smirnov words

\begin{align*} \frac{1}{1-F\frac{z}{1+z}}&=\frac{1+z}{1-(F-1)z}\tag{8}\\ &=1+\sum_{n=1}^{\infty}F(F-1)^{n-1}z^n \end{align*}

From here it's only a small step to find the generating function of words, that never contain runs of $N$ consecutive equal letters. We follow aspect 2 and substitute each character by a run of length less than $N$. In terms of generating functions we have to substitute in (8)

$$z\mapsto z\frac{1-z^{N-1}}{1-z}.$$

We finally obtain the

Solution: The generating function $\tilde{S}(z)$ counting all words over the alphabet $\mathcal{A}=\{a_1,\ldots,a_F\}$ with runs of length less than $N$ consecutive characters is

\begin{align*} \tilde{S}(z)=\frac{1}{1-F\frac{z\frac{1-z^{N-1}}{1-z}}{1+z\frac{1-z^{N-1}}{1-z}}}=\frac{1-z^{N}}{1-Fz+(F-1)z^{N}} \end{align*}

If $C_F(M,N)$ denotes the coefficient of $[z^M]\tilde{S}(z)$ the wanted probability is $$\frac{C_F(M,N)}{F^M}$$

Note: Observe that $\tilde{S}(z)$ is the generating function also given by @RusMay.


Let's finish with two small examples:

Example: $F=3,N=3,\mathcal{A}=\{0,1,2\}$

The generating function counting the number of ternary strings with consecutive runs of length less than 3 is

\begin{align*} \frac{1-z^3}{1-3z+2z^3}=1+3z+9z^2+24z^3+66z^4+\ldots \end{align*}

the corresponding strings are

\begin{align*} \{&\varepsilon,\\ &0,1,2\\ &00,01,02,10,11,12,20,21,22\\ &001,002,010,011,012,020,021,022,\\ &100,101,102,110,112,120,121,122,\\ &200,201,202,210,211,212,220,221,\\ &0010,0011,0012,0020,0021,0022,0100,0101,0102,0110,0112,\\ &0120,0121,0122,0200,0201,0202,0210,0211,0212,0220,0221,\\ &1001,1002,1010,1011,1012,1020,1021,1022,1100,1101,1102,\\ &1120,1121,1122,1200,1201,1202,1210,1211,1212,1220,1221,\\ &2001,2002,2010,2011,2012,2020,2021,2022,2100,2101,2102,\\ &2110,2112,2120,2121,2122,2200,2201,2202,2210,2211,2212,\\ &\ldots\} \end{align*}

Example: $F=3,N=4,\mathcal{A}=\{0,1,2\}$

The generating function counting the number of ternary strings with consecutive runs of length less than 4 is

\begin{align*} \frac{1-z^4}{1-3z+2z^4}=1+3z+9z^2+27z^3+78z^4+\ldots \end{align*}

the corresponding strings are

\begin{align*} \{&\varepsilon,\\ &0,1,2\\ &00,01,02,10,11,12,20,21,22\\ &000,001,002,010,011,012,020,021,022,\\ &100,101,102,110,111,112,120,121,122,\\ &200,201,202,210,211,212,220,221,222,\\ &0001,0002,0010,0011,0012,0020,0021,0022,0100,\\ &0101,0102,0110,0111,0112,0120,0121,0122,0200,\\ &0201,0202,0210,0211,0212,0220,0221,0222,1000,\\ &1001,1002,1010,1011,1012,1020,1021,1022,1100,\\ &1101,1102,1110,1112,1120,1121,1122,1200,\\ &1201,1202,1210,1211,1212,1220,1221,1222,2000,\\ &2001,2002,2010,2011,2012,2020,2021,2022,2100,\\ &2101,2102,2110,2111,2112,2120,2121,2122,2200,\\ &2201,2202,2210,2211,2212,2220,2221,\\ &\ldots\} \end{align*}


Note: According to P. Flajolet (p.205) this example applies equally well to non-uniform letter probabilities and to a collection of run-length upper-bounds and lower-bounds dependent on each particular letter. This topic is in particular pursued by different methods in several works of S. Karlin and coauthors.