How can one calculate the distribution of this "multinomial" analog of the geometric distribution?

542 Views Asked by At

The specific word problem that motivated this question was:

Generate random numbers 0-9 uniformly. Define $W$ to be the number of trials required for at least one 4, at least one 5, and at least one 6 to show up.

For example, if the generated sequence is $444444440980981245550986$, then $W=24$.

More generally:

Question 1: Given a Multinoulli random variable (e.g. a fair die) with probability distributed uniformly between $K$ categories. Define $W$ to be the number of trials required for some subset $1 \le k \le K$ of the categories to all appear at least once. What is the distribution of $W$?

Possible solution: Calculating $\mathbb{P}(W \le n)$ corresponds (I think) to calculating

  • all length $n$ sequences in which at least one of each category shows up (i.e. in $k$ of the trials), and then the sequences can do whatever they want in the remaining trials

and then dividing this number by

  • the number of all possible length $n$ sequences.

So we can break this up into sub-problems as follows:

I think the first number is (according to the fundamental counting principle):

$$ \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} (K - k)^{n-m} $$

and the second number is $K^n$, so is the probability just:

$$ \mathbb{P}(W \le n) \overset{?}{=} \sum_{m=k}^n \sum_{x_1 + \dots + x_k = m \,, \\ \neg(x_1 = \dots = x_k = 0)} \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K - k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,? $$

Question 2: If the above formula is correct, is there a way to simplify it?

It does appear to simplify to the geometric CDF for $p= \frac{1}{2}$ (take $K=2$ and $k=1$). A simplification which held more generally would be nice.

EDIT: It occurred to me that always $m >0$, since $m \ge k$, but $x_1 = \dots = x_k = 0$ would imply $x_1 + \dots + x_k = m = 0$, a contradiction. Therefore the second condition in the sum is redundant, so the above expression simplifies to:

$$\sum_{m=k}^n \sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \left(\frac{K - k}{K} \right)^{n-m} \left(\frac{1}{K}\right)^m \,, $$

which we can get a simplified closed form for if we can simplify

$$\sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \quad \text{for }k \le m \le n \,. $$

When $m=n$ the multinomial theorem gives us that this sum is equal to $k^n$. But I don’t know about more generally.

Progress so far: In the case that $k=1$, this reduces to a regular geometric distribution, since we can "clump" all of the remaining $K-1$ categories into one super-category.

However, for $k>1$, this trick no longer seems to work. This is because the events of any of the categories in the $k$ categories showing up do not seem to be independent. E.g. in the example above, if one generates a $4$, one also knows that one did not generate a $6$.

And it isn't sufficient to just calculate the distribution that any one of categories appears $k$ times, since one could have the same category appearing $k$ times without any of the categories of interest appearing at all, e.g. $444$ in the above example.

It seems like it might be easier to calculate the probabilities:

$$\mathbb{P}(W > n)\,, \quad n \ge k \,. $$

This would allow us to get the distribution, since:

$$F_W = \begin{cases} 0 & 0 \le n < k \\ 1 - \mathbb{P}(W > n) \end{cases} $$

and because $W$ is non-negative we even get the expectation for free:

$$\mathbb{E}[W] = \sum_{n=0}^{\infty} \mathbb{P}(W > n) \,. $$

The only thoughts I've had for calculating $\mathbb{P}(W =k )$ so far are:

  • find a recurrence relation for $\mathbb{P}(W >n)$ in terms of $\mathbb{P}(W > n-1)$ (and possibly also $\mathbb{P}(W > n-2)$), and then try to derive a probability generating function from this somehow.
  • Brute force calculate the number of sequences of interest using some combinatorial principle and then divide by $\left(\frac{1}{K}\right)^n$.
2

There are 2 best solutions below

1
On BEST ANSWER

Let your sample space $X$ be the set of strings of length $n$ with elements from $K$ categories, so that $|X|=K^n$. Suppose $1 \ldots k$ are the categories defining the random variable $W$. Let $X_i$ be the subset of $X$ whose elements do not contain an element of category $i$, as $i$ varies from $1$ to $k$. Then the event that $W > n$ is exactly the subset of the sample space $\displaystyle \bigcup_{i=1}^{k} X_i$.

Using inclusion-exclusion, the size of this union is $$\displaystyle \sum_{j=1}^k (-1)^{j+1}{k \choose j}(K-j)^n$$

In this summation, $j$ represents the number of categories of $1 \ldots k$ that are not seen in the string. The $(-1)^{j+1}$ comes from the inclusion-exclusion, ${k \choose j}$ counts the number of ways to pick which of the $k$ categories are skipped, and $(K-j)^n$ counts the number of ways to assign each element of the string excepting the $j$ categories to be avoided.

0
On

Regarding simplifying

$$\sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} \quad \text{for }k \le m \le n \,, $$

one has that:

$$\binom{n}{x_1, x_2, \dots, x_k} = \frac{n!}{x_1! \dots x_k! (n-m)!} \,,$$

so therefore

$$\sum_{x_1 + \dots + x_k = m } \binom{n}{x_1, x_2, \dots, x_k} = \sum_{x_1 + \dots + x_k = m } \frac{n!}{x_1! \dots x_k! (n-m)!} = \frac{n!}{(n-m)!}\sum_{x_1 + \dots + x_k = m } \frac{1}{x_1! \dots x_k!}$$

$$ = \frac{n!}{m! (n-m)!}\sum_{x_1 + \dots + x_k = m } \frac{m!}{x_1! \dots x_k!} = \binom{n}{m} k^m \,.$$

This also makes sense reasoning more directly from the fundamental counting principle: we want the number of ways to select $m$ from $n$ slots and then partition those $m$ slots into $k$ categories. The number of ways to do the former is $\binom{n}{m}$, and the number of ways to do the latter (via the multinomial theorem) is $k^m$.

However, it turns out that simplifying that sum doesn't actually help us, because I misremembered/misinterpreted the problem. More specifically, one has that:

$$ \neg(x_1 = \dots = x_k = 0) \not= [x_1 \not=0 \land x_2 \not=0 \land \dots \land x_k \not=0] \,, $$ for example, $x_1 \not = 0, x_2 = \dots = x_k =0$ satisfies the former condition but not the latter. So the sum we are actually interested in is:

$$\sum_{x_1 + \dots + x_k = m \\ x_1 \ge 1, \dots , x_k \ge 1} \binom{n}{x_1, x_2, \dots, x_k} = \binom{n}{m}\sum_{x_1 + \dots + x_k = m \\ x_1 \ge 1, \dots , x_k \ge 1} \binom{m}{x_1, x_2, \dots, x_k} =: \binom{n}{m} S(m,k) \,, $$

where $S(m,k)$ is called the Stirling number of the second kind (just found this out this morning). We can split up the complexity of calculating this Stirling number of the second kind by splitting it up into other Stirling numbers of the second kind. Specifically, consider the sum

$$\sum_{x_1 + \dots + x_k = m \\ x_1 \ge 1, \dots , x_k \ge 1} \binom{m}{x_1, x_2, \dots, x_k} = \sum_{x_1 + \dots + x_k = m } \binom{m}{x_1, x_2, \dots, x_k} - \sum_{x_1 + \dots + x_k = m \\ x_1 = 0 \lor \dots \lor x_k = 0} \binom{m}{x_1, x_2, \dots, x_k} $$

$$ = k^m - \sum_{x_1 + \dots + x_k = m \\ x_1 = 0 \lor \dots \lor x_k = 0} \binom{m}{x_1, x_2, \dots, x_k} \,.$$

We can tackle the last sum by formulating this as the composition of sub-problems and applying the fundamental counting principle. Specifically, we choose $1 \le \ell \le k-1$ of the $x_1, \dots, x_k$ to be zero -- for any given $\ell$ there are exactly $\binom{k}{\ell}$ ways to do this. Then we need to calculate the number of ways to partition $m$ into the remaining $k - \ell$ non-zero categories, i.e.

$$\sum_{x_1 + \dots + x_k = m \\ x_1 = 0 \lor \dots \lor x_k = 0} \binom{m}{x_1, x_2, \dots, x_k} = \sum_{\ell=1}^{k - 1} \binom{k}{\ell} \sum_{x_{\ell+1} + \dots + x_k = m \\ x_{\ell+1} \ge 1, \dots, x_k \ge 1} \binom{m}{x_{\ell +1} \dots x_k} = \sum_{\ell = 1}^{k-1} \binom{k}{\ell} S(m, k - \ell) \,.$$

If nothing else, this seems to show that there's no way of getting around the Stirling numbers of the second kind in our counting problem. So we have either that the number of sequences of interest is

$$ \sum_{m=k}^n \binom{n}{m} S(m,k) \,,$$

or that (equivalently) it is

$$ \sum_{m=k}^n \binom{n}{m} \left[ k^m - \sum_{\ell=1}^{k-1} \binom{k}{\ell} S(m, k-\ell) \right] \,. $$

The explicit formula for the Stirling numbers of the second kind seems to rely (at least implicitly) on the Inclusion-Exclusion principle:

$$S(m,k) =\frac{1}{m!} \sum_{\ell=0}^k (-1)^{k - \ell} \binom{k}{\ell} \ell^m \,. $$

So for the final probability, we get an expression with a double-sum in the numerator:

$$\mathbb{P}(W \le n) = \frac{\sum_{m=k}^n \binom{n}{m}\frac{1}{m!} \sum_{\ell = 0}^k (-1)^{k - \ell} \binom{k}{\ell} \ell^m}{K^n} \,. $$

We can check that this reduces to the right expression (namely the CDF of the geometric with $p = \frac{1}{2}$) when $k=1$ and $K=2$ (since clearly the Stirling number $S(m,1)=1$ always -- see also the Wikipedia article):

$$ = \frac{\sum_{m=1}^n \binom{n}{m} S(m, 1) }{2^n} = \frac{\sum_{m=1}^n \binom{n}{m} (1)}{2^n} = \frac{\sum_{m=0}^n \binom{n}{m} - \binom{n}{0}}{2^n} = \frac{2^n - 1}{2^n} = 1 - \left( \frac{1}{2} \right)^n = 1 - (1-p)^n \,.$$

The accepted answer is probably equivalent (or if not equivalent then this answer is wrong), but is obviously much simpler.