Distribution of number of trials before $k$ distinct successes

107 Views Asked by At

I am familiar with the Negative Binomial distribution $NB(p, k)$, which gives the number of failures before $k$ successes occur in a Bernoulli process with parameter $p$. I am wondering, however, if there the distribution of the number of failures before getting $k$ distinct successes is a well-known distribution.

More precisely, suppose we have some set of $N$ elements; without loss of generality, we may assume this set is $[N] = \{1, \cdots, N\}$. We have a subset $S \subseteq [N]$ which contains the successes, while every element in $[N] \setminus S$ is a failure. Now in our process, we select a random element uniformly at random from $N$, and keep on doing this until we have attained some $k \leq |S|$ successes. Let $X$ be the total number of trials performed before we stop the process. What is the distribution of $X$, in terms of $k$, N, and p = $|S| / N$? If we can't concisely describe this distribution, can we at least come up with some (relatively tight) tail bounds? And how valid is the approximation $X \approx NB(p, k)$?

I know this problem can simplify to the coupon collector's problem in the case that $S = [N]$ and $k = N$, but I'm more interested in the case where $0 < p < 1$ and $k \ll |S| = pN$.

2

There are 2 best solutions below

0
On

$X$ can be written as a sum of $k$ independent geometric variables $X=\sum_{i=0}^{k-1} G_i$ where $G_i$ is geometric with parameter $\frac{|S|-i}{N-i}$. This representation allows to easily calculate the mean, variance and moment generating function of $X$. In particular, for $k$ large, $X$ is quite concentrated near its mean. If you subtract the mean and normalize by the standard deviation, it is asymptotically Gaussian as $k \to \infty$. If $k=o(N)$ as $N \to \infty$ then the law of $X$ is well approximated by a negative binomial law.

0
On

I will try and find $P(X = m)$. If $m < k$, then this probability is clearly $0$. Otherwise, the probability is the sum over all distinct $k-$tuples from $S$ of the probability of that exact tuple being the "winning" numbers. Since this is equal for all the different tuples, the probability is $\binom{s}{k}$ times the probability of $\{1, \cdots, k\}$ being the winning numbers and none of $\{k+1, \cdots, s\}$ showing up, where $s = |S|$. Then because the last one has to be a number in $\{1, \cdots, k\}$ that hasn't been seen before, this is equal to

$$\binom{s}{k}\frac{k}{n}\cdot P(\text{all of $\{1, 2, ..., k-1\}$ getting chosen in $m-1$ trials and none of $\{ k, k+1, ..., s \}$ being chosen})$$

This inner probability is equal to $$\frac{|\text{set of $(m-1)$-tuples such that all of $\{1, 2, ..., k-1\}$ is there}|}{n^{m-1}}$$

but instead of having $n$ choices for each element in the tuple, there are $n-s+k-1$ choices, which accounts for $\{k, \cdots, s\}$ not being allowed.

Then, after taking the complement of this and using inclusion-exclusion, this is $$\frac{\sum_{i=0}^{k-1} (-1)^{i}\binom{k-1}{i}(n-s+k-1-i)^{m-1}}{n^{m-1}} = \frac{\sum_{i=0}^{k-1} (-1)^{k-1-i}\binom{k-1}{i}(n-s+i)^{m-1}}{n^{m-1}}$$

So the final probability is

$$P(X=m)=\frac{k\binom{s}{k}}{n^m} \sum_{i=0}^{k-1} (-1)^{k+i+1}\binom{k-1}{i}(n-s+i)^{m-1}$$

I'm not sure how valid the negative binomial distribution is for approximating this though.