Expected number of successes before first failure (Hypergeometric distribution)

1.6k Views Asked by At

Think of the following scenario:

We are a group of $42$ people. I tag you, and you tag another person. This other person tags another person, etc. The "chain" of tagging stops when a person has been tagged twice. (A person can't tag the person that tagged them).

I am trying to find a general formula to calculate the expected number of people who are tagged. I looked at the mean of a hypergeometric distribution, but it doesn't seem to be what I need, since it's used to determine the expected number of successes in n trials. In my case, the trials stop after the first failure.

My second guess was a negative hypergeometric distribution, but apart from a disputed wikipedia article and this Encyclopedia of math article, information seems very scarce.

Following the latter, I calculated the mean as follows: $E[X] = m(N-M)/(M+1)$

Where $M = 1$ (number of tagged people), $N-M = 41$ (untagged), $N = 42$ (total), and $m = 1$ (number of failures before stopping). This yields $E[X] = 20.5$.

Is this accurate ? Is there a way this formula can be derived using only the hypergeometric distribution (which is widely more documented than the negative hypergeometric distribution).

Thank you.

2

There are 2 best solutions below

0
On

Original poster here. I think I have got it figured out. I assumed that every "tagging" trial is independant, with a varying probability. (41/42, 40/42, 39/42 and so on).

The expected number of "successes" before the first failure is simply the sum of all probabilities. See https://en.wikipedia.org/wiki/Poisson_binomial_distribution

This yields an expected number of 20.5 taggings.

0
On

Let there be $n$ people labeled $\{a_1, a_2, \ldots, a_n\}$. Without loss of generality, assume that person $a_1$ tags person $a_2$, who then tags person $a_3$, as we can always number the people accordingly.

Define $X$ to be the random number of distinct individuals tagged when we observe the first instance of a previously tagged person being tagged a second time. Then $$\Pr[X = 3] = \frac{1}{n-2}.$$ The denominator is explained by the fact that no person ever tags themselves or the person that tagged them. The numerator is explained by the fact that person $a_1$ must be tagged by $a_3$ in order for $X = 3$ (person $a_2$ cannot be tagged because of the above rule).

In a similar fashion, we compute $$\Pr[X = 4] = \frac{n-3}{n-2} \cdot \frac{2}{n-2}.$$ This is because the fourth person to be tagged can be selected in $n-3$ ways, and this person can terminate the tagging process if they tag $a_1$ or $a_2$. And we also have $$\Pr[X = 5] = \frac{3(n-3)(n-4)}{(n-2)^3}.$$

Now, with this in mind, we see that a pattern develops, and in fact, we can assume that $a_i$ represents the $i^{\rm th}$ potential person to be tagged before the $X^{\rm th}$ person tags someone who was already tagged. That is to say, $$\begin{align*} \Pr[X = x] &= \frac{(x-2)(n-3)!}{(n-x)!(n-2)^{x-2}}, \quad x = 3, 4, \ldots, n.\end{align*}$$ Now that we have the exact probability distribution, we can compute for $n = 42$ $$\operatorname{E}[X] = \frac{216377989685883373635784276129625729114158059321}{ 22517998136852480000000000000000000000000000000} \approx 9.60911.$$

This result is supported by $10^7$ simultations in Mathematica:

F[n_] := NestWhile[Append[#, RandomChoice[Complement[Range[n],
         Take[#, -2]]]] &, {1, 2}, # == DeleteDuplicates[#] &]
G[n_] := Length[F[n]] - 1
Mean[Parallelize[Table[G[42], {10^7}]]]

The output I obtained in my run was 2402501/250000 which agrees very closely to the theoretical result above. A slight modification of the above code produces the empirical probability distribution, which I have plotted alongside the theoretical distribution below. As you can see, the difference in observed and expected frequencies is so small that it is effectively indistinguishable. enter image description here

A plot of the differences in observed and expected probabilities is shown below, and we can be confident that the result is valid. enter image description here


Note: I have assumed that person $a_1$ is by default "tagged" themselves, even though strictly speaking, no one tagged the originator. If we do not make this assumption, then the analysis is nearly identical, which I leave to the reader to work out.