The Coupon Collector's Problem (CCP) is very useful in many applications. However, the "default" CCP is relatively simple: suppose you have an urn containing $n$ pairwise different balls. Now you want to draw a ball from the urn with replacements until you have seen each of the $n$ balls at least one. Now you can compute the average waiting time to get the number of draws overall needed by the formula \begin{align} \mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[X_i] = nH_n \end{align} where $H_n$ is defined as the harmonic series and $\mathbb{E}$ is the expected value. Also, the random variable $X$ is defined as the random number of draws you have to make in order to get all $n$ balls at least once. $X_i$ denotes the additional number of draws one has to make in order to get from $i-1$ different balls to $i$ different balls drawn. Additionally, each ball has an equal probability of $1/n$.
Now consider an advanced CCP question: how does the formula change in case you want to draw $p\geq 1$ pairwise different balls (instead of only one as in the default CCP) per draw, called packets? In other words: Given an urn containing $n$ balls, how many balls do I need to draw in order to get all $n$ balls when drawing always $p\geq 1$ (pairwise different) balls out of the urn? The set of balls is drawn with replacement. (Therefore all balls of one package are different, but different packages can contain same balls.)
An answer gives this paper on top of page 20, and also this german lecture gives an answer on slide 229, 14.7b). A third -- at the same time very intuitive to get -- answer is given on the german Wikipedia, subsection "Päckchen".
Now two questions arise.
- Why do the answers in the paper and the lecture differ? If you plug in some numbers, you get different results for numbers above 1000.
- How do I get from these solutions to the one given on Wikipedia? For me it seems like an approximation of the real value, since it is very fast to compute compared to the "scientific" answers, and the results is always "in the near of" the results of the other computations.
Since I am interested in understanding the formula on Wikipedia, can anyone help understanding the equation how the formula is derived or give some insight?
The German Wikipedia formula is indeed wrong.
It's hard to figure out why someone comes up with a wrong solution for something. However, you could think of an experiment (different from the CCP) where the formula would give the right answer.
Say we have an urn with n balls numbered from 1 to $n$. Now we draw one ball at a time with replacement, until we got every number at least once. This is CCP with $p = 1$. If we have already seen $k$ distinct numbers, the expected value for the necessary draws to get the $(k+1)$st distinct number is $\frac{n}{n-k}$. Therefore, the expectation of the total number of necessary draws is $$ \sum_{k=0}^{n-1} \frac{n}{n-k}. $$
Now let's change the setting a little bit. We start again from scratch and draw one ball at a time, basically with replacement. But any time we get a previously unseen number, we do not replace it. However, as soon as we have seen $p$ distinct numbers, we replace all of them. Then we keep drawing balls with replacement (from all $n$ balls again); any time we get a previously unseen number, we do not replace it, until we have seen another $p$ distinct numbers; then we replace again all $p$ balls into the urn, and so on. This is a mixture of with and without replacement sampling. You could view this as a series of "with replacement" episodes, where episode $k$ lasts until you get the $k$th distinct ball, and where the number of balls in the urn during episode $k$ is $n-((k-1)\mod p)$, including $n-(k-1)$ previously unseen balls. Therefore, the expected duration for the $k$th episode is $$\frac{n-((k-1)\mod p)}{n-(k-1)},$$ and thus the total number of necessary draws is in expectation
$$\sum_{k=1}^n \frac{n-((k-1)\mod p)}{n-(k-1)} = \sum_{k=0}^{n-1} \frac{n-(k\mod p)}{n-k}.$$
This is the Wikipedia formula you mentioned.
Note that in the CCP with $p$ coupons in each package, we draw $p$ coupons without replacement, then replace them, draw again $p$ coupons without replacement and so on, so in some sense this as also a series of draws, where the number of balls in the urn varies between $n$ and $n-(p-1)$. This similarity seems to have fooled the Wikipedia author.
If $p$ is small and $n$ is large, the CCP with $p > 1$ coupons may be approximated by CCP with $p=1$, and in this case the experiment decribed above is equal to CCP. Therefore the (wrong) Wikipedia formula cannot be way off in this case. But I suspect the discrepancies to be larger if $p$ is large (or $n$ is small).