What is the probability of having $d$ duplicates when sampling $n$ elements from a pool of $N$ elements?

199 Views Asked by At

Question

What is the probability $P(d | n,N)$ of having $d$ duplicates in my sample if I sample with replacement $n$ elements from a pool of $N$ elements?

How to count duplicates?

The number of duplicates is defined as the number of pairwise comparison that are the same. That is

$$d = \sum_i \sum_{j>i} f(x_i,x_j)$$

where

$$ f(x_i,x_j) = \begin{cases} 1, & \text{if $x_i = x_j$} \\ 0, & \text{if $x_i ≠ x_j$} \end{cases}$$

For example [1,3,3,1] has 2 duplicates. [1,3,1,1] has 3 duplicates. [1,1,1,1] has 6 duplicates and [1,3,1,1,2,2,1,3] has 8 duplicates.

For $d$ equals 0

I think it is correct that $P(d=0 | n,N) = \prod_{i=1}^{n-1} 1 - \frac{i}{N}$. But I could not go any further.

1

There are 1 best solutions below

4
On

The number of duplicates is approximately Poisson distributed, with expectation parameter $\lambda= \binom n 2 / N$. In particular, if $n$ and $N$ go to infinity in such a way that $\binom n 2/N\to\lambda$, the number of duplicates converges in distribution to $\operatorname{Pois}(\lambda)$. This is an easy consequence of the "Chen-Stein" method, as described in many places. The Chen-Stein method itself is too complex to sketch here, but is well described in a very readable book Poisson Approximation by Barbour, Holst and Janson. Its application to the birthday problem is described in a paper.