How to compute the expected number of duplicates

993 Views Asked by At

Consider $n$ objects each with an associated probability $p_i$, $i\in\{1,\dots,n\}$. If I sample objects $k$ times independently with replacement according to the probability distribution defined by the $p_i$, how does one compute the expected number of times you sample an object you have sampled before?

We can assume that $n > k$.

2

There are 2 best solutions below

7
On BEST ANSWER

Let $X_i = 1$ if the $i$th number sampled has already been sampled before, and $X_i = 0$ otherwise. Then we are looking for $E(X_1 + \cdots + X_k) = E(X_1) + \cdots E(X_k)$. Now, $E(X_1) = 0$, since $X_1 \equiv 0$. If $i > 0$, $$ P(X_i = 0) = \sum_{j = 1}^n\underset{\substack{\text{not choosing item $j$}\\\text{on the first $i-1$ times}}}{\underbrace{(1 - p_j)^{i-1}}}\underset{\substack{\text{choosing item $j$}\\\text{on the $i$th time}}}{\underbrace{p_j}}. $$ Therefore $E(X_i) = 1 - \sum_{j = 1}^n(1 - p_j)^{i-1}p_j$.

Summing up, we find that $$ \begin{align} E(X_1) + \cdots + E(X_k) &= E(X_2) + \cdots + E(X_k) \\ &= k - 1 - \sum_{i = 2}^k \sum_{j = 1}^n (1-p_j)^{i-1}p_j \\ &= k - \sum_{i = 1}^k \sum_{j = 1}^n (1-p_j)^{i-1}p_j \\ &= k - \sum_{j = 1}^n p_j \sum_{i = 0}^{k-1} (1-p_j)^i \\ &= k - \sum_{j = 1}^n p_j \frac{1 - (1-p_j)^k}{1-(1-p_j)} \\ &= k - n + \sum_{j = 1}^n (1-p_j)^k \end{align} $$

2
On

Re-sampling $m$ objects means sampling $k-m$ unique objects, which means avoiding $n-(k-m)=n+m-k$ unique objects, so using inclusion-exclusion, we can write $$E_k(m)=\sum_{m=0}^{k-1}\left\{m\sum_{u=0}^{k-m}\left[(-1)^u{n+m+u-k \choose u}\sum_{J\subseteq I; |J|=n+m+u-k}\left(1-\sum_{j\in J}p_j\right)^k\right]\right\}$$That's a long way from a closed form solution, but it is at least an expression for the value that you want.

Edit: Not much point to this since there are two answers nicer than mine now, but I just realized you can do the inclusion-exclusion on the objects you do pick, rather than the ones you don't, which leads to the somewhat simpler $$E_k(m)=\sum_{m=0}^{k-1}\left\{m\sum_{u=0}^{k-m}\left[(-1)^u{n+m+u-k \choose u}\sum_{J\subseteq I; |J|=k-m-u}\left(\sum_{j\in J}p_j\right)^k\right]\right\}$$ which is easier to simplify, but again, not much point in using this now that Did got it down to $k-n+\sum_i \left(1-p_i\right)^k$