Experimenting with numpy and generating random arrays of integers (discrete uniform distribution) I noticed something but can't explain how it would be calculated.
I generate a random sample of natural numbers from 0 to $n$ of size $n$. From this sample I choose (with replacement) $n$ numbers randomly.
Now I compare both samples and calculate the number of values which are in the first sample but not in the second one via symmetric difference. While increasing the size of $n$ I noticed the percentage seems to converge to around $16.36\%$
How would I calculate this number for any $n$? Is there a formula?
Problem seems related to the Coupon Collectors Problem but I was not able to find a formula which converges to $16.36$ for high $n$.
When the first sample is not randomized f.e. 1 to n the value approaches $36.78\% $ so $ 1/e$.
2026-02-23 22:37:48.1771886268
Percentage of non picked balls with replacement
29 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
Just to be clear. You pick $n$ values with replacement uniformly iid from $S_0= \{0,1,2,\ldots, n-1\}$ to form a multiset $S_1$ and then you pick $n$ values with replacement uniformly iid from $S_1$ to form a multiset $S_2$, and you seem to be interested in the number of distinct values in $S_1 \cap S_2^c$.
Consider an individual value $k$. The probability $k$ is not in $S_1$ is $\left(\frac{n-1}n\right)^n$, and the probability $k$ is in $S_1$ is $1-\left(\frac{n-1}n\right)^n$. You can go further and say that the probability $k$ appears $t$ times in $S_1$ is ${n\choose t}\frac{(n-1)^{n-t}}{n^n}$.
If $k$ appears $t$ times in $S_1$, then the probability $k$ does not appear in $S_2$ is $\left(\frac {n-t}n\right)^n$, so the probability $k$ appears in $S_1$ but not $S_2$ is $$\sum\limits_{t=1}^n {n\choose t}\frac{(n-1)^{n-t}}{n^n}\left(\frac {n-t}n\right)^n$$ and so this is also the expected proportion of the distinct original $n$ values in appearing in $S_1 \cap S_2^c$.
For $n=1,2,3,4,5$ this gives $0, 0.125, 0.13992, 0.14685, 0.150669$ respectively while for $n=1000000$ it gives just over $0.163584$, which seems to match your observation.
As an approximation for large $n$, the number of times $k$ appears in $S_1$ is approximately Poisson distributed with mean $1$ so $P(T=t) \approx \frac{e^{-1}}{t!}$; with $t=0$ this matches your $\frac1e \approx 0.3678$. Meanwhile $\left(\frac {n-t}n\right)^n \approx e^{-t}$ for large $n$ and fixed $t$, so we are looking at $\sum\limits_{t=1}^\infty \frac{e^{-1}}{t!}e^{-t} $ $= e^{-1}\left(e^{e^{-1}}\sum\limits_{t=0}^\infty \frac{e^{-e^{-1}}(e^{-1})^t}{t!} - 1\right)$ $= e^{-1}\left(e^{e^{-1}} - 1\right) \approx 0.1635842.$