If an item is chosen at random $k$ times from a set of $n$ items, the probability the chosen items are all different is exactly $\dfrac{n^\underline{k}}{n^k}=\dfrac{n!}{(n-k)!n^k}$. For large $n$, the expression is said to be approximately equal to $\exp\left(\dfrac{-k(k-1)}{2n}\right)$, which works out to probability of collision of about $\dfrac{k^2}{2n}$ for $1 \ll k \ll n$. How does one derive the former approximation? Apparently the Stirling formula first, and then I see some terms that remind me of $\left(1+\dfrac1x\right)^x \approx e^x$, but it doesn’t quite work out for me.
2026-03-28 09:54:29.1774691669
Hash Collision Probability Approximation
302 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
The probability of choosing different items is
$$ \left(1-\frac0n\right)\left(1-\frac1n\right)\left(1-\frac2n\right)\cdots\left(1-\frac{k-1}n\right)\;. $$
For $n\gg k^2$, we can keep only the terms up to first order in $\frac1n$ to obtain
$$ 1-\frac1n\sum_{j=0}^{k-1}j=1-\frac{(k-1)k}{2n}\;. $$
To get the exponential form, you can either approximate this directly as $\exp\left(\dfrac{-k(k-1)}{2n}\right)$ or first approximate the factors in the product:
$$ \exp\left(-\frac0n\right)\exp\left(-\frac1n\right)\exp\left(-\frac2n\right)\cdot\exp\left(-\frac{k-1}n\right)=\exp\left(-\frac1n\sum_{j=0}^{k-1}j\right)=\exp\left(-\frac{(k-1)k}{2n}\right)\;. $$
The error in the exponential form is $O\left(k^3/n^2\right)$, whereas the error in the first form is $O\left(k^4/n^2\right)$, since the exponential form only has incorrect $\frac1{n^2}$ terms for each factor individually, whereas the first version drops an $\frac1{n^2}$ term for each pair of factors.