Hash Collision Probability Approximation

302 Views Asked by At

If an item is chosen at random $k$ times from a set of $n$ items, the probability the chosen items are all different is exactly $\dfrac{n^\underline{k}}{n^k}=\dfrac{n!}{(n-k)!n^k}$. For large $n$, the expression is said to be approximately equal to $\exp\left(\dfrac{-k(k-1)}{2n}\right)$, which works out to probability of collision of about $\dfrac{k^2}{2n}$ for $1 \ll k \ll n$. How does one derive the former approximation? Apparently the Stirling formula first, and then I see some terms that remind me of $\left(1+\dfrac1x\right)^x \approx e^x$, but it doesn’t quite work out for me.

1

There are 1 best solutions below

0
On BEST ANSWER

The probability of choosing different items is

$$ \left(1-\frac0n\right)\left(1-\frac1n\right)\left(1-\frac2n\right)\cdots\left(1-\frac{k-1}n\right)\;. $$

For $n\gg k^2$, we can keep only the terms up to first order in $\frac1n$ to obtain

$$ 1-\frac1n\sum_{j=0}^{k-1}j=1-\frac{(k-1)k}{2n}\;. $$

To get the exponential form, you can either approximate this directly as $\exp\left(\dfrac{-k(k-1)}{2n}\right)$ or first approximate the factors in the product:

$$ \exp\left(-\frac0n\right)\exp\left(-\frac1n\right)\exp\left(-\frac2n\right)\cdot\exp\left(-\frac{k-1}n\right)=\exp\left(-\frac1n\sum_{j=0}^{k-1}j\right)=\exp\left(-\frac{(k-1)k}{2n}\right)\;. $$

The error in the exponential form is $O\left(k^3/n^2\right)$, whereas the error in the first form is $O\left(k^4/n^2\right)$, since the exponential form only has incorrect $\frac1{n^2}$ terms for each factor individually, whereas the first version drops an $\frac1{n^2}$ term for each pair of factors.