Birthday problem with large $n, d$ values

121 Views Asked by At

In the Birthday problem, the formulas

$${\displaystyle {\begin{aligned}p(n;d)&={\begin{cases}1-\displaystyle \prod _{k=1}^{n-1}\left(1-{\frac {k}{d}}\right)&n\leq d\\1&n>d\end{cases}}&\approx 1-e^{-{\frac {n(n-1)}{2d}}}&\approx 1-\left({\frac {d-1}{d}}\right)^{\frac {n(n-1)}{2}}\end{aligned}}}$$

work well for $d = 365$ and $n=23$, and gives the usual estimation that if you have 23 people in the same room, the probability to have at least two people born the same day is $\geq 50 \%$.

Question: what formula is available for $p(n; d)$ with more precise error terms?


Concrete application:

I'm using random 5-alphanumeric-character identifiers for an inventory of objects.

Example: V4QH7, WYJ9X, LK6H4, etc.

If I have $n = 10,000$ objects, what is the probability that at least 2 objects have the same ID?

Note: the last formula (the one after the one with exponential function above) gives Error, numeric exception: overflow in Maple when I take $d=(26+10)^5=60,466,176$ and $n=10,000$.

The formula with exp gives $p \approx 3.8 \%$ but since no error term is given, I don't know if this is accurate.


Edit: Mistake: $p \approx 3.8 \%$ was obtained when I took $d=33^6$ (6-alphanumeric characters with a few letters removed for easier identification: I vs 1, etc.). With $d=36^5$, we get $56.3 \%$ probability of having a collision with the exp formula above, which is in accordance with the accepted answer.

1

There are 1 best solutions below

2
On BEST ANSWER

This can be done pretty straightforwardly as follows: we want to upper bound

$$- \sum_{k=1}^{n-1} \log \left( 1 - \frac{k}{d} \right)$$

and we can do this by applying an upper bound for $f(x) = - \log \left( 1 - x \right)$. On $x \in [0, a]$, Taylor's theorem with remainder gives that

$$|f(x) - x| \le \frac{f''(\xi)}{2} x^2$$

where $\xi$ maximizes $f''(x)$ for $x \in [0, a]$. We have $f'(x) = \frac{1}{1 - x}$ and $f''(x) = \frac{1}{(1 - x)^2}$, which is maximized when $x = a$. This gives that

$$- \log (1 - x) \le x + \frac{x^2}{2(1 - a)^2}, x \in [0, a]$$

and setting $a = \frac{n}{d}$ gives

$$- \sum_{k=1}^{n-1} \log \left( 1 - \frac{k}{d} \right) \le \sum_{k=1}^{n-1} \left( \frac{k}{d} + \frac{k^2}{2 (d^2 - n^2)} \right) \le \frac{1}{d} {n \choose 2} + \frac{n^3}{6(d^2 - n^2)}.$$

On the other hand, by convexity we have $-\log (1 - x) \ge x$ which gives a lower bound

$$- \sum_{k=1}^{n-1} \log \left( 1 - \frac{k}{d} \right) \ge \frac{1}{d} {n \choose 2}.$$

Exponentiating both of these gives

$$\boxed{ 1 - \exp \left( - \frac{1}{d} {n \choose 2} \right) \le p(n, d) \le 1 - \exp \left( - \frac{1}{d} {n \choose 2} - \frac{n^3}{6(d^2 - n^2)} \right) }.$$

Plugging in $n = 10^4, d = 36^5$ gives

$$\boxed{ 0.5625\color{red}{6} \ldots \le p(n, d) \le 0.5625 \color{red}{8} \ldots }$$

which disagrees with your answer. Heuristically these estimates tell us to expect a non-negligible probability of a collision once $d \approx {n \choose 2}$ (equivalently, once the expected number of collisions, which is exactly $\frac{1}{d} {n \choose 2}$ and which is an upper bound on the probability, is approximately $1$), and we have ${n \choose 2} \approx 5 \times 10^7$ and $d \approx 6 \times 10^7$, so if you were given these numbers by someone else that was probably deliberate.