What's the error in this birthday-problem estimation?

55 Views Asked by At

Theorem. Choose $Q$ random natural numbers in the set $\{1,2, ..., M\}.$ The probability of getting at least one collision is

$$P_C(Q) = 1 - \frac{M - (Q - 1)}{M} P_{\neg C}(Q-1).$$

Notation. By $P_C$, I mean the probability of getting a colision. By $P_{\neg C}$ I mean the probability of not getting a collision.

Remark. This is the birthday problem.

Remark. So $P_C(Q)$ is just being computed by using its complement. The reason I express the theorem this way is because its induction proof relates directly to it being enunciated this way.

Theorem. $$P_C(Q) \approx 1 - e^{-\tfrac{(Q-1)Q}{2M}}.$$

Proof. We know $$e^{-x} = 1 - x + x^2/2! - x^3/3! + x^4/4! - ...$$

If we take the two terms of this expansion, we get $e^{-x} \approx 1 - x$. Then

\begin{align} P(Q)&= \prod_{i=1}^{Q-1} \left(1 - \dfrac{i}{M}\right)\\ &\approx \prod_{i=1}^{Q-1} e^{-i/M} \\ &= e^{-1/M} e^{-2/M} \dotsc\ e^{-(Q - 1)/M} \\ &= e^{-\sum_{i=1}^{Q-1} i/M} \\ &= e^{-\dfrac{1}{M} (Q-1)Q/2}\\ &= e^{-\dfrac{(Q-1)Q}{2M}}, \end{align}

So $P_C(Q) \approx 1 - P_{\neg C}(Q),$ as desired.

Question. How can I (at least) have some notion of how off the right number I am if I use this estimation to compute the probability of getting a collision in a concrete case?

1

There are 1 best solutions below

4
On BEST ANSWER

Some empirical results:

Your approximation for the probability of a collision of $1-e^{-{(Q-1)Q}/({2M})}$ is never too high and is only exact when $Q=1$ or $0$

Clearly there is an error when $Q\gt M$, as the probability of a collision is $1$ while your approximation gives a value slightly less than $1$. In particular then $Q=M+1$ the error is $e^{-({M+1})/{2}}$, which is very small for large $M$ and the error gets even smaller when $Q$ is larger still

This is not the largest error when $M\gt 2$. Instead the largest error then seems to occur when $Q \approx \sqrt{3M}$ and seems to be between $\dfrac{0.19}{\sqrt{M}}$ and $\dfrac{0.26}{\sqrt{M}}$, declining slowly as $M$ increases

As an illustration, with $M=10$ the largest error seems to be when $Q=6$, which is near $\sqrt{30}$, with the exact probability of a collision being $0.8488$ but your approximation giving about $0.77687$; the difference of about $0.07193$ is around $\frac{0.2274}{\sqrt{10}}$. The following is a table of the probabilities of collisions for different $Q$s when $M=10$

    Q  Exact_prob Approx      Error
    1  0          0.00000000  0
    2  0.1        0.09516258  0.00483742
    3  0.28       0.25918178  0.02081822
    4  0.496      0.45118836  0.04481164
    5  0.6976     0.63212056  0.06547944
    6  0.8488     0.77686984  0.07193016
    7  0.93952    0.87754357  0.06197643
    8  0.981856   0.93918994  0.04266606
    9  0.9963712  0.97267628  0.02369492
   10  0.99963712 0.98889100  0.01074612
   11  1          0.99591323  0.00408677
   12  1          0.99863963  0.00136037
   13  1          0.99959027  0.00040973
   14  1          0.99988833  0.00011167
   15  1          0.99997246  0.00002754
   16  1          0.99999386  0.00000614
   17  1          0.99999876  0.00000124
   18  1          0.99999977  0.00000023
   19  1          0.99999996  0.000000037
   20  1          0.99999999  0.0000000056