Hash collision probability

973 Views Asked by At

I am trying to show that the probability of a hash collision with a simple uniform 32-bit hash function is at least 50% if the number of keys is at least 77164. I have figured out how to plot a graph on python and then read off the values and percentages there, but I can't seem to figure out a formal proof. I've played around a bit and have come to this formula

$$1-e^{-k(k-1)/(2N)}.$$

That is what I used in python and I used a loop to sub in different values for $k$. Does anyone know how I can use this to solve it by hand since I don't think my proof via python would be valid?

1

There are 1 best solutions below

0
On

First of all, the formula is approximate, see the below part for the approxiamation;

$$ p(n;H) \approx 1 - e^{-n(n-1)/(2H)} \approx 1-e^{-n^2/(2H)} \tag{1}\label{1}$$

This is the probability of uniform randomly selecting $n$ values from $H$ values with repetitions.

What I understand that you need a graph that shows the percentages like $10\%, 20\%, 30\%, 40\%,$, etc. For this, you need to invert the above expression \ref{1}.

$$n(p;H) \approx \sqrt{2H\ln\frac{1}{1-p}}$$

Simply plug your required probability to get the required selection.

$$n(0.5;H) \approx 1.774 \sqrt{H}$$

If we plug $H=2^{32}$ on Wolfram Alpha, see here, then

$$n(0.5;H) \approx \sqrt{2\cdot 2^{32} \ln\frac{1}{1-(1/2)}} = 77162.74324...$$

I can't seem to figure out a formal proof.

The formal proof comes with the birthday pardox, we will look at the only hash part, where the hash function has output space $H$. Let \bar p(n) is the probability of no two hash value collide is then $$p(n) = 1 - \bar p(n) \tag{2}\label{2} $$ is the probability of at least two hash values are colliding.

$$\bar p(n) = \left(\frac{H-1}{H}\right) \cdots \left(\frac{H-k}{H}\right) = \prod_{k=1}^{n-1}\left(1-\frac{k}{H}\right) \tag{3}\label{3}$$ where $p(n)$ is the probability of at least two hash is colliding.

The approximation starts with $e^x \approx 1 +x $ for $|x| \ll 1$

Set $x = - \frac{a}{H}$ then $e^{-a/H} \approx 1 - \frac{a}{H}$. Now use this formula to replace each of the product in equation \ref{3}

\begin{align} \bar p(n) & \approx e^{-1/H} \cdot e^{-2/H} \cdots e^{-(n-1)/H}\\ & \approx e^{-\left( 1+2+\cdots + (n-1)/H\right)}\\ & \approx e^{-n(n-1)/(2H)} \end{align}

Now use \ref{3} with the above result $$ p(n) = 1 - \bar p(n) \approx 1 -e^{-n(n-1)/(2H) $$ We can now parametrize as

$$p(n) \approx 1 - e^{-n(n-1)/(2H)} \approx 1-e^{-n^2/(2H)} $$

Note: I've used the $n$ instead of $k$.