How many SHA-256 hashes of emails are duplicates of each other?

1.9k Views Asked by At

There are $5$ billion unique email addresses in the World. If I created a database containing their SHA-256 hashes, how any unique hashes would we expect that database to contain?

By my crude methods, a SHA-256 hash is $256$ bits long so there are $k=2^{256}$ possible hashes. Therefore the probability of any two email addresses chosen at random having the same hash is roughly $p=\frac{1}{k}=2^{-256}$.

Let $n=5\times10^9$ be the number of distinct email addresses.

Therefore the expected number of duplicates of any one chosen hashed email address is given by $E=np=4.3\times10^{-68}$

Now suppose we need to check all hashed email addresses in turn for duplicates we will expect to find $n^2p=2.16\times10^{-58}$ email addresses having a duplicate hash. We would then roughly halve this since any duplicates would have been found twice (and by the smallness of the probability, the likelihood of there existing triplets does not materially affect the result).

So in intuitive terms, we would need to inhabit $10^{58}$ planet Earths to the same density as we do this one, before the expected number of duplicate hashed email addresses reached $1$.

Therefore we can be extremely certain that every hash will be unique. Is this correct, and can this estimate be refined or improved? A key (reasonable?) assumption I've made is that the common structure which email addresses share does not impact the likelihood of hashes matching.

2

There are 2 best solutions below

1
On BEST ANSWER

Your reasoning is basically right.

You can compute two related but different things.

First, the expected number of coincidences (or collisions). This is given by

$$C= \frac{n (n-1)}{2} \frac{1}{k} \approx \frac{n^2}{2k} $$

Then the probability that there is at least a collission. This is

$$P = 1- \frac{k!}{n!}\frac{1}{k^n} \approx 1 - \exp \left( -\frac{n^2}{2k}\right)$$

Both numbers almost coincide if the probability of collissions is small. For details see the birthday problem.

0
On

This is all very reasonable and the calculations are correct.

As an extension, often we are interested not in the expected value being $1$ but the expected value being "reasonably close" to $1$. You can rework this to obtain a formula for the expected value as a function of the number of earths, and use that to answer questions such as "when expected value > $1/2$" or "when is expected value > $1-\varepsilon$."