Approximating a probability threshold for hash collisions

129 Views Asked by At

First off: I know this is on the verge of being subjective and i'll try my best to make it as specific as possible.

I am in a situation where I need to generate a fixed number of hashes (k), say 10,000. My goal is to choose a small as possible number space for those hashes (N) while keeping collision probability below a fixed threshold (p).

The collision probability can be approximated by: $p\left(k,N\right)=1-e^{\frac{-k\left(k-1\right)}{2N}}$

The key question now is: How low is low enough for my collision threshold? I've tried numerous things to come up with a number:

  • Comparing various unlikely events like winning the Powerball, meteorite hits, lightning strikes, etc. (While doing this I found out that China is going to be hit by a meteorite in 2027)
  • Looking into the failure rates of technical items like uncorrectable bit errors in hard drives
  • Read a lot of Stack, Mathematics and CS posts about the topic which all explain the birthday problem and show me why I have a problem but are unfortunately not helping to solve it.

Yet nothing seems to do the trick for me. The system I am implementing relies on the hashes generated and collision would corrupt the functionality of the system at its core. Unfortunately perfect hashing is not an option for reasons not relevant here.

Does anyone have an Idea how I can tackle this problem and come up with a solid, profound probability threshold that is not just made up from the top of my head?

1

There are 1 best solutions below

2
On

It all depends on cost of increasing space of hashes and cost of collision (and number of trials you'll have, below I'll assume it's $1$) - there is no universal answer. If you, say, can make a chip at cost of one dollar with space of $10^9$ hashes, chip at cost of two dollars with space of $10^{18}$ or chip at cost of three with space of $10^{30}$, then the first is better if cost of failure is less then 10 dollars, the second if it is from 10 dollars to 10 billion dollars, and the third is better if cost of failure is greater then 10 billion dollars.