Beginner level understanding concept on how to derive probability of hash collision

577 Views Asked by At

A hash function indexes all items in hash tables and searches for near items via hash table lookup. The hash

table is a data structure that is composed of buckets, each of which is indexed by a hash code.

The hash table is defined the function family $G = \{g:S \rightarrow U^k\}$ such that $g(p) = (h_1(p),\ldots,h_k(p))$ , where $h_i ∈ H$ to obtain a $k$ bit hash code. The query point $q$ is hashed into all the hash table $\{g_1(p),\ldots,g_l(p)\}$. The candidate set, $\{p_1,p_2,\ldots,p_m\}$, is composed of the points in all the hash tables which are hashed into the same bucket with the query point $q$. Two hash codes for two different messages can collide if they have the same hash code. This is often called the Birthday paradox.

The properties of a good hash function is that there should be no hash collision.

How can I calculate the probability of hash collision?

1

There are 1 best solutions below

13
On BEST ANSWER

I'll start with the answer: the number of keys falling into a given bucket quite accurately follows a Poisson distribution:

$$P(n) = e^{-\lambda}\frac{\lambda^n}{n!}$$

Here, $\lambda$ is the average number of keys per bucket, which is equal to the number of keys $K$ divided by the number of buckets $B$. The total number of keys and buckets is approximately irrelevant, so long as they are larger than about $10$; only the ratio of those two numbers matters.

An easy way to derive this formula is to start from the formula for the number of heads in $K$ throws of a biased coin:

$$P(n\text{ heads}) = \frac{K!}{n!(K-n)!}h^n(1-h)^{K-n}$$

In this formula, $h$ is the probability that a given throw will come out heads.

The hash collision problem for a single bucket can be viewed as a coin where "heads" means the hash landed in the bucket and "tails" means it landed somewhere else. Therefore, we take $h=\lambda/K$:

$$P(n\text{ heads}) = \frac{K!}{n!(K-n)!}\left(\frac{\lambda}{K}\right)^n\left(1-\frac{\lambda}{K}\right)^{K-n}$$

Now let's suppose $K$ is very large. We make three approximations:

  • $K!/(K-n)!$ is roughly $K^n$.
  • $(1-\lambda/K)^{-n}$ is roughly $1$.
  • $(1-\lambda/K)^K$ is roughly $e^{-\lambda}$. This is a good formula to know; I use it again lower down.

Cancel the two factors of $K^n$ and rearrange a bit and you get the Poisson distribution.

But I think you are asking for the probability of two or more keys landing in any bucket, whereas I have so far given a formula for the number landing in a single bucket. The interesting case is when $\lambda$ is very small, because otherwise we're almost guaranteed to have a collision somewhere in the table. Let's look at the first few values of $P(n)$:

  • $P(0)$ is $e^{-\lambda}$, which is close to (a bit smaller than) $1$.
  • $P(1)$ is $e^{-\lambda}\lambda$, which is close to $\lambda$.
  • $P(2)$ is $e^{-\lambda}(\lambda^2/2)$, which is close to $\lambda^2/2$.
  • $P(3)$ is smaller by a further factor of $\lambda/3$, which is very small, and subsequent terms are even smaller. These cases basically don't happen, so let's ignore them.

$n=0$ and $n=1$ are not collisions, so the probability of a collision in a single bucket is about $P(2)$, which is about $\lambda^2/2$. Therefore, the probability of no collision in any bucket is about $P(\text{good}) = (1-\lambda^2/2)^B$. Here I have also made the approximation that the buckets are independent, which is good for large $B$.

We would prefer a formula involving $K$ and $B$, so let's substitute $\lambda=K/B$. This gives $P(\text{good}) = (1-K^2/2B^2)^B = (1-((K^2/2B)/B))^B$. For large $B$, this approaches $e^{-K^2/2B}$.

This gives you the birthday result: the chance of a collision somewhere in the table becomes significant when $K^2$ approaches $2B$.