How to overcome the wastefulness of uniform distributions in a hash function

196 Views Asked by At

Recently on the programming StackOverflow I posted a query about my hash function - I was looking for mistake when there was none. I was alarmed by an empty bucket count of around 36%, as I thought, for an uniform distribution, that that number was erroneously high. Later today I figured it out using the binomial distribution and I realised that the counter-intuitive (similar to birthday problem) 36% “inefficiency” makes perfect sense. Since something cannot surely be more fair than a uniform distribution, how can a real world example of a hashing system reliably and evenly distribute data across ALL its buckets - is it even possible? The programmers at StackOverflow tell me that using a modulus of a large prime would help, but mathematically that wouldn’t change the binomial result of 0.3679... - or would it? Please enlighten me or tell me that I’m worrying over nothing and that the 0.367... is unavoidable! Thanks

Note: I am aware that a real world example would have more data points than buckets which would make the number of empty buckets decrease to zero (or would it? I’ve lost confidence in my probability intuitions) - I’m curious about the one-to-one case, where we have near identical numbers in search of a uniform spread

The equations I used to calculate probabilities were: (excuse the mobile phone formatting)

Let N = the number of buckets in the table

Let C = the number of repeat entries or collisions.

The probability of any bucket having C collisions is:

nCr(N, C) * (1/N^C) * (N-1/N)^(N-C) ≈ 0.368, when C = 0 [empty, wasted bucket], as N tends to infinity.

1

There are 1 best solutions below

5
On BEST ANSWER

If we have $N$ buckets and place $M$ entries in them, with equal probability, the number of elements in each bucket, for large $N,M$, approximately follows ("Poissonization") a Poisson distribution $P_k= e^{-\lambda} \frac{\lambda^k}{k!}$ with $\lambda=M/N$.

If we denote $X_i=1$ if the bucket $i$ is empty, $0$ otherwise, we have $E[X_i]\approx e^{-M/N}$ , and if $Z=\sum X_i$ denotes the number of empty buckets we have

$$E[Z] \approx N e^{-M/N}$$

Hence the proportion of empty buckets is approximately $e^{-M/N}$. If $M=N$ we get $0.368$, the $36\%$ you are getting. Notice, however, that this goes to $13.5\%$ for $M=2N$ and so on,

You cannot expect having a better occupation rate for a "universal" hashing function.

Only in special cases (where your key domain has some restrictions) you can do better, see eg https://en.wikipedia.org/wiki/Perfect_hash_function

Edit: To prove that the uniform distribution minimizes the expected number of empty buckets you might write

$$E[Z]=\sum_{i=1}^N E[X_i]= \sum_{i=1}^N (1-h_i)^M $$

where $h_i$ is the hashing distribution function. You need only to prove that $h_i=1/N$ attains the minimum, subject to the restrictions $ \sum_{i=1}h_i=1$, $h_i \ge 0$.