Finding a minimal perfect hash function for small sets quickly

1.2k Views Asked by At

I'm trying to solve the computer science problem "Minimal perfect hash function" (MPHF). I have an algorithm that can generate a MPHF for very large sets in $O(n)$ that only needs 1.54 bits/key, very close to the theoretical limit of 1.44 bits/key, and it is relatively fast. But I would like to make it faster. To do that, I need to find MPHFs for "small" sets, lets say of size $n=16$. Right now I can do that in $O(\frac{n^n}{n!})$, but maybe there is a $O(n \log{n})$ algorithm.

I would like to reformulate the problem to another domain (for example Chinese remainder theorem, prime number factorization, Galois Field, finding the GCD), and solve it like this, or prove brute force is the the best algorithm.

Warning: I'm not good at math, so the following is probably ridiculous, but maybe somebody can help to improve my question and use the right tags.

So I came up with a (integer only) formula that has multiple solutions, and one unknown, $x$. I'm trying to find the smallest $x$ quickly. There are a number of known, randomly distributed, positive integers $h_n$, in the range [0 .. 65536). The size $s$ is quite small, let's say 16 (the higher the better). The right hand side of the equation is a constant.

$$ \sum_{n=1}^{s} {s ^ {(\lfloor\frac{h_n x}{65536}\rfloor \bmod s)}} = \sum_{n=1}^{s} {s ^ {n}} $$

How can the smallest $x$ be found quickly. For $s = 10$, $x$ is probably below 12'000, with $s = 11$ below 32'000, with $s = 12$ below 85'000, and so on.

The $\sum_{n=1}^{s} {s ^ {n}}$ is a weird way to do "bitwise OR". What I'm trying to do is ensure there are no collisions, or written in another way, in case $s=3$:

$$ \lfloor\frac{h_1 x}{65536}\rfloor \not\equiv \lfloor\frac{h_2 x}{65536}\rfloor \not\equiv \lfloor\frac{h_3 x}{65536}\rfloor \bmod 3 $$

The function $f = (\lfloor\frac{h x}{65536}\rfloor \bmod s)$ is just an example. It is a "supplemental hash function", and is supposed to returns an integer between $0$ and $s$ with about equal probability. Any other function with the same property can be used instead. A good supplemental hash function is a bit more complex, but this one is good enough for my use case (it doesn't need to be perfect).

2

There are 2 best solutions below

0
On

The approach described in the paper "Constructing Minimal Perfect Hash Functions Using SAT Technology" describes a way to find an MPHF for a small set that is faster than brute force. There is also a reference implementation. It works for sets up to about 40 keys. Still, it only finds a solution with a certain probability at a certain size. In practise, it may not be faster than brute force, and it may not use less space: as far as I understand, it depends a lot on the SAT solver implementation.

0
On

A new algorithm has been proposed for this problem, in ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force. This is done using a small, heavily overloaded cuckoo hash table. The entries of the small set (e.g. 24 entries each time) are hashed using a seed, and then added to a cuckoo hash table that is 100% full. Each entry can be in two positions. This will likely not work, so a new seed is used. At some point this works, and the seed is stored. Storing this seed uses about 0.5 bits per entry. Also stored is, for each entry, at which location it is stored. This information is stored using a retrieval data structure - e.g. a ribbon filter, with a bit more than one bit per entry.