Hash functions for unordered data

671 Views Asked by At

I am interested in stream hash functions for unordered data: Hash functions $g$ on lists which, given a two lists of integers $L_1, L_2$ where $L_2$ is a permutation of $L_1$, $g(L_1)=g(L_2)$.

I can imagine some ways of constructing such a function. One obvious method is to just add the members of the list together. This fails badly in many cases (for example the lists [1], [1,0] produce the same hash).

A slightly better function can be generated by first using a 'good' hash to hash the members of $L$, and then adding the results, but this still gives a bad hash function (for example for any list [A,A], the last bit of the hash will be 0).

An obvious 'good' way of hashing would just be to sort $L$ and then use any hash function on the result, but I would prefer not to pay that expense if it is avoidable.

Is there any published research on such hash functions? Will any stream hash function on unordered lists be inevitably of poor quality?

1

There are 1 best solutions below

0
On

Since you're looking at unordered data, the only thing you care about are the counts of the distinct elements in each stream. Luckily, there's a data structure designed specifically for this purpose, the count-min sketch.

You haven't mentioned all of your specific requirements, but it does satisfy the one requirement you do mention (i.e.: that hashes are stable under permutations of the input stream). I'm fairly confident that you can control the level of collision-resistance by varying the number of rows and columns of the sketch. However, if you require pre-image resistance, you'll probably need to layer on another level of hashing.