I'm trying to come up with an effective way of handling slight variations in sets of numbers when hashing in an Information Retrieval system. I have a simple solution that is reasonably effective and is definitely computationally cheap, but it's not accurate in edge cases. I'm interested to see if there's another trick that can improve on this without adding much time complexity.
To give a working example of what I mean, $A$ and $B$ below have almost identical values, which in this particular system would indicate they are actually the same document. The sum of each set below is (arbitrarily) exactly $1145$, however, clearly set $C$ is comprised of very different values and, as such, is considered highly dissimilar. In practice, these sets will often have a reasonably similar sum, but relatively distinct values. Also, their order is not relevant so $A$ and $B$'s ordering could be completely shuffled and they'd still be considered equally similar.
$A = {203, 145, 305, 236, 222, 34}$
$B = {205, 145, 304, 235, 221, 35}$
$C = {100, 300, 310, 100, 150, 185}$
Maths is not my strong suit so my current solution is basic - take the sum of the squares of each value and compare the variance. I have tried the sum of the cubes, but I'm not sure by how much this improves things, I'll write up a test in due course.
The sum of the squares seems to work well for a lot of random sets, and generally produces variances well in excess of 10%, but some edge cases can of course return a very similar value. In the example given we can see that it doesn't work too well:
$Var(A,B) = 0.24\%$
$Var(A,C) = 0.79\%$
So, does anyone have any suggestions for a hash function that has these properties? - To create a hash that can determine with a high degree of confidence that two sets are almost identical.
Thanks, Rich.
Sounds like a problem in descriptive statistics: "How do I come up with a value (or values) that characterize a set of numbers?". Do you absolutely have to hash to a single number? If not, I'd suggest that "Mean" and "Standard Variation" give a very good idea of a set of numbers, and are insensitive to orderings. (They encode the same information as the sum and sum of squares, but are scaled a bit better.)
Another possibility is that if you have a (large) collection of number sets that are characteristic of the kinds you expect to see, you could run some sort of machine learning algorithm algorithm on your data and it would identify the most useful statistic(s) for distinguishing your samples. Principal component analysis is the one I would start with. You said math isn't you're strong suit, but I'll bet there's a Python package that could do it easily. Unfortunately I only know the abstract theory behind it and can't point you to an easy-to-use implementation.