Probability of collision with a hashing function

841 Views Asked by At

I have a data set with N = 80,000 supposedly unique individuals. Individuals are coded by a 9 digit hexadecimal "hashed" version of their 9 digit decimal US SSN. I have no idea what the "hashing" function is, but there appears to be collisions. I am interested in knowing what the expected number of collisions is with a general purpose 36 bit hashing function and 80,000 entries.

1

There are 1 best solutions below

1
On BEST ANSWER

This is an odd question, you are increasing the amount of bits required to encode the SSN. A decimal is encodable is roughly 3.32 bits, a hex value is 4 bits, so you could uniquely describe every single US SSN possible in that value, with room to spare, much less your 80,000. This means there should be no collisions, for reference. You could encode every SSN Possible in roughly 30 bits, so a 32 bit integer would be sufficient (available in most programming platforms, as I assumed this is for a program).

As far as hashing algorithms go, there is no "general," hashing mechanism, there are a couple of common ones, sum mod n and md5 leap to mind, however these aren't really standard.

The goal of a hashing algorithm is, in the context of this discussion, to create a uniform distribution of the domain to the range. This will allow for the fewest hash collisions. The probability of a collision, assuming this is working correctly, given hashing n items into k locations is $n-k+k(1-1/k)^n$, which in your case is $80000-2^{36}+2^{36}(1-1/2^{36})^{80000}$, which is 4.6%.

https://math.dartmouth.edu/archive/m19w03/public_html/Section6-5.pdf