Scaling a threshold for anomaly detection depending on the sample size

44 Views Asked by At

I'm trying to study and test 32-bit hash functions - specifically probability of collision (repeated results for different inputs). And I'm struggling in defining a threshold for outliers/anomalies, where the precision of an acceptable random variance appears to increase as the ideal number of collisions increases.

First, this formula is used to determine the approximate number of expected collisions after $n$ outputs of an idealistic 32-bit hash function which is indistinguishable from random: $$ideal=\frac{n^2-n}{2(2^{32})}$$

This can be considered the ideal result to compare against.

So, given two sample sizes $n=466550$ and $n=1470580$, we know the expected number of collisions are $25.339941022684798$ and $251.76024830667302$, respectively.

And my problem is when you consider that 39 is a a 53.9072% increase from 25.33994, and 282 is a 12.0113% increase from 251.76, but the probability of that number of collisions given the sample size, should be roughly equivalent. That is to say, the 53% increase would be considered more of an outlier for a higher sample size than a lower one.

For reference, I calculate the percent increase/decrease using this basic formula:

$$\frac{collisions - ideal}{ideal}\times100$$

And so, I'm wondering if these percentages can be normalized or scaled to be proportional, with only knowing the ideal collision value described above, and the known sample size variable, to be able to scale a threshold for identifying anomalies.

Or in other words: can we determine the probability of getting 39 collisions after generating 466550 hashes, if we know that the ideal (or mean) is 25.339941022684798? And how would that compare to the probability of getting 282 collisions after 1470580 hashes produced if we know the ideal is 251.76024830667302?