How do I caclulate the probability of hash collisions?

455 Views Asked by At

I have a 10Gb file and the entire file is overwritten with random data every day.

Afterwards, I divide the data into blocks and hash each block to generate a fingerprint.

I am trying to choose an appropriate1 hash size to make it unlikely that that will be a collision (a false match in the hash for a particular block as compared to the previous iteration: a match is OK if the data is randomly the same but not if it's showing the same hash for different data).

How can I calculate the probability of a false match in 10 years given the block size and hash length?


1 appropriate means: as small as possible without making the first collision more than 50% likely in 10 years

1

There are 1 best solutions below

1
On BEST ANSWER

Assuming that the hash function behaves like a random oracle, then the probability that any given block hashes to the same value as the previous version of the same block is 2-n, for a hash function output size of n bits. If you have k blocks in your file, then, over ten years, you are computing k·3653 block hashes (or k·3652 or k·3654, depending on leap years), so the probability of not hitting a collision in 10 years will be equal to (1-2-n)k·3653. It is simpler to count the average time between two occurrences of a collision: it will be 2n/k days.

For instance, if using 1 kB blocks in a 10 GB file, then k = 107 (ten million blocks). If n = 128, then the average time between two collisions will be close to 1040 years, i.e. a lot of time. Said otherwise, if you want collisions to occur once every 3653 days on average, then you need a hash function with an output of about 35 bits, because 235 is roughly equal to 3653·107. More bits lower collision rate, of course.


All of the above assumes that you are fighting against random collisions. If some evil attacker tries to force collisions to happen, and can choose the contents of some blocks, then you need the hash function to be collision resistant up to the computational power that the attacker could muster; in that case, use a hash function with an output of at least 160 bits, corresponding to a theoretical resistance of 280 (the traditional limit of attacker's power). Note that some theoretical weaknesses have been found in SHA-1 (the usual 160-bit hash function) so you would prefer using SHA-256, which offers a 256-bit output. If size is constrained, then you may want to truncate the SHA-256 output, but don't reduce it below 160 bits.

An interesting point is that when fighting an attacker, then how many hashes you compute is no longer important; what matters is how many the attacker may be able to compute, and he may compute billions per second, way more than your millions per day.