TL;DR; How bad is it to XOR MD5 checksums? Will this lose MD5's advantages and if so, what else can be used for making fast, online checksum on a key-value store with a good collision rate?
We got a key-value store and we want to implement a simple and very fast checksum, which can be updated online (based on put/delete operations). We don't need a cryptographic hash, hence we don't need something complex like LtHash.
A simple XOR would work, but it leads to too many collisions. That's how we came up with the idea - MD5 for each key-value pair, then XOR with the store's checksum. In other words:
Operation(current_checksum, key, value):
current_checksum ^= MD5(key)
current_checksum ^= MD5(value)
And this can be used for both - delete and put operations (delete 'subtracks' the checksum for the deleted pair, put 'adds' the checksum for the new pair).
The naïve assumption is that MD5 will give enough randomness and the XOR part will give us the ability to "add/subtrack" checksums.