It is often said in judicial opinions and legal briefs that the hash value derived from a file is like a fingerprint that uniquely corresponds to the source file. While this may be true in a practical sense, I question whether it is literally true. It is my understanding that the hash function is a many-to-one function, and therefore multiple source files could produce the same hash value. I understand that the utility of the hash function is that making a small change to a source file does not result in a small change to hash value derived from that file.
Can different source files produce the same hash value?
1.3k Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 4 best solutions below
On
Yes, different files can produce the same hash value. As you say, a hash function goes from a "big" discrete space $\mathcal{X}$ to a smaller image space $\mathcal{Y}$, and is thus many-to-one.
A good hash function, however, will take images "as uniformly distributed as possible" on $\mathcal{Y}$, in order to minimize the probability (over the input) that such a collision occurs.
(For many applications, one would also ask that the hash function be "hard to reverse-engineer" -- see the crypto paragraph on the Wikipedia article.)
On
Yes. At least for the common hash functions, different files can produce the same hash, because those hash functions return byte arrays of constant length.
This is easily proven by the fact that there is an infinite number of possible files, and only a limited number of hashes. For example, for SHA1 there are $2^{160}$ possible hashes, for SHA512 $2^{512}$, etc. (And obviously there is an infinite number of files. E.g. start with "0", and add another "0" ad infinitum.)
But it might be possible to construct a hash function of variable length for which the answer would be no. See this answer for some ideas on how to do that. Though that would depend on calling a variable output-length function a hash function.
An arbitrary file, with many million bytes is mapped to a few bytes typically. So there ought to be collisions.
For a good cryptographic hash function it is however hard to compute a message which results in a given has value. See e.g. preimage attack.
Note that this is a mapping from a huge set (e.g. $2^{8\cdot 10^6}$ possible elements) to still a very large set (e.g. $2^{160}$ elements).
It should result in a noticeable change in the hash. E.g. see the properties listed here.