Can different source files produce the same hash value?

1.3k Views Asked by At

It is often said in judicial opinions and legal briefs that the hash value derived from a file is like a fingerprint that uniquely corresponds to the source file. While this may be true in a practical sense, I question whether it is literally true. It is my understanding that the hash function is a many-to-one function, and therefore multiple source files could produce the same hash value. I understand that the utility of the hash function is that making a small change to a source file does not result in a small change to hash value derived from that file.

4

There are 4 best solutions below

6
On BEST ANSWER

It is my understanding that the hash function is a many-to-one function, and therefore multiple source files could produce the same hash value.

An arbitrary file, with many million bytes is mapped to a few bytes typically. So there ought to be collisions.

For a good cryptographic hash function it is however hard to compute a message which results in a given has value. See e.g. preimage attack.

In the legal community, hash codes are generally taken as being unique identifiers.

Note that this is a mapping from a huge set (e.g. $2^{8\cdot 10^6}$ possible elements) to still a very large set (e.g. $2^{160}$ elements).

I understand that the utility of the hash function is that making a small change to a source file does not result in a small change to hash value derived from that file.

It should result in a noticeable change in the hash. E.g. see the properties listed here.

0
On

Yes. People have created examples of this for many popular has functions. See this answer.

4
On

Yes, different files can produce the same hash value. As you say, a hash function goes from a "big" discrete space $\mathcal{X}$ to a smaller image space $\mathcal{Y}$, and is thus many-to-one.

A good hash function, however, will take images "as uniformly distributed as possible" on $\mathcal{Y}$, in order to minimize the probability (over the input) that such a collision occurs.

(For many applications, one would also ask that the hash function be "hard to reverse-engineer" -- see the crypto paragraph on the Wikipedia article.)

0
On

Yes. At least for the common hash functions, different files can produce the same hash, because those hash functions return byte arrays of constant length.

This is easily proven by the fact that there is an infinite number of possible files, and only a limited number of hashes. For example, for SHA1 there are $2^{160}$ possible hashes, for SHA512 $2^{512}$, etc. (And obviously there is an infinite number of files. E.g. start with "0", and add another "0" ad infinitum.)

But it might be possible to construct a hash function of variable length for which the answer would be no. See this answer for some ideas on how to do that. Though that would depend on calling a variable output-length function a hash function.