Let be a hash function. For any such there is always a trivial collision-finding attack running in time simply evaluate on distinct inputs; by the pigeonhole principle, two of the outputs must be equal. Is this the best possible attack?
Let us generalize the above algorithm by taking as a parameter. Say we choose uniform (distinct) inputs compute for all , and check whether any of the are equal. As noted, if then there is certainly a collision. When we can no longer guarantee a collision, but there is clearly some nonzero probability that a collision occurs. It is somewhat difficult to analyze this probability when is arbitrary, and so we instead consider the idealized case where is treated as a random function. (It can be shown that this is the worst case, and collisions occur with higher probability if deviates from random.) That is, for each we assume that the value is uniformly distributed in and independent of all the other values (recall all the are distinct). We have thus reduced our problem to the following:
If we generate uniform what is the probability that there exist distinct with
This question has been extensively studied, and is related to the so-called birthday problem ,for this reason the collision-finding algorithm described above is one of a class of algorithms called birthday attacks.
The birthday problem is this: if people are in a room, what is the probability that some two of them share a birthday? (Assume birthdays are uniformly and independently distributed among the 365 days of a non-leap year.)
This is analogous to our problem: if is the birthday of person , then we have uniform and independent , and matching birthdays correspond to distinct with (i.e., matching birthdays correspond to collisions).
We refer to the stated event as a collision, and let denote the probability of this event. ()
The following shows that when , the probability of a collision is alternately, for the probability of a collision is constant.
Fix a positive integer and say elements are chosen uniformly and independently from a set of size . Then
The upper bound, which holds for arbitrary can be proven by a simple application of the union bound. Recall that a collision means that there exist distinct with Let Coll denote the event of a collision, and let Coll denote the event that It is immediate that Coll for any distinct . Furthermore,Coll = Coll and so repeated application of the union bound implies that
For the lower bound, let NoColl be the event that there is no collision among that is, for all . Then NoColl is the event that there is no collision at all. If NoColl occurs then NoColl must also have occurred for all
Thus, NoColl NoColl NoColl NoColl NoColl NoColl
Now, NoColl since cannot collide with itself. Furthermore, if event NoColl occurs then contains distinct values; so, the probability that collides with one of these values is and hence the probability that does not collide with any of these values is . This means
and so
Since for all we have and so
We conclude that
(note that . Use below: For all with it holds that
We show that when are uniform in , then if the probability of a collision is roughly (In the case of birthdays, once there are only 23 people the probability that some two of them have the same birthday is roughly
In our setting, this means that when the hash function has output length (and so has range of size evaluating on inputs yields a collision with probability roughly .
From a concrete-security perspective, this implies that for a hash function to be collision resistant against attackers running in time it is required that have output at least bits long. Taking specific parameters: if we want finding collisions to be as difficult as an exhaustive search over 128 -bit keys, then we need the output length of the hash function to be at least 256 bits. (We stress that having output this long is only a necessary condition, not a sufficient one.)