Sometimes I let my mind wonder and I get crazy questions. Today was a good example, I encountered a MD5 hash and I started to wonder, would there be a hash which would (when hashed again) be the same?
Thus: MD5(x) = x
This would be a kind of MD5 quine, when fed into the algorithm you get the original value back. This is actually called an MD5 fixed point.
Information I’ve found
So I started investigating, soon I discovered this website about collisions. Its well known that all hashing algorithms must have collisions, you can’t always produce a unique hash for input larger then the output of course.
The output of the MD5 sum is 128 bit (16 byte) long, so the input should also have the same length. But the MD5 algorithm is defined to have 512 bits as input. This isn’t really a problem because the algorithm will extend smaller input with padding. Lets assume the MD5 sum of any input is uniformly distributed over all possible sums, then the probability that a 128-bit string is a fixed point is 1/2^128. This isn’t a crazy assumption because all hashing-algorithms are designed to distribute the output as uniformly as possible to avoid collisions.
So, the probability that no 128-bit string is a real fixed point is (1 - 1/2^128)^(2^128). The probability that there IS a fixed point is 1 - (1 - 1/2^128)^(2^128).
Since the limit as N goes to infinity of (1 - 1/N)^N is 1/e, and 2^128 is most certainly a very large number, this probability is almost exactly 1 - 1/e = 63.21%.
But, of course, there is no randomness involved here - there either is a fixed point or there isn’t. But, we can be 63.21% confident that there is a fixed point. (Also, notice that this number does not depend on the size of the keyspace - if MD5 sums were 32 bits or 1024 bits, the answer would be the same).
Looking for the fixed point
I’ve just implemented a small program to look for these hashes, even though I know it will take millions of years to check all the numbers. But you never know, I might get lucky ;-)
The first algorithm I created took a single random String as input and kept applying the algorithm to the output. Eventually it will:
- Go into a loop
- Find a fixed point (which is a loop of size 1)
The weird thing is, if it ends up in a loop of size 1… I’ve found two things. Not only a md5 fixed point, which creates itself after applying the algorithm. But also an input-value that produces this md5 as output, a collision!
Another interesting thing would be a complete graph of all md5 answers. Which loops can we find, which md5 has most collisions etc. But this would take eternity to calculate, even using all the machines in the world.
- Are there loops? (It is possible there aren’t any loops at all…?)
- Are there loops of size 1, a.k.a. fixed points?
- Which/how many 128 bit combinations can’t be created with the input values?
- Which/how many collisions will you get with all the input values?
More thoughts, errors, solutions..?