## Introduction

Lately, I've been seeing some discussion online about cryptographic hashing functions, along with some confusion between a cryptographic digest, a cryptographic signature, and a message authentication codes. At least in that last post, I think I did well defining and clarifying the differences between those terms, but I also feel like I could take this discussion a lot further. So, I decided to dedicate a series to generic cryptographic hashing functions, which will include building compression frameworks with security proofs, specific implementations of cryptographic hashing functions, and some implementations of these functions. So, without further ado, let's get started.

## Collisions

When we talk about a hashing function (cryptographic or otherwise), we are referring to any function that can take an arbitrary length of data, and compress it into a fixed-length digest. Typically, this digest is called a "fingerprint", a "checksum", or a "hash". The goal, is that any time we input the same data, our function outputs the same digest. Further, it's important that not only can I produce that digest, but anyone can produce the same digest. This gets us prepared for the Random Oracle, but we still have some ground to cover first.

Because our hashing function has a fixed length output, say 128-bits, then an ideal function would map every input to one of those outputs. In other words, our function maps an element in the domain (our data to be hashed) to exactly one element in the range (our actual hash). So, if our function produces 128-bit digests, then there are a total of 2^128 digests in the range. This means, that we have at least a one-to-one mapping of elements in the domain to elements in the range. Again, speaking about an ideal hashing function.

However, we know that there are many more inputs than just 2^128; there are infinitely many, actually. But think about it for a second. Take the number zero, and send it through our hashing function. Increment that number by 1, then hash that number. Continue in this manner, assuming infinite computing resources and infinite time, until you've hashed every number between 0 and 2^128. Ideally, you've produced exactly 2^128 unique digests. But, what happens when you now want to hash 2^128+1? Now we have what is called a collision. In other words, two distinct inputs was hashed to the same output. To put it formally:

Definition: A collision is when two distinct pieces of data hash to the same digest, checksum, or fingerprint.
Theorem: For any fixed-length hashing function, there are infinitely many collisions.
Proof: This can be proven using the pigeon-hole principle. Given a fixed-length hashing function of n-bits of output, hashing n+1 inputs from the domain will produce a collision in the range. As n tends to infinity, the collisions tend to infinity. Q.E.D.

I don't think I need to tell you how much larger infinity is to 128-bits. As a result, collisions are overwhelming. In fact, would you like to see a collision in practice? Below are 2 different hexadecimal strings. The differences are very subtle, but they indeed distinct (emphasized in bold red). Here, we'll take the two strings, and hash them with the known MD5 algorithm. Then, just to show I'm not cheating, we'll hash the same strings with SHA-1. While we produce a collision in MD5, we have distinct digests with SHA-1. Go ahead, and verify that you get the same results.

```\$ INPUT1=d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89\
d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0\
e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70
\$ INPUT2=d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89\
d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0\
e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70
\$ printf "\$INPUT1" | xxd -r -p | md5sum
79054025255fb1a26e4bc422aef54eb4  -
\$ printf "\$INPUT2" | xxd -r -p | md5sum
79054025255fb1a26e4bc422aef54eb4  -
\$ printf "\$INPUT1" | xxd -r -p | sha1sum
a34473cf767c6108a5751a20971f1fdfba97690a  -
\$ printf "\$INPUT2" | xxd -r -p | sha1sum

Crazy, right? With an ideal hashing function, it should be at least as difficult as a brute force search to find these collisions, and it should take searching an entire 128-bit domain to find a collision. Unfortunately, however, finding blind collisions with a brute force search turns out to be much faster, thanks to the Birthday Paradox. The Birthday Paradox says the following:

In a room of just 23 people, there is a 50% probability that at least two of them share the same birthday. In a room of just 75 people, there is a 99.9% probability that at least two of them share the same birthday.

Wait, what? Uhm, last I checked, there are 366 days days in a year, assuming leap year. Soooo, if there are 23 people in a room, then there should be a 23/366, or about a 6% probability that two people share the same birthday. Unfortunately, this isn't how it works. There may be a 6% chance someone shares your birthday, but there is a 50% chance two arbitrary people share the same birthday. Now do you see the problem? Not only must you compare your birthday to everyone, but so must everyone else. This is a case of permutations. So, with 23 people in the room, there are actually 253 possible comparisons that must be made (23*22/2). The math gets a little hairy, and to be honest, it's a bit outside the scope of this post, and this series (it's going to be long enough as it is). Refer to the Wikipedia article if you want to work through the theory and the proof.

We can use this Birthday Paradox to work out an attack on finding two distinct inputs that produce an identical digest. This is called the Birthday Attack, and it's the primary driver in finding collisions. The attack basically says something like this:

To find a collision in a n-bit range with approximately 50% probability, you need to only search the the square root of 'n' of elements in the domain.

So, for a 128-bit digest (2^128 possible distinct outputs), using the Birthday Attack, I only need to search 2^64 possible inputs to have approximately a 50% probability that I have found a collision. If you don't think 2^64 is very small, the bitcoin network is currently mining 2^64 SHA-256 digests about every 20 seconds.

## Blind, preimage, and second preimage collisions

Armed with this knowledge, we can now formalize some definitions of collision attacks. This might be confusing, so I'll define it first, then give some examples.

Collision attack:
A blind search, where two distinct inputs produce the same digest.
Preimage attack:
A search to find an input that matches a defined digest.
Second preimage attack:
A search to find a second file that matches the digest of a defined file.

Let's break these down individually. A collision search is literally a blind search, without any respect to inputs or outputs. You don't know what the inputs will be nor do you know what their outputs will be. You only know that you have found two distinct inputs that collide to the same output, all of which is entirely arbitrary.

A preimage attack is where you have a digest in your possession, but you would like to find an input that matches it. In this case, while the input is completely arbitrary, the output is static. For example, suppose you have the 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9". A preimage attack would be successfully finding the input that produced it. In this case, it was "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz".

A second preimage attack means having both the input "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz" and its 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9", and finding a second input that produces that same digest.

Usually, when breaking cryptographic hash functions, the first thing to break is the compression function, which I'll cover in later posts. Once the compression function is broken, the next step is to break searching for blind collisions. This is generally done by analyzing the weaknesses in mathematics, find bias in the output, observe the quality of the avalanche effect, and so forth. You eventually learn where the hashing function is weak, and where you can take "shortcuts" to get to your goal. Eventually, the algorithm is broken to the point that finding blind collisions is practical. MD5 is broken in this regard.

After breaking the compressing function, and weakening the algorithm to the point of practical collision attacks, preimage attacks become the next focus of analysis. However, when the compression function is broken, such as in the case of SHA-1, it's a strong sign to start moving away from the algorithm, long before you find collisions. So, analysis tends to slow down after collisions have been found, because no one should be using the function anymore. This also means continuing to find second preimage collisions gets even less attention.

## Avalanche Effect

The final property of cryptographic hashing functions that needs to be addressed is the "avalanche effect". It is absolutely critical in cryptographic hashing functions that even though inputs may be sequential, their outputs do not show that to be the case. For example, consider the SHA-256 of the first 10 digits:

```\$ for I in {1..10}; do printf "\$I: "; printf "\$I" | sha256sum -; done
2: d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35  -
4: 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a  -
6: e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683  -
7: 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451  -
8: 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3  -
9: 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7  -

Notice that there is no clear indication on sequential digests. For all practical purposes, they are truly randomized output, despite the sequential input (merely flipping a single bit on each input from the previous). However, can we formally define the avalanche effect? What would be ideal is that with each bit change on the input, every bit in the digest output has as close to a 50% chance of being flipped as theoretically possible.

I'll talk more about "rounds" in future posts when I talk about specific implementations and designs. Suffice it to say that a cryptographic hashing function will iterate through the compression functions a certain number of times, before outputing the state. On each round, the bits in the output should each have a 50% chance of being flipped. So, on each output of each round iteration, close to half of the bits have been flipped in some pseudorandom manner. After a certain number of rounds, the final output should be indistinguishable to true random noise.

When a single input bit is flipped, each output bit should change with a 50% probability.

Of course, the cryptographic strength doesn't rest solely on the avalanche effect. There are mathematical properties that determine that. But, the output should be completely unpredictable. You could apply the "next bit test", in that there is no algorithm you could produce that would determine the next state of the next bit, without actually compromising the state of the machine (this is a test held to cryptographically secure pseudorandom number generators).

Unfortunately, all we have to test the avalanche effect is standard randomness tests, such as the chi-square distribution, Monte Carlo for Pi, and the Birthday Paradox, among others. This doesn't say anything about the cryptographic strength of the hashing function, but says a lot about randomness properties (non-cryptographic hashing functions can also exhibit strong randomness qualities).

There are a couple software utilities we can use to test and analyze cryptographic hashing functions. First, we have standard randomness tests, such as Dieharder and the FIPS 140-2 suite. But, for something more specific on analyzing cryptographic primitives, I would recommand Cryptol. On the one side, this isn't an out-of-the-box software solution for just running a battery of tests and analysis. It is actually a domain-specific language that will require a bit of a learning curve. On the other hand, it's Free Software, and you'll probably learn more about cryptanalysis with this tool, than just playing with randomness tests.

## Conclusion

This was just a primer post to get you thinking about cryptographic hashes, specifically thinking about their output, and the task of finding collisions. The rest of the posts in the series will cover specific functions such as MD5, SHA-1, -2, and -3, as well as some others. We'll talk about hashing constructions, and where you'll find cryptographic functions in practice (I think you'll be surprised). I may even throw in a post or two about random oracles, and how we want cryptographic hashing functions to not only imitate them, but be proven secure under the "Random Oracle Model".

Regardless, this post will get you started, and hopefully excited for what is to come.