What Is A Cryptographic Hashing Algorithm?
Cryptographic hashing algorithms are one-way functions that produce a message digest that represents a given input. Because the keyspace is so astromically large, it should be practically infeasible to find a different input that represents the same digest. The input is typically referred to as the message while the output is typically referred to as the digest.
Cryptographic hashes usually hold to four main principles:
- Computing a digest should be easy for any given message (some would also say it should be fast).
- When changing the message, the digest should exhibit the "avalanche effect".
- It should be practically infeasible to find two messages that produce the same digest.
- It should be practically infeasible to produce the message for a given digest.
Of all of the many cryptographic hashing algorithms in the world today, the following list are still considered to be secure:
- RIPEMD-160, RIPEMD-256, RIPEMD-320
- SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, SHA-512/256
- SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, SHAKE256
- Skein-256, Skein-512
SHA3 is a new standard adopted by the United States National Institude of Standards and Technology (NIST) which uses Keccak as the underlying function. Keccak uses the sponge contruction for creating its digests, which has some interesting properties that we'll look at in a second.
The Workhorse Of Cryptography
Many call cryptographic hashing algorithms the workhorse of cryptography, and rightfully so. These hashing functions are used for a great deal of things:
- Password storage- Hash the password, and store to disk.
- Pseudo-anonymous user- Hash the username, and pseudonymize.
- Data deduplication- Hash the data blocks, and compare.
- Message authentication- Hash the message, and send.
Cryptographic hashes are used in a lot of ways. PGP fingerprints are nothing more than SHA-1 hashes of a timestamp. Similarly, OpenSSH fingerprints are MD5 results. ZFS uses SHA-256 for data and metadata integrity in its filesystem. Bitcoin uses double SHA-256 functions for the block chain, transactions, and mining. Cryptographic hashing algorithms are used EVERYWHERE. Unfortunately, developers keep reinventing the wheel, and don't pay attention to the problems that come with cryptographic hashing algorithms.
Failure #1- Raw Speed
While certainly an advantage, the speed at which cryptographic hashing algorithms creates the digest can also be a big drawback. Consider storing passwords to disk. Hopefully by now, developers know not to store the passwords in plaintext in the database. If not, get a new job. So, if we hash the password with say SHA-512, then according to our principles discussed at the beginning of this post, it should be infeasible to find the password (message) that produced the hash stored on disk.
If this hashed password database is leaked, the password cracker will be attempting to hash as many messages as possible to retrieve the given digest. Typically, this is done using custom word lists, hashing each word, and comparing it to what is stored in the password database. The faster the password cracker can go, the sooner they can recover the digests. On my wimpy laptop, using the CPU alone, I can hash approximately 1 million passwords per second with SHA-512. Using cheap GPU clusters, it's feasible to go upwards of 50 billion passwords per second, for basement-type miscreants living in their mother's basement.
Possible Speed Fix #1- Key Streching
One way around brute forcing passwords is a concept called key stretching. This is the concept where I would take the hash of the hash of the message. In other words, the message is recursively hashed. So, if my password is hashed 5,000 times, then it would take the password cracker 5,000 times more work to get to the original message. So, my wimpy laptop is reduced to going through only 200 passwords per second, instead of 1 million. Anything to slow the attackers down, is a good thing.
A password key stretched twice would look like this, where "H" is the hashing algorithm:
For example, taking the SHA-256 of "password" twice gives the following hexadecimal output:
SHA-256(SHA-256("password")) = 73641c99f7719f57d8f4beb11a303afcd190243a51ced8782ca6d3dbe014d146
Possible Speed Fix #2- Key-based Derivation Functions
Another alternative, and there is a good following for this on the Internet, is using key-based derivation functions. Without getting into too much detail here (I can save this for another post), key-based derivation functions like PBKDF2, bcrypt(), and scrypt() are slow, slow, slow, and very resource intensive. As such, this greatly limits the speed in which the password cracker can recover my password, much like key stretching previously mentioned.
Failure #2- Hashed Messages
Unfortunately, there's this thing called rainbow tables. Rainbow tables simply put, are large databases with message to digest mappings. Rainbow tables are built by hashing messages of all types, and storing both the message and the digest in the database. Then, all it takes is to look up the digest in the database. If it exists, you have the message that produced it. In this case, you have the password.
Thankfully, rainbow tables are usually bloated in size, typically several terabytes. The standard password cracker isn't going to dedicate 10 TB of disk just to store rainbow tables for MD5, SHA-1, and NTLM hashes. However, due to the one-to-one relationship (in practice) of messages to digests, rainbow tables are a very effective way to recover messages from digests only.
Possible Message Fix- Salting Hashed Messages
The number one way to thwart rainbow table attacks, is to prepend a cryptographic nonce, also called a "salt" to the password. The larger the space of the salt, the more possibilities the password could be when stored to disk. For example, if the salt is chosen from a base64 set of characters, and the salt is 8 characters long, then the salt's keyspace is 64^8 or 281,474,976,710,656 possible salts. This means that the rainbow table must store 281,474,976,710,656 digests for every message. As mentioned, rainbow tables are already several TB in size. Now they must be 281 trillion times larger.
A salted message looks like this, where "H" is the hashing algorithm, and "||" indicates concatenation:
H(salt || message)
For example, if my salt was "FhuVTwA710", and the password is "password", then taking the SHA-256 would give the following output:
SHA-256("FhuVTwA710" || "password") = 4c700717f3e5eb92a872362108f2f716d2ff179ea94d1f10853a50e181a43663
Failure #3- Salted Message Authentication Codes
Message authentication codes, or MACs for short, are a way to send a message between two parties that have previously authenticated. The idea is to send a message from Alice to Bob, such that when Alice's message reaches Bob, he knows whether or not if the message has been tampered with, based on if the hash of the message produces the same hash as what Alice sent. Alice and Bob must have met previously, and agreed on a key that they would use as a "salt" for each message they send.
In other words, their MAC setup looks like this, again where "H" is the hashing algorithm, "key" is the key both Alice and Bob agreed on, and "message" is what is being sent:
H(key || message)
For example, suppose the message was "The quick brown dog jumps over the lazy dog.", and the agreed-upon key is "J6n^,1=/#;RCA5bC10". Then our SHA-256 MAC would look like this:
SHA-256("J6n^,1=/#;RCA5bC10" || "The quick brown dog jumps over the lazy dog.") = 10c98713fbaa574ad9b2ffc053662c89f6fd387a8a350f6324b966d36936d1d3 MAC = ("The quick brown dog jumps over the lazy dog.", "10c98713fbaa574ad9b2ffc053662c89f6fd387a8a350f6324b966d36936d1d3")
Alice would send the full MAC (both the message and its digest). The idea is if the message changes, the digest will change, and as such, Bob will be able to detect tampering. Eve, a malicious 3rd party, should not be able to send fraudulent message to Bob, because Eve does not know the key Alice and Bob agreed on.
However, SHA-256 uses the Merkle–Damgård construction. Merkle–Damgård is iterative in nature- IE: block-by-block. So, while Eve may not know the private key, she was able to grab both the message and its digest in transit. So, all Eve needs to do is concatenate additional data after the message, know the length of the key, calculate the new digest, and ship this message to Bob. This is known as the length-extension attack.
In other words, if you know what the digest of H(key || message) is, then you can easily determine what H(key || message || ANYTHING) is. So to compromise the MAC system, you just need to see one message and digest MAC, then you can impersonate the sender from there on out. This should bother you.
Possible MAC Fix #1- The Secret Suffix
At this point, you may be thinking to switch the order of the key and the message, or H(message || key). This is known as the "secret suffix", and it is certainly a very valid fix to address the length-extension attack. However, it comes with a large assumption that you are making about the underlying cryptographic hashing algorithm.
That assumption is that the algorithm does not have any known digest collisions. In other words, if two messages m1 and m2 can be found to produce the same digest, then the attacker will be able to use this collision on the sent message, to use a different start message, putting the hashing algorithm state in the same starting position, and thus producing the same MAC, even though the key is still unknown.
Currently, both SHA-1 and SHA-2 do not have any known collisions, but it's only a matter of time.
Possible MAC Fix #2- The Enveloped MAC
Another possible solution is to envelope the message around the key, or H(key || message || key). This requires that the attacker know the key length, in order to identify the starting point of the message, and thus be able to forge a valid MAC. While more secure than the secret suffix, there has been some research in this area that suggests weaknesses to this approach, even when two different keys are used.
Possible MAC Fix #3- SHA-224, SHA-384, or SHA-3 (Keccak)
An actual solid fix is to use SHA-3, which is based on Keccak, for your hashing algorithm. SHA-3 is not vulnerable to the length-extension attack, and as such, can be used for a MAC. This is due to the fact that SHA-3 is based on the sponge construction, which is not an iterative block-by-block compression function like Merkle–Damgård is. With the existence of SHA-3, we may not need to worry about the next section for sending MACs.
Also, SHA-224 and SHA-384 are not vulnerable to the length-extension attack, due to the internal truncation of the internal state, whereas MD5, SHA-1, SHA-256, and SHA-512, among others, output the entire state.
Solution- The Hash-based Message Authentication Code (HMAC)
HMAC securely addresses the MAC length-extension attack. It does so by taking the hash function twice in a deterministic manner, first with an inner key, and again with an outer key. The algorithm is shown here:
- Prepare the key:
- If the key is less than the hashing algorithm block size, then append zeros to the key until it is the same size as the hashing algorithm block size.
- If the key is greater than the hashing algorithm block size, then take the hash of the key, using the digest as the new key.
- Prepare the ipad and the opad:
- To create the ipad, repeat the byte 0x36 the hashing algorithm block size times.
- To create the opad, repeat the byte 0x5C the hashing algorithm block size times.
- Take the XOR of the key and the ipad. Call this the ipad_key.
- Take the XOR of the key and the opad. Call this the opad_key.
- Take H(ipad_key || message). Call this digest1.
- Take H(opad_key || digest1). This is your MAC.
HMAC has the flexibility that ANY hashing function can work that is not vulnerable to preimage attacks. This includes MD5 and SHA1, as well as the other functions listed at the beginning of this post. The security of HMAC rests on the fact that Eve can take as many MAC messages and digets as she wants, but she'll never be able to determine the digest of the a message she hasn't seen yet.
In reality, HMAC can be used to store passwords. It can be used as data integrity. It can be used to pseudonymize usernames. In fact, a good rule of thumb on when to use HMAC and when not to, is asking yourself the following question: Is what I am storing or sending need to be secret? If so, HMAC.