Image of the glider from the Game of Life by John Conway
Skip to content

{ pts“” } Search Results

Now Using miniLock

I have been a long proponent of OpenPGP keys for a way to communicate securely. I have used my personal key for signing emails since ~ 2005. I have used my key at dozens and dozens of keysigning parties. I have used my key to store account passwords and credentials with vim(1), Python, and so many other tools. I have used my key to encrypt files to myself. And so much more. There is only one problem.

No one is using my key to send me encrypted data.

Sure, when attending keysigning parties, it's an encryption orgy. Attendees will sign keys, then send encrypted copies to recipients. And, of course, people will send encrypted emails before and after the party, for various reasons. But, when the party dies down, and people get back to their regular lives, and very few actually send encrypted data with OpenPGP keys. Realistically, it's rare. Let's be honest. Sure, there's the one-off, either as a corporation (XMission uses OpenPGP extensively internally for password storage) or individuals (sending encrypted tax forms to a spouse or accountant), but by large, it's rarely used.

I have a good idea of why, and it's nothing ground breaking- OpenPGP is hard. It's hard to create keys. It's hard to manage the keys. It's hard to grasp the necessary concepts of public keys, private keys, encryption, decryption, signatures, verification, the Web of Trust, user identities, key signing parties, revocation certificates, and so much more.

OpenPGP is just hard. Very hard.

Well, in an effort to encourage more people, such as my family and friends that would not use OpenPGP, to encrypt sensitive data, I've jumped on board with miniLock. What is miniLock? Currently, it's a Free Software browser application for the Google Chrome/Chromimum browser (not an extension). It uses ECC, bcrypt, BLAKE2, zxcvbn, and a number of other tools that you really don't need to worry about, unless you want to audit the project. All you need is an email and a password. The keys are deterministically generated based on that.

Think about this for a second. You don't need a public and private keyring to store your keys. You don't need to upload them to a key server. You don't need to attend keysigning parties, worry about the Web of Trust, or any of that other stuff that makes OpenPGP the nightmare it is.

All you need is an email and a password.

Unfortunately, this does have one big drawback- your email or password can't change, without changing your keys. However, the miniLock keys are cheap- IE: you can change them any time, or create as many as you want. You only need to distribute your miniLock ID. In fact, the miniLock ID is the entire public key. So, they don't even need to be long term. Generate a one-time session miniLock ID for some file that you need to send to your accountant during tax season, and call it good.

However, I prefer long-term keys, so as such, I created 3 IDs, one for each email account that I use. If you want to send me encrypted data, without the hassle of OpenPGP, feel free to use the correct miniLock ID for the paired email address.

Email miniLock ID
aaron.toponce@gmail.com mWdv6o7TxCEFq1uN6Q6xiWiBwMc7wzyzCfMa6tVoEPJ5S
atoponce@xmission.com qU7DJqG7UzEWYT316wGQHTo2abUZQk6PG8B6fMwZVC9MN
aaron.toponce@utah.edu 22vDEVchYhUbGY9Wi6EdhsS47EUeLKQAVEVat56HK8Riry

Don't misunderstand me. If you have an OpenPGP key, and would prefer to use that instead, by all means do so. However, if you don't want to setup OpenPGP, and deal with the necessary overhead, I can now decrypt data with miniLock. Maybe that will a better alternative for you instead.

md5crypt() Explained

Recently, the Password Hashing Competition announced its winner, namely Argon2, as the future of password hashing. It's long since been agreed that using generic-purpose cryptographic hashing algorithms for passwords is not a best practice. This is due to their speed. Cryptographic hashing algorithms are designed to be lighting fast, while also maintaining large margins of security. However, Poul-Henning Kamp noticed in the early 1990s that the DES-based crypt() function was no longer providing the necessary margins of security for hashing passwords. He noticed how fast crypt() had become, and that greatly bothered him. Even worse, was the realization that FPGAs could make practical attacks against crypt() in practical time. As he was the FreeBSD release engineer, this meant putting something together that was intentionally slow, but also with safe security margins. He chose MD5 as the basis for his new "md5crypt password scrambler", as he called it.

Before delving into the algorithm, the first thing you'll notice is the strange number of steps and mixing that PHK does with his md5crypt() algorithm. When I was reading the algorithm, the first question that popped into my mind was: "Why not just do standard key-stretching with the password?" Something like this (pseudocode):

digest = md5(password + salt).digest()
rounds = 1000
while rounds > 0:
  digest = md5(password + salt + digest).digest()
  counter -= 1

This certainly seems to be the most straightforward approach, and the entirety of the security is based on the cryptographic security of MD5. If you were concerned about the output digest being recognizable, it might make sense to scramble it. You could scramble the remaining bytes in a deterministic fashion, which PHK actually ends up doing before saving to disk.

But then it hit me: PHK wanted his new algorithm to be intentionally slow, even if using MD5. This means adding additional steps to mixing the password, which requires more CPU, and thus, more time. If raw MD5 could process 1,000,000 hashes per second, then standard key-stretching of 1,000 iterations would bring it down to 1,000 hashes per second. However, if adding additional operations slows it down by 1/N-iterations, the the resulting throughput would be 1,000/N hashes per second. I can see it now- anything to slow down the process, without overburdening the server, is a gain. As such, the md5crypt() function was born.

Here is the algorithm, including what I think may be a bug:

  1. Set some constants:
    "pw" = user-supplied password.
    "pwlen" = length of "pw".
    "salt" = system-generated random salt, 8-characters, from [./0-9A-Za-z].
    "magic" = the string "$1$".
    "itoa64" = is our custom base64 string "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    
  2. Initialize digest "a", and add the password, magic, and salt strings to it:
    da = MD5.init()
    da.update(pw)
    da.update(magic)
    da.update(salt)
    
  3. Initialize digest "b", and add the password, salt, and password strings to it:
    db = MD5.init()
    db.update(pw)
    db.update(salt)
    db.update(pw)
    final = db.digest()
    
  4. Update digest "a" by repeating digest "b", providing "pwlen" bytes:
    for(pwlen; pwlen > 0; pwlen -= 16):
      if(pwlen > 16):
        da.update(final)
      else:
        da.update(final[0:pwlen])
    
  5. Clear virtual memory
    memset(final, 0, length(final))
    
  6. Update digest "a" by adding a character at a time from either digest "final" or from "pw" based on each bit from "pwlen":
    for(i = pwlen; i; i >>= 1):
      if i % 2 == 1:
        da.update(final[0])
      else:
        da.update(pw[0])
    dc = da.digest()
    
  7. Iterate 1,000 times to prevent brute force password cracking from going too fast. Mix the MD5 digest while iterating:
    for(i=0; i<1000; i++)
      tmp = MD5.init()
      if i % 2 == 0:
        tmp.upate(dc)
      else:
        tmp.update(pw)
      if i % 3 == 0:
        tmp.update(salt)
      if i % 7 == 0:
        tmp.update(pw)
      if i % 2 == 0:
        tmp.update(pw)
      else:
        tmp.update(dc)
      dc = tmp.digest()
    
  8. Convert 3 8-bit words of digest "c" into 4 6-bit words:
    final = ''
    for a, b, c in ((0, 6, 12), (1, 7, 13), (2, 8, 14), (3, 9, 15), (4, 10, 5)):
      v = ord(dc[a]) < < 16 | ord(dc[b]) << 8 | ord(dc[c])
      for i in range(4):
        final += itoa64[v & 0x3f]
        v >>= 6
    v = ord(dc[11])
    for i in range(2):
      final += itoa64[v & 0x3f]
      v >>= 6
    
  9. Clear virtual memory:
    memset(dc, 0, length(dc))
    

Notice that between steps 5 and 6, the virtual memory is cleared, leaving the digest "final" as NULLs. Yet, in step 6, the for-loop attempts to address the first byte of digest "final". It seems clear that PHK introduced a bug in this algorithm, that was never fixed. As such, every implementation must add a C NULL in step 6, instead of final[0]. Otherwise, you will end up with a different output than the original source code by PHK.

Anyway, that's the algorithm behind md5crypt(). Here's a simple Python implementation that creates valid md5crypt() hashes:

1
from hashlib import md5
1
2
# $ mkpasswd --method='md5' --salt='2Z4e3j5f' --rounds=1000 --stdin 'toomanysecrets'
# $1$2Z4e3j5f$sKZptx/P5xzhQZ821BRFX1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
pw = "toomanysecrets"
salt = "2Z4e3j5f"

magic = "$1$"
pwlen = len(pw)
itoa64 = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

# Start digest "a"
da = md5(pw + magic + salt)

# Create digest "b"
db = md5(pw + salt + pw).digest()

# Update digest "a" by repeating digest "b", providing "pwlen" bytes:
i = pwlen
while i &gt; 0:
da.update(db if i &gt; 16 else db[:i])
i -= 16

# Upate digest "a" by adding either a NULL or the first char from "pw"
i = pwlen
while i:
da.update(chr(0) if i &amp; 1 else pw[0])
i &gt;&gt;= 1
dc = da.digest()

# iterate 1000 times to slow down brute force cracking
for i in xrange(1000):
tmp = md5(pw if i &amp; 1 else dc)
if i % 3: tmp.update(salt)
if i % 7: tmp.update(pw)
tmp.update(dc if i &amp; 1 else pw)
dc = tmp.digest()

# convert 3 8-bit words to 4 6-bit words
final = ''
for x, y, z in ((0, 6, 12), (1, 7, 13), (2, 8, 14), (3, 9, 15), (4, 10, 5)):
# wordpress bug: &lt; &lt;
v = ord(dc[x]) &lt;&lt; 16 | ord(dc[y]) &lt;&lt; 8 | ord(dc[z])
for i in range(4):
final += itoa64[v &amp; 0x3f]
v &gt;&gt;= 6
v = ord(dc[11])
for i in range(2):
final += itoa64[v &amp; 0x3f]
v &gt;&gt;= 6
1
 
1
2
# output the result
print "{0}${1}${2}".format(magic, salt, final)

Ulrich Drepper created a "sha256crypt()" as well as "sha512crypt()" function, which is very similar in design, and which I'll blog about later.

It's important to note, that while PHK may have announced md5crypt() as insecure, it's not for the reasons you think. Yes, MD5 is broken, horribly, horribly broken. However, these breaks only deal with the compression function and blind collision attacks. MD5 is not broken with preimage or second preimage collisions. In the case of a stored md5crypt() hash, it requires either a brute force search or a preimage attack to find the plaintext that produced the hash. MD5 is secure with preimage attacks. The reason md5crypt() has been deemed as "insecure", is because MD5 is fast, fast, fast. Instead, password hashing should be slow, slow, slow, and no amount of creativity with MD5 can adequately address its performance. As such, you should migrate to a password hashing solution designed specifically to slow attackers, such as bcrypt or scrypt, with appropriate parameters for security margins.

The Chaocipher With Playing Cards

As you know, I am a cryptography hobbyist. More specifically, I have an interest in pencil and paper ciphers, also referred to as "hand ciphers" or "field ciphers". Since Bruce Schneier released his Solitaire Cipher for Neal Stephenson's book "Cryptonomicon" (known in the book as "Pontifex"), I have had a real desire to learn hand ciphers with playing cards, that I'll refer to as "card ciphers".

Further, in 1918, John F. Byrne invented a mechanical encryption system that he called "Chaocipher". He released an autobiography titled "The Silent Years", of which he describes the system without the algorithm, and releases a series of exhibits of ciphertexts for cryptography experts to break.

Unfortunately, because he didn't release the algorithm I think, no one took his encryption system seriously, despite his best efforts to get the War Department to use it. It wasn't until 2010 that the widow of John F. Byrne's son released the Chaocipher papers, mechanics, and artifacts to the National Cryptologic Museum in Maryland, that we finally fully understood the algorithm.

In this post, I am going to describe the algorithm using playing cards, whereas John F. Byrne's original invention required two circular rotating disks. Another hindering aspect to Byrne's invention was the mechanical engineering required. At best, the device is clunky and awkward to use. However, using playing cards, I think you'll find the system much more elegant, easier to carry around, and has the advantage of not being incriminating that you are carrying a cryptographic device. Playing cards were certainly available in the early 1900s, so it's unfortunate that he didn't think of using playing cards as the primary mechanism for the cipher.

Setup

The Chaocipher uses the concept of lookup tables to encrypt or decrypt a message. This is done by maintaining two separate alphabets. When encrypting a message, a character is located in the plaintext alphabet, and it's location is noted. Then, the ciphertext character is identified by locating the character in the ciphertext alphabet at the same position. After the ciphertext character has been recorded, both the plaintext and ciphertext alphabets are permuted. We'll get into those details in a moment, but first, let's set aside some definitions.

Because John F. Byrne's original invention required two circular disks, there are two definitions that you should be aware of:

Zenith
The top of the wheel or circle. In our case, this will be the top of the pile.

Nadir
The bottom of the wheel or circle. In our case, this will be in the middle of the pile (the 14th of 26 cards).

Deck
All 52 cards in a standard poker deck.
Pile
Either the red playing cards (Diamonds and Hearts) dedicated to the ciphertext alphabet, or the black playing cards (Clubs and Spades) dedicated to the plaintext alphabet. Each pile is exactly 26 cards.

Left alphabet
The red pile of playing cards containing the ciphertext characters A-Z.
Right alphabet
The black pile of playing cards containing the plaintext characters A-Z.

We will be treating our two piles (the red and black piles) as circular. The piles will always be face-up on the table and in our hands. The top card in the face-up pile will be the 1st card while the bottom card will be the 26th card. Because the pile is circular in nature, this means that the top and bottom cards in the pile are "next" to each other in succession. This means further, then, that the 14th card in the pile is our nadir, while the top card, or the 1st card in the pile, is our zenith.

Now that we've set that aside, we need to create some definitions so we know exactly which playing card in every suit is assigned to which English alphabet character. I've assigned them as follows:

Hearts and Spades Clubs and Diamonds
A 2 3 4 5 6 7 8 9 10 J Q K A 2 3 4 5 6 7 8 9 10 J Q K
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

This means that the English character "X" would be the Jack of Clubs in the plaintext "black" pile ("right alphabet" in Chaocipher-speak), and the Jack of Diamonds in the ciphertext "red" pile ("left alphabet" in Chaocipher-speak). This also means that the 8 of Spades would be the English character "H", just as much as the 8 of Hearts.

On a side note, if you wish to program this in software, and you populate an array with the values of 1 through 52 to represent each card in the deck, it's standard to use 1-13 for the Clubs, 14-26 for the Diamonds, 27-39 for the Hearts, and 40-52 for the Spades ("bridge order").

Algorithm

The algorithm can comprise of a standard "simple" set of steps, or using a more advanced "takeoff pattern" for enciphering and deciphering the message. First, let me discuss the simple pattern, then I'll look at the more advanced takeoff pattern.

An overview of the algorithm could be described as follows:

  1. Determine the ciphertext character according to the plaintext character (and vice versa for decryption).
  2. Permute the red pile.
  3. Permute the black pile.

Each of these three steps are executed in order until the message is exhausted.

Enciphering Plaintext

Looking at it in closer detail, suppose I had the following red and black piles (using "+" to identify the zenith, and "*" to identify the nadir):

            +                                      *
  red (ct): 7D 3H TH 2H JD AD 8H 8D 5H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH 7H 6H 4H 4D 3D
black (pt): TC 3S JS 2C 5S AC 4C KC 9S TS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS 2S 5C KS 6C

If I wanted to encrypt the character "A", in the black deck, according to our table above, that would be the Ace of Spades. As such, I need to find the Ace of Spades in my black pile. While locating the card, I need to be counting, so I know the position in the pile that the Ace of Spades in in. In this case, the Ace of Spades is the 22nd card in the black pile. Thus, the 22nd card in the red pile is the Seven of Hearts:

            +                                      *                       ↓
  red (ct): 7D 3H TH 2H JD AD 8H 8D 5H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH 7H 6H 4H 4D 3D
black (pt): TC 3S JS 2C 5S AC 4C KC 9S TS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS 2S 5C KS 6C

The Seven of Hearts produces the English character "G". Thus, with these two piles, "A" encrypts to "G" before permutation. Conversely, "G" would decrypt to "A" with these starting piles.

Permuting the Red and Black Piles

Now that we've discovered our plaintext and ciphertext characters, we need to cut the deck, such that both the plaintext and ciphertext characters are at the zenith of each pile. The resulting piles would then be as follows:

            +                                      *
  red (ct): 7H 6H 4H 4D 3D 7D 3H TH 2H JD AD 8H 8D 5H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH
black (pt): AS 2S 5C KS 6C TC 3S JS 2C 5S AC 4C KC 9S TS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C

Permuting the Red Pile

Permuting the red pile follows the following steps:

  1. Remove the zenith + 1 card (2nd card) from the red pile.
  2. Place the removed card into the nadir of the red pile (will be the 14th card).

So, we'll follow these steps by taking the zenith + 1 card (2nd card), which is the "6H", and placing it at the nadir of the red pile (14th card). The resulting red pile will look as follows:

            +                                      *
  red (ct): 7H .. 4H 4D 3D 7D 3H TH 2H JD AD 8H 8D 5H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH

            +                                      *
  red (ct): 7H 4H 4D 3D 7D 3H TH 2H JD AD 8H 8D 5H .. TD QH 9H JH 2D 6D KH QD 9D 5D KD AH

            +                                      *
  red (ct): 7H 4H 4D 3D 7D 3H TH 2H JD AD 8H 8D 5H 6H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH

Permuting the Black Pile

Permuting the black pile follows the following steps:

  1. Take the zenith (top) card and place it at the bottom of the black pile.
  2. Remove the zenith + 2 card (3rd card) from the black pile.
  3. Place the removed card into the nadir of the black pile (will be the 14th card).

So, we'll follow these steps by taking the zenith card (top card), which is the "AS", and placing it at the bottom of the black pile. The resulting black pile will look as follows:

            +                                      *
black (pt): 2S 5C KS 6C TC 3S JS 2C 5S AC 4C KC 9S TS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS

Now take the zenith + 2 (3rd card), which is the "KS" and place it at the nadir of the black pile (14th card). The final black pile will look as follows:

            +                                      *
black (pt): 2S 5C .. 6C TC 3S JS 2C 5S AC 4C KC 9S TS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS

            +                                      *
black (pt): 2S 5C 6C TC 3S JS 2C 5S AC 4C KC 9S TS .. 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS

            +                                      *
black (pt): 2S 5C 6C TC 3S JS 2C 5S AC 4C KC 9S TS KS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS

As such, both the red and black piles should look like the following after enciphering the plaintext character "A" and permuting both piles:

            +                                      *
  red (ct): 7H 4H 4D 3D 7D 3H TH 2H JD AD 8H 8D 5H 6H TD QH 9H JH 2D 6D KH QD 9D 5D KD AH
black (pt): 2S 5C 6C TC 3S JS 2C 5S AC 4C KC 9S TS KS 9C 6S 7S 8S QS QC 7C JC 4S 3C 8C AS

To summarize, the algorithm steps are as follows:

  1. Find the plaintext character in the black pile.
  2. Record the position of this card in the black pile.
  3. Find the ciphertext character in the red pile by counting to that position.
  4. Bring the plaintext character to the zenith by cutting the deck at that position.
  5. Bring the ciphertext character to the zenith by cutting the deck at that position.
  6. Permute the red pile:
    1. Remove the zenith + 1 card from the red pile (2nd card).
    2. Insert the removed card into the nadir of the red pile (14th location).
  7. Permute the black pile:
    1. Move the card at the zenith to the bottom of the black pile.
    2. Remove the zenith + 2 card from the black pile (3rd card).
    3. Insert the removed card into the nadir of the black pile (14th location).

Make sure you understand these steps before continuing.

Permuting with a Takeoff Pattern

John F. Byrne described a "takeoff pattern" in which the left and right alphabets are used for both the plaintext and ciphertext characters. In the simple method, the right alphabet (black pile) is used exclusively for all plaintext characters in the message. So, if the plaintext message was "ATTACKATDAWN", then you could think of using the right pile 12 times, or "RRRRRRRRRRRR" ("BBBBBBBBBBBB" if we're thinking "black pile").

However, suppose you would like to use both of the red and black piles (left and right alphabets respectively) for your plaintext message. Then you could create a "takeoff pattern" for encrypting your text. Suppose you used the following takeoff pattern: "RLRRLLRRRLLL" (right, left, right, right, left, left, right, right, right, left, left, left). This means that you would use the right alphabet for the first plaintext character, then the left alphabet for the second plaintext character, the right alphabet for the 3rd, the right alphabet for the 4th, etc. Or, if using playing cards, you could think of the same takeoff pattern as "BRBBRRBBBRRR" (black, red, black, black, red, red, black, black, black, red, red, red).

Personally, I don't care for the takeoff pattern for two main reasons: first, the takeoff pattern needs to be communicated with the key. This may not be a problem if code books are distributed among field agents, as the takeoff pattern can be printed on the same page as the key. However, this does mean that the takeoff pattern needs to be as long as the key itself.

The second reason I don't care for the take of pattern, is due to the unnecessary complexity of the takeoff pattern itself, it greatly increases the chances to make a mistake. Already, the sender and recipient will be going back and forth frequently between the red and black pile of cards. By creating a takeoff pattern, this makes that back and forth more frequent. Further, if you are using the 3rd of 5 "L"s in a stream, but you think you are on the 4th "L", then the encryption or decryption will be wrong from there out. Chaocipher doesn't have the ability to correct itself from a mistake.

For these two reasons, I suggest that when using playing cards with the Chaocipher, that instead you always use the black pile for the plaintext characters, and the red pile for the ciphertext characters. Then, the only thing that you need to keep track of is the characters in the message itself.

Keying the Deck

Before executing the Chaocipher algorithm, the deck should be "keyed". This refers to the order of the deck. Both the sender and the recipient must have the same deck order in order to successfully encrypt and decrypt a message. The deck can be keyed by either a sufficient set of shuffling and cutting, or keyed with a key phrase. First, let's look at thoroughly shuffling and cutting a full 52-card deck.

Keying with Shuffling and Cutting

Suppose after thoroughly shuffling and cutting the deck, the deck order face-up is as follows:

|< - top                                                                                                                                         bottom ->|
3H 8D QH 4C 6S QS 8C 4D 9S 5D 8S QC 3C 6H JS 7H 5S TS QD 7C 4H JC KD TH 3S KS 6D 9C 9D 2C JD 2H 2D 6C 8H KC 9H JH 7S KH AS AH 5C AD TC 7D 4S 3D 2S TD 5H AC

We now need a deterministic algorithm for separating the red cards from the black cards. Holding the deck face-up in your hand, deal out two face-down piles, the left pile of red cards, and the right pile of black cards. Do this card-for-card, one-at-a-time. Do not grab a bunch of similarly-colored cards. This can introduce error into the keying process. Doing it one-at-a-time ensures exactness, and minimizes the chances for mistake.

After the full deck has been dealt into two face-down piles, turn the piles over, so they are face-up. Using the standard Chaocipher tokens of "+" to identify the zenith, or top of the pile, and the "*" to identify the nadir, or 14th card in the pile, your two piles should be in the following order:

            +                                      *
  red (ct): 3H 8D QH 4D 5D 6H 7H QD 4H KD TH 6D 9D JD 2H 2D 8H 9H JH KH AH AD 7D 3D TD 5H
black (pt): 4C 6S QS 8C 9S 8S QC 3C JS 5S TS 7C JC 3S KS 9C 2C 6C KC 7S AS 5C TC 4S 2S AC
            -----------------------------------------------------------------------------
  position:  1  2  3  4  5  6  7  8  9  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2
                                        0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6

Verify that you can do this by hand, and that it matches with the deck order above. Remember, the red pile is our "left alphabet" in the Chaocipher which contains all ciphertext English characters. The black pile is our "right alphabet" in the Chaocipher which contains all plaintext English characters. In other words, if we converted them to English characters, then the left and right alphabets would be as follows, using the same notation to identify the zenith and nadir:

            +                         *
 left (ct): C U L Q R F G Y D Z J S V X B O H I K M A N T P W E
right (pt): Q F L U I H Y P K E J T X C M V O S Z G A R W D B N
            ---------------------------------------------------
  position: 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
                              0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6

Keying with a Key Phrase

Already knowing the algorithm prepares you for using a key phrase to key the deck. Basically, you'll just use the characters in your key phrase as the plaintext message, using the black pile to find key key phrase character, just as you would if encrypting a message. Both piles will be permuted, as normal. The only difference is that you will be not be recording the ciphertext characters. Further, you will start with alphabetized piles.

Both piles will start with the following order:

            +                                      *
 left (ct): AH 2H 3H 4H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4D 5D 6D 7D 8D 9D TD JD QD KD
right (pt): AS 2S 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC

Suppose our key phrase is "CHAOCIPHER". Then, working through the steps character for character, they would follow the following order:

Locate "C" in the black pile:
            +                                      *
 left (ct): AH 2H 3H 4H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4D 5D 6D 7D 8D 9D TD JD QD KD
right (pt): AS 2S 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC

Bring both characters to zenith:
            +                                      *
 left (ct): 3H 4H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S

Permute the red pile. Remove the zenith + 1 card:
            +                                      *
 left (ct): 3H .. 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S

            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D .. 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S

Insert the card into the nadir:
            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4H 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 3S 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S

Permute the black pile. Move the top card to the bottom:
            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4H 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 4S 5S 6S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S 3S

Remove the zenith + 2 card:
            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4H 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 4S 5S .. 7S 8S 9S TS JS QS KS AC 2C 3C 4C 5C 6C 7C 8C 9C TC JC QC KC AS 2S 3S

            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4H 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 4S 5S 7S 8S 9S TS JS QS KS AC 2C 3C 4C .. 5C 6C 7C 8C 9C TC JC QC KC AS 2S 3S

Insert the card into the nadir:
            +                                      *
 left (ct): 3H 5H 6H 7H 8H 9H TH JH QH KH AD 2D 3D 4H 4D 5D 6D 7D 8D 9D TD JD QD KD AH 2H
right (pt): 4S 5S 7S 8S 9S TS JS QS KS AC 2C 3C 4C 6S 5C 6C 7C 8C 9C TC JC QC KC AS 2S 3S

Repeat for "H", "A", "O", "C", "I", "P", "H", "E", & "R".

When you are finished keying the deck with the key phrase "CHAOCIPHER", you should have the following order for the red and black piles:

            +                                      *
 left (ct): 6D JH 7D 8D 9D 2D KH JD QD 2H 3H 6H 7H 8H 9H TH QH AD KD AH TD 3D 4H 5H 4D 5D
right (pt): 4S AS 3S JS 5S 7S 9S QS AC 2C 3C 4C 6S 2S 6C 8S 7C 8C KS TS 9C TC JC QC KC 5C

Enhancements

Initialization Vectors

One thing that we have learned with modern computer encryption primitives is to prepend initialization vectors to the ciphertext. The initialization vector must be random and unpredictable. However, its function is to create a unique state on the system before the plaintext is encrypted or before the ciphertext is decrypted. The point is to modify a secret state (our key, or pile orders) while ignoring the output. By adding an initialization vector to the system, we limit the effectiveness of attacks on the ciphertext. For example, if the initialization vector is 26 characters long (one character for each character in the English alphabet, or 26! total combinations), then 25 collisions on one initialization vector to launch an attack on the state (the last element can be determined by process of elimination).

Unfortunately, a 26-character initialization vector is not very practical to use by hand. Knowing that it is standard for field agents to break up their messages into blocks of five characters, it would seems reasonable to use a 5-character initialization vector. However, this doesn't seem to mix the state well enough.

For example, consider using an unkeyed deck to encrypt the text "AARON" 10 times with different initialization vectors at each round:

BSTPR VUYVS
LOKJY WJXTR
YFTLN WJLHQ
UAOZP UTVIV
YXTXU VILUH
WGQCJ UTLUE
LYPYZ WHYSS
QHESJ VHKEQ
CLZRN WVMRE
FEKEQ VUKDR

The first five characters in each ciphertext is the initialization vector, randomly generated. The second block of 5 characters is my name encrypted after the initialization vector keyed the deck. Notice that the first character in the second block seems to have a lot of "V"s and "W"s. If I do 100 rounds, and count the frequency of the first character, I get the following:

F:1, L:1, G:3, K:3, T:4, I:7, J:7, U:7, H:8, W:14, V:45

That is not a good distribution of characters for the first plaintext character being an "A" over 100 different initialization vectors. I would expect it to be much more diffuse. So, how about instead of a 5-character initialization vector, we bump it to 10? How does the frequency distribution look then?

A:1, I:1, U:1, V:1, X:1, Z:1, T:2, E:3, G:3, N:5, O:5, Q:5, B:6, C:6, F:7, D:10, S:11, R:13, P:18

That's a little bit better. A 26-character initialization vector would certainly show a flatter frequency distribution for the first ciphertext character in the message. However, as mentioned, that's cumbersome. So, at this point, it's up to you. Using a 5-character initialization vector would provide about 10 or 11 possible first ciphertext characters. Using a 10-character initialization vector increases that to about 18 with a flatter distribution.

PKCS#7 Padding

As mentioned, it has become a field cipher standard to separate your ciphertext into blocks of 5 characters. This means that if your message is not a multiple of 5 characters, to add padding at the end until it is. However, when the recipient decrypts the message, it should be unambiguous exactly what is padding, and what is not. The padding in PKCS#7 makes this possible.

We can define this easily enough be determining exactly how many characters must be added to pad the message into multiples of 5 characters. So, we'll count:

  • If the message needs only one character appended, append a single "V".
  • If the message needs two characters appended, append "WW".
  • If the message needs three characters appended, append "XXX".
  • If the message needs four characters appended, append "YYYY".
  • If the message is already a multiple of five characters, append "ZZZZZ".

By using the padding described above, after decrypting the message, the recipient needs to only look at the last character to determine exactly how many characters make up the padding, and to strip from the plaintext.

To illustrate this, let's take an unkeyed deck, add a 5-character initialization vector, and encrypt the message "ATTACK AT DAWN". This message is only 12 characters, so I would need to add "XXX" at the end of the message according to the definition above. This my message becomes (removing spaces) "ATTACKATDAWNXXX". Adding the 5-character initialization vector "KEQPN" then encrypting, I get the following result:

plaintext: ATTACK AT DAWN
initialization vector: KEQPN
padding: XXXX

ciphertext: KEQPN XLHTT PRUCA FHUEC

Of course, decrypting "KEQPN XLHTT PRUCA FHUEC" and removing the initialization vector "KEQPN" will reveal "ATTACKATDAWNXXX". It's clear to the recipient that "XXX" is padding, and can be stripped without affecting the plaintext.

Conclusion

This has been a lengthy post, and I commend you for reading this far. The Chaocipher is an interesting algorithm, and I'll be studying its properties as time moves forward. I think the Chaocipher fits well as playing card cipher, and gets as close to "bare metal" as you can without designing an actual mechanical mechanism with two rotating disks and removable character tiles. Playing cards are easy to carry around with you in your pocket, so its portability is nice.

Further, we can increase the strength of the algorithm, as mentioned, by adding an initialization vector at the start of the message, and by adding padding, we can stick with the standard of 5-character blocks in our ciphertext. Of course, this means adding 6-10 additional characters, but for a 160-character message, this doesn't seem too cumbersome.

There are some things that I have observed while using playing cards for the cipher hardware. First, encrypting and decrypting are slow. It takes me about a minute to encrypt/decrypt two-three characters. So, for a 160-character message, it could take the better part of an hour to work through.

Second, due to its slow speed, you may get tempted to try and speed things up a bit, so you can work through the message more quickly. However, this drastically opens you up to mistakes. I was encrypting the plaintext "JELLY LIKE ABOVE THE HIGHWIRE SIX QUACKING PACHYDERMS KEPT THE CLIMAX OF THE EXTRAVAGANZA IN A DAZZLING STATE OF FLUX" over and over. After about 10 ciphertexts, I wrote a Python script to automate the process for me. Only 1 of the ciphertexts was 100% accurate, character-for-character. And, unfortunately, 1 ciphertext was 0% accurate, with every character in the message incorrect. However, on the other 8 messages, I seemed to maintain accuracy for about 2/3 of the characters on most messages. Some others, I made a mistake more early on. Regardless, the point is, I was making frequent mistakes, despite my best effort to not do so. Only 1 out of 10 ciphertexts would decrypt cleanly. It might be worth having two decks, one for producing the ciphertext character, and one for double-checking your work. Of course, this slows you down further, but could be doable for minimizing mistakes.

However, the Chaocipher with playing cards is a fun cipher to work, and easy once you get the hang of it. I would recommend using plastic playing cards, such as the ones from Kem, Copag, or Bicycle Prestige. This way, the cards don't get gummed up like paper cards, are washable, last longer due to their extra durability, and overall, just are a better feeling playing card.

If you work the Chaocipher with playing cards, let me know what you think.

SHA512crypt Versus Bcrypt

On the Internet, mostly in crypto circles, you'll see something like the following in a comment, forum post, on a mailing list, other otherwise:

Do not use fast hashes to store passwords on disk. Use bcrypt.

In most cases, however, the understanding of why to use bcrypt isn't entirely clear. You'll hear the standard answer "It's slow", without a real understand as to "how slow?" nor as to "why is it slow?". In this post, I want to explain why bcrypt is slow, some misconceptions about using fast hashes, and where the real strength of bcrypt lies (hint- it's not speed). Finally, I'll close with an alternative that many are starting to talk about as a possible replacement to bcrypt.

First, when people are talking about using bcrypt for password hashing, they are referring to the bcrypt cryptographic key derivation function, designed by Niels Provos and David Mazières. Bcrypt is designed to be intentionally slow and expensive. It was designed specifically with password storage in mind. The motivation is clear- if a password database of any kind is leaked to the Internet, it should be cost prohibitive for password crackers to make any sort of progress recovering the unknown passwords from the known hashes.

bcrypt algorithm
How does bcrypt work though? What is the algorithm? According to the paper, the core bcrypt function in pseudocode is as follows:

bcrypt(cost, salt, input)
    state = EksBlowfishSetup(cost, salt, input)
    ctext = "OrpheanBeholderScryDoubt" //three 64-bit blocks
    repeat (64)
        ctext = EncryptECB(state, ctext) //encrypt using standard Blowfish in ECB mode
    return Concatenate(cost, salt, ctext)

The first function, "EksBlowfishSetup(cost, salt, input)" in the algorithm is defined as follows:

EksBlowfishSetup(cost, salt, key)
    state = InitState()
    state = ExpandKey(state, salt, key)
    repeat (2^cost) // exponential cost by powers of 2
        state = ExpandKey(state, 0, key)
        state = ExpandKey(state, 0, salt)
    return state

In the "EksBlowfishSetup", you'll notice the "repeat" step uses a binary exponential parameter. As the cost is increased, the time it will take to finish the algorithm will take exponentially longer. Bcrypt was designed with this cost parameter to adjust for Moore's law. As computing strength continues to improve, bcrypt should be flexible in its design to adjust for those advancements. This is why the cost parameter is baked into bcrypt, and why people call it "slow".

Finally, you'll notice the "ExpandKey(state, salt, key)" function in the algorithm. It is defined as follows:

ExpandKey(state, salt, key)
    for(n = 1..18)
        P_n  key[32(n-1)..32n-1] XOR P_n //treat the key as cyclic
    ctext = Encrypt(salt[0..63])
    P_1 = ctext[0..31]
    P_2 = ctext[32..63]
    for(n = 2..9)
        ctext = Encrypt(ctext XOR salt[64(n-1)..64n-1]) //encrypt using the current key schedule and treat the salt as cyclic
        P_2n-1) = ctext[0..31]
        P_2n = ctext[32..63]
    for(i = 1..4)
        for(n = 0..127)
            ctext = Encrypt(ctext XOR salt[64(n-1)..64n-1]) //as above
            S_i[2n] = ctext[0..31]
            S_i[2n+1] = ctext[32..63]
    return state

Because bcrypt was based on Blowfish, the "ExpandKey(state, 0, key)" function used in the "EksBlowfishSetup" function is the same as regular Blowfish key schedule since all XORs with the all-zero salt value are ineffectual. The bcrypt "ExpandKey(state, 0, salt)" function is similar, but uses the salt as a 128-bit key.

Also, to clarify, a 128-bit salt is also baked into the algorithm, as you can see. This is to prevent the building of lookup tables for bcrypt, such as rainbow tables. Salts do not slow down crackers, and it's assumed that salts will be leaked with the database. All salts provide is the protection against using a hash lookup table to find the originating plaintext. Because salts are baked into bcrypt, bcrypt lookup tables will never exist. This forces password crackers to brute force the hash.

Understanding password security
There are a few key security elements related to passwords that you must understand. They are the following:

  1. The unpredictability measurement, aka "entropy", of the password provided by the user.
  2. The speed at which brute forcing passwords can commence.
  3. The cryptographic strength of the function itself.

I ordered these for a specific reason- the most likely "weak link" in the chain of password security is password the user provides. History of leaked password databases have shown us that. If users understood real strength behind passwords, they would understand the basic concepts of entropy, even if they weren't familiar with the term itself. If entropy levels were high in all user's passwords, no matter what, then the success of recovering passwords from hashes via brute force would be ineffective. But, 70-80%, and better, of password databases are recovered, because of this simple concept not getting applied. The speed at which password crackers brute forced their way through the hashes in the database would no longer matter, because no amount of practical computing power would be able to work fast enough within the death of the Universe, to recover the user's password.

Sadly, this just isn't the case. People suck as picking passwords.

Key stretching
So, we need to compensate for users picking bad passwords, and bcrypt makes a great leap in this regard. Because of the cost parameter which is part of the algorithm, we can adjust the cost to make password hashing intentionally slow. And, as computing power increases, the cost parameter can continue to be adjusted to compensate. This is what most people understand, when they claim that "bcrypt is slow".

The argument is that cryptographic hashes are designed to be fast, fast, fast. And they're right. Cryptographic hash functions are designed to provide data integrity regardless of the size of the input. If I have a 4.7 GB CD image, I should be able to calculate its digest in reasonable time, so when I transfer the image to another computer, I can recalculate the digest, and compare that the two digests match, in reasonable time.

This would seem like a Bad Thing for password storage, because passwords are short (much shorter than 4.7 GB at least), so password crackers would be able to guess millions or billions of passwords per second using a fast cryptographic hash. You're right, for the most part. Ars Technica ran a story on password cracking with a 25-GPU cluster. It achieves a speed of 350 billion NTLM passwords per second, which means every conceivable Windows NTLM password can be recovered in less than 6 hours using this behemoth. It can work MD5 at 180 billion per second, or 63 billion per second with SHA-1.

At these speeds, the argument against using fast cryptographic hashes to store passwords is sounding pretty strong. Except, that Ars Technica article, and most bcrypt fanboys seem to overlook one thing- key stretching. Just because cryptographic hashes are fast, doesn't mean we can't intentionally slow them down. This is key stretching.

Key stretching is the idea that you reuse internal state for calculating a new key. For cryptographic hash functions, this is "iterations" or "rotations". The idea is taking the cryptographic digest of the input, and using this digest as the input for another hashing round. In pseudocode, you could think of it like this:

salt = random() // programmatically determine a salt randomly
password = input() // get the password from the user
key = ''
cost = 5000

for ROUND in 1 to cost: do
    digest = SHA512(salt, password,  key)
    key = digest
done

If our 25-GPU cluster could work through 50 billion SHA-512 cryptographic hashes per second, by forcing 5,000 SHA-512 calculations before getting to the desired hash, our 25-GPU cluster can now only work through 10 million SHA-512 hashes per second. As the iterative count is increased, the time it takes to calculate the resulting digest increases. As such, we have created a "sha512crypt" that has a similar cost parameter as bcrypt. Now the question remains- does it hold up?

Practical examples
To see if this "key stretching" idea holds up, I wrote two Python scripts- one using SHA-512, and the other using bcrypt. In both cases, I increase the cost parameter from a reasonable starting point, and increased it well beyond a reasonable expectation.

Here is my Python code for "test-sha512.py":

1
2
3
4
5
6
7
8
9
#!/usr/bin/python
import hashlib
password = b'password'
cost = 5000
key = ''
m = hashlib.sha512()
for i in xrange(cost):
    m.update(key+password)
    key = m.digest()

And here is my Python code for "test-bcrypt.py":

1
2
3
4
5
6
#!/usr/bin/python
import bcrypt
cost = 6
password = b'password'
salt = bcrypt.hashpw(password,bcrypt.gensalt(cost))
hash = bcrypt.hashpw(password, salt)

In both cases, I incremented the cost, then timed re-running the script. Of course, Python is an interpreted language, so the absolute times would be much lower if this were implemented in C, or assembly. Further, this was done on my aging T61 laptop. Running it on my 8-core i7 workstation with triple-channel DDR3 would show improved times. It not the times that are critical. What is critical is seeing the exponential back-off as the cost is increased.

Here is a table detailing my findings. Notice that the bcrypt cost increments by a single digit. Because it's binary exponential back-off, the times increase by a power of 2 at each iteration. I also adjusted the sha512crypt cost a little to more closely match the bcrypt timings, even though it's not a strict doubling of each cost value.

bcrypt sha512crypt
cost time iterations time
6 0.032 5,000 0.036
7 0.045 10,000 0.047
8 0.064 20,000 0.064
9 0.114 40,000 0.105
10 0.209 80,000 0.191
11 0.384 160,000 0.368
12 0.745 320,000 0.676
13 1.451 640,000 1.346
14 2.899 1,280,000 2.696
15 5.807 2,560,000 5.347
16 11.497 5,500,000 11.322
17 22.948 11,000,000 22.546
18 45.839 22,000,000 45.252
19 1:31.95 44,000,000 1:30.14
20 3:07.27 88,000,000 3:07.52

In the Python bcrypt implementation, the default cost is "10". For most modern GNU/Linux operating systems, when storing the user password in /etc/shadow with sha512crypt (yes, I didn't come up with the name), the default cost is 5,000 iterations. In both these cases, the cost can be adjusted. In the case of the Python bcrypt module, it's just passing the function with a numerical argument. In the case of GNU/Linux, it's editing PAM by adding "rounds=" to a config file.

As such, sha512crypt can be just as slow as bcrypt. Remember, we are trying to adjust for increased computing power that password crackers will have access to. In both cases, bcrypt and sha512crypt address that requirement.

Bcrypt's additional strength
So, if sha512crypt can operate with a cost parameter similar to bcrypt, and can provide that exponential back-off that we are looking for to slow down password brute force searching, then what's the point of bcrypt? Are there any advantages to running it? It turns out, there is, and I suspect this is a consequence of the design, and not something that was intentionally added.

What we would like is to prevent password crackers from using non-PC hardware on attacking the password database. SHA-2 functions, such as SHA-512, work very well on GPUs. SHA-2 functions work well on specialized hardware such as ASICs and FPGAs. As such, while we could make things slow for CPU or GPU crackers, those password crackers with specialized hardware would still have an upper hand on attacking the password database. Further, by addressing GPU cracking, and making it intentionally slow there, we make like more difficult for CPUs, which means hurting the honest user when trying to login to your web application. In other words, if I adjusted sha512crypt for GPU crackers, such that only 1,000 passwords per second could be achievable on a GPU, that might be a full second, or more, for the user logging into your CPU server. This may or may not be desirable.

So, how does bcrypt attack this problem? Part of the algorithm requires a lookup table stored in RAM that is constantly modified during execution. This turns out to work out very well on a standard PC where the CPU has exclusive access to RAM. This turns out to work out fairly poorly on a GPU, where the cores share the on-board memory, and each core must compete on the bus for access to those registers. As a result, any additional cracking speed is greatly minimized on a GPU when compared to the CPU. In other words, GPU password cracking with bcrypt isn't entirely effective either.

For SHA-2 functions, like SHA-512 however, this is not the case. SHA-2 functions use only 32-bit logic and arithmetic operations, which GPUs excel at. By using a GPU over a CPU for our sha512crypt function, a password cracker can get a couple to many orders of magnitude of additional cracking power.

So, the reason to use bcrypt isn't because "it's slow". The reason to use bcrypt is because "it's ineffective on GPUs".

A better alternative- scrypt
Unfortunately for bcrypt, however, due to its low memory requirement, bcrypt can be implemented in a field programmable gate array (FPGA) or custom ASICs. Understand that bcrypt was designed in 1999, when such specialized hardware had low gate counts, and was few and far bewteen. 15 years later, times have drastically changed. Bitcoin ASICS with SHA-256 FPGAs on board are common place. Hardware AES is common in CPUs and embedded systems. The fact of the matter is, these FPGAs, with their onboard, and fast RAM are well suited to bring bcrypt password cracking well into "fast" territory.

An alternative would be a solution that not only requires address registers to be constantly modified during algorithm execution, but to also exponentially bloat the memory requirement for the increased cost. scrypt addresses this shortcoming in bcrypt. Scrypt is another password key derivation function that was initially designed in 2009 by Colin Percival as part of the Tarsnap online backup service.

Scrypt has all of the advantages that bcrypt provides- baked in salt, exponential cost parameter, and ineffectiveness on GPUs, while also adding an exponential RAM requiremnt per the cost. Because of this RAM requirement, it is no longer cost efficient to build FPGAs with the necessary RAM.

Security, standards, and implementations
Scrypt is only 5 years young. This gives bcrypt a good 10 year head start. In terms of security, this is preferred. We want cryptographic functions that have withstood the test of time with cryptographers constantly attacking and analyzing their functions, primitives, and implementations. The longer it remains "unbroken", the more "secure" we deem the functions to be.

Bcrypt continues to be attacked and analyzed, and is showing no serious sign of weakness, 15 years later. This is good for the security of bcrypt. Scrypt however has seen less scrutiny, mostly due to its young age. However, it has been 5 years, and like bcrypt, no serious signs of weakness have been shown. By comparison, the SHA-2 family of functions was created in 2001, and has been scrutinized much more than bcrypt and scrypt combined, and also is not showing any serious signs of weakness. So, from a security standpoint, the SHA-2, bcrypt, and scrypt functions all seem to be fairly secure.

When looking at governing body standards, NIST has no paper on bcrypt or scrypt. They do recommend using PBKDF2 (another key derivation function (which I haven't explained here, but love)) for password storage, and NIST has standardized on SHA-2 for data integrity. Personally, while I like the ideas of bcrypt and scrypt, I would recommend sticking with the NIST recommendations with high iteration counts, as shown above. Until we see more standards boy interest in bcrypt and scrypt, IMO, you are taking a risk using them for password storage (less so for bcrypt than scrypt at least).

Finally, because of the newness of scrypt, there are less tools for its use in programming languages than bcrypt, and even more so for SHA-2. Further, most programming languages don't include either bcrypt or scrypt in their "standard library" of modules or functions, while SHA-2 is more generally found. And for those implementations, some 3rd party libraries are more trust worthy than others. Because you're dealing with password storage, it's critical you get this right.

Conclusion
While I love the algorithms behind bcrypt and scrypt, I've always advocated for using high iterative counts on the SHA-2 or PBKDF2 functions. Even further is the advocating of teaching people how to understand the concepts behind their password entropy, and improving their own online security. That is the weakest link, and needs the most work, IMO.

So, if you ask me, I'll tell you to use either PBKDF2 or SHA-2 with high iterative counts. Or, if you must absolutely use bcrypt, then I'll recommend that you use scrypt instead.

Use /dev/random Instead Of /dev/null

While writing a shell script the other day, I was redirecting some output to /dev/null, as normal, when something dawned on me. Why don't I redirect my output to /dev/random instead? After all, both Linux random devices are writable by everyone on the system:

$ ls -l /dev/*random
crw-rw-rw- 1 root root 1, 8 Nov 13 15:14 /dev/random
crw-rw-rw- 1 root root 1, 9 Nov 13 15:14 /dev/urandom

Knowing what I know about the Linux cryptographic pseudorandom number generator (CSPRNG), I know that any bits put into the CSPRNG input pool are hashed with the SHA1 cryptographic hash function 512-bits at a time. This includes any data you redirect to it from the shell, as well as output from itself. When data is fed into the CSPRNG input pool, the RNG is reseeded.

To understand this concept of seeding an RNG, let's assume for a moment that the only source of input for the RNG is its own output. If this were the case, we would only need a starting value to "seed" the RNG, then let it run by hashing its own digests. In this scenario, each digest is chosen deterministically, and if we know the input that seeded the RNG, we can predict all of its future outputs.

Think of this scenario like a progress bar. For the SHA1 cryptographic hash, there are 2^160 possible unique digests. Theoretically, our RNG should be able to work through all 2^160 digests only once before starting over, provided there is enough time to do so (we know now this isn't the case, which means SHA1 has been weakened in terms of collision attacks). However, when you change the input by providing something other than the next digest in the queue, you change the next starting point of the RNG. It's as though you've "skipped" to a non-sequential location in your progress bar.

Now, consider constantly reseeding your RNG. This is what your Linux system is actually doing. It's constantly processing timing events from disk IO, network packets, keyboard presses, mouse movements, etc. All these inputs get collected into an "entropy pool", which is then hashed with SHA1 512-bits at a time, as we mentioned earlier. This input changes the sequential ordering of the digest outputs, making the result unpredictable, and non-deterministic.

So, when working on the shell, by redirecting your output to /dev/random, you reseed the CSPRNG, meaning you have changed the digest output ordering to something different than what it would have been had you not redirected those bits. In fact, the more you send data to the CSRNG, the more you reseed it, forever altering the path it takes on its "progress bar".

Now, you may ask why not have some userspace PID running in the background that is always reseeding the CSPRNG? Sure, you can. In this case, I would recommend running Haveged on your system. Haveged will probe much more hardware events on the system than the default install will, and keep the entropy pool topped off at full. The CSPRNG will be constantly reseeded. However, for shell scripts, redirecting to /dev/random instead of /dev/null works.

My only concern with redirecting to /dev/random would be bandwidth concerns. Doing a simple and crude benchmark comparing /dev/null to /dev/random, I get the following on my workstation:

$ for I in {1..5}; do dd if=CentOS-6.5-x86_64-bin-DVD1.iso of=/dev/random; done
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 81.3842 s, 54.9 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 76.4597 s, 58.4 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 74.6036 s, 59.9 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 75.4946 s, 59.2 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 74.375 s, 60.1 MB/s
$ for I in {1..5}; do dd if=CentOS-6.5-x86_64-bin-DVD1.iso of=/dev/null; done  
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 59.325 s, 75.3 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 56.5847 s, 79.0 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 54.4541 s, 82.1 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 56.0187 s, 79.8 MB/s
8726528+0 records in
8726528+0 records out
4467982336 bytes (4.5 GB) copied, 57.0039 s, 78.4 MB/s

Seems I get slightly better throughput with /dev/null, which isn't surprising. So, unless you know you need the throughput, I would recommend sending your data to /dev/random over /dev/null.

Using The Bitmessage Storage Service

While hanging out on the "privacy" channel on Bitmessage, someone sent the following:

"You have no files saved. For instructions please send a message to BM-2cUqBbiJhTCQsTeocfocNP5WCRcH28saPU with the subject 'help'."

This is actually pretty cool. No doubt should you call into question a faceless storage provider, but I thought I would play around with it. When you send 'help' to the address, you get the following information:

"You send commands to the service via messages, the message subject is used to tell the server what you want to do. You are authenticated via the address you send from, so be sure to keep it safe. Below are the possible commands:

LIST - Lists all the files you currently have saved to your account.
NEW [file_title] - Save a new file to your account. e.g "NEW my top secret file", put the contents of your file into the message body.
UPDATE [file_id] [new_file_name (optional) ] - Update an existing file saved to your account. The new_file_name is optional and is only used to rename your file. e.g "UPDATE a567f My even more top secret file"
REMOVE [file_id] - Removes a file saved to your account. e.g "REMOVE a567f".

TOP TIP!!!
When sending the LIST command, type some random blurb into the message. If you send multiple LIST commands to the server in a short period of time the Bitmessage client see's your requests as duplicate messages and ignores them. The random blurb ensures the server always hears you."

Intrigued, I started thinking about this. Due to the limitation of BM sending up to 256KB messages, your file can be no bigger than 256K, unless you setup a striped RAID array. Then you can chop up the file into multiple messages. However, the messages will likely be decrypted at the other end, and will very likely be stored unencrypted. So, they need to be encrypted, then converted into base64 before sending to the address.

Initially, I figured GnuPG would work for this. But then I thought that it's not a good fit, because I lose plausible deniability. So, instead, I'll use the 'dm-crypt' module with the cryptsetup(8).

First, I need my block devices. I'll setup some files with dd(1), then add them to loopback devices. Notice that I'm building the files from /dev/urandom. This is critical, so no metadata, including encryption boundaries, is leaked:

$ dd if=/dev/urandom of=~/file1 bs=256k count=1
$ dd if=/dev/urandom of=~/file2 bs=256k count=1
$ ls -l file?
-rw-rw-r-- 1 user user 262144 Nov 29 07:57 file1
-rw-rw-r-- 1 user user 262144 Nov 29 07:57 file2
$ sudo losetup /dev/loop1 ~/file1
$ sudo losetup /dev/loop2 ~/file2
$ losetup -l
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
/dev/loop1         0      0         0  0 /home/user/file1
/dev/loop2         0      0         0  0 /home/user/file2

Because I want plausible deniability, I'll use cryptsetup(8) first, _then_ create the RAID array. If I create the RAID array first, the two files will reveal information that they belong to an array. I don't want any metadata leaked at all.

$ sudo cryptsetup create aes-crypt-1 /dev/loop1
$ sudo cryptsetup create aes-crypt-2 /dev/loop2
$ ls -l /dev/mapper/aes-crypt-?
lrwxrwxrwx 1 root root 7 Nov 29 07:47 /dev/mapper/aes-crypt-1 -> ../dm-0
lrwxrwxrwx 1 root root 7 Nov 29 07:47 /dev/mapper/aes-crypt-2 -> ../dm-1

I can now create the RAID array, format it wxth ext2, and mount it. A couple notes: I'll first want to set the RAID chunk size to something low, otherwise I won't be able to put down a filesystem on the block device. So, I chose the minimum according to mdadm(8), which is 4KB. Then, when formatting with ext2, I'll only get a total of 64 inodes, which means a total of 64 files. I'm going to increase this to 256, so I can intentionally fragment the filesystem before putting down the data. I'll explain the reason for the intentional fragmentation in a second.

$ sudo mdadm --create /dev/md0 --level 0 --raid-devices 4 --chunk 4 /dev/mapper/aes-crypt-{1,2}
$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Nov 29 07:47:19 2014
     Raid Level : raid0
     Array Size : 464
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Nov 29 07:47:19 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 4K

           Name : example:0  (local to host example)
           UUID : 2bca1bf9:3af4a5d1:1989bb34:9b46bb9c
         Events : 0

    Number   Major   Minor   RaidDevice State
       0     253        0        0      active sync   /dev/dm-0
       1     253        1        1      active sync   /dev/dm-1

Now the formatting. Notice I'm changing the number of inodes. Also, we don't need to set aside any space for the root user: It's occupying precious disk space.

$ sudo mkfs.ext2 -N 256 -m 0 /dev/md0
$ sudo mount /dev/md0 /mnt
$ ls /mnt
lost+found
$ df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        427K   15K  412K   4% /mnt
$ df -i /mnt
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/md0          256    11   245    5% /mnt

Now before putting data into the filesystem, I'm going to fill it with small files of random data, then remove every n-th file to create enough space to put down my data. This fill force my data to be intentionally fragmented. The reason for this is to avoid a snapshot attack on the files I'll be storing remotely. If I did not fragment the data, then as I update the filesystem, only small incremental changes will take place. This will allow the attack to "subtract" a previous filesystem iteration from the current, allowing them to know where my stored data resides, as well as possibly figuring out what it contains. Because we're talking about a 512 KB filesystem here, even encrypted, disk I/O isn't a concern.

First, how big can each file be at a maximum? It appears to be 1721 bytes.

$ echo '(412*1024)/245' | bc -l # 245 available inodes, 412K available space
1721.99183673469387755102

Next, create the files, each with 1721 bytes:

$ for i in {1..245}; do sudo dd if=/dev/urandom of=/mnt/file$i bs=1721 count=1 2> /dev/null; done

Unfortunately, the filesystem filled up before completion. As such, we have empty files. So, we'll find those and remove them:

$ find /mnt -empty -type f | wc -l
25
$ sudo find /mnt -empty -type f -delete
$ ls /mnt/file* | wc -l
204

Now, I'm ready to fragment the filesystem. I know that one file I want to copy is 8820 bytes in size. So I need to free up 6 non-contiguous files. according to my math, I need to free up every 34th file:

$ echo '8820/1721' | bc -l
5.12492736780941313190
$ echo '204/6' | bc -l
34.00000000000000000000
$ sudo rm /mnt/file{$((34*1)),$((34*2)),$((34*3)),$((34*4)),$((34*5)),$((34*6))}
$ df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        427K  415K   12K  98% /mnt

I'm now ready to copy in my 8820-byte file:

$ sudo cp ~/secret-file.txt /mnt
$ df -h /mnt                      
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        427K  424K  3.0K 100% /mnt

Now I can tear everything down:

$ sudo umount /mnt
$ sudo mdadm --stop /dev/md0
$ sudo cryptsetup close aes-crypt-1
$ sudo cryptsetup close aes-crypt-2
$ sudo losetup -d /dev/loop1
$ sudo losetup -d /dev/loop2

Now I need to convert my two files to base64, copy, paste, and send the resulting output to BM-2cUqBbiJhTCQsTeocfocNP5WCRcH28saPU. The subject of the first message would be "NEW file1", while the subject of the second message would be "NEW file2". The body of each message would be the base64 output for file1 and file2.

$ base64 file1 | tr -d '\n'; echo
yAFtXPy7NgHl5Q0ueJiZgjOlYyrocWaxcvA6CjKF0rNd10sTfNvMCmQrL0cA79oO0 ...(snip) ...


$ base64 file2 | tr -d '\n'; echo
T3GTNvwiewXnnOHTjITNpukyLz4d8iv8wl/JP0YjY0v5s5euF1qv4WwMv9Ejl9AsNMm5NXoK/hFQK ...(snip)...

Of course, sending 2 messages with 256 KB in size will take some time to calculate the PoW, so be prepared for that (I don't know why the PoW is what it is. It should have been a double SHA256() like Bitcoin. Meh)

When you want to update your files, you wll need to build everything backup, except you won't "create" the RAID array, you'll "assemble" it, and you won't format your filesystem, obviously:

$ sudo losetup /dev/loop1 file1
$ sudo losetup /dev/loop2 file2
$ sudo cryptsetup create aes-crypt-1 /dev/loop1
$ sudo cryptsetup create aes-crypt-2 /dev/loop2
$ sudo mdadm /dev/md0 --assemble /dev/mapper/aes-crypt-1 /dev/mapper/aes-crypt-2
$ sudo mount /dev/md0 /mnt

You'll also need to move your file off the filesystem and rewrite all the random files, before copying the new data on. This is important, because we're trying to prevent a snapshot attack against our encrypted filesystem. So, on every commit to the BM storage service, as best as we can control, every bit on the filesystem needs to be changing.

As the help mentions, sending frequent "LIST" commands can cause your message to be ignored, unless the body of the message includes unique random data for each message sent. So, the "TOP TIP" of sending random data in the body of the message is a good idea. As such, I'll run the following for each "LIST", practically guaranteeing that I'll see a unique string, and the query will succeed:

$ dd if=/dev/urandom bs=2k count=1 2> /dev/null | rhash --sha3-512 -
4b5b19c9a8a61cf724cd710c6fd0c54edb46de2bfa55f2ec9a179a590b808993593d333f95dd6c607e9d366c385037cc0a600d262898e3b4c5be26479d3c962c  (stdin)

I'll copy and paste the result into the message body before sending.

I don't know that I'm ready to store sensitive data with it, but I do think it's pretty cool. The 256KB limit for the file size is pretty damning, so this isn't something to use for intense media storage, but instead, mostly for documents, such as password storage, browsing bookmarks, student grades, etc. It works, although it's definitely a niche product for a niche group. I wouldn't expect my wife to use this.

What's The Matter With PGP?

http://blog.cryptographyengineering.com/2014/08/whats-matter-with-pgp.html has been making its way around crypto circles recently. Enough so, that I figured I would comment by posting a reply on my own blog.

This post is written by cryptographer Matthew D. Green. As a result, even though I've never heard of Dr. Green, when I read the post, I was expecting something of high caliber. After all, this isn't some random dude on the Internet blowing his horn. THIS IS A GENIUNE CRYPTOGRAPHER. Turns out, the post is complete junk. It's about as thoughtful and informative as the Nigerian 411 scam. Let's analyze the post, top-to-bottom.

First off, as a TL;DR, Dr. Green thinks PGP is Teh Fail. After reading the post, it almost seems to me that he just barely installed PGP on his desktop, couldn't figure most of it out, got confused by a great deal of the terminology (isn't he a cryptographer?), and threw his hands up in defeat. When you read it a 2nd time, you notice almost immediately that he is confusing a few things:

  • The OpenPGP specification.
  • The GnuPG implementation of the OpenPGP specification.
  • Mail client frontends to GnuPG.

Are they all Teh Fail, or just the OpenPGP specification? After reading it a few times, I'm convinced he is rolling up anything and everything to do with PGP, past, present, and future, as total failure. It's just unfortunate he can't set things straight between the OpenPGP specification, the GnuPG implementation, or a mail client.

Meh. Color me unimpressed, Dr. Green.

1. PGP Keys Suck
My favorite line out of his post is this:

"For historical reasons [PGP keys] tend to be large and contain lots of extraneous information, which it difficult to print them a business card or manually compare."

Yes, they are definitely large. But printing the entire key on a business card? Really?? I've been using GnuPG since 2004, and I've never heard of anyone wanting to even compare public asymmetric keys character-by-character, let alone print a full asymmetric key on a business card. This is what the SHA-1 fingerprint is for. It's been generally accepted that comparing keys using the 40-character hexadecimal string is sufficient enough to know that each person has the same key. Key signing parties are based on this principle. Fingerprints as a result have been printed on business cards. Example 1, example 2, example 3.

Regardless, asymmetric keys in general are nasty to deal with by hand. This isn't just a problem with GnuPG, but OpenSSH, OpenSSL, or any other front end implementation that uses public key cryptography. This shouldn't come as a surprise. This is why we have tools to work with them. Things like Gnu Privacy Assistant, KGPG, Enigmail, GnuPG, and others. There is nothing in the OpenPGP specification that defines how you should implement such a front end. As a result, this is not the fault of OpenPGP. If you don't like the way GnuPG handles transferring public keys from one destination to another, propose a patch?

Without taking a breath, he then complains that your request for a GnuPG key is over plaintext HTTP. Well, okay. Sure. HTTPS/HKPS would be optimal to transfer the key, and he rightfully identifies a bug with GnuPG that doesn't verify if the correct key was downloaded. But, again, this is not a problem with OpenPGP. OpenPGP does not define anything about infrastructure. These servers are poorly designed, with OpenPGP front ends like GnuPG full of bugs. Hardly a problem with OpenPGP. If you don't like the key server implementations, maybe submit a patch?

Then his rant turns to OpenPGP infrastructure, by bitching about public key servers (it actually seems like he's comparing everything to miniLock, and setting miniLock as the Gold Standard). His first gripe is getting keys via Twitter:

"If we must exchange things via Twitter, why not simply exchange keys?"

Really? Twitter? Do we want to rely on Twitter for getting public keys? We can't get them from an email, or in person, or from a website, or from a business card, or some other form of informational transfer? Just, Twitter?

Finally, he opens up ECC keys, with the following:

"Modern EC public keys are tiny."

Seems Dr. Green has some sort of obsession with ECC keys. Earlier he mentions that you can fit ECC keys on business cards. He continues comparing a lot of OpenPGP to ECC keys. I'll address that later.

2. PGP key management sucks
There really isn't much to say here. Every point I argued previously, in that OpenPGP provides no infrastructure, applies here. However, he does hit one topic which strikes me as odd: the Web of Trust. He says the following:

"Most OpenPGP implementations do a lousy job of presenting any of this data to their users anyway."

OpenPGP implementations should never define trust for the user. That should always be at the discretion of the user. OpenPGP implementations should only define accuracy, such as correctly decrypting an encrypted piece of text, or successfully verifying a digital signature, because that's all the OpenPGP specification defines. I decide whether or not to trust the data given to me by someone else. And, most of the tools I have dealt with properly work this way. A trust database exists on the computer, allowing me to say "I trust this key", "I don't trust this key", and so forth. I define how those trust relationships are built, and I can put them into the database, so OpenPGP front ends can notify to me whether or not I am dealing with trustworthy data. But notifications should be the extent of the Web of Trust in OpenPGP implementations. And just because I signed a key, doesn't necessarily mean I trust that data from that key owner, so the tool should never make that assumption.

3. No forward secrecy
Okay. I get it. People like perfect forward secrecy (PFS), even to the point of obsession. I'm not one of those people. I use OTR, and I implement PFS on SSL where appropriate, but I'm not going to lose any sleep over the fact that PFS isn't defined in OpenPGP. OpenPGP has its problems. PFS isn't one of them, for a couple of reasons.

First, email is generally looked upon as long-term communication. Most people want to read their email, not destroy it. Many use long-term archival to go back to email replies in a thread, or to look up purchase orders, or recall sensitive employee information. Email has value in multiple views. PFS just doesn't fit well here. If you want PFS in your communication, look at something where there is not high value in retrieving old communication, such as instant chat. Off-the-record (OTR) messaging works well here, but even then, it has it's problems. As someone who logs every chat message that comes in, I want to have the ability to decrypt old communication messages sent to me, for a number of reasons. OTR makes this impossible. OpenPGP makes this doable.

Second, GnuPG cryptographically signs encrypted data by default. Thus, if you were to use an OpenPGP implementation for end-to-end communication, all messages have been cryptographically signed. Enabling PFS is perpendicular to signed data. Prior messages should not only be un-decryptable, but they should also not identify who sent the message. Because OpenPGP implementations build everything around a Web of Trust, I find it difficult to see where PFS would fit at all.

However, this sentence down right scares me:

"If the NSA is your adversary just forget about PGP."

What does that mean? The NSA has beaten OpenPGP? I doubt it. Strongly doubt it, actually. So much so, I'm really beginning to wonder about Dr. Green's competence as a cryptographer. Dude, look at the specification, and tell me the NSA has beaten every cryptographic cipher and hash in that RFC. I don't buy into the conspiracy theories that the NSA has some super-duper-uber-mega-cracking-datacenter, and AES-256 is child's play to the NSA. I trust the mathematics, and I trust academia. Maybe the NSA has their own microprocessor manufacturing plant? That has outpaced Intel? Yeah, not buying it.

However, maybe Dr. Green means to say that if you're targeted by the NSA, then before you even use your OpenPGP implementation to encrypt some communication, the NSA has a copy. But then, how is this a problem of OpenPGP and its implementations?

As a cryptographer, to wave your hand like a Jedi, claiming thet PGP is worthless in front of the NSA is frightening. Not because there might be some truth to it (there isn't), but because that's just amateur FUD slinging. For someone with a Ph.D in the field, I would expect better.

4. The OpenPGP format and defaults suck
Hahahahahahahahahahahah. Uh, no. Not even close. I literally laughed out loud when I read that section. So much so, I showed a couple of mathematicians and amateur cryptographers, and we had a good time.

Chapter 9 of RFC 4880 (the OpenPGP RFC) defines the following constants:

9.1. Public-Key Algorithms

ID Algorithm
-- ---------
1 - RSA (Encrypt or Sign) [HAC]
2 - RSA Encrypt-Only [HAC]
3 - RSA Sign-Only [HAC]
16 - Elgamal (Encrypt-Only) [ELGAMAL] [HAC]
17 - DSA (Digital Signature Algorithm) [FIPS186] [HAC]
18 - Reserved for Elliptic Curve
19 - Reserved for ECDSA
20 - Reserved (formerly Elgamal Encrypt or Sign)
21 - Reserved for Diffie-Hellman (X9.42,
as defined for IETF-S/MIME)
100 to 110 - Private/Experimental algorithm

Implementations MUST implement DSA for signatures, and Elgamal for
encryption. Implementations SHOULD implement RSA keys (1). RSA
Encrypt-Only (2) and RSA Sign-Only are deprecated and SHOULD NOT be
generated, but may be interpreted. See Section 13.5. See Section
13.8 for notes on Elliptic Curve (18), ECDSA (19), Elgamal Encrypt or
Sign (20), and X9.42 (21). Implementations MAY implement any other
algorithm.

9.2. Symmetric-Key Algorithms

ID Algorithm
-- ---------
0 - Plaintext or unencrypted data
1 - IDEA [IDEA]
2 - TripleDES (DES-EDE, [SCHNEIER] [HAC] -
168 bit key derived from 192)
3 - CAST5 (128 bit key, as per [RFC2144])
4 - Blowfish (128 bit key, 16 rounds) [BLOWFISH]
5 - Reserved
6 - Reserved
7 - AES with 128-bit key [AES]
8 - AES with 192-bit key
9 - AES with 256-bit key
10 - Twofish with 256-bit key [TWOFISH]
100 to 110 - Private/Experimental algorithm

Implementations MUST implement TripleDES. Implementations SHOULD
implement AES-128 and CAST5. Implementations that interoperate with
PGP 2.6 or earlier need to support IDEA, as that is the only
symmetric cipher those versions use. Implementations MAY implement
any other algorithm.

9.3. Compression Algorithms

ID Algorithm
-- ---------
0 - Uncompressed
1 - ZIP [RFC1951]
2 - ZLIB [RFC1950]
3 - BZip2 [BZ2]
100 to 110 - Private/Experimental algorithm

Implementations MUST implement uncompressed data. Implementations
SHOULD implement ZIP. Implementations MAY implement any other
algorithm.

9.4. Hash Algorithms

ID Algorithm Text Name
-- --------- ---------
1 - MD5 [HAC] "MD5"
2 - SHA-1 [FIPS180] "SHA1"
3 - RIPE-MD/160 [HAC] "RIPEMD160"
4 - Reserved
5 - Reserved
6 - Reserved
7 - Reserved
8 - SHA256 [FIPS180] "SHA256"
9 - SHA384 [FIPS180] "SHA384"
10 - SHA512 [FIPS180] "SHA512"
11 - SHA224 [FIPS180] "SHA224"
100 to 110 - Private/Experimental algorithm

Implementations MUST implement SHA-1. Implementations MAY implement
other algorithms. MD5 is deprecated.

Please identify what is wrong with this specification. GnuPG implements IDEA, CAST5, 3DES, AES, AES-192, and AES-256 as ciphers, and RIPEMD-160, SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512 as cryptographic hashes. What are wrong with these defaults? Please, enlighten us Dr. Green.

However, Dr. Green is mixing what the OpenPGP specification requires and what implementations use. The RFC exists as a core minimum set of standards, defining what is required to use as a base, with things optional and deprecated. However, GnuPG (as well as PGP) have both gone well beyond the defaults that the RFC requires.

But then there's this little gem:

"Elliptic curve crypto is (still!) barely supported."

He's back at ECC. Of course, this shouldn't come as much of a surprise, but there is a good reason most academic and open source cryptographers have steered clear of ECC- software patents:

According to Bruce Schneier as of May 31, 2007, "Certicom certainly can claim ownership of ECC. The algorithm was developed and patented by the company's founders, and the patents are well written and strong. I don't like it, but they can claim ownership."

Dr. Thomas Pornin (one cryptographer on Stack Exchange I have great respect for) answers the ECC patents this way:

"To my knowledge (but I am not entitled, in any way, to give law-related advice), parts of ECC cryptography which may still be patented are:

  • implementation of curves over binary fields using normal bases;
  • point compression;
  • acceleration of Koblitz curves using the Frobenius endomorphism;
  • various optimization tricks on dedicated hardware architectures (FPGA, ASIC)."

However, there is finally an RFC for ECC in OpenPGP. It defines three curves, which are also already standardized by NIST (pdf)- Curve P-256, Curve P-384, and Curve P-521. This document was standardized only two years ago, June 2012. Unfortunately, these three curves do not meet the "safe curves" definition as defined by Dr. Daniel J. Bernstien ("djb").

GnuPG will likely use EdDSA and Ed25519 as default for signatures (the same algorithms supported by OpenSSH actually). Werner did implement Brainpool P256r1, P384r1, and P512r1, even though it is not defined in the RFC.

5. Terrible mail client implementations
Again, mixing the OpenPGP standard, which does nothing to define how an MUA should behave, with a front end to a specific OpenPGP implementation. Regardless, there are some hidden gems in this section.

First, his claim that we're storing the passphrase to the private key in memory:

"[Mail clients] demand you encrypt your key with a passphrase, but routinely bug you to enter that passphrase in order to sign outgoing mail -- exposing your decryption keys in memory even when you're not reading secure email."

Hahahahahaha. Oh, boy. I guess we haven't heard of "GnuPG Agent"? Yeah, there's this thing that exists setting up a "token" based system. If you can successfully decrypt your private key, GnuPG will setup a token which is cached in RAM. Then, the MUA just needs to check if such a token exists. If so, it can use the full features of decryption and signing, without ever asking the user for the passphrase.

Regardless, if an attacker has access to your RAM to read it's contents, then you have bigger fish to fry, and there is nothing on the OpenPGP specification to help you.

Now this little gem:

"Most of these problems stem from the fact that PGP was designed to retain compatibility with standard (non-encrypted) email. If there's one lesson from the past ten years, it's that people are comfortable moving past email."

Really, I'm fairly confident that I don't need to say much here, but I'll just end this section with my own personal thought:

"If there's one lesson from the past 50 years, it's that people are not comfortable moving past email."

What's cute, is his examples to vendor controlled messaging systems provided by Apple, Facebook, and WhatsApp as a way to do secure communication. All proprietary, all vendor controlled lockin, all reliant on 3rd-party infrastructures, all client-server encrypted/decrypted, and not end-to-end, which OpenPGP is not. OpenPGP is end-to-end encryption, decentralized, and ad hoc. More importantly, the largest implementation (GnuPG) is Free and Open Source Software.

6. So what should we be doing?
Well for one, not listening to Dr. Green. OpenPGP has legitimate concerns, no doubt. But they aren't what are described in the article. He followed a white rabbit down a hole, when he should have just been studying the specification, and identifying it's real problems.

For example, if using OpenPGP in email, all of the email headers remain in plaintext. Only the body is encrypted. This metadata can reveal a great deal of information, such as the sender and recipient, the SMTP servers used in the exchange, the subject content, additional recipients on the carbon copy, etc.

Further, because OpenPGP encrypts data in packets, "header packets" exist in every OpenPGP encrypted file. These header packets can also reveal a great deal of information, such as who encrypted the data, who the recipient of the encrypted data is for, when the encrypted data was created, and other things.

In other words, one problem with OpenPGP is plausible deniability. If you want your encrypted data to be indistinguishable from true random, you're going to fail miserably using OpenPGP. OpenPGP concerns itself with message confidentiality and integrity- deniability just isn't in its scope.

For me, throwing away PGP, because of a number of misguided personal opinions, seems like a Big Mistake. GnuPG has been in development since 1999. OpenPGP itself has undergone revisions, improving the specification. PGP, which OpenPGP built itself upon, is even older, entering the arena at 1991. Throwing 23 years of development, bug fixing, and revisions out the window, just because one troubled cryptographer doesn't like it, seems to be akin to the very tardy, and incomplete Netscape 6.0 release.

SCALE 12x PGP Keysigning Party

This year, at SCALE 12x, I'll be hosting the PGP keysigning party. What is a keysigning party, and why should you attend? Maybe this will clear things up.

What is a keysigning party?

A PGP keysigning party is an event where PGP users meet together to exchange identity information and PGP fingerprints. Typically, at a conference such as SCALE 12x, PGP parties are common. The whole idea of the party is to expand the global "Web of Trust". In reality, however, you attend a keysigning party, because you share interests in cryptography, or you are interested in communicating privately with others. Usually, you can expect 10-20 people to attend keysigning parties at Linux conferences, sometimes more, sometimes less.

What is the Web of Trust?

The Web of Trust is just a logical mapping of exchanged identities. When at the keysigning party, you will exchange photo identification with each other as wall as PGP fingerprints. You do this for two reasons, and two reasons only: to verify that you have the correct key and to verify that the owner of that key is who they say they are. That's it. When you leave the party, you will sign every key that you have personally identified. By doing so, you bring that key into your own personal Web of Trust. An arrow from you to them indicates that you have signed their key. If they return the favor, and sign your key, then a return arrow from them to you will result.

It's not too difficult to create a personal Web of Trust that you can view. I have blogged about it in the past at https://pthree.org/2013/03/01/create-your-own-graphical-web-of-trust-updated/. It's interesting to see the large groupings of signatures. It's clear that there was a PGP keysigning party in those large groups.

What do I bring to a keysigning party?

You really only need three things with you when you come to the party:

  1. Something to write with, like a pen or pencil.
  2. A government issued photo identification at a minimum. Additional identification is appreciated.
  3. Your own printout of your PGP key fingerprint

That last item is important. Very important. If you didn't bring item #3, then most PGP keysigning party organizers will not allow you to participate. In order to printout your PGP key fingerprint, run the following command at your terminal:

$ gpg -K --fingerprint

Print that out on a piece of paper, and bring it with you to the party. Some conferences, such as SCALE 12x, will print your PGP fingerprint on your conference badge. This will allow you to sign keys anywhere anytime. All you need to do is verify that the fingerprint on your personal printout matches the fingerprint on the conference badge. Then you may use your conference badge fingerprint at the party.

It's important that you bring your own copy of your PGP key fingerprint, however. The keysigning party organizer will handout a printout of all the PGP key fingerprints for everyone in attendance. This is to verify that the organizer downloaded your correct and current key(s). You will read off your fingerprint from your personal printout, and everyone else will verify the fingerprint on their printout.

What happens before the party?

All that needs to be done, is every attendant must submit their public PGP key to the party organizer. Typically, there is a deadline on when keys can be submitted. It's important that you adhere to that deadline. The party organize then prints out on paper a list of every PGP key fingerprint of those who are attending. If you submit your key late, it will not be on the paper, and as such, many party orgasizers will not let you participate.

What happens at the party?

The key party organizer will pass out sheets of paper with every PGP key fingerprint that has been submitted. Typically, party organizers will also explain the importance of the party, why cryptography, and other things related to crypto. Then, he will explain how the party will proceed, at which point every attendee will read their PGP key fingerprint from their own printout. Everyone else in attendance will verify that the organizer has the correct key by following along on the handout. This is done for everyone in attendance.

After fingerprints have been verified, we then get into two equal lines, facing each other. While facing someone in the line opposite if you, you introduce yourself, explain where your key is on the handout, and exchange government issued identification. After identification has been exchanged, everyone in both lines takes 1/2 step to their right. This will put you in front of a new person to repeat the process. Those at the ends turn around, facing the opposite direction, and continue shuffling to their right. Think of the whole process as one large conveyor belt.

Once you are facing the person you started with, then everyone should have successfully verified everyone else's identity and key. At this point, the party is typically over.

What happens after the party?

This is the most critical step of the whole process, and for some reason, not everyone does it. Now that you have your handout with all the keys printed on it, you need to sign every key that you have personally identified. What you attended was a KEYSIGNING party. This means that you must SIGN KEYS. I know I'm putting a lot of emphasis on this, but I have personally attended close to a dozen PGP keysigning parties, and I would say the rate of signing keys is about 70%, unless I annoy them week in and week out to sign my key, then I'll get a return close to 90%. It blows my mind that people spent a great amount of time at the PGP keysigning party, then don't actually do anything about it.

There are a lot of utilities out there for keysigning party events that make attempts at making the whole process easier. In all reality, the only "difficult" or troublesome part about it is, is converting the fingerprints you have on paper to your computer. Some PGP keysigning organizers will already have a party public keyring, that they will email to those who attended, with only the keys of those that attended. If that's the case, you have it easy. Otherwise, you must do something like the following:

$ gpg --recv-keys 0x<KEYID>

Where "<KEYID>" is the last 16 characters of the person's fingerprint. After you have their public key, then you can sign it with the following command for each key:

$ gpg --default-cert-level 3 --sign-key 0x<KEYID>

It's important that you add the "--default-cert-level 3" as part of the signing process. This certification level says that you have done very careful checking, and you are 100% confident that the key in question belongs to the owner, and that you have personally verified the owner's identity.

After you have signed the key, it is courtesy to email them their public key with your signature. As such, you will need to export their key from your public keyring. You should do this for each key:

$ gpg --armor --output /tmp/0x<KEYID>.asc --export 0x<KEYID>

Attach "/tmp/0x<KEYID>.asc" to your encrypted email, and send it off.

Additional Information

  • Should I bring a computer to the keysigning party? No. It's not necessary, and many party organizers consider it a security liability. Things like swapping around USB sticks could infect viruses or other badware. If secret keys are on the computer, it's possible they could be compromised. Even worse, the physical computer itself could get damaged. It's generally best to just leave the computer in your backback or hotel room, and attend the party without it.
  • Why should I care about signing PGP keys? Have you ever stopped to think about certificate authorities? When you create an SSL certificate signing request, you ship the CSR off to the CA, along with $200, and they returned a signed key. You then install your key on your web server, and browsers automatically trust data encrypted with it. All because you paid someone money. PGP keys do not have a centralized authority. As such, signing keys is done in an ad hoc manner. Further, money is not exchanged when signing keys. Instead, signing keys is done in person, with face-to-face contact. When you sign a key, you are ultimately saying that you have verified the owner is who they claim to be, and that the key you just signed belongs to them. As a result, a decentralized Web of Trust is built.
  • What is the Web of Trust really? The PGP Web of Trust is a decentralized web of connected keys, where the connection is made by cryptographically signing keys. The larger and more connected the Web of Trust is, the stronger the trust becomes for people to send private data to each other within that web. It sounds all geeky and mathematical, but if you just sit back and think about it, it makes sense. No money is involved, like using CAs. No blind trust is involved, such as the behavior of browsers. It's you and me, meeting face-to-face, claiming we are who we say we are, and claiming we own the key we have. Now, after this meeting, I can rest assured that if you send me cryptographically signed data from your key, I know it came from you, and no one else. If you send me encrypted data, I have a copy of your public key, and can decrypt it knowing that you were the one encrypting it. The Web of Trust is just that- establishing trust.
  • What is the PGP Strong Set? The largest and most connected Web of Trust is called the PGP Strong Set. There are more keys in this Web of Trust than any other. The way you get into the PGP Strong Set is by having someone in the Strong Set sign your key. A great deal of analysis has been done on the Strong Set. You can read about that analysis at http://pgp.cs.uu.nl/plot/. You can get key statistics such as mean signature distance (MSD), and calculate the distance from one key in the strong set to another key, such as yours that may not be in the strong set. My key, 0x8086060F is in the Strong Set. If ever I am at a keysigning party, and I sign your key, your key will also be included in the Strong Set.

ZFS Administration, Appendix C- Why You Should Use ECC RAM

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Introduction

With the proliferation of ZFS into FreeBSD, Linux, FreeNAS, Illumos, and many other operating systems, and with the introduction of OpenZFS to unify all the projects under one collective whole, more and more people are beginning to tinker with ZFS in many different situations. Some install it on their main production servers, others install it on large back-end storage arrays, and even yet, some install it on their workstations or laptops. As ZFS grows in popularity, you'll see more and more ZFS installations on commodity hardware, rather than enterprise hardware. As such, you'll see more and more installations of ZFS on hardware that does not support ECC RAM.

The question I pose here: Is this a bad idea? If you spend some time searching the Web, you'll find posts over and over on why you should choose ECC RAM for your ZFS install, with great arguments, and for good reason too. In this post, I wish to reiterate those points, and make the case for ECC RAM. Your chain is only as strong as your weakest link, and if that link is non-ECC RAM, you lose everything ZFS developers have worked so hard to achieve on keeping your data from corruption.

Good RAM vs Bad RAM vs ECC RAM

To begin, let's make a clear distinction between "Good RAM" and "Bad RAM" and how that compares to "ECC RAM":

  • Good RAM- High quality RAM modules with a low failure rate.
  • Bad RAM- Low quality RAM modules with a high failure rate.
  • ECC RAM- RAM modules with error correcting capabilities.

"Bad RAM" isn't necessarily non-ECC RAM. I've deployed bad ECC RAM in the past, where even though they are error correcting, they fail frequently, and need to be replaced. Further, ECC RAM isn't necessarily "Good RAM". I've deployed non-ECC RAM that has been in production for years, and have yet to see a corrupted file due to not having error correction in the hardware. The point is, you can have exceptional non-ECC "Good RAM" that will never fail you, and you can have horrid ECC "Bad RAM" that still creates data corruption.

What you need to realize is the rate of failure. An ECC RAM module can fail just as frequently as a non-ECC module of the same build quality. Hopefully, the failure rate is such that ECC can fix the errors it detects, and still function without data corruption. But just to beat a dead horse dead, ECC RAM and hardware failure rates are disjointed. Just because it's ECC RAM does not mean that the hardware fails less frequently. All it means is that it detects the failures, and attempts to correct them.

ECC RAM

Failure rates are hard to get a handle on. If you read the Wikipedia article on ECC RAM, it mentions a couple studies that have been attempted to get a handle on how often bit errors occur in DIMM modules:

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10^(−10) [to] 10^(−17) error/bit-h[ours], roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory. A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 2.5^(–7) x 10^(−11) error/bit-h[ours])(i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year.

So roughly, from what Google was seeing in their datacenters, 5 bit errors in 8 GB of RAM per hour in 8% of their installed RAM. If you don't think this is significant, you're fooling yourself. Most of these bit errors are caused by background radiation affecting the installed DIMMs, due to neutrons from cosmic rays. But voltage fluctuations, bad circuitry, and just poor build quality can also come in as factors to "bit flips" in your RAM.

ECC RAM works by detecting this bad bit by using an extra parity bit per byte. In other words, for every 8 bits, there is a 9th parity bit which operates as the checksum for the previous 8. So, for a DIMM module registering itself as 64 GB to the system, there is actually 72 GB physically installed on the chip to give space for parity. However, it's important to note that ECC RAM can only correct 1 bit flip per byte (8 bits). If you have 2 bit flips per byte, ECC RAM will not be able to recover the data.

ZFS was designed to detect silent data errors that happen due to hardware and other factors. ZFS checksums your data from top to bottom to ensure that you do not have data corruption. If you've read this series from the beginning, you'll know how ZFS is architected, and how data integrity is first priority for ZFS. People who use ZFS use it because they cannot stand data corruption anywhere in their filesystem, at any time. However, if your RAM is not ECC RAM, then you do not have the guarantee that your file is not corrupt when stored to disk. If the file was corrupted in RAM, due to a frozen bit in RAM, then when stored to ZFS, it will get checksummed with this bad bit, as ZFS will assume the data it is receiving is good. As such, you'll have corrupted data in your ZFS dataset, and it will be checksummed corrupted, with no way to fix the error internally.

This is bad.

A scenario

To drive the point home further about ECC RAM in ZFS, let's create a scenario. Let's suppose that you are not using ECC RAM. Maybe this is installed on your workstation or laptop, because you like the ZFS userspace tools, and you like the idea behind ZFS. So, you want to use it locally. However, let's assume that you have non-ECC "Bad RAM" as defined above. For whatever reason, you have a "frozen bit" in one of your modules. The DIMM is only storing a "0" or a "1" in a specific location. Let's say it always reports a "0" due to the hardware failure, no matter what should be written there. To keep things simple, we'll look at 8 bits, or 1 byte in our example. I'll show the bad bit with a red "0".

Your application wishes to write "11001011", but due to your Bad RAM, you end up with "11000011". As a result, "11000011" is sent to ZFS to be stored. ZFS adds a checksum to "11000011" and stores it in the pool. You have data corruption, and ZFS doesn't know any different. ZFS assumes that the data coming out of RAM is intentional, so parity and checksums are calculated based on that result.

But what happens when you read the data off disk and store it back in your faulty non-ECC RAM? Things get ugly at this point. So, you read back "11000011" to RAM. However, it's stored in _almost_ the same position before it was sent to disk. Assume it is stored only 4 bits later. Then, you get back "01000011". Not only was your file corrupt on disk, but you've made things worse by storing them back into RAM where the faulty hardware is. But, ZFS is designed to correct this, right? So, we can fix the bad bit back to "11000011", but the problem is that the data is still corrupted!

Things go downhill from here. Because this is a physical hardware failure, we actually can't set that first bit to "1". So, any attempt at doing so, will immediately revert it back to "0". So, while the data is stored in our faulty non-ECC RAM, the byte will remain as "01000011". Now, suppose we're ready to flush the data in RAM to disk, we've compounded our errors by storing "01000011" on platter. ZFS calculates a new checksum based on the newly corrupted data, again assuming our DIMM modules are telling us the truth, and we've further corrupted our data.

As you can see, the more we read and write data to and from non-ECC RAM, the more we have a chance of corrupting data on the filesystem. ZFS was designed to protect us against this, but our chain is only as strong as the weakest link, which in this case is non-ECC RAM corrupting our data.

You might think that backups can save you. If you have a non-corrupted file you can restore from, great. However, if your ZFS snapshot, or rsync(1) copied over the corrupted bit to your backups, then you're sunk. And ZFS scrubbing won't help you here either. As already mentioned, you stored corrupted data on disk correctly, meaning the checksum for the corrupted byte is correct. However, if you have additional bad bits in RAM, scrubbing will actually try to "fix" the bad bits. But, because it's a hardware failure, causing a frozen bit, the scrub will continue, and continue, thrashing your pool. So, scrubbing will actually bring performance to its knees, trying to fix that one bad bit in RAM. Further, scrubbing won't fix bad bits already stored in your pool that have been properly checksummed.

No matter how you slice and dice it, you trusted your non-ECC RAM, and your trusty RAM failed you, with no recourse to fall back on.

ECC Pricing

Due to the extra hardware on the DIMM, ECC RAM is certainly more costly than their non-ECC counterparts, but not by much. In fact, because ECC DIMMs have 9/8 additional more hardware, the price pretty closely reflects that. In my experience, 64 GB of ECC RAM is roughly 9/8 more costly than 64 GB of non-ECC RAM. Many general purpose motherboards will support unbuffered ECC RAM also, although you should choose a motherboard that supports active ECC scrubbing, to keep bit corruption minimized.

You can get high quality ECC DDR3 SDRAM off of Newegg for about $50 per 4 GB. Non-ECC DDR3 SDRAM retails for almost exactly the same price. To me, it just seems obvious. All you need is a motherboard supporting it, and Supermicro motherboards supporting ECC RAM can also be relatively inexpensive. I know this is subjective, but I recently built a two-node KVM hypervisor shared storage cluster with 32 GB of registered ECC RAM in each box with Tyan motherboards. Total for all 32 GB was certainly more costly than everything else in the system, but I was able to get them at ~ $150 per 16 GB, or $600 for all 64 GB total. The boards were ~$250 each, or $500 total for two boards. So, it total, for two very beefy servers, I spent ~$1100, minus CPU, disk, etc. To me, this is a small investment to ensure data integrity, and I would not have saved much going the non-ECC route.

The very small extra investment was very much worth it, to make sure I have data integrity from top-to-bottom.

Conclusion

ZFS was built from the ground up with parity, mirroring, checksums and other mechanisms to protect your data. If a checksum fails, ZFS can make attempts at loading good data based on redundancy in the pool, and fix the corrupted bit. But ZFS is assuming that a correct checksum means the bits were correct before the checksum was applied. This is where ECC RAM is so critical. ECC RAM can greatly reduce the risk that your bits are not correct before they get stored into the pool.

So, some lessons you should take away from this article:

  • ZFS checksums assume the data is correct from RAM.
  • Regular ZFS scrubs will greatly reduce the risk of corrupted bits, but can be your worst enemy with non-ECC RAM hardware failures.
  • Backups are only as good as the data they store. If the backup is corrupted, it's not a backup.
  • ZFS parity data, checksums, and physical data all need to match. When they don't, repairs start taking place. If it is corrupted out the gate, due to non-ECC RAM, why are you using ZFS again?

Thanks to "cyberjock" on the FreeBSD forums for inspiring this post.

OpenSSH Keys and The Drunken Bishop

Introduction
Have you ever wondered what the "randomart" or "visual fingerprint" is all about when creating OpenSSH keys or connecting to OpenSSH servers? Surely, you've seen them. When generating a key on OpenSSH version 5.1 or later, you will see something like this:

$ ssh-keygen -f test-rsa
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in test-rsa.
Your public key has been saved in test-rsa.pub.
The key fingerprint is:
18:ff:18:d7:f4:a6:d8:ce:dd:d4:07:0e:e2:c5:f8:45 aaron@kratos
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|      .     . E  |
|       +   = o   |
|      . S + = =  |
|         * * * ..|
|        . + + . +|
|           o . o.|
|            o . .|
+-----------------+

I'm sure you've noticed this, and probably thought, "What's the point?" or "What's the algorithm in generating the visual art?" Well, I'm going to answer those questions for you in this post.

This post is an explanation of the algorithm as explained by Dirk Loss, Tobias Limmer, and Alexander von Gernler in their PDF "The drunken bishop: An analysis of the OpenSSH fingerprint visualization algorithm". You can find their PDF at http://www.dirk-loss.de/sshvis/drunken_bishop.pdf‎. In the event that link is no longer available, I've archived the PDF at http://aarontoponce.org/drunken_bishop.pdf.

Motivations

Bishop Peter finds himself in the middle of an ambient atrium. There are walls on all four sides and apparently there is no exit. The floor is paved with square tiles, strictly alternating between black and white. His head heavily aching—probably from too much wine he had before—he starts wandering around randomly. Well, to be exact, he only makes diagonal steps—just like a bishop on a chess board. When he hits a wall, he moves to the side, which takes him from the black tiles to the white tiles (or vice versa). And after each move, he places a coin on the floor, to remember that he has been there before. After 64 steps, just when no coins are left, Peter suddenly wakes up. What a strange dream!

When creating OpenSSH key pairs, or when connecting to an OpenSSH server, you are presented with the fingerprint of the keypair. It may look something like this:

$ ssh example.com
The authenticity of host 'example.com (10.0.0.1)' can't be established.
RSA key fingerprint is d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9.
Are you sure you want to continue connecting (yes/no)?

At this point, as a responsible citizen of the community, you call up the system administrator of the host "example.com", and verify that the fingerprint you are being presented with is the same fingerprint he has on the server for the RSA key. If the fingerprints match, you type "yes", and continue the connection. If the fingerprints do not match, you suspect a man-in-the-middle attack, and type "no". If the server is a server under your control, then rather than calling up the system administrator for that domain, you physically go to the box, pull up a console, and print the server's RSA fingerprint.

In either case, verifying a 32-character hexadecimal string is cumbersome. If we could have a better visual on the fingerprint, it might be easier to verify that we've connected to the right server. This is where the "randomart" comes from. Now, when connecting to the server, I can be presented with something like this:

The authenticity of host 'example.com (10.0.0.1)' can't be established.
RSA key fingerprint is d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9.
+--[ RSA 2048]----+
|             o . |
|         . .o.o .|
|        . o .+.. |
|       .   ...=o+|
|        S  . .+B+|
|            oo+o.|
|           o  o. |
|          .      |
|           E     |
+-----------------+
Are you sure you want to continue connecting (yes/no)?

Because I have a visual representation of the server's fingerprint, it will be easier for me to verify that I am connecting to the correct server. Further, after connecting to the server many times, the visual fingerprint will become familiar. So, upon connection, when the visual fingerprint is displayed, I can think "yes, that is the same picture I always see, this must be my server". If a man-in-the-middle attack is in progress, a different visual fingerprint will probably be displayed, at which point I can avoid connecting, because I have noticed that the picture changed.

The picture is created by applying an algorithm to the fingerprint, such that different fingerprints should display different pictures. Turns out, there can be some visual collisions that I'll lightly address at the end of this post. However, this visual display should work in "most cases", and cause you to start verifying fingerprints of OpenSSH keys.

The Board
Because the bishop finds himself in a room, with no exits, and only walls, we need to create a visually square room on the terminal. This is done by creating a room with 9 rows and 17 columns, creating a total of 153 total squares the bishop can travel. The bishop must start in the exact center of the room, thus the reason for odd-numbered rows and columns.

Our board setup then looks like this, where "S" is the starting location of the drunk bishop:

             1111111
   01234567890123456
  +-----------------+x (column)
 0|                 |
 1|                 |
 2|                 |
 3|                 |
 4|        S        |
 5|                 |
 6|                 |
 7|                 |
 8|                 |
  +-----------------+
  y
(row)

Each square on the board can be thought as a numerical position from Cartesian coordinates. As mentioned, there are 153 squares on the board, so each square gets a numerical value through the equation "p = x + 17y". So, p=0, for (0,0); p=76 for (8,4), the starting location of our bishop; and p=152, for (16,8), the lower right-hand corner of the board. Having a unique numerical value for each position on the board will allow us to do some simple math when the bishop begins his random walk.

The Movement
In order to define movement, we need to understand the fingerprint that is produced from an OpenSSH key. An OpenSSH fingerprint is an MD5 checksum. As such, it has a 16-byte output. An example fingerprint could be "d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9".

Because the bishop can only move one of four valid ways, we can represent this in binary.

  • "00" means our bishop takes one move diagonally to the north-west.
  • "01" means our bishop takes one move diagonally to the north-east.
  • "10" means our bishop takes one move diagonally to the south-west.
  • "11" means our bishop takes one move diagonally to the south-east.

With the bishop in the center of the room, his first move will take him off square 76. After his first move, his new position will be as follows:

  • "00" will place him on square 58, a difference of -18.
  • "01" will place him on square 60, a difference of -16.
  • "10" will place him on square 92, a difference of +16.
  • "11" will place him on square 94, a difference of +18.

We must now convert our hexadecimal string to binary, so we can begin making movements based on our key. Our key:

d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9

would be converted to

11010100:11010100:11111101:11001010:...snip...:00111011:11100100:10111010:11101001

When reading the binary, we read each binary word (8-bits) from left-to-right, but we read each bit-pair in each word right-to-left (little endian). Thus our bishop's first 16 moves would be:

00 01 01 11 00 01 01 11 01 11 11 11 10 10 00 11

Or, you could think of it in terms of steps, if looking at the binary directly:

4,3,2,1:8,7,6,5:12,11,10,9:16,15,14,13:...snip...:52,51,50,49:56,55,54,53:60,59,58,57:64,63,62,61

Board Coverage
All is well and good if our drunk bishop remains in the center of the room, but what happens when he slams into the wall, or walks himself into a corner? We need to take into account these situations, and how to handle them in our algorithm. First, let us define every square on the board:

+-----------------+
|aTTTTTTTTTTTTTTTb|   a = NW corner
|LMMMMMMMMMMMMMMMR|   b = NE corner
|LMMMMMMMMMMMMMMMR|   c = SW corner
|LMMMMMMMMMMMMMMMR|   d = SE corner
|LMMMMMMMMMMMMMMMR|   T = Top edge
|LMMMMMMMMMMMMMMMR|   B = Bottom edge
|LMMMMMMMMMMMMMMMR|   R = Right edge
|LMMMMMMMMMMMMMMMR|   L = Left edge
|cBBBBBBBBBBBBBBBd|   M = Middle pos.
+-----------------+

Now, let us define every move for every square on the board:

Pos Bits Heading Adjusted Offset
a 00 NW No Move 0
01 NE E +1
10 SW S +17
11 SE SE +18
b 00 NW W -1
01 NE No Move 0
10 SW SW +16
11 SE S +17
c 00 NW N -17
01 NE NE -16
10 SW No Move 0
11 SE E +1
d 00 NW NW -18
01 NE N -17
10 SW W -1
11 SE No Move 0
T 00 NW W -1
01 NE E +1
10 SW SW +16
11 SE SE +18
B 00 NW NW -18
01 NE NE -16
10 SW W -1
11 SE E +1
R 00 NW NW -18
01 NE N -17
10 SW SW +16
11 SE S +17
L 00 NW N -17
01 NE NE -16
10 SW S +17
11 SE SE +18
M 00 NW NW -18
01 NE NE -16
10 SW SW +16
11 SE SE +18

How much of the board will our bishop walk? Well, with our fingerprints having a 16-byte output, that means there are 64 total moves the bishop can walk. As such, the most board a bishop could cover, is if each square was only visited once. Thus 65/153 ~= 42.48%, which is less than half of the board.

Position Values
Remember that our bishop is making a random walk around the room, dropping coins on every square he's visited. If he's visited a square in the room more than once, we need a way to represent that in the art. As such, we will use a different ASCII character as the count increases.

Unfortunately, in my opinion, I think the OpenSSH developers picked the wrong characters. They mention in their PDF that the intention of the characters they picked, was to increase the density of the characters as the visitation count to a square increases. Personally, I don't think these developers have spent much time working on ASCII art. Had I been on the development team, I would have picked a different set. However, here is the set they picked for each count:

Freq 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Char .

o

+

=

*

B

O

X

@

%

&

#

/

^

S

E

The special characters "S" and "E" are to identify the starting and ending location of the bishop respectively.

Additional Thoughts
Now that you know how the random art is generated for a given key, you can begin to ask yourself some questions:

  • When addressing picture collisions, how many fingerprints produce the same picture (same values for all positions)?
  • How many fingerprints produce the same shape (same visited squares with different values)?
  • How many different visualizations can the algorithm produce?
  • Can a fingerprint be derived by looking only at the random art?
  • How many different visualizations can a person easily distinguish?
  • What happens to the visualizations when changing the board size, either smaller or larger?
  • What visualizations can be produced with different chess pieces? How would the rules change?
  • What visualizations can be produced if the board were a torus (like Pac-man)?
  • Could this be extended to other fingerprints, such as OpenPGP keys? If so, could it simplify verifying keys at keysigning parties (and as a result, speed them up)?

Conclusion
Even though this post discussed the algorithm in generating the random art, it did not address the security models of those visualizations. There are many questions that need answers. Obviously, there are collisions. So, how many collisions are there? Can it be discovered to predict the number of visual collisions based on the cryptographic hash, and size of the board? Does this weaken security when verifying OpenSSH keys, and if so, how?

These questions, and many others, should be addressed. But, for the time being, it seems to be working "well enough", and most people using OpenSSH probably only ever see the visualization when creating their own private and public key pair. You can enable it for every OpenSSH server you connect to, by setting "VisualHostKey yes" in your ssh_config(5).

In the meantime, I'll be working on a Python implementation for this on OpenPGP keys. It will use a larger board (11x19), and a different set of characters for the output, but the algorithm will remain the same. I'm interested to see if this can improve verifying people have the right OpenPGP key, by just checking the ASCII art, rather than reading out a 20-byte random string using the NATO alphabet. Keep an eye on https://github.com/atoponce/scripts/blob/master/art.py.

Strengthen Your Private Encrypted SSH Keys

Recently, on Hacker News, a post came through about improving the security of your encrypted private OpenSSH keys. I want to re-blog that post here (I'm actually jealous he blogged it first), in my own words, and provide a script at the end that will automate the process for you.

First off, Martin goes into great detail about the storage format of your unencrypted private OpenSSH keys. The unencrypted key is stored in a format known as Abstract Syntax Notation One (ASN.1) (for you web nerds, it's similar in function to JSON). However, when you encrypt the key with your passphrase, it is no longer valid ASN.1. So, Martin then takes you through the process of how the key is encrypted. The big take-away from that introduction is the following, that by default:

  • Encrypted OpenSSH keys use MD5- a horribly broken cryptographic hash.
  • OpenSSH keys are encrypted with AES-128-CBC, which is fast, fast, fast.

It would be nice if our OpenSSH keys used a stronger cryptographic hash like SHA1, SHA2 or SHA3 in the encryption process, rather than MD5. Further, it would be nice if we could cause attackers who get our private encrypted OpenSSH keys to expend more computing resources when trying to brute force our passphrase. So, rather than using the speedy AES algorithm, how about 3DES or Blowfish?

This is where PKCS#8 comes into play. "PKCS" stands for "Public-key cryptography standards". There are currently 15 standards, with 2 withdrawn and 2 under development. Standard #8 defines how private key certificates are to be handled, both in unencrypted and encrypted form. Because OpenSSH use public key cryptography, and private keys are stored, it would be nice if it adhered to the standard. Turns out, it does. From the ssh-keygen(1) man page:

     -m key_format
             Specify a key format for the -i (import) or -e (export) conver‐
             sion options.  The supported key formats are: “RFC4716” (RFC
             4716/SSH2 public or private key), “PKCS8” (PEM PKCS8 public key)
             or “PEM” (PEM public key).  The default conversion format is
             “RFC4716”.

As mentioned, the supported key formats are RFC4716, PKCS8 and PEM. Seeing as though PKCS#8 is supported, it seems like we can take advantage of it in OpenSSH. So, the question then comes, what does PKCS#8 offer me in terms of security that I don't already have? Well, Martin answers this question in his post as well. Turns out, there are 2 versions of PKCS#8 that we need to address:

  • The version 1 option specifies a PKCS#5 v1.5 or PKCS#12 algorithm to use. These algorithms only offer 56-bits of protection, since they both use DES.
  • The version 2 option specifies that PKCS#5 v2.0 algorithms are used which can use any encryption algorithm such as 168 bit triple DES or 128 bit RC2.

As I mentioned earlier, we want SHA1 (or better) and 3DES (or slower). Turns out, the OpenSSL implementation of PKCS#8 version 2 uses the following algorithms:

  • PBE-SHA1-RC4-128
  • PBE-SHA1-RC4-40
  • PBE-SHA1-3DES
  • PBE-SHA1-2DES
  • PBE-SHA1-RC2-128
  • PBE-SHA1-RC2-40

PBE-SHA1-3DES is our target. So, the only question remaining, is can we convert our private OpenSSH keys to this format? If so, how? Well, because OpenSSH relies heavily on OpenSSL, we can use the openssl(1) utility to make the conversion to the new format, and due to the ssh-keygen(1) manpage quoted above, we know OpenSSH supports the PKCS#8 format for our private keys, so we should be good.

Before we go further though, why 3DES? Why not stick with the default AES? DES is slow, slow, slow. 3DES is DES chained together 3 times. Compared to AES, it's a snail racing a hare. With 3DES, the data is encrypted with a first 56-bit DES key, then encrypted with a second 56-bit DES key, the finally encrypted with a third 56-bit DES key. The result is an output that has 168-bits of security. There are no known practical attacks against 3DES, and NIST considers it secure through 2030. It's certainly appropriate to use as an encrypted storage for our private OpenSSH keys.

To convert our private key, all we need to do is rename it, run openssl(1) on the keys, then test. Here are the steps:

$ mv ~/.ssh/id_rsa{,.old}
$ umask 0077
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa     # dsa and ecdsa are also supported

Now login to a remote OpenSSH server where the public portion of that key is installed, and see if it works. If so, remove the old key. To simplify the process, I created a script where you provide your private OpenSSH key as an argument, and it does the conversion for you. You can find that script at https://github.com/atoponce/scripts/blob/master/ssh-to-pkcs8.zsh

What's the point? Basically, you should think of it the following way:

  • We're using SHA1 rather than MD5 as part of the encryption process.
  • By using 3DES rather than AES, we've slowed down brute force attacks to a crawl. This should buy us 2-3 extra characters of entropy in our passphrase.
  • Using PKCS#8 gives us the flexibility to use other algorithms in the future, as old ones are replaced.

I agree with Martin that it's a shame OpenSSH isn't using this by default. Why stick with the original OpenSSH storage format? Compatibility isn't a concern, as the support relies solely on the client, not the server. Because every client should have a different keypair installed, there is no worry about new versus old client. Extra security is purchased through the use of SHA1 and 3DES. Computing time to create the keys was trivial, and the performance difference when using them is not noticeable compared to the traditional format. Of course, if your passphrase protecting your keys is strong, with lots and lots of entropy, then an attacker will be foiled with a brute force attack anyway. Regardless, why not make it more difficult for him by slowing him down?

Martin's post is a great read, and as such, I've converted my OpenSSH keys to the new format. I'd encourage you to do the same.

ZFS Administration, Part XV- iSCSI, NFS and Samba

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

I spent the previous week celebrating the Christmas holiday with family and friends, and as a result, took a break from blogging. However, other than the New Year, I'm finished with holidays for a while, and eager to get back to blogging, and finishing off this series. Only handful of posts left to go. So, let's continue our discussion with ZFS administration on GNU/Linux, by discussing sharing datasets.

Disclaimer

I have been trying to keep these ZFS posts as operating system agnostic as much as possible. Even though they have had a slant towards Linux kernels, you should be able to take much of this to BSD or any of the Solaris derivatives, such as OpenIndiana or Nexenta. With this post, however, it's going to be Linux-kernel specific, and even Ubuntu and Debian specific at that. The reason being is iSCSI support is not compiled into ZFS on Linux as of the writing of this post, but sharing via NFS and SMB is. Further, the implementation details for sharing via NFS and SMB will be specific to Debian and Ubuntu in this post. So, you may need to make adjustments if using Fedora, openSUSE, Gentoo, Arch, et cetera.

Motivation

You are probably asking why you would want to use ZFS specific sharing of datasets rather than using the "tried and true" methods with standard software. The reason is simple. When the system boots up, and goes through its service initialization process (typically by executing the shell scripts found in /etc/init.d/), it has a method to the madness on which service starts first. Loosely speaking, filesystems are mounted first, networking is enabled, then services are started at last. Some of these are tied together, such as NFS exports, which requires the filesystem to be mounted, a firewall in place, networking started, and the NFS daemon running. But, what happens when the filesystem is not mounted? If the directory is still accessible, it will be exported via NFS, and applications could begin dumping data into the export. This could lead to all sorts of issues, such as data inconsistencies. As such, administrators have put checks into place, such as exporting only nested directories in the mount point, which would not be available if the filesystem fails to mount. These are clever hacks, but certainly not elegant.

When tying the export directly into the filesystem, you can solve this beautifully, which ZFS does. In the case of ZFS, you can share a specific dataset via NFS, for example. However, if the dataset does not mount, then the export will not be available to the application, and the NFS client will block. Because the network share is inherent to the filesystem, there is no concern for data inconsistencies, and no need for silly check hacks or scripts. As a result, ZFS from Oracle has the ability to share a dataset via NFS, SMB (CIFS or Samba) and iSCSI. ZFS on Linux only supports NFS and SMB currently, with iSCSI support on the way.

In each case, you still must install the necessary daemon software to make the share available. For example, if you wish to share a dataset via NFS, then you need to install the NFS server software, and it must be running. Then, all you need to do is flip the sharing NFS switch on the dataset, and it will be immediately available.

Sharing via NFS

To share a dataset via NFS, you first need to make sure the NFS daemon is running. On Debian and Ubuntu, this is the "nfs-kernel-server" package. Further, with Debian and Ubuntu, the NFS daemon will not start unless there is an export in the /etc/exports file. So, you have two options: you can create a dummy export, only available to localhost, or you can edit the init script to start without checking for a current export. I prefer the former. Let's get that setup:

$ sudo aptitude install -R nfs-kernel-server
$ echo '/mnt localhost(ro)' >> /etc/exports
$ sudo /etc/init.d/nfs-kernel-server start
$ showmount -e hostname.example.com
Export list for hostname.example.com:
/mnt localhost

With our NFS daemon running, we can now start sharing ZFS datasets. I'll assume already that you have created your dataset, it's mounted, and you're ready to start committing data to it. You'll notice in the zfs(8) manpage, that for the "sharenfs" property, it can be "on", "off" or "opts", where "opts" are valid NFS export options. So, if I wanted to share my "pool/srv" dataset, which is mounted to "/srv" to the 10.80.86.0/24 network, I could do something like:

# zfs set sharenfs="rw=@10.80.86.0/24" pool/srv
# zfs share pool/srv
# showmount -e hostname.example.com
Export list for hostname.example.com:
/srv 10.80.86.0/24
/mnt localhost

If you want your ZFS datasets to be shared on boot, then you need to install the /etc/default/zfs config file. If using the Ubuntu PPA, this will be installed by default for you. If compiling from source, this will not be provided. Here are the contents of that file. I've added emphasis to the two lines that should be modified for persistence across boots, if you want to enable sharing via NFS. Default is 'no':

$ cat /etc/default/zfs
# /etc/default/zfs
#
# Instead of changing these default ZFS options, Debian systems should install
# the zfs-mount package, and Ubuntu systems should install the zfs-mountall
# package. The debian-zfs and ubuntu-zfs metapackages ensure a correct system
# configuration.
#
# If the system runs parallel init jobs, like upstart or systemd, then the
# `zfs mount -a` command races in a way that causes sporadic mount failures.

# Automatically run `zfs mount -a` at system start. Disabled by default.
ZFS_MOUNT='yes'
ZFS_UNMOUNT='no'

# Automatically run `zfs share -a` at system start. Disabled by default.
# Requires nfsd and/or smbd. Incompletely implemented for Linux.
ZFS_SHARE='yes'
ZFS_UNSHARE='no'

As mentioned in the comments, running a parallel init system creates problems for ZFS. This is something I recently banged my head against, as my /var/log/ and /var/cache/ datasets were not mounting on boot. To fix the problem, and run a serialized boot, thus ensuring that everything gets executed in the proper order, you need to touch a file:

# touch /etc/init.d/.legacy-bootordering

This will add time to your bootup, but given the fact that my system is up months at a time, I'm not worried about the extra 5 seconds this puts on my boot. This is documented in the /etc/init.d/rc script, setting the "CONCURRENCY=none" variable.

You should now be able to mount the NFS export from an NFS client:

(client)# mount -t nfs hostname.example.com:/srv /mnt

Sharing via SMB

Currently, SMB integration is not working 100%. See bug #1170 I reported on Github. However, when things get working, this will likely be the way.

As with NFS, to share a ZFS dataset via SMB/CIFS, you need to have the daemon installed and running. Recently, the Samba development team released Samba version 4. This release gives the Free Software world a Free Software implementation of Active Directory running on GNU/Linux systems, SMB 2.1 file sharing support, clustered file servers, and much more. Currently, Debian testing has the beta 2 packages. Debian experimental has the stable release, and it may make its way up the chain for the next stable release. One can hope. Samba v4 is not needed to share ZFS datasets via SMB/CIFS, but it's worth mentioning. We'll stick with version 3 of the Samba packages, until version 4 stabilizes.

# aptitude install -R aptitude install samba samba-client samba-doc samba-tools samba-doc-pdf
# ps -ef | grep smb
root     22413     1  0 09:05 ?        00:00:00 /usr/sbin/smbd -D
root     22423 22413  0 09:05 ?        00:00:00 /usr/sbin/smbd -D
root     22451 21308  0 09:06 pts/1    00:00:00 grep smb

At this point, all we need to do is share the dataset, and verify that it's been shared. It is also worth noting that Microsoft Windows machines are not case sensitive, as things are in Unix. As such, if you are in a heterogenious environment, it may be worth disabling case sensitivity on the ZFS dataset. Setting this value can only be done on creation time. So, you may wish to issue the following when creating that dataset:

# zfs create -o casesensitivity=mixed pool/srv

Now you can continue with configuring the rest of the dataset:

# zfs set sharesmb=on pool/srv
# zfs share pool/srv
# smbclient -U guest -N -L localhost
Domain=[WORKGROUP] OS=[Unix] Server=[Samba 3.6.6]

        Sharename       Type      Comment
        ---------       ----      -------
        print$          Disk      Printer Drivers
        sysvol          Disk      
        netlogon        Disk      
        IPC$            IPC       IPC Service (eightyeight server)
        Canon-imageRunner-3300 Printer   Canon imageRunner 3300
        HP-Color-LaserJet-3600 Printer   HP Color LaserJet 3600
        salesprinter    Printer   Canon ImageClass MF7460
        pool_srv        Disk      Comment: /srv
Domain=[WORKGROUP] OS=[Unix] Server=[Samba 3.6.6]

        Server               Comment
        ---------            -------
        EIGHTYEIGHT          eightyeight server

        Workgroup            Master
        ---------            -------
        WORKGROUP            EIGHTYEIGHT

You can see that in this environment (my workstation hostname is 'eightyeight'), there are some printers being shared, and a couple disks. I've emphasized the disk that we are sharing in my output, to verify that it is working correctly. So, we should be able to mount that share as a CIFS mount, and access the data:

# aptitude install -R cifs-utils
# mount -t cifs -o username=USERNAME //localhost/srv /mnt
# ls /mnt
foo

Sharing via iSCSI

Unfortunately, sharing ZFS datasets via iSCSI is not yet supported with ZFS on Linux. However, it is available in the Illumos source code upstream, and work is being done to get it working in GNU/Linux. As with SMB and NFS, you will need the iSCSI daemon installed and running. When support is enabled, I'll finish writing up this post on demonstrating how you can access iSCSI targets that are ZFS datasets. In the meantime, you would do something like the following:

# aptitude install -R openiscsi
# zfs set shareiscsi=on pool/srv

At which point, from the iSCSI client, you would access the target, format it, mount it, and start working with the data..

ZFS Administration, Part XIV- ZVOLS

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

What is a ZVOL?

A ZVOL is a "ZFS volume" that has been exported to the system as a block device. So far, when dealing with the ZFS filesystem, other than creating our pool, we haven't dealt with block devices at all, even when mounting the datasets. It's almost like ZFS is behaving like a userspace application more than a filesystem. I mean, on GNU/Linux, when working with filesystems, you're constantly working with block devices, whether they be full disks, partitions, RAID arrays or logical volumes. Yet somehow, we've managed to escape all that with ZFS. Well, not any longer. Now we get our hands dirty with ZVOLs.

A ZVOL is a ZFS block device that resides in your storage pool. This means that the single block device gets to take advantage of your underlying RAID array, such as mirrors or RAID-Z. It gets to take advantage of the copy-on-write benefits, such as snapshots. It gets to take advantage of online scrubbing, compression and data deduplication. It gets to take advantage of the ZIL and ARC. Because it's a legitimate block device, you can do some very interesting things with your ZVOL. We'll look at three of them here- swap, ext4, and VM storage. First, we need to learn how to create a ZVOL.

Creating a ZVOL

To create a ZVOL, we use the "-V" switch with our "zfs create" command, and give it a size. For example, if I wanted to create a 1 GB ZVOL, I could issue the following command. Notice further that there are a couple new symlinks that exist in /dev/zvol/tank/ and /dev/tank/ which points to a new block device in /dev/:

# zfs create -V 1G tank/disk1
# ls -l /dev/zvol/tank/disk1
lrwxrwxrwx 1 root root 11 Dec 20 22:10 /dev/zvol/tank/disk1 -> ../../zd144
# ls -l /dev/tank/disk1
lrwxrwxrwx 1 root root 8 Dec 20 22:10 /dev/tank/disk1 -> ../zd144

Because this is a full fledged, 100% bona fide block device that is 1 GB in size, we can do anything with it that we would do with any other block device, and we get all the benefits of ZFS underneath. Plus, creating a ZVOL is near instantaneous, regardless of size. Now, I could create a block device with GNU/Linux from a file on the filesystem. For example, if running ext4, I can create a 1 GB file, then make a block device out of it as follows:

# fallocate -l 1G /tmp/file.img
# losetup /dev/loop0 /tmp/file.img

I now have the block device /dev/loop0 that represents my 1 GB file. Just as with any other block device, I can format it, add it to swap, etc. But it's not as elegant, and it has severe limitations. First off, by default you only have 8 loopback devices for your exported block devices. You can change this number, however. With ZFS, you can create 2^64 ZVOLs by default. Also, it requires a preallocated image, on top of your filesystem. So, you are managing three layers of data: the block device, the file, and the blocks on the filesystem. With ZVOLs, the block device is exported right off the storage pool, just like any other dataset.

Let's look at some things we can do with this ZVOL.

Swap on a ZVOL

Personally, I'm not a big fan of swap. I understand that it's a physical extension of RAM, but swap is only used when RAM fills, spilling the cache. If this is happening regularly and consistently, then you should probably look into getting more RAM. It can act as part of a healthy system, keeping RAM dedicated to what the kernel actively needs. But, when active RAM starts spilling over to swap, then you have "the swap of death", as your disks thrash, trying to keep up with the demands of the kernel. So, depending on your system and needs, you may or may not need swap.

First, let's create 1 GB block device for our swap. We'll call the dataset "tank/swap" to make it easy to identify its intention. Before we begin, let's check out how much swap we currently have on our system with the "free" command:

# free
             total       used       free     shared    buffers     cached
Mem:      12327288    8637124    3690164          0     175264    1276812
-/+ buffers/cache:    7185048    5142240
Swap:            0          0          0

In this case, we do not have any swap enabled. So, let's create 1 GB of swap on a ZVOL, and add it to the kernel:

# zfs create -V 1G tank/swap
# mkswap /dev/zvol/tank/swap
# swapon /dev/zvol/tank/swap
# free
             total       used       free     shared    buffers     cached
Mem:      12327288    8667492    3659796          0     175268    1276804
-/+ buffers/cache:    7215420    5111868
Swap:      1048572          0    1048572

It worked! We have a legitimate Linux kernel swap device on top of ZFS. Sweet. As is typical with swap devices, they don't have a mountpoint. They are either enabled, or disabled, and this swap device is no different.

Ext4 on a ZVOL

This may sound wacky, but you could put another filesystem, and mount it, on top of a ZVOL. In other words, you could have an ext4 formatted ZVOL and mounted to /mnt. You could even partition your ZVOL, and put multiple filesystems on it. Let's do that!

# zfs create -V 100G tank/ext4
# fdisk /dev/tank/ext4
( follow the prompts to create 2 partitions- the first 1 GB in size, the second to fill the rest )
# fdisk -l /dev/tank/ext4

Disk /dev/tank/ext4: 107.4 GB, 107374182400 bytes
16 heads, 63 sectors/track, 208050 cylinders, total 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disk identifier: 0x000a0d54

          Device Boot      Start         End      Blocks   Id  System
/dev/tank/ext4p1            2048     2099199     1048576   83  Linux
/dev/tank/ext4p2         2099200   209715199   103808000   83  Linux

Let's create some filesystems, and mount them:

# mkfs.ext4 /dev/zd0p1
# mkfs.ext4 /dev/zd0p2
# mkdir /mnt/zd0p{1,2}
# mount /dev/zd0p1 /mnt/zd0p1
# mount /dev/zd0p2 /mnt/zd0p2

Enable compression on the ZVOL, copy over some data, then take a snapshot:

# zfs set compression=lzjb pool/ext4
# tar -cf /mnt/zd0p1/files.tar /etc/
# tar -cf /mnt/zd0p2/files.tar /etc /var/log/
# zfs snapshot tank/ext4@001

You probably didn't notice, but you just enabled transparent compression and took a snapshot of your ext4 filesystem. These are two things you can't do with ext4 natively. You also have all the benefits of ZFS that ext4 normally couldn't give you. So, now you regularly snapshot your data, you perform online scrubs, and send it offsite for backup. Most importantly, your data is consistent.

ZVOL storage for VMs

Lastly, you can use these block devices as the backend storage for VMs. It's not uncommon to create logical volume block devices as the backend for VM storage. After having the block device available for Qemu, you attach the block device to the virtual machine, and from its perspective, you have a "/dev/vda" or "/dev/sda" depending on the setup.

If using libvirt, you would have a /etc/libvirt/qemu/vm.xml file. In that file, you could have the following, where "/dev/zd0" is the ZVOL block device:

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none'/>
  <source dev='/dev/zd0'/>
  <target dev='vda' bus='virtio'/>
  <alias name='virtio-disk0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>

At this point, your VM gets all the ZFS benefits underneath, such as snapshots, compression, deduplication, data integrity, drive redundancy, etc.

Conclusion

ZVOLs are a great way to get to block devices quickly while taking advantage of all of the underlying ZFS features. Using the ZVOLs as the VM backing storage is especially attractive. However, I should note that when using ZVOLs, you cannot replicate them across a cluster. ZFS is not a clustered filesystem. If you want data replication across a cluster, then you should not use ZVOLs, and use file images for your VM backing storage instead. Other than that, you get all of the amazing benefits of ZFS that we have been blogging about up to this point, and beyond, for whatever data resides on your ZVOL.

ZFS Administration, Part X- Creating Filesystems

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

We now begin down the path that is the "bread and butter" of ZFS, known as "ZFS datasets", or filesystems. Previously, up to this point, we've been discussing how to manage our storage pools. But storage pools are not meant to store data directly. Instead, we should create filesystems that share the same storage system. We'll refer to these filesystems from now as datasets.

Background

First, we need to understand how traditional filesystems and volume management work in GNU/Linux before we can get a thorough understanding of ZFS datasets. To treat this fairly, we need to assemble Linux software RAID, LVM, and ext4 or another Linux kernel supported filesystem together.

This is done by creating a redundant array of disks, and exporting a block device to represent that array. Then, we format that exported block device using LVM. If we have multiple RAID arrays, we format each of those as well. We then add all these exported block devices to a "volume group" which represents my pooled storage. If I had five exported RAID arrays, of 1 TB each, then I would have 5 TB of pooled storage in this volume group. Now, I need to decide how to divide up the volume, to create logical volumes of a specific size. If this was for an Ubuntu or Debian installation, maybe I would give 100 GB to one logical volume for the root filesystem. That 100 GB is now marked as occupied by the volume group. I then give 500 GB to my home directory, and so forth. Each operation exports a block device, representing my logical volume. It's these block devices that I format with ex4 or a filesystem of my choosing.

lvm
Linux RAID, LVM, and filesystem stack. Each filesystem is limited in size.

In this scenario, each logical volume is a fixed size in the volume group. It cannot address the full pool. So, when formatting the logical volume block device, the filesystem is a fixed size. When that device fills, you must resize the logical volume and the filesystem together. This typically requires a myriad of commands, and it's tricky to get just right without losing data.

ZFS handles filesystems a bit differently. First, there is no need to create this stacked approach to storage. We've already covered how to pool the storage, now we well cover how to use it. This is done by creating a dataset in the filesystem. By default, this dataset will have full access to the entire storage pool. If our storage pool is 5 TB in size, as previously mentioned, then our first dataset will have access to all 5 TB in the pool. If I create a second dataset, it too will have full access to all 5 TB in the pool. And so on and so forth.

zfs
Each ZFS dataset can use the full underlying storage.

Now, as files are placed in the dataset, the pool marks that storage as unavailable to all datasets. This means that each dataset is aware of what is available in the pool and what is not by all other datasets in the pool. There is no need to create logical volumes of limited size. Each dataset will continue to place files in the pool, until the pool is filled. As the cards fall, they fall. You can, of course, put quotas on datasets, limiting their size, or export ZVOLs, topics we'll cover later.

So, let's create some datasets.

Basic Creation

In these examples, we will assume our ZFS shared storage is named "tank". Further, we will assume that the pool is created with 4 preallocated files of 1 GB in size each, in a RAIDZ-1 array. Let's create some datasets.

# zfs create tank/test
# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank         175K  2.92G  43.4K  /tank
tank/test   41.9K  2.92G  41.9K  /tank/test

Notice that the dataset "tank/test" is mounted to "/tank/test" by default, and that it has full access to the entire pool. Also notice that it is occupying only 41.9 KB of the pool. Let's create 4 more datasets, then look at the output:

# zfs create tank/test2
# zfs create tank/test3
# zfs create tank/test4
# zfs create tank/test5
# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank         392K  2.92G  47.9K  /tank
tank/test   41.9K  2.92G  41.9K  /tank/test
tank/test2  41.9K  2.92G  41.9K  /tank/test2
tank/test3  41.9K  2.92G  41.9K  /tank/test3
tank/test4  41.9K  2.92G  41.9K  /tank/test4
tank/test5  41.9K  2.92G  41.9K  /tank/test5

Each dataset is automatically mounted to its respective mount point, and each dataset has full unfettered access to the storage pool. Let's fill up some data in one of the datasets, and see how that affects the underlying storage:

# cd /tank/test3
# for i in {1..10}; do dd if=/dev/urandom of=file$i.img bs=1024 count=$RANDOM &> /dev/null; done
# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank         159M  2.77G  49.4K  /tank
tank/test   41.9K  2.77G  41.9K  /tank/test
tank/test2  41.9K  2.77G  41.9K  /tank/test2
tank/test3   158M  2.77G   158M  /tank/test3
tank/test4  41.9K  2.77G  41.9K  /tank/test4
tank/test5  41.9K  2.77G  41.9K  /tank/test5

Notice that in my case, "tank/test3" is occupying 158 MB of disk, so according to the rest of the datasets, there is only 2.77 GB available in the pool, where previously there was 2.92 GB. So as you can see, the big advantage here is that I do not need to worry about preallocated block devices, as I would with LVM. Instead, ZFS manages the entire stack, so it understands how much data has been occupied, and how much is available.

Mounting Datasets

It's important to understand that when creating datasets, you aren't creating exportable block devices by default. This means you don't have something directly to mount. In conclusion, there is nothing to add to your /etc/fstab file for persistence across reboots.

So, if there is nothing to add do the /etc/fstab file, how do the filesystems get mounted? This is done by importing the pool, if necessary, then running the "zfs mount" command. Similarly, we have a "zfs unmount" command to unmount datasets, or we can use the standard "umount" utility:

# umount /tank/test5
# mount | grep tank
tank/test on /tank/test type zfs (rw,relatime,xattr)
tank/test2 on /tank/test2 type zfs (rw,relatime,xattr)
tank/test3 on /tank/test3 type zfs (rw,relatime,xattr)
tank/test4 on /tank/test4 type zfs (rw,relatime,xattr)
# zfs mount tank/test5
# mount | grep tank
tank/test on /tank/test type zfs (rw,relatime,xattr)
tank/test2 on /tank/test2 type zfs (rw,relatime,xattr)
tank/test3 on /tank/test3 type zfs (rw,relatime,xattr)
tank/test4 on /tank/test4 type zfs (rw,relatime,xattr)
tank/test5 on /tank/test5 type zfs (rw,relatime,xattr)

By default, the mount point for the dataset is "/<pool-name>/<dataset-name>". This can be changed, by changing the dataset property. Just as storage pools have properties that can be tuned, so do datasets. We'll dedicate a full post to dataset properties later. We only need to change the "mountpoint" property, as follows:

# zfs set mountpoint=/mnt/test tank/test
# mount | grep tank
tank on /tank type zfs (rw,relatime,xattr)
tank/test2 on /tank/test2 type zfs (rw,relatime,xattr)
tank/test3 on /tank/test3 type zfs (rw,relatime,xattr)
tank/test4 on /tank/test4 type zfs (rw,relatime,xattr)
tank/test5 on /tank/test5 type zfs (rw,relatime,xattr)
tank/test on /mnt/test type zfs (rw,relatime,xattr)

Nested Datasets

Datasets don't need to be isolated. You can create nested datasets within each other. This allows you to create namespaces, while tuning a nested directory structure, without affecting the other. For example, maybe you want compression on /var/log, but not on the parent /var. there are other benefits as well, with some caveats that we will look at later.

To create a nested dataset, create it like you would any other, by providing the parent storage pool and dataset. In this case we will create a nested log dataset in the test dataset:

# zfs create tank/test/log
# zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
tank            159M  2.77G  47.9K  /tank
tank/test      85.3K  2.77G  43.4K  /mnt/test
tank/test/log  41.9K  2.77G  41.9K  /mnt/test/log
tank/test2     41.9K  2.77G  41.9K  /tank/test2
tank/test3      158M  2.77G   158M  /tank/test3
tank/test4     41.9K  2.77G  41.9K  /tank/test4
tank/test5     41.9K  2.77G  41.9K  /tank/test5

Additional Dataset Administration

Along with creating datasets, when you no longer need them, you can destroy them. This frees up the blocks for use by other datasets, and cannot be reverted without a previous snapshot, which we'll cover later. To destroy a dataset:

# zfs destroy tank/test5
# zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
tank            159M  2.77G  49.4K  /tank
tank/test      41.9K  2.77G  41.9K  /mnt/test
tank/test/log  41.9K  2.77G  41.9K  /mnt/test/log
tank/test2     41.9K  2.77G  41.9K  /tank/test2
tank/test3      158M  2.77G   158M  /tank/test3
tank/test4     41.9K  2.77G  41.9K  /tank/test4

We can also rename a dataset if needed. This is handy when the purpose of the dataset changes, and you want the name to reflect that purpose. The arguments take a dataset source as the first argument and the new name as the last argument. To rename the tank/test3 dataset to music:

# zfs rename tank/test3 tank/music
# zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
tank            159M  2.77G  49.4K  /tank
tank/music      158M  2.77G   158M  /tank/music
tank/test      41.9K  2.77G  41.9K  /mnt/test
tank/test/log  41.9K  2.77G  41.9K  /mnt/test/log
tank/test2     41.9K  2.77G  41.9K  /tank/test2
tank/test4     41.9K  2.77G  41.9K  /tank/test4

Conclusion

This will get you started with understanding ZFS datasets. There are many more subcommands with the "zfs" command that are available, with a number of different switches. Check the manpage for the full listing. However, even though this isn't a deeply thorough examination of datasets, many more principles and concepts will surface as we work through the series. By the end, you should be familiar enough with datasets that you will be able to manage your entire storage infrastructure with minimal effort.

ZFS Administration, Part IX- Copy-on-write

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Before we can get into the practical administration of ZFS datasets, we need to understand exactly how ZFS is storing our data. So, this post will be theoretical, and cover a couple concepts that you will want to understand. Namely the the Merkle tree and copy-on-write. I'll try and keep it at a higher level that is easier to understand, without getting into C code system calls and memory allocations.

Merkle Trees

Merkle trees are nothing more than cryptographic hash trees, invented by Ralph Merkle. In order to understand a Merkle tree, we will start from the bottom, and work our way up the tree. Suppose you have 4 data blocks, block 0, block 1, block 2 and block 3. Each block is cryptographically hashed, and their hash is stored in a node, or "hash block". Each data block has a one-to-one relationship with their hash block. Let's assume the SHA-256 cryptographic hashing algorithm is the hash used. Suppose further that our four blocks hash to the following:

  • Block 0- 888b19a43b151683c87895f6211d9f8640f97bdc8ef32f03dbe057c8f5e56d32 (hash block 0-0)
  • Block 1- 4fac6dbe26e823ed6edf999c63fab3507119cf3cbfb56036511aa62e258c35b4 (hash block 0-1)
  • Block 2- 446e21f212ab200933c4c9a0802e1ff0c410bbd75fca10168746fc49883096db (hash block 1-0)
  • Block 3- 0591b59c1bdd9acd2847a202ddd02c3f14f9b5a049a5707c3279c1e967745ed4 (hash block 1-1)

The reason we cryptographically hash each block, is to ensure data integrity. If the block intentionally changes, then its SHA-256 hash should also change. If the the block was corrupted, then the hash would not change. Thus, we can cryptographically hash the block with the SHA-256 algorithm, and check to see if it matches its parent hash block. If it does, then we can be certain with a high degree of probability that the block was not corrupted. However, if the hash does not match, then under the same premise, the block is likely corrupted.

Hash trees are typically binary trees (one node has at most 2 children), although there is no requirement to make them so. Suppose in our example, our hash tree is a binary tree. In this case, hash blocks 0-0 and 0-1 would have a single parent, hash block 0. And hash blocks 1-0 and 1-1 would have a single parent, hash block 1 (as shown below). Hash block 0 is a SHA-256 hash of concatenating hash block 0-0 and hash block 0-1 together. Similarly for hash block 1. Thus, we would have the following output:

  • Hash block 0- 8a127ef29e3eb8079aca9aa5fc0649e60edcd0a609dd0285d1f8b7ad9e49c74d
  • Hash block 1- 69f1a2768dd44d11700ef08c1e4ece72c3b56382f678e6f20a1fe0f8783b12cf

We continue in a like manner, climbing the Merkle tree until we get to the super hash block, super node or uber block. This is the parent node for the entire tree, and it is nothing more than a SHA-256 hash of its concatenated children. Thus, for our uber block, we would have the following output by concatenating hash blocks 0 and 1 together:

  • Uber block- 6b6fb7c2a8b73d24989e0f14ee9cf2706b4f72a24f240f4076f234fa361db084

This uber block is responsible for verifying the integrity of the entire Merkle tree. If a data block changes, all the parent hash blocks should change as well, including the uber block. If at any point, a hash does not match its corresponding children, we have inconsistencies in the tree or data corruption.

Image showing an example of our binary hash tree.
Image showing our binary hash tree example

ZFS uses a Merkle tree to verify the integrity of the entire filesystem and all of the data stored in it. When you scrub your storage pool, ZFS is verifying every SHA-256 hash in the Merkle tree to make sure there is no corrupted data. If there is redundancy in the storage pool, and a corrupted data block is found, ZFS will look elsewhere in the pool for a good data block at the same location by using these hashes. If one is found, it will use that block to fix the corrupted one, then reverify the SHA-256 hash in the Merkle tree.

Copy-on-write (COW)

Copy-on-write (COW) is a data storage technique in which you make a copy of the data block that is going to be modified, rather than modify the data block directly. You then update your pointers to look at the new block location, rather than the old. You also free up the old block, so it can be available to the application. Thus, you don't use any more disk space than if you were to modify the original block. However, you do severely fragment the underlying data. But, the COW model of data storage opens up new features for our data that were previously either impossible or very difficult to implement.

The biggest feature of COW is taking snapshots of your data. Because we've made a copy of the block elsewhere on the filesystem, the old block still remains, even though it's been marked as free by the filesystem. With COW, the filesystem is working its way slowly to the end of the disk. It may take a long time before the old freed up blocks are rewritten to. If a snapshot has been taken, it's treated as a first class filesystem. If a block gets overwritten after it has been snapshotted, it gets copied to the snapshot filesystem. This is possible, because the snapshot is a copy of the hash tree at that exact moment. As such, snapshots are super cheap, and unless the snapshotted blocks are overwritten, they take up barely any space.

In the following image, when a data block is updated, the hash tree must also be updated. All hashes starting with the child block, and all its parent nodes must be updated with the new hash. The yellow blocks are a copy of the data, elsewhere on the filesystem, and the parent hash nodes are updated to point to the new block locations.

Image showing the nature of updating the hash tree when a data block is updated.
Image showing the COW model when a data block is created, and the hash tree is updated. Courtesy of dest-unreachable.net.

As mentioned, COW heavily fragments the disk. This can have massive performance impacts. So, some work needs to be taken to allocate blocks in advance to minimize fragmentation. There are two basic approaches: using a b-tree to pre-allocate extents, or using a slab approach, marking slabs of disk for the copy. ZFS uses the slab approach, where Btrfs uses the b-tree approach.

Normally, filesystems write data in 4 KB blocks. ZFS writes data in 128 KB blocks. This minimizes the fragmentation by an order of 32. Second, the slab allocater will allocate a slab, then chop up the slab into 128 KB blocks. Thirdly, ZFS delays syncing data to disk every 5 seconds. All data remaining is flushed after 30 seconds. That makes a lot of data flushed to disk at once, in the slab. As a result, this highly increases the probability that similar data will be in the same slab. So, in practice, even though COW is fragmenting the filesystem, there are a few things we can do to greatly minimize that fragmentation.

Not only does ZFS use the COW model, so does Btrfs, NILFS, WAFL and the new Microsoft filesystem ReFS. Many virtualization technologies use COW for storing VM images, such as Qemu. COW filesystems are the future of data storage. We'll discuss snapshots in much more intimate detail at a later post, and how the COW model plays into that.