Image of the glider from the Game of Life by John Conway
Skip to content

Identification Versus Authentication

Recently, Apple announced and released the iPhone 5S. Part of the hardware specifications on the phone is a new fingerprint scanner, coupled with their TouchID software. Immediately upon the announcement, I wondered how they would utilize the fingerprint. It is unfortunate, but not surprising, that they are using your fingerprint incorrectly.

To understand how, we first need to understand the difference between "identification" and "authentication". Your fingerprint should be used as an identifying token, and not an authenticating one. Unfortunately, most fingerprint scanner vendors don't follow this advice. In other words, when you scan your fingerprint, the software should identify you from a list of users. After identifying who you are, you then provide the token to authenticate that you are indeed the correct person. This is generally how usernames and passwords work. You provide a username to the login form to claim that you are indeed the correct person. Then you provide a password or some other token to prove that is the case. Your figerprint should be used as the identifying token, such as a username in a login form, rather than as the authenicating token, such as a password.

Why? Here's some concerns with using fingerprints as authentication tokens:

  • Fingerprints can't be changed easily. Once someone has compromised your account by lifting your print off of a surface, you can't just "change your fingerprint".
  • Fingerprints are easy low-hanging fruit for Big Brother. If faced in a situation where you must turn over your authentication tokens, it's much easier for Big Brother to get your fingerprint, than it is to get a long password.
  • Lifting fingerprints is easily hacked. They provide very little security. Further, your fingerprints are everywhere, especially on your phone. If you lost your iPhone 5S, or it's stolen, the bad guys now have your fingerprints.

To illustrate how easy that last bullet point is, the Chaos Computer Club posted a YouTube video on breaking the TouchID software with little difficulty. And they're hardly the first. Over, and over, and over again, fingerprint scanners are quickly broken. While the tech is certainly cool, it's hardly secure.

While I like to throw jabs and punches an Apple, Inc., I expected much more from them. This seems like such a n00b mistake, it's almost hard to take seriously. A fingerprint scanner on a phone would make sense where multiple users could use the device, independent of each other, such as the release of Android 4.2, where multiuser support was added. Scanning your finger would identify you to the device, and present a password, pattern or PIN entry dialog, asking you to authenticate. That's appropriate use of a fingerprint scanner.

RIP Microsoft Tag

So yesterday, Microsoft announced that it is ending support for Tag, (on Facebook?) it's proprietary barcode format. To me, this doesn't come as a major surprise. Here's why:

  1. Tag arrived very late in the game, after QR Codes were pretty much establishing themselves as "the norm".
  2. Tag is a proprietary barcode. No way around it.
  3. Tag is a specific implementation of HCCB, and the specification for HCCB was never released to developers.
  4. Tag requires a data connection to retrieve the data out of the barcode.
  5. Tag requires a Microsoft account to create Tags.
  6. Tag requires a commercial non-free license for using Tags in a commercial space, or for selling products as a result of scanning the Tag.
  7. Tags can only be scanned using Microsoft Tag readers.
  8. Generating a Tag barcode means accepting its Terms of Service. A barcode has a TOS.

Just over a year ago, I gave credit where credit is due. Microsoft had something awesome with HCCB. It's technically superior to most 2D barcodes in many ways. But, Microsoft never made that specification to create HCCB codes available. In fact, I even had an email discussion with a Microsoft engineer regarding HCCB and Tag:

From: Aaron Toponce
To: Microsoft Tag Support
Subject: HCCB Specification

I'm familiar with Microsoft Tag, an HCCB implementation, but I'm interested in the specifications for HCCB. Specifically, I am interested in generating HCCB codes that are not Microsoft Tag. I have an account on http://tag.microsoft.com, but don't see any way to generate HCCB codes outside of Microsoft Tags.

Any help would be great.

--------------------
From: Microsoft Tag Support
To: Aaron Toponce
Subject: Re: HCCB Specification

Hello Aaron,

Thank you for contacting Microsoft Tag Support. Could you please elaborate the issue as we are not able to understand the below request. High Capacity Color Barcode (HCCB) is the name coined by Microsoft for its technology of encoding data in a 2D barcode using clusters of colored triangles instead of the square pixels. Microsoft Tag is an implementation of HCCB using 4 colors in a 5 x 10 grid. Apart from Tag barcode you can also create QR code and NFC URL using Microsoft Tag Manager.

For more information http://tag.microsoft.com/what-is-tag/home.aspx

Attached is the document on specification of Tag, hope this will help you.

If you have any questions, comments or suggestions, please let us know.

Thank you for your feedback and your interest in Microsoft Tag.

--------------------
From: Aaron Toponce
To: Microsoft Tag Support
Subject: Re: HCCB Specification

I'm not interested in Tag. I'm only interested in the High Capacity Color Barcode specification, that Tag uses. For example, how do I create offline HCCB codes? How do I implement digital signatures? How do I implement error correction? I would be interested in developing a Python library to create and decode HCCB codes.

I am already familiar with HCCB, what it is, how it works, and some of the features. I want to develop libraries for creating and decoding them. But I can't seem to find any APIs, libraries or documentation in encoding and decoding HCCB.

I never got a followup reply. Basically, I'm allowed to create Tags, but not HCCB codes. I'm sure I'm not the only developer denied access to HCCB. I've thoroughly scanned the Internet looking for anything regarding building HCCB barcodes. It just doesn't exist. Plenty of sites detailing how to create a Microsoft account to create Tags, but nothing for standard offline HCCB codes.

So, citing the reasons above it's no surprise to me that Tag failed. Maybe Microsoft will finally release the specifications of HCCB to the market, so people can create their own offline HCCB codes, and develop apps for scanning, encoding and decoding them. Time will tell, and I'm not losing any sleep over it.

RIP Tag.

The NSA and Number Stations- An Historical Perspective

With all the latest news about PRISM and the United States government violating citizen's 4th amendment rights, I figured I would throw in a blog post about it. However, I'm not going to add anything really new about how to subvert the warantless government spying. Instead, I figured I would throw in an historical perspective on how some avoid being spied on.

The One-time Pad

In order to understand this post, we first need to understand the One-time Pad, or OTP for short. The OTP is a mathematically unbreakable encryption algorithm, which uses a unique and different random key for every message sent. The OTP must be the same length as the message being sent, or longer. The plaintext is then XOR'd with the OTP to create the ciphertext. The recipient on the other end has a copy of the OTP, which is used to XOR the ciphertext, and get back to the original plaintext. The system is extremely elegant, but it's not without its flaws.

First, the OTP must be communicated securely with the recipient. One argument against the OTP is if you can communicate your key securely, then why not just communicate the message in that manner? That's a fine question, except it misses one critical point: more than one OTP can be communicated at first meeting. The recipient might have 20 or 50 OTPs in their possession, knowing the order in which they are used.

Second, if the same OTP key is used for two or more messages, and those messages are intercepted, they can be used to derive the private key! It is exceptionally critical that every message be encrypted with its own unique and random OTP. This is not trivial.

One major advantage of the OTP is the lack of incriminating evidence. OTPs have been found on rice paper, bars of soap, microfilm, or hidden in plain sight, such as using words from a book or a crossword puzzle. One the key has been used, it can be destroyed with minimal effort. Compared to destroying data on a computer, which is much more difficult, than say, burning the rice paper, or shuffling a deck of cards.

Field Agents

Enter spies and field agents. Suppose a government wishes to communicate with a field agent in a remote country. The message they wish to send is "ATTACK AT DAWN". How do you get this message delivered to your agent securely and anonymously? More importantly, how can your field agent intercept the message without raising suspicion, or without any incriminating evidence against them?

This turns out to be a difficult problem to solve. If you meet at a specific location at a specific time, how do you communicate it without raising suspicion? Maybe you mail a package or envelope to your agent, but then how do you know it won't be intercepted and examined? Many totalitarian states, such as North Korea, examine all inbound and outbound mail.

Numbers Stations

Enter radio. First, in developed countries, just about everyone owns a radio. You can purchase them just about everywhere, and carrying one around, or having one in your room, is not incriminating enough to convict you as a spy. Second, your field agent already has a set of OTPs on hand. So, transmitting the encrypted message over the air isn't a problem for interception.

So, roughly around the time of World War 2, governments started communicating with field agents on the radio. Now, this can neither be confirmed, nor denied, but numbers stations have been on the air for decades. Numbers stations are illegal transmissions, usually on the edges of short wave bands. Typically, this is referred to as "pirate radio", and governments are very effective at finding them. Most of these numbers stations have very rigid schedules; so rigid, you could set your watch to them. If they are not transmitted by government agencies, they would be shut down fast. Given the length they've been on the air, the sheer number of them, and their rigid schedules, tells us that government agencies are the best bet for the source of the transmission.

So, what does a numbers station sound like? Typically, most of them have some sort of "header" transmission, before getting into the "body" of the encrypted text. This header could be a series of digits repeated over and over, a musical melody, a sequence of tones, or nothing. Then the body is delivered. Typically, it's given in sets of 5 numbers, which is common in cryptography circles. Something like "51237 65500 81734", etc. The transmissions are usually short, roughly 3-5 minutes in length. Some transmissions will end with a "footer", like "000 000" or "end transmission" for the agent to identify the transmission is over. There is never any sort of station identification. They are one-way anonymous transmissions. Almost always, the voice reading the numbers is computer generated. They can be transmitted in many different languages: Spanish, English, German, Chinese, etc. And if that's not enough, some are verbally spoken, some in Morse code, some digital.

Want to hear what one sounds like? Here is a transmission from the "Lincolnshire Poacher" (Wikipedia page, found in the article). Some numbers stations have been given names by their enthusiasts, who listen and record them frequently. In this case, named after an English folk-song, because it is played as the header to every transmission. However, the station didn't exist in England. Rather, it was stationed in Cyprus.

Don't think that sounds eerie enough? There is a German numbers station called the "Swedish Rhapsody", where it starts by ringing church bells for the header. Then, a female child voice reads the numbers. You swear this could be something out of a horror movie.

Not all stations stay on the air either. Many disappear over time, some quickly, some after many years. The Licolnshire Poacher numbers station was on the air for about 20 years, before it went silent. Numbers stations also don't always have rigid schedules. Some will just appear seemingly out of nowhere, and never come back online. And because these are on shortwave bands, they can travel hundreds and thousands of miles, so your field agent could literally be anywhere in the world. So long as he has his radio with him, a decent antenna, and a clear sky overhead, he'll pick it up.

The NSA

So, where does that bring us? Well, with the NSA spying us, numbers stations sound like an attractive alternative to phone and email conversations. Now, as already mentioned, numbers stations are illegal, especially in the United States. So, it doesn't seem like an attractive alternative, even if they are still on the air.

However, the OTP can be an effective and practical way to send messages securely. I mentioned almost a year ago, of a way to create a USB hard drive with a OTP on the drive. Both the sender and the recipient have an exact copy of the drive, along with a software utility necessary for encrypting and decrypting the data, as well as destroying the bits used for the OTP. Once the bits on the drives are all used up, the sender and the recipient meet to rebuild the OTP on the drive.

OpenSSH Keys and The Drunken Bishop

Introduction
Have you ever wondered what the "randomart" or "visual fingerprint" is all about when creating OpenSSH keys or connecting to OpenSSH servers? Surely, you've seen them. When generating a key on OpenSSH version 5.1 or later, you will see something like this:

$ ssh-keygen -f test-rsa
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in test-rsa.
Your public key has been saved in test-rsa.pub.
The key fingerprint is:
18:ff:18:d7:f4:a6:d8:ce:dd:d4:07:0e:e2:c5:f8:45 aaron@kratos
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|      .     . E  |
|       +   = o   |
|      . S + = =  |
|         * * * ..|
|        . + + . +|
|           o . o.|
|            o . .|
+-----------------+

I'm sure you've noticed this, and probably thought, "What's the point?" or "What's the algorithm in generating the visual art?" Well, I'm going to answer those questions for you in this post.

This post is an explanation of the algorithm as explained by Dirk Loss, Tobias Limmer, and Alexander von Gernler in their PDF "The drunken bishop: An analysis of the OpenSSH fingerprint visualization algorithm". You can find their PDF at http://www.dirk-loss.de/sshvis/drunken_bishop.pdf‎. In the event that link is no longer available, I've archived the PDF at http://aarontoponce.org/drunken_bishop.pdf.

Motivations

Bishop Peter finds himself in the middle of an ambient atrium. There are walls on all four sides and apparently there is no exit. The floor is paved with square tiles, strictly alternating between black and white. His head heavily aching—probably from too much wine he had before—he starts wandering around randomly. Well, to be exact, he only makes diagonal steps—just like a bishop on a chess board. When he hits a wall, he moves to the side, which takes him from the black tiles to the white tiles (or vice versa). And after each move, he places a coin on the floor, to remember that he has been there before. After 64 steps, just when no coins are left, Peter suddenly wakes up. What a strange dream!

When creating OpenSSH key pairs, or when connecting to an OpenSSH server, you are presented with the fingerprint of the keypair. It may look something like this:

$ ssh example.com
The authenticity of host 'example.com (10.0.0.1)' can't be established.
RSA key fingerprint is d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9.
Are you sure you want to continue connecting (yes/no)?

At this point, as a responsible citizen of the community, you call up the system administrator of the host "example.com", and verify that the fingerprint you are being presented with is the same fingerprint he has on the server for the RSA key. If the fingerprints match, you type "yes", and continue the connection. If the fingerprints do not match, you suspect a man-in-the-middle attack, and type "no". If the server is a server under your control, then rather than calling up the system administrator for that domain, you physically go to the box, pull up a console, and print the server's RSA fingerprint.

In either case, verifying a 32-character hexadecimal string is cumbersome. If we could have a better visual on the fingerprint, it might be easier to verify that we've connected to the right server. This is where the "randomart" comes from. Now, when connecting to the server, I can be presented with something like this:

The authenticity of host 'example.com (10.0.0.1)' can't be established.
RSA key fingerprint is d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9.
+--[ RSA 2048]----+
|             o . |
|         . .o.o .|
|        . o .+.. |
|       .   ...=o+|
|        S  . .+B+|
|            oo+o.|
|           o  o. |
|          .      |
|           E     |
+-----------------+
Are you sure you want to continue connecting (yes/no)?

Because I have a visual representation of the server's fingerprint, it will be easier for me to verify that I am connecting to the correct server. Further, after connecting to the server many times, the visual fingerprint will become familiar. So, upon connection, when the visual fingerprint is displayed, I can think "yes, that is the same picture I always see, this must be my server". If a man-in-the-middle attack is in progress, a different visual fingerprint will probably be displayed, at which point I can avoid connecting, because I have noticed that the picture changed.

The picture is created by applying an algorithm to the fingerprint, such that different fingerprints should display different pictures. Turns out, there can be some visual collisions that I'll lightly address at the end of this post. However, this visual display should work in "most cases", and cause you to start verifying fingerprints of OpenSSH keys.

The Board
Because the bishop finds himself in a room, with no exits, and only walls, we need to create a visually square room on the terminal. This is done by creating a room with 9 rows and 17 columns, creating a total of 153 total squares the bishop can travel. The bishop must start in the exact center of the room, thus the reason for odd-numbered rows and columns.

Our board setup then looks like this, where "S" is the starting location of the drunk bishop:

             1111111
   01234567890123456
  +-----------------+x (column)
 0|                 |
 1|                 |
 2|                 |
 3|                 |
 4|        S        |
 5|                 |
 6|                 |
 7|                 |
 8|                 |
  +-----------------+
  y
(row)

Each square on the board can be thought as a numerical position from Cartesian coordinates. As mentioned, there are 153 squares on the board, so each square gets a numerical value through the equation "p = x + 17y". So, p=0, for (0,0); p=76 for (8,4), the starting location of our bishop; and p=152, for (16,8), the lower right-hand corner of the board. Having a unique numerical value for each position on the board will allow us to do some simple math when the bishop begins his random walk.

The Movement
In order to define movement, we need to understand the fingerprint that is produced from an OpenSSH key. An OpenSSH fingerprint is an MD5 checksum. As such, it has a 16-byte output. An example fingerprint could be "d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9".

Because the bishop can only move one of four valid ways, we can represent this in binary.

  • "00" means our bishop takes one move diagonally to the north-west.
  • "01" means our bishop takes one move diagonally to the north-east.
  • "10" means our bishop takes one move diagonally to the south-west.
  • "11" means our bishop takes one move diagonally to the south-east.

With the bishop in the center of the room, his first move will take him off square 76. After his first move, his new position will be as follows:

  • "00" will place him on square 58, a difference of -18.
  • "01" will place him on square 60, a difference of -16.
  • "10" will place him on square 92, a difference of +16.
  • "11" will place him on square 94, a difference of +18.

We must now convert our hexadecimal string to binary, so we can begin making movements based on our key. Our key:

d4:d3:fd:ca:c4:d3:e9:94:97:cc:52:21:3b:e4:ba:e9

would be converted to

11010100:11010100:11111101:11001010:...snip...:00111011:11100100:10111010:11101001

When reading the binary, we read each binary word (8-bits) from left-to-right, but we read each bit-pair in each word right-to-left (little endian). Thus our bishop's first 16 moves would be:

00 01 01 11 00 01 01 11 01 11 11 11 10 10 00 11

Or, you could think of it in terms of steps, if looking at the binary directly:

4,3,2,1:8,7,6,5:12,11,10,9:16,15,14,13:...snip...:52,51,50,49:56,55,54,53:60,59,58,57:64,63,62,61

Board Coverage
All is well and good if our drunk bishop remains in the center of the room, but what happens when he slams into the wall, or walks himself into a corner? We need to take into account these situations, and how to handle them in our algorithm. First, let us define every square on the board:

+-----------------+
|aTTTTTTTTTTTTTTTb|   a = NW corner
|LMMMMMMMMMMMMMMMR|   b = NE corner
|LMMMMMMMMMMMMMMMR|   c = SW corner
|LMMMMMMMMMMMMMMMR|   d = SE corner
|LMMMMMMMMMMMMMMMR|   T = Top edge
|LMMMMMMMMMMMMMMMR|   B = Bottom edge
|LMMMMMMMMMMMMMMMR|   R = Right edge
|LMMMMMMMMMMMMMMMR|   L = Left edge
|cBBBBBBBBBBBBBBBd|   M = Middle pos.
+-----------------+

Now, let us define every move for every square on the board:

Pos Bits Heading Adjusted Offset
a 00 NW No Move 0
01 NE E +1
10 SW S +17
11 SE SE +18
b 00 NW W -1
01 NE No Move 0
10 SW SW +16
11 SE S +17
c 00 NW N -17
01 NE NE -16
10 SW No Move 0
11 SE E +1
d 00 NW NW -18
01 NE N -17
10 SW W -1
11 SE No Move 0
T 00 NW W -1
01 NE E +1
10 SW SW +16
11 SE SE +18
B 00 NW NW -18
01 NE NE -16
10 SW W -1
11 SE E +1
R 00 NW NW -18
01 NE N -17
10 SW SW +16
11 SE S +17
L 00 NW N -17
01 NE NE -16
10 SW S +17
11 SE SE +18
M 00 NW NW -18
01 NE NE -16
10 SW SW +16
11 SE SE +18

How much of the board will our bishop walk? Well, with our fingerprints having a 16-byte output, that means there are 64 total moves the bishop can walk. As such, the most board a bishop could cover, is if each square was only visited once. Thus 65/153 ~= 42.48%, which is less than half of the board.

Position Values
Remember that our bishop is making a random walk around the room, dropping coins on every square he's visited. If he's visited a square in the room more than once, we need a way to represent that in the art. As such, we will use a different ASCII character as the count increases.

Unfortunately, in my opinion, I think the OpenSSH developers picked the wrong characters. They mention in their PDF that the intention of the characters they picked, was to increase the density of the characters as the visitation count to a square increases. Personally, I don't think these developers have spent much time working on ASCII art. Had I been on the development team, I would have picked a different set. However, here is the set they picked for each count:

Freq 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Char . o + = * B O X @ % & # / ^ S E

The special characters "S" and "E" are to identify the starting and ending location of the bishop respectively.

Additional Thoughts
Now that you know how the random art is generated for a given key, you can begin to ask yourself some questions:

  • When addressing picture collisions, how many fingerprints produce the same picture (same values for all positions)?
  • How many fingerprints produce the same shape (same visited squares with different values)?
  • How many different visualizations can the algorithm produce?
  • Can a fingerprint be derived by looking only at the random art?
  • How many different visualizations can a person easily distinguish?
  • What happens to the visualizations when changing the board size, either smaller or larger?
  • What visualizations can be produced with different chess pieces? How would the rules change?
  • What visualizations can be produced if the board were a torus (like Pac-man)?
  • Could this be extended to other fingerprints, such as OpenPGP keys? If so, could it simplify verifying keys at keysigning parties (and as a result, speed them up)?

Conclusion
Even though this post discussed the algorithm in generating the random art, it did not address the security models of those visualizations. There are many questions that need answers. Obviously, there are collisions. So, how many collisions are there? Can it be discovered to predict the number of visual collisions based on the cryptographic hash, and size of the board? Does this weaken security when verifying OpenSSH keys, and if so, how?

These questions, and many others, should be addressed. But, for the time being, it seems to be working "well enough", and most people using OpenSSH probably only ever see the visualization when creating their own private and public key pair. You can enable it for every OpenSSH server you connect to, by setting "VisualHostKey yes" in your ssh_config(5).

In the meantime, I'll be working on a Python implementation for this on OpenPGP keys. It will use a larger board (11x19), and a different set of characters for the output, but the algorithm will remain the same. I'm interested to see if this can improve verifying people have the right OpenPGP key, by just checking the ASCII art, rather than reading out a 20-byte random string using the NATO alphabet. Keep an eye on https://github.com/atoponce/scripts/blob/master/art.py.

Strengthen Your Private Encrypted SSH Keys

Recently, on Hacker News, a post came through about improving the security of your encrypted private OpenSSH keys. I want to re-blog that post here (I'm actually jealous he blogged it first), in my own words, and provide a script at the end that will automate the process for you.

First off, Martin goes into great detail about the storage format of your unencrypted private OpenSSH keys. The unencrypted key is stored in a format known as Abstract Syntax Notation One (ASN.1) (for you web nerds, it's similar in function to JSON). However, when you encrypt the key with your passphrase, it is no longer valid ASN.1. So, Martin then takes you through the process of how the key is encrypted. The big take-away from that introduction is the following, that by default:

  • Encrypted OpenSSH keys use MD5- a horribly broken cryptographic hash.
  • OpenSSH keys are encrypted with AES-128-CBC, which is fast, fast, fast.

It would be nice if our OpenSSH keys used a stronger cryptographic hash like SHA1, SHA2 or SHA3 in the encryption process, rather than MD5. Further, it would be nice if we could cause attackers who get our private encrypted OpenSSH keys to expend more computing resources when trying to brute force our passphrase. So, rather than using the speedy AES algorithm, how about 3DES or Blowfish?

This is where PKCS#8 comes into play. "PKCS" stands for "Public-key cryptography standards". There are currently 15 standards, with 2 withdrawn and 2 under development. Standard #8 defines how private key certificates are to be handled, both in unencrypted and encrypted form. Because OpenSSH use public key cryptography, and private keys are stored, it would be nice if it adhered to the standard. Turns out, it does. From the ssh-keygen(1) man page:

     -m key_format
             Specify a key format for the -i (import) or -e (export) conver‐
             sion options.  The supported key formats are: “RFC4716” (RFC
             4716/SSH2 public or private key), “PKCS8” (PEM PKCS8 public key)
             or “PEM” (PEM public key).  The default conversion format is
             “RFC4716”.

As mentioned, the supported key formats are RFC4716, PKCS8 and PEM. Seeing as though PKCS#8 is supported, it seems like we can take advantage of it in OpenSSH. So, the question then comes, what does PKCS#8 offer me in terms of security that I don't already have? Well, Martin answers this question in his post as well. Turns out, there are 2 versions of PKCS#8 that we need to address:

  • The version 1 option specifies a PKCS#5 v1.5 or PKCS#12 algorithm to use. These algorithms only offer 56-bits of protection, since they both use DES.
  • The version 2 option specifies that PKCS#5 v2.0 algorithms are used which can use any encryption algorithm such as 168 bit triple DES or 128 bit RC2.

As I mentioned earlier, we want SHA1 (or better) and 3DES (or slower). Turns out, the OpenSSL implementation of PKCS#8 version 2 uses the following algorithms:

  • PBE-SHA1-RC4-128
  • PBE-SHA1-RC4-40
  • PBE-SHA1-3DES
  • PBE-SHA1-2DES
  • PBE-SHA1-RC2-128
  • PBE-SHA1-RC2-40

PBE-SHA1-3DES is our target. So, the only question remaining, is can we convert our private OpenSSH keys to this format? If so, how? Well, because OpenSSH relies heavily on OpenSSL, we can use the openssl(1) utility to make the conversion to the new format, and due to the ssh-keygen(1) manpage quoted above, we know OpenSSH supports the PKCS#8 format for our private keys, so we should be good.

Before we go further though, why 3DES? Why not stick with the default AES? DES is slow, slow, slow. 3DES is DES chained together 3 times. Compared to AES, it's a snail racing a hare. With 3DES, the data is encrypted with a first 56-bit DES key, then encrypted with a second 56-bit DES key, the finally encrypted with a third 56-bit DES key. The result is an output that has 168-bits of security. There are no known practical attacks against 3DES, and NIST considers it secure through 2030. It's certainly appropriate to use as an encrypted storage for our private OpenSSH keys.

To convert our private key, all we need to do is rename it, run openssl(1) on the keys, then test. Here are the steps:

$ mv ~/.ssh/id_rsa{,.old}
$ umask 0077
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa     # dsa and ecdsa are also supported

Now login to a remote OpenSSH server where the public portion of that key is installed, and see if it works. If so, remove the old key. To simplify the process, I created a script where you provide your private OpenSSH key as an argument, and it does the conversion for you. You can find that script at https://github.com/atoponce/scripts/blob/master/ssh-to-pkcs8.zsh

What's the point? Basically, you should think of it the following way:

  • We're using SHA1 rather than MD5 as part of the encryption process.
  • By using 3DES rather than AES, we've slowed down brute force attacks to a crawl. This should buy us 2-3 extra characters of entropy in our passphrase.
  • Using PKCS#8 gives us the flexibility to use other algorithms in the future, as old ones are replaced.

I agree with Martin that it's a shame OpenSSH isn't using this by default. Why stick with the original OpenSSH storage format? Compatibility isn't a concern, as the support relies solely on the client, not the server. Because every client should have a different keypair installed, there is no worry about new versus old client. Extra security is purchased through the use of SHA1 and 3DES. Computing time to create the keys was trivial, and the performance difference when using them is not noticeable compared to the traditional format. Of course, if your passphrase protecting your keys is strong, with lots and lots of entropy, then an attacker will be foiled with a brute force attack anyway. Regardless, why not make it more difficult for him by slowing him down?

Martin's post is a great read, and as such, I've converted my OpenSSH keys to the new format. I'd encourage you to do the same.

ZFS Administration, Appendix B- Using USB Drives

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Introduction

This comes from the "why didn't I think of this before?!" department. I have lying around my home and office a ton of USB 2.0 thumb drives. I have six 16GB drives and eight 8GB drives. So, 14 drives in total. I have two hypervisors in a GlusterFS storage cluster, and I just happen to have two USB squids, that support 7 USB drives each. Perfect! So, why not put these to good use, and add them as L2ARC devices to my pool?

Disclaimer

USB 2.0 is limited to 40 MBps per controller. A standard 7200 RPM hard drive can do 100 MBps. So, adding USB 2.0 drives to your pool as a cache is not going to increase the read bandwidth. At least not for large sequential reads. However, the seek latency of a NAND flash device is typically around 1 milliseconds to 3 milliseconds, whereas a platter HDD is around 12 milliseconds. If you do a lot of small random IO, like I do, then your USB drives will actually provide an overall performance increase that HDDs cannot provide.

Also, because there are no moving parts with NAND flash, this is less data that needs to be read from the HDD, which means less movement of the actuator arm, which means consuming less power in the long term. So, not only are they better for small random IO, they're saving you power at the same time! Yay for going green!

Lastly, the L2ARC should be read intensive. However, it can also be write intensive if you don't have enough room in your ARC and L2ARC to store all the requested data. If this is the case, you'll be constantly writing to your L2ARC. For USB drives without wear leveling algorithms, you'll chew through the drive quickly, and it will be dead in no time. If this is your case, you could store only metadata, rather than the actual data block pages in the L2ARC. You can do this with the following:

# zfs set secondarycache=metadata pool

You can set this pool-wide, or per dataset. In the case outlined above, I would certainly do it pool-wide, which each dataset will inherit by default.

Implementation

To this up, it's rather straight forward. Just identify what the drives are, by using their unique identifiers, then add them to the pool:

# ls /dev/disk/by-id/usb-* | grep -v part
/dev/disk/by-id/usb-Kingston_DataTraveler_G3_0014780D8CEBEBC145E80163-0:0@
/dev/disk/by-id/usb-Kingston_DataTraveler_SE9_00187D0F567FEC2090007621-0:0@
/dev/disk/by-id/usb-Kingston_DataTraveler_SE9_00248121ABD5EC2070002E70-0:0@
/dev/disk/by-id/usb-Kingston_DataTraveler_SE9_00D0C9CE66A2EC2070002F04-0:0@
/dev/disk/by-id/usb-_USB_DISK_Pro_070B2605FA99D033-0:0@
/dev/disk/by-id/usb-_USB_DISK_Pro_070B2607A029C562-0:0@
/dev/disk/by-id/usb-_USB_DISK_Pro_070B2608976BFD58-0:0@

So, there are my seven drives that I outlined at the beginning of the post. So, to add them to the system as L2ARC drives, just run the following command:

# zpool add -f pool cache usb-Kingston_DataTraveler_G3_0014780D8CEBEBC145E80163-0:0\
usb-Kingston_DataTraveler_SE9_00187D0F567FEC2090007621-0:0\
usb-Kingston_DataTraveler_SE9_00248121ABD5EC2070002E70-0:0\
usb-Kingston_DataTraveler_SE9_00D0C9CE66A2EC2070002F04-0:0\
usb-_USB_DISK_Pro_070B2605FA99D033-0:0\
usb-_USB_DISK_Pro_070B2607A029C562-0:0\
usb-_USB_DISK_Pro_070B2608976BFD58-0:0

Of course, these are the unique identifiers for my USB drives. Change them as necessary for your drives. Now that they are installed, are they filling up?

# zpool iostat -v
pool                                                          alloc   free   read  write   read  write
------------------------------------------------------------  -----  -----  -----  -----  -----  -----
pool                                                           695G  1.13T     21     59  53.6K   457K
  mirror                                                       349G   579G     10     28  25.2K   220K
    ata-ST1000DM003-9YN162_S1D1TM4J                               -      -      4     21  25.8K   267K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50708780                       -      -      4     21  27.9K   267K
  mirror                                                       347G   581G     11     30  28.3K   237K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50713154                       -      -      4     22  16.7K   238K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50710024                       -      -      4     22  19.4K   238K
logs                                                              -      -      -      -      -      -
  mirror                                                         4K  1016M      0      0      0      0
    ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part1                  -      -      0      0      0      0
    ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part1                  -      -      0      0      0      0
cache                                                             -      -      -      -      -      -
  ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part2                52.2G    16M      4      2  51.3K   291K
  ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part2                52.2G    16M      4      2  52.6K   293K
  usb-Kingston_DataTraveler_G3_0014780D8CEBEBC145E80163-0:0    465M  6.80G      0      0    319  72.8K
  usb-Kingston_DataTraveler_SE9_00187D0F567FEC2090007621-0:0  1.02G  13.5G      0      0  1.58K  63.0K
  usb-Kingston_DataTraveler_SE9_00248121ABD5EC2070002E70-0:0  1.17G  13.4G      0      0    844  72.3K
  usb-Kingston_DataTraveler_SE9_00D0C9CE66A2EC2070002F04-0:0   990M  13.6G      0      0  1.02K  59.9K
  usb-_USB_DISK_Pro_070B2605FA99D033-0:0                      1.08G  6.36G      0      0  1.18K  67.0K
  usb-_USB_DISK_Pro_070B2607A029C562-0:0                      1.76G  5.68G      0      1  2.48K   109K
  usb-_USB_DISK_Pro_070B2608976BFD58-0:0                      1.20G  6.24G      0      0    530  38.8K
------------------------------------------------------------  -----  -----  -----  -----  -----  -----

Something important to understand here, is the drives do not need to be all the same size. You can mix and match as you have on hand. Of course, the more space you can give to the cache, the better off you'll be.

Conclusion

While this certainly isn't designed for speed, it can be used for lower random IO latencies, and it well reduce power in the datacenter. Further, what else are you going to do with those USB devices just lying around? Might as well put them to good use. Definitely seeing as though "the cloud" is making it trivial to get all of your files online.

Password Attacks, Part III- The Combination Attack

Introduction

It's important to understand that most of the password attacks to offline databases where only hashes are stored are extensions of either the brute force attack or the dictionary attack, or a hybrid combination of both. There isn't really anything new outside of those two basic attacks. The combination attack is one such attack where we combine two dictionaries to create a much larger one. This larger one becomes the basis for creating passphrase dictionaries.

Combination Attack

Suppose we have two (very small) dictionaries. The first dictionary is a list of sizes while the second dictionary is a list of animals:

Dictionary 1:

tiny
small
medium
large
huge

Dictionary 2:

cat
dog
monkey
mouse
elephant

In order to combine these two dictionaries, we use a standard cross product between the two dictionaries. This means that there will be a total list of 25 words in our combined dictionary:

tinycat
tinydog
tinymonkey
tinymouse
tinyelephant
smallcat
smalldog
smallmonkey
smallmouse
smallelephant
mediumcat
meduimdog
mediummonkey
mediummouse
mediumelephant
largecat
largedog
largemonkey
largemouse
largeelephant
hugecat
hugedog
hugemonkey
hugemouse
hugeelephant

We have begun to assemble some crude rudimentary phrases. They may not make a lot of sense, but building that dictionary was cheap. I could create a script, similar to to the following, to have it build me that list:

1
2
3
4
5
6
#!/bin/sh
while read A; do
    while read B; do
        echo "${A}${B}"
    done < /tmp/dict2.txt
done < /tmp/dict1.txt > /tmp/comb-dict.txt

Personal Attack Building

Let's extend the combination attack a bit to target personal data. Suppose dictionary one is a list of male names and dictionary two is a list of four-digit years, starting from year 0 through year 2013. We can then create a combined dictionary that has a list of males and their birth (or death) years. For those passwords that don't pass simple dictionary attacks, maybe these hashes are passwords of when their kid was born, or when their dad died. I've shoulder surfed a number of different people, and watched as they type in their password, and you would be surprised how many passwords meet this exact criteria. Something like "Christian1995".

Now that I have you thinking, it should be obvious now that we can create all sorts of personalized dictionaries. How about a list of male names, a list of female names, a list of surnames, and a list of dates in MMDDYY format? Or how about a list of cities, states, universities, sports teams? Lists of "last 4 digits of your SSN", fully qualified domain names of popular websites, and common patterns on keyboards, such as "qwert", "asdf" and "zxcv", or "741852963" (I've seen this hundreds of times, where people just swipe their finger down their 10-key).

Even though the success rate of getting a personalized password out of one of these dictionaries might be low compared to the common dictionary attack, it's still far more efficient than the brute force, and it's so trivial to make these dictionaries, that I could have one server continuing to create combination dictionaries, while another works on the dictionary lists that are produced.

Extending the Attack

There are always sites that are forcing numbers on you, as well as upper case characters, and non-alphanumeric characters. Believe it or not, but just adding a "0" or a "1" at the beginning and end of these dictionaries can greatly improve my chances of discovering your password. Or, even just adding an exclamation point at the beginning or end might be all that's needed to extend a standard dictionary.

However, we can get a bit more creative with little cost. Rather than just concatenating the words together, we can insert the dash "-" and the underscore "_" between our concatenated words. These are common characters to use when forced to use non-alphanumeric characters in passwords that must be 8 characters or longer. Passwords such as "i-love-you", or "Alice_Bob" are common ways to satisfy the requirement, as well as keeping the password easy to remember. And, of course, we can append "!" or "0" to our password dictionary for the numerical requirement.

According to Wikipedia, the Oxford English Standard Dictionary, which is the Go To for all things English definitions, is the following size:

As of 30 November 2005, the Oxford English Dictionary contained approximately 301,100 main entries.

This is smaller than what most Unix-like operating systems will ship, but it's targeted enough, that most people if relying on dictionary words for their passwords, won't deviate much from. So, creating a full three-word passphrase dictionary would consist of only 90,661,210,000 entries. Assuming the average word length is 9 characters long, plus the newline character, this would be about a 900 GB file. Certainly nothing to laugh at, but with compression, we should be able to get this down do about 250 GB, give or take. Now let's generate two or three of those files with various combinations of word separation or prepending/appending numbers or non-alphanumeric characters, and we have a great attack vector starting point.

It's important to note that these files only need to be generated once, then stored for long-term use. The cost of generating these files initially will be high, of course. For only $11,000 USD, you can purchase a Backblaze storage pod with 180 TB of raw disk to store these behemoth files, and many combinations of them. It may be a large initial expense, but the potential value of a cracked password may be worth it in the end- something many attackers and their parent companies might just consider. Further, you may even find some of these dictionaries online.

Conclusion

We must admit, that if we have gotten to this point in the game, we are getting desperate for the password. This attack is certainly more effective than the brute force, and it won't take us long to exhaust these combined dictionaries. However, the rate at which we pull passwords out of them will certainly be much smaller than our success for finding 70% of the passwords with a plain dictionary. That doesn't mean it's not worth it. Even if we find only 10% of the passwords out of our hashed database, that's great success for the effort put into it.

ZFS Administration, Appendix A- Visualizing The ZFS Intent LOG (ZIL)

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Background

While taking a walk around the city with the rest of the system administration team at work today (we have our daily "admin walk"), a discussion came up about asynchronous writes and the contents of the ZFS Intent Log. Previously, as shown in the Table of Contents, I blogged about the ZIL in great length. However, I didn't really discuss what the contents of the ZIL were, and to be honest, I didn't fully understand it myself. Thanks to Andrew Kuhnhausen, this was clarified. So, based on the discussion we had during our walk, as well as some pretty graphs on the whiteboard, I'll give you the breakdown here.

Let's start at the beginning. ZFS behaves more like an ACID compliant RDBMS than a traditional filesystem. Its writes are transactions, meaning there are no partial writes, and they are fully atomic, meaning you get all or nothing. This is true whether the write is synchronous or asynchronous. So, best case is you have all of your data. Worst case is you missed the last transactional write, and your data is 5 seconds old (by default). So, let's look at those too cases- the synchronous write and the asynchronous write. With synchronous, we'll consider the write both with and without a separate logging device (SLOG).

The ZIL Function

The primary, and only function of the ZIL is to replay lost transactions in the event of a failure. When a power outage, crash, or other catastrophic failure occurs, pending transactions in RAM may have not been committed to slow platter disk. So, when the system recovers, the ZFS will notice the missing transactions. At this point, the ZIL is read to replay those transactions, and commit the data to stable storage. While the system is up and running, the ZIL is never read. It is only written to. You can verify this by doing the following (assuming you have SLOG in your system). Pull up two terminals. In one terminal, run an IOZone benchmark. Do something like the following:

$ iozone -ao

This will run a whole series of tests to see how your disks perform. While this benchmark is running, in the other terminal, as root, run the following command:

# zpool iostat -v 1

This will clearly show you that when the ZIL resides on a SLOG, the SLOG devices are only written to. You never see any numbers in the read columns. This is becaus the ZIL is never read, unless the need to replay transactions from a crash are necessary. Here is one of those seconds illustrating the write:

                                                            capacity     operations    bandwidth
pool                                                     alloc   free   read  write   read  write
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
pool                                                     87.7G   126G      0    155      0   601K
  mirror                                                 87.7G   126G      0    138      0   397K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9421741-part5          -      -      0     69      0   727K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9755779-part5          -      -      0     68      0   727K
logs                                                         -      -      -      -      -      -
  mirror                                                 2.43M   478M      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part1      -      -      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part1      -      -      0      8      0   108K
  mirror                                                 2.57M   477M      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part1      -      -      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part1      -      -      0      7      0  95.9K
cache                                                        -      -      -      -      -      -
  scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part5    26.6G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part5    26.5G  56.8G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part5    26.7G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part5    26.7G  56.7G      0      0      0      0
-------------------------------------------------------  -----  -----  -----  -----  -----  -----

The ZIL should always be on non-volatile stable storage! You want your data to remain consistent across power outages. Putting your ZIL on a SLOG that is built from TMPFS, RAMFS, or RAM drives that are not battery backed means you will lose any pending transactions. This doesn't mean you'll have corrupted data. It only means you'll have old data. With the ZIL on volatile storage, you'll never be able to get the new data that was pending a write to stable storage. Depending on how busy your servers are, this could be a Big Deal. SSDs, such as from Intel or OCZ, are good cheap ways to have a fast, low latentcy SLOG that is reliable when power is cut.

Synchronous Writes without a SLOG

When you do not have a SLOG, the application only interfaces with RAM and slow platter disk. As previously discussed, the ZFS Intent LOG (ZIL) can be thought of as a file that resides on the slow platter disk. When the application needs to make a synchronous write, the contents of that write are sent to RAM, where the application is currently living, as well as sent to the ZIL. So, the data blocks of your synchronous write at this exact moment in time have two homes- RAM and the ZIL. Once the data has been written to the ZIL, the platter disk sends an acknowledgement back to the application letting it know that it has the data, at which point the data is flushed from RAM to slow platter disk.

This isn't ALWAYS the case, however. In the case of slow platter disk, ZFS can actually store the transaction group (TXG) on platter immediately, with pointers in the ZIL to the locations on platter. When the disk ACKs back that the ZIL contains the pointers to the data, then the write TXG is closed in RAM, and the space in the ZIL opened up for future transactions. So, in essence, you could think of the TXG SYNCHRONOUS write commit happening in three ways:

  1. All data blocks are synchronously written to both the RAM ARC and the ZIL.
  2. All data blocks are synchronously written to both the RAM ARC and the VDEV, with pointers to the blocks written in the ZIL.
  3. All data blocks are synchronously written to disk, where the ZIL is completely ignored.

In the image below, I tried to capture a simplified view of the first process. The pink arrows, labeled as number one, show the application committing its data to both RAM and the ZIL. Technically, the application is running in RAM already, but I took it out to make the image a bit more clean. After the blocks have been committed to RAM, the platter ACKs the write to the ZIL, noted by the green arrow labeled as number two. Finally, ZFS flushes the data blocks out of RAM to disk as noted by the gray arrow labeled as number three.

Show how a synchronous write works with ZFS and the ZIL on platter.
Image showing a synchronous write with ZFS without a SLOG

Synchronous Writes with a SLOG

The advantage of a SLOG, as previously outlined, is the ability to use low latency, fast disk to send the ACK back to the application. Notice that the ZIL now resides on the SLOG, and no longer resides on platter. The SLOG will catch all synchronous writes (well those called with O_SYNC and fsync(2) at least). Just as with platter disk, the ZIL will contain the data blocks the application is trying to commit to stable storage. However, the SLOG, being a fast SSD or NVRAM drive, ACKs the write to the ZIL, at which point ZFS flushes the data out of RAM to slow platter.

Notice that ZFS is not flushing the data out of the ZIL to platter. This is what confused me at first. The data is flushed from RAM to platter. Just like an ACID compliant RDBMS, the ZIL is only there to replay the transaction, should a failure occur, and the data is lost. Otherwise, the data is never read from the ZIL. So really, the write operation doesn't change at all. Only the location of the ZIL changes. Otherwise, the operation is exactly the same.

As shown in the image, again the pink arrows labeled number one show the application committing its data to both the RAM and the ZIL on the SLOG. The SLOG ACKs the write, as identified by the green arrow labeled number two, then ZFS flushes the data out of RAM to platter as identified by the gray arrow labeled number three.

Show how a synchronous write works with ZFS and the ZIL on a SLOG.
Image showing a synchronous write with ZFS with a SLOG

Asynchronous Writes

Asynchronous writes have a history of being "unstable". You have been taught that you should avoid asynchronous writes, and if you decide to go down that path, you should prepare for corrupted data in the event of a failure. For most filesystems, there is good counsel there. However, with ZFS, it's a nothing to be afraid of. Because of the architectural design of ZFS, all data is committed to disk in transaction groups. Further, the transactions are atomic, meaning you get it all, or you get none. You never get partial writes. This is true with asynchronous writes. So, your data is ALWAYS consistent on disk- even with asynchronous writes.

So, if that's the case, then what exactly is going on? Well, there actually resides a ZIL in RAM when you enable "sync=disabled" on your dataset. As is standard with the previous synchronous architectures, the data blocks of the application are sent to a ZIL located in RAM. As soon as the data is in the ZIL, RAM acknowledges the write, and then flushes the data do disk, as would be standard with synchronous data.

I know what you're thinking: "Now wait a minute! The are no acknowledgements with asynchronous writes!" Not always true. With ZFS, there is most certainly an acknowledgement, it's just one coming from very, very fast and extremely low latent volatile storage. The ACK is near instantaneous. Should there be a crash or some other failure that causes RAM to lose power, and the write was not saved to non-volatile storage, then the write is lost. However, all this means is you lost new data, and you're stuck with old but consistent data. Remember, with ZFS, data is committed in atomic transactions.

The image below illustrates an asynchronous write. Again, the pink number one arrow shows the application data blocks being initially written to the ZIL in RAM. RAM ACKs back with the green number two arrow. ZFS then flushes the data to disk, as per every previous implementation, as noted by the gray number 3 arrow. Notice in this image, even if you have a SLOG, with asynchronous writes, it's bypassed, and never used.

Show how an asynchronous write works with ZFS and the ZIL.
Image showing an asynchronous write with ZFS.

Disclaimer

This is how I and my coworkers understand the ZIL. This is after reading loads of documentation, understanding a bit of computer science theory, and understanding how an ACID compliant RDBMS works, which is architected in a similar manner. If you think this is not correct, please let me know in the comments, and we can have a discussion about the architecture.

There are certainly some details I am glossing over, such as how much data the ZIL will hold before its no longer utilized, timing of the transaction group writes, and other things. However, it should also be noted that aside from some obscure documentation, there doesn't seem to be any solid examples of exactly how the ZIL functions. So, I thought it would be best to illustrate that here, so others aren't left confused like I was. For me, images always make things clearer to understand.

Password Attacks, Part II - The Dictionary Attack

Introduction
Before we start delving into the obscure attacks, it probably makes the most sense to get introduced to the most common attacks. The dictionary attack is one such attack. Previously we talked about the brute force attack, which is highly ineffective, and exceptionally slow and expensive to maintain. Here, we'll introduce a much more effective attack that will open up the ability to crack 15 character passwords, and longer, with ease.

Dictionary Attack
The dictionary is another dumb search, except for one thing: an assumption is made that people choose passwords that are based on dictionary words, because adding mutations to the password requires more work, is more difficult to remember, and more difficult to type. Because humans are largely lazy by default, we take the lazy approach to password creation- base it on a dictionary word, and be done with it. After all, no one is really going to hack my account. Right?

A couple years ago, during the height of the Sony PlayStation 3 hacking saga, 77 million PlayStation Network accounts were leaked. These were accounts from all over the globe. Worse, SONY STORED 1 MILLION OF THOSE PASSWORDS IN PLAINTEXT! I'll let that sink in for a minute. These 1 million passwords were leaked to Bittorrent. So, we can do some analysis on the passwords themselves, such as length and difficulty. Troy Hunt did some amazing work on the analysis of the passwords, so this data should be credited to him, but let's look it over:

  • 93% of the passwords were between 6 and 10 characters long.
  • 50% of the passwords were less than 8 characters.
  • Of the following character sets, only 4% of the passwords had 3 or more: numbers, uppercase, lowercase, everything else.
  • 45% of the passwords were lowercase only.
  • 99% of the passwords did not contain non-alphanumeric characters.
  • 65% of the passwords can be found in a dictionary.
  • Within Sony, there were two separate accounts: "Beauty" and "Delboca". Where there was a common email address between the accounts, 92% of the accounts used the same password between both.
  • Comparing the Sony and Gawker hacks, where there was a common email address, 67% of those accounts used the same password.
  • 82% of the passwords would fall victim to a Rainbow Table Attack (something we'll cover later).

With the cryptography circles I run in, these numbers are not very surprising. 65% of the words found in a dictionary is actually a bit low. I've seen the average sit more around 70%, which is troubling. This means that a dictionary attack is extremely effective, no matter how long your password is. If it can be found in a dictionary, you'll fall victim.

Creating a Dictionary
So, what exactly is a dictionary that can be used for this attack? Generally, it's nothing more than a word list, with one word on each line. Standard Unix operating systems have a dictionary installed when a spell checking utility is installed. This can be found in /usr/share/dict/words. For the case of my Debian GNU/Linux system, I have about 100,000 words in the dictionary:

$ wc -l /usr/share/dict/words 
99171 /usr/share/dict/words

But, I can install a much larger wordlist:

$ sudo aptitude install wamerican-insane
$ sudo select-default-wordlist
$ wc -l /usr/share/dict/words
650722

Even though my word list has grown by 6x the previous size, this still pales in comparison to some dictionaries you can download online. The Openwall word list contains 40 million entries, and is over 500 MB in size. It consists of words from over 20+ languages, and includes passwords generated with pwgen(1). It will cost you $27.95 USD for the download, however. There are plenty of other word lists all over the Internet. Spend some time searching, and you can generate a decently sized word list on your own.

Precomputed Dictionary
This will open the discussion for Rainbow Tables, something we'll discuss later on. However, with a precomputed dictionary attack, I can spend the time hashing all the values in my dictionary, and store them as a key/value pair, where the key is the hash, and the value is the password. This can save considerable time for the password cracking utility when doing the lookup. However, it comes at a cost; I must spend the time precomputing all the values in the dictionary. However, once they are computed, I can use this over and over for my lookups as needed. Disk space is also a concern. For a SHA1 hash, you'll be adding 40 bytes to every entry. For the Openwall word list, this means your dictionary will grow from 500 MB to 2 GB. Not a problem for today's storage, but certainiy something you should be aware of.

Rainbow tables are a version of the precomputed dictionary attack well look at later. The advantage of a rainbow table is savings on disk space for the cost of a bit longer lookup times. We still have precomputed hashes for dictionary words, but they don't occupy as much space.

Thwarting precomputed hashes can be accomplished by salting your password. I discuss password salts on my blog when discussing the shadowed password on Unix systems. Because hashing functions produce the same output for a given input, if that input changes, such as by adding a salt, the output will change. Even though your password was the same, by appending a salt to your password, the computed hash will be completely different. Even if I have your salt in my possession, precomputed dictionary attacks are of no use, because each salt for each account will likely be different, which means I need to precompute different dictionaries with different salts, a very costly task for both CPU and disk space.

However, if I have the salt in my possession, I can still use the salt in conjunction with a standard word list, to compute the desired hash. If I find the hash, I have found the word in your dictionary, even if I needed the salt to help me get there.

Conclusion
Because 65%-70% of people use dictionary words for their passwords, this makes the dictionary attack extremely attractive for attackers who have offline password databases. Even with the Openwall word list of 40 million words, most CPUs can exhaust that word list in seconds, meaning 70% of the passwords will be found in very little time with very little effort. Further, because 67% of the population or more use the same password across multiple accounts, if we know something about the accounts we've just attacked, we can now use that information to login to their bank, Facebook, email, Twitter and other accounts. For the effort, dictionary attacks are very valuable, and a first pick for many attackers.

Password Attacks, Part I - The Brute Force Attack

Introduction

For those who follow my blog know I have blogged about password security in the past. No matter how you spin it, no matter how you argue it, no matter what your opinions are on password security. If you don't think entropy matters, think again. Entropy is everything. Now, I've blogged about entropy from information theory in one way or another. Just search my blog for the word "entropy". It should keep you occupied for a while.

I'm not going to cover entropy here. Instead, I wish to put my focus on attacking the passwords, rather than building them. Entropy is certainly important, but let's discuss it from the perspective of actually using it against the owner of that password. For simplicity, we'll assume best-case scenarios with entropy. That is, we'll assume that even though the human has influence to weaken the entropy of their password, we'll assume it hasn't been weakened. Let's first start with the brute force attack, which is the most simplistic, although most inefficient attack on passwords.

Brute Force
Brute force is a dumb search. Literally. It means we know absolutely nothing about the search space. We don't know the character restrictions. We don't know length restrictions. We don't know anything about the password. All we have is a file of hashed passwords. We have enough data that if I use a utility like John the Ripper or Hashcat, I can start incrementing through a search space. My search may look something like this:

a, b, c, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ..., bx, by, bz, ...

This is only encompassing lowercase letters in the English alphabet, but it illustrates the point. I start with a common denominator, and increment its value by one until I've exhausted the search space for that single character. Then I concatenate an additional character, and increment until all possibilities in the two-character space are exhausted. Then I move to the three-character space, and so forth, as fast as my hardware will allow.

On the United States English keyboard, there are 94 printable case sensitive characters. Because this is a brute force search, we can make very little assumptions about our search space. We can assume a lot of non-printable characters will not be valid, such as the BELL, the TAB and the ENTER. In fact, looking at the ASCII table, we can safely assume that ASCII decimal values 0 through 31 won't be used. However, it is safe to assume that decimal values 32 through 126 will be used, where decimal value 32 is the SPACE. This is 95 total characters for my search space.

Hashing Algorithms
Before we get into speed, we must also address hashing. Many password databases will hash their passwords, and we're pretending that I have access to one of these databases offline. Most of these databases will also store the salt, if the password is salted, and the algorithm of the hashing type, if it supports more than one hash. So, let's assume that I know the hashing algorithm. This is important, because it will play into how fast I can brute force these passwords.

A designer of a password database store should keep the following things in mind:

  • Slower algorithms, such as Blowfish and MD5, will hinder brute force attacks. However, it can also be slow for busy server farms.
  • Algorithms such as SHA256 and SHA512 should be used when possible due to their lack of practical crypanalysist attacks.
  • Rotating the password thousands of times before storing the resulting hash on disk makes brute force attacks very, very ineffective.

What do I mean by "rotating passwords"? Let's assume the password is the text "pthree.org". Let's further assume that the password is being hashed with SHA1. The SHA1 of "pthree.org" is "bea133441136e7a7c992aa2e961002c170e54c93". Now, let's take that hash, and use it as the input to SHA1 again. The SHA1 of "bea133441136e7a7c992aa2e961002c170e54c93" is "cec33ee7686fc2a91b167e096b1f3182ae7828b7". Now let's take that hash, and use it as the input to SHA1 again, continuing until we have called the SHA1 function 1,000 times. The resulting hash should be "4f019154eaefe82946d833f1b05aa1968f1f202a".

A basic Unix shell script could look like this:

1
2
3
4
5
6
#!/bin/sh
PASSWORD="pthree.org"
for i in `seq 1000`; do
    PASSWORD=`echo -n $PASSWORD | sha1sum - | awk '{print $1}'`
done
echo $PASSWORD

MD5 is a slow(er) algorithm, but it is also horribly broken, and should be avoided as a result. SHA1 is faster than MD5, but still slower than others. Practical cryptanalysis is approaching, and should probably be avoided. However, Blowfish is exceptionally slow, even slower than MD5, and has not shown any practical attack against the algorithm. So, it makes for a perfect candidate for storing passwords. If the password is hashed and rotated 1,000 times before storing on disk, this cripples brute force attacks to an absolute crawl. Other candidates, such as SHA256, SHA512 and the newly NIST approved SHA3 algorithms are much faster, so their passwords should definitely be rotated 1,000 or even 10,000 times before storing to disk to slow the brute force search.

Raw Speed
Now the question remains- how fast can I search through passwords using a brute force method? Well, this will depend largely on hardware, and as just discussed, the chosen algorithm. Let us suppose that I have a rack of 25 of the latest AMD Radeon graphics cards. According to Ars Technica, I can achieve a bind blowing speed of 350 BILLION passwords per second if attacking the NTLM hash used to store Windows passwords. If every password in the database was 8 characters in length, then my search space is:

95 * 95 * 95 * 95 * 95 * 95 * 95 * 95 = 95^8 = 6,634,204,312,890,625 passwords

Six-and-a-half QUADRILLION passwords. That should give me a search space large enough to not worry about, right? Wrong. At a crazy pace of 350 billion passwords per second:

6634204312890625 / 350000000000 = 18955 seconds = 316 minutes = 5.3 hours

In just under 6 hours, I can exhaust the 8-character search space using this crazy machine. But what about 9 characters? What about 10 characters? What about 15 characters? These are a legitimate question, and why entropy is EVERYTHING with regards to your password and the search space to which it belongs. Think of your password as a needle, and entropy as the haystack. The larger your entropy is (total number of possible passwords), the harder it will be to find that needle in the haystack.

The Exponential Wall
Let's do some math, and explain what is going on. I'll give you a table of values, showing passwords that start with 8 characters through 15 characters in the first column. In the second column will be the search space, and in the 3rd column will be the time it takes to fully exhaust that search space. Before continuing, remember that it may not be necessary to exhaust the search space. Whenever I find the password, I should stop. There is no need to continue. So, I may find that password very early in the search, and I may find it very late in the search, or anywhere between. So, the 3rd column is merely a maximum time.

Length Search Space Max at 350 bpps
8 6634204312890625 5.3 hours
9 630249409724609375 20.8 days
10 59873693923837890625 5.4 years
11 5688000922764599609375 5.1 centuries
12 540360087662636962890625 48.9 millenia
13 51334208327950511474609375 4,650.1 millenia
14 4876749791155298590087890625 441,830.6 millenia
15 463291230159753366058349609375 41,973,910.1 millenia

What's going on here? By nearly doubling the length of my password, my search space went from 5 hours to 41 million millenia. We've hit what's called "the exponential wall". What is happening is we are traveling the road of 95^x, where 95 is our character set, and "x" is our length. To put this visually, it looks exactly like this;

Graphic showing the exponential wall.

As you can see, it gets exceptionally expensive to continue searching past 9 characters, and absolutely prohibitive to search past 10, let alone 15. We've hit the exponential wall. Now, I've heard the argument from all of my conspiracy theorist friends that faceless nameless government agencies with endlessly deep pockets (nevermind most countries are practically bankrupt), who have access to quantum supercomputers (nevermind quantum computing has yet to be implemented) can squash these numbers like the gnat you are. Let's put that to the test.

Just recently, the NSA built a super spy datacenter in Utah. Let's assume that the NSA has one of these spy datacenters in each of the 50 states in the USA. Further, let's assume that they have not only 1 of these GPU clusters, but there are 30 of them in each datacenter. That means we have 750 GPUs in one datacenter, which means they have 37,500 total across the country. That's 13 quadrillion NTLM hashes per second (assuming the communication between all 37,500 GPUs across all 50 states does not add any latencies)! Surely we can exhaust the 15 length passwords with ease!

I'll do the math for you. It would only take 5 days to exhaust the 11 character length passwords, 1.3 years for 12 character passwords, 1.2 centuries for 13 character passwords, 11.8 millenia for 14 character passwords and 1,119.3 millenia for 15 character passwords. Again, we've hit that exponential wall, where 11 characters is doable (although exceptionally expensive), 12 characters is challenging, and 13 characters is impractical. All this horsepower only bought us 2 extra characters in the brute force search using only 25 GPUs. For an initial pricetag that is in the billions, if not trillions for this sort of setup (not to mention the power bill to keep it running as well as the cooling bill), it just isn't practical for a nameless faceless government agency. The math just isn't on their side. It just doesn't pencil in financially, nor does it pencil in theoretically. Such much for that conspiracy.

Conclusion
As is clearly demonstrated, truly random 12-character passwords would be sufficient to thwart even the most wealthy and dedicated attackers. After all, you hit that exponential wall. Surely, all bets are off! Not so. There are plenty of additional attacks that the attacker can use, which are much more effective than the brute force, and which makes finding a 12-character password more feasible. So, we'll be discovering each of those in this series, how effective they are, how they work, and software utilities that take advantage of them.

Stay tuned!

Please Consider A Donation

I just received an anonymous $5 donation via PayPal for my series on ZFS. Thank you, whomever you are! I've received other donations in the past, and never once have I had a donation page or ads on the site or asked for money. It's awesome to know that some people will financially support your effort to keep up a blog. Also, thanks to everyone who has bought me lunches, books and other things for the same reason.

The donation page can be found at http://pthree.org/donate/. As explained on the donation page, there will never be ads on this site. I'm not interested in writing for income. I write, because I love to. So, if you like what you've read, please consider a donation to the blog. I have a Bitcoin address, or you can use PayPal.

This will be the only post I'll make about the issue. You can follow the "Donate" link on the sidebar. Thanks a bunch!

Tiny Tiny RSS - The Google Reader Replacement

With all the weeping, wailing and gnashing of teeth about Google killing Reader, I figured I'd blog something productive. Rather than piss and moan, here is a valid solution you can build for at most two bucks, using entirely Free Software, running on your own server, under your control. The solution is to install Tiny Tiny RSS on your own server, and if you have an Android smartphone, the official Tiny Tiny RSS app ($2 for the unlock key (support the developer- this stuff rocks)). Here are the step-by-step installation directions that should get you an up-and-running Reader replacement in less than 30 minutes.

First, create a directory on your webserver where you will install Tiny Tiny RSS. You will need Apache, lighttpd, Cherokee, or some other web server, PHP with the necessary modules as well as the PHP CLI interpreter, and either MySQL or PostgreSQL as prerequisites.:

# mkdir /var/www/rss
# wget https://github.com/gothfox/Tiny-Tiny-RSS/archive/1.7.4.tar.gz
# tar -xf 1.7.4.tar.gz -C /var/www/rss/
# chown -R root.www-data /var/www/rss/
# chmod -R g+w,o+ /var/www/rss/

Pull up the web interface by navigating to http://example.com/rss/ (replace "example.com" with your domain name). The default login credentials are "admin" and "password". Make sure to change the default password. Also, Tiny Tiny RSS uses a multiuser setup by default. You can add additional users, including one for yourself that isn't "admin", or you can change it to single user mode in the preferences.

After the setup is the way you want it, you'll want to get your Google Reader feeds into Tiny Tiny RSS. Navigate to Reader, and export your data. This will take you to Google Takeout, and you'll download a massive ZIP archive, that contains an OPML file, as well as a ton of other data. Grab your "subscriptions.xml" from that ZIP file, and import them into your Tiny Tiny RSS installation.

One awesome benefit of Tiny Tiny RSS, is that it has a built-in mobile version, if browsing the install from a mobile browser. It looks good too.

The only thing left to do, is navigate to the preferences, and enable the external API. There are additional 3rd party desktop-based readers that have Tiny Tiny RSS support, such as Liferea and Newsbeuter. Even the official Android app will need the option enabled. This will give you full synchronization between the web interface, your Android app, and your desktop RSS reader.

Unfortunately, Tiny Tiny RSS doesn't update the feeds by default. You need to setup a script that manages this for you. The best solution is to write a proper init script that starts and stops the updating daemon. I didn't do this. Instead, I did the next best thing. I put the following into my /etc/rc.local configuration file:

#!/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.

sudo -u www-data php /var/www/rss/update_daemon2.php > /dev/null&
exit 0

A couple of things to notice here. First, the redirection to /dev/null. Whichever terminal you execute that script from, it will send a ton of data to STDOUT. Also, if it doesn't succeed, the redirection to /dev/null may not display some error output. So, only after you are sure that everything is setup correctly, should you be redirecting the output. Second, is the fact that we are running the script as the "www-data" user (the default user for Apache on Debian/Ubuntu). The script should not run as root.

Now, execute the following, and you should be good to go:

# /etc/init.d/rc.local start

You now have a web-based RSS reader licensed under the GPL running under your domain that you control. If you have an Android smartphone, then install the official Tiny Tiny RSS app, along with it's unlock key, and put in your username, password, and URL to the installation. The Android app is also GPL licensed. Make sure you purchase the unlock key, or your app will only be good for 7 days (it's trialware).

Lastly, both Liferea and Newsbeuter support Tiny Tiny RSS. However, make sure you get the latest upstream versions from both, as Tiny Tiny RSS changed their API recently. For Newsbeuter, this means version 2.6 or later (I actually haven't tested Liferea). I'll show how to get Newsbeuter working with Tiny Tiny RSS. All you need to do, is edit your ~/.config/newsbeuter/config file with the following contents:

auto-reload "yes"
reload-time 360
text-width 80
ttrss-flag-star "s"
ttrss-flag-publish "p"
ttrss-login "admin"
ttrss-mode "multi"
ttrss-password "password"
ttrss-url "http://example.com/rss/"
urls-source "ttrss"

Restart Newsbeuter, and you should be good to go.

You now have a full Google Reader replacement, with entirely Free Software. Synchronization between the web, your phone, and your desktop, including a mobile version of the page for mobile devices. And it only cost you two bux.

Changing RSS Source

Due to the recent news about Google shutting down a number of services, including Reader an the CalDAV API, I came to the realization that my RSS source for this blog should probably go back to the main feed. So, as of July 1, 2013, the same day Reader shuts down, the old feed at http://feeds2.feedburner.com/pthree will be deprecated in favor of http://pthree.org/feed.

I chose Feed Burner years ago for its data collection. Then Google purchased Feed Burner, and the data runs through their servers. Due to the killing off of Reader, it would not surprise me if they killed off Feed Burner as well. I don't look at the stats much, so they're not as important to me now, as they were when I set it up.

Many of you may be using the http://pthree.org/feed source already. If so, you're good. If you're using the http://feeds2.feedburner.com/pthree source, then you may want to make the switch before July 1, 2013. The Ubuntu Planet uses my Feed Burner source. I'll get it updated before the deadline.

Create Your Own Graphical Web Of Trust- Updated

A couple years ago, I wrote about how you can create a graphical representation of your OpenPGP Web of Trust. It's funny how I've been keeping mine up-to-date for these past couple years as I attend keysigning parties, without really thinking about what it looks like. Well, I recently returned from the SCaLE 11x conference, which had a PGP keysigning party. So, I've been keeping the graph up-to-date as new signatures would come in. Then it hit me: am I graphing ONLY the signatures on my key, or all the signatures in my public keyring, or something somewhere in between? It seemed to be the latter, so I decided to do something about it.

The following script assumes you have the signing-party, graphviz and imagemagick packages installed. It grabs only the signatures on your OpenPGP key, downloads any keys that have signed your key that you may not have downloaded, places them in their own public keyring, then uses that information to graph your Web of Trust. Here's the script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# Replace $KEY with your own KEYID
KEY="22EEE0488086060F"
echo "Getting initial list of signatures..."
gpg --with-colons --fast-list-mode --list-sigs $KEY | awk -F ':' '$1 ~ /sig|rev/ {print $5}' | sort -u > ${KEY}.ids
echo "Refreshing your keyring..."
gpg --recv-keys $(cat ${KEY}.ids) > /dev/null 2>&1
echo "Creating public keyring..."
gpg --export $(cat ${KEY}.ids) > ${KEY}.gpg
echo "Creating dot file..."
gpg --keyring ./${KEY}.gpg --no-default-keyring --list-sigs | sig2dot > ${KEY}.dot 2> ${KEY}.err
echo "Creating PostScript document..."
neato -Tps ${KEY}.dot > ${KEY}.ps
echo "Creating graphic..."
convert ${KEY}.ps ${KEY}.gif
echo "Finished."

It may take some time to download and refresh your keyring, and it may take some time generating the .dot file. Don't be surprised if it takes 5-10 minutes, or so. However, when it finishes, you should end up with something like what is below (it's obvious when you've attended keysigning parties by the clusters of strength in your web):

pubring-small
Click for a larger version

GlusterFS Linked List Topology

Lately, a few coworkers and myself decided to put our workstations into a GlusterFS cluster. We wanted to test distributed replication. Our workstations are already running ZFS on Linux, so we built two datasets on each of our workstations, and made them the bricks for GlusterFS. We created a nested "brick" directory to prevent GlusterFS from sending data to the root ZFS mountpoint, if the dataset is not mounted. Here is our setup on each of our workstations:

# zfs create -o sync=disabled pool/vol1
# zfs create -o sync=disabled pool/vol2
# mkdir /pool/vol1/brick /pool/vol2/brick

Notice that I've disabled synchronous writes. This is because GlusterFS is synchronous by default already. Because ZFS resides underneath the GlusterFS client mount, and GlusterFS is communicating with the application about synchronous data, there is no need to increase write latencies with synchronous writes on ZFS.

Now the question comes as to how to set the right topology for our storage cluster. I wish to maintain two copies of the data in a distributed manner. Meaning that the local peer has a copy of the data, and a remote peer also has a copy. Thus, distributed replication. But, how do you decide where the copies get distributed? I looked at two different topologies before making my decision, which I'll discuss here.

Paired Server Topology

Paired server GlusterFS topology

In this topology, servers are completely paired together. This means you always know where both copies of your data reside. You could think of it as a mirrored setup. The bricks on serverA will hold identical data to the bricks on serverB. This obviously simplifies administration and troubleshooting a great deal. And, it's easy to setup. Suppose we wish to create a volume named "testing", and assumed that we've peered with all the necessary nodes, we would proceed as follows:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol1/brick\
serverA:/pool/vol2/brick serverB:/pool/vol2/brick
# gluster volume info testing
 
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick

If we wish to add more storage to the volume, the commands are pretty straight forward:

# gluster volume add-brick testing\
serverC:/pool/vol1/brick serverD:/pool/vol1/brick\
serverC:/pool/vol2/brick serverD:/pool/vol2/brick
# gluster volume info testing
 
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol1/brick
Brick7: serverC:/pool/vol2/brick
Brick8: serverD:/pool/vol2/brick

The drawback to this setup, as it should be obvious, is when servers are added, they must be added in pairs. You cannot have an odd number of servers in this topology. However, as shown in both the image, and the commands, this is fairly straight forward from an administration perspective, and from a storage perspective.

Linked List Topology

glusterfs-linked-list

In computer science, a "linked list" is a data structure sequence, where the tail of one node points to the head of another. In the case of our topology, the "head" is the first brick, and the "tail" is the second brick. As a result, this creates a circular storage setup, as shown in the image above.

To set something like this up with say 3 peers, you would do the following:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol2/brick\
serverB:/pool/vol1/brick serverC:/pool/vol2/brick\
serverC:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverA:/pool/vol2/brick

Now, if you wanted to add a new server to the cluster, you can. You can add servers individually, unlike the paired server topology above. But, the trick is replacing bricks, as well as adding bricks, and it's not 100% intuitive on how to proceed. Thus, if I wanted to add "serverD" with its two bricks to the setup, I would first need to recognize that "serverA:/pool/vol2/brick" is going to be replaced with "serverD:/pool/vol2/brick". Then, I will have two bricks available to add to the volume, namely "serverD:/pool/vol1/brick" and "serverA:/pool/vol2/brick". Armed with that information, and assuming that "serverD" has already peered with the others, let's proceed:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick start

I can run "gluster volume replace-brick testing status" to keep an eye on the brick replacement. When ready, I need to commit it:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick commit

Now we have two bricks to add to the cluster. However, the "serverA:/pool/vol2/brick" brick was previously part of the cluster. As such, it contains metadata that is no longer relevant when adding the new server. As such, we must clear the metadata off of the brick, so it starts from a clean slate, then we can add it without problem. Here are the steps we need to do next:

(serverA)# setfattr -x trusted.glusterfs.volume-id /pool/vol2/brick
(serverA)# setfattr -x trusted.gfid /pool/vol2/brick
(serverA)# rm -rf /pool/vol2/brick/.glusterfs/
(serverA)# service glusterfs-server restart

We are now ready to add the bricks cleanly:

# gluster volume add-brick testing\
serverD:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol2/brick
Brick7: serverD:/pool/vol1/brick
Brick8: serverA:/pool/vol2/brick

It should be obvious that this is a more complicated setup. It's more abstract from a topological perspective, and it more difficult to implement and get right from an application perspective. And there is certainly a strong argument for simplifying storage architectures and administration. However, this linked list topology has the advantage of adding and removing one server at a time, unlike the paired server setup. If this is something you need, or you have an odd-number of servers in your cluster, the linked list topology will work well.

For our workstation cluster at the office, we went with a linked list topology, because it will mimic our production setup needs more closely. There may also be other topologies that we haven't explored. We also added "geo-replication" by replicating our volume to a larger storage node in the server room. This allows us to ensure data integrity, should two servers go down in our cluster.