Image of the glider from the Game of Life by John Conway
Skip to content

Strengthen Your Private Encrypted SSH Keys

Recently, on Hacker News, a post came through about improving the security of your encrypted private OpenSSH keys. I want to re-blog that post here (I'm actually jealous he blogged it first), in my own words, and provide a script at the end that will automate the process for you.

First off, Martin goes into great detail about the storage format of your unencrypted private OpenSSH keys. The unencrypted key is stored in a format known as Abstract Syntax Notation One (ASN.1) (for you web nerds, it's similar in function to JSON). However, when you encrypt the key with your passphrase, it is no longer valid ASN.1. So, Martin then takes you through the process of how the key is encrypted. The big take-away from that introduction is the following, that by default:

  • Encrypted OpenSSH keys use MD5- a horribly broken cryptographic hash.
  • OpenSSH keys are encrypted with AES-128-CBC, which is fast, fast, fast.

It would be nice if our OpenSSH keys used a stronger cryptographic hash like SHA1, SHA2 or SHA3 in the encryption process, rather than MD5. Further, it would be nice if we could cause attackers who get our private encrypted OpenSSH keys to expend more computing resources when trying to brute force our passphrase. So, rather than using the speedy AES algorithm, how about 3DES or Blowfish?

This is where PKCS#8 comes into play. "PKCS" stands for "Public-key cryptography standards". There are currently 15 standards, with 2 withdrawn and 2 under development. Standard #8 defines how private key certificates are to be handled, both in unencrypted and encrypted form. Because OpenSSH use public key cryptography, and private keys are stored, it would be nice if it adhered to the standard. Turns out, it does. From the ssh-keygen(1) man page:

     -m key_format
             Specify a key format for the -i (import) or -e (export) conver‐
             sion options.  The supported key formats are: “RFC4716” (RFC
             4716/SSH2 public or private key), “PKCS8” (PEM PKCS8 public key)
             or “PEM” (PEM public key).  The default conversion format is

As mentioned, the supported key formats are RFC4716, PKCS8 and PEM. Seeing as though PKCS#8 is supported, it seems like we can take advantage of it in OpenSSH. So, the question then comes, what does PKCS#8 offer me in terms of security that I don't already have? Well, Martin answers this question in his post as well. Turns out, there are 2 versions of PKCS#8 that we need to address:

  • The version 1 option specifies a PKCS#5 v1.5 or PKCS#12 algorithm to use. These algorithms only offer 56-bits of protection, since they both use DES.
  • The version 2 option specifies that PKCS#5 v2.0 algorithms are used which can use any encryption algorithm such as 168 bit triple DES or 128 bit RC2.

As I mentioned earlier, we want SHA1 (or better) and 3DES (or slower). Turns out, the OpenSSL implementation of PKCS#8 version 2 uses the following algorithms:

  • PBE-SHA1-RC4-128
  • PBE-SHA1-RC4-40
  • PBE-SHA1-RC2-128
  • PBE-SHA1-RC2-40

PBE-SHA1-3DES is our target. So, the only question remaining, is can we convert our private OpenSSH keys to this format? If so, how? Well, because OpenSSH relies heavily on OpenSSL, we can use the openssl(1) utility to make the conversion to the new format, and due to the ssh-keygen(1) manpage quoted above, we know OpenSSH supports the PKCS#8 format for our private keys, so we should be good.

Before we go further though, why 3DES? Why not stick with the default AES? DES is slow, slow, slow. 3DES is DES chained together 3 times. Compared to AES, it's a snail racing a hare. With 3DES, the data is encrypted with a first 56-bit DES key, then encrypted with a second 56-bit DES key, the finally encrypted with a third 56-bit DES key. The result is an output that has 168-bits of security. There are no known practical attacks against 3DES, and NIST considers it secure through 2030. It's certainly appropriate to use as an encrypted storage for our private OpenSSH keys.

To convert our private key, all we need to do is rename it, run openssl(1) on the keys, then test. Here are the steps:

$ mv ~/.ssh/id_rsa{,.old}
$ umask 0077
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa     # dsa and ecdsa are also supported

Now login to a remote OpenSSH server where the public portion of that key is installed, and see if it works. If so, remove the old key. To simplify the process, I created a script where you provide your private OpenSSH key as an argument, and it does the conversion for you. You can find that script at

What's the point? Basically, you should think of it the following way:

  • We're using SHA1 rather than MD5 as part of the encryption process.
  • By using 3DES rather than AES, we've slowed down brute force attacks to a crawl. This should buy us 2-3 extra characters of entropy in our passphrase.
  • Using PKCS#8 gives us the flexibility to use other algorithms in the future, as old ones are replaced.

I agree with Martin that it's a shame OpenSSH isn't using this by default. Why stick with the original OpenSSH storage format? Compatibility isn't a concern, as the support relies solely on the client, not the server. Because every client should have a different keypair installed, there is no worry about new versus old client. Extra security is purchased through the use of SHA1 and 3DES. Computing time to create the keys was trivial, and the performance difference when using them is not noticeable compared to the traditional format. Of course, if your passphrase protecting your keys is strong, with lots and lots of entropy, then an attacker will be foiled with a brute force attack anyway. Regardless, why not make it more difficult for him by slowing him down?

Martin's post is a great read, and as such, I've converted my OpenSSH keys to the new format. I'd encourage you to do the same.

ZFS Administration, Appendix B- Using USB Drives

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats


This comes from the "why didn't I think of this before?!" department. I have lying around my home and office a ton of USB 2.0 thumb drives. I have six 16GB drives and eight 8GB drives. So, 14 drives in total. I have two hypervisors in a GlusterFS storage cluster, and I just happen to have two USB squids, that support 7 USB drives each. Perfect! So, why not put these to good use, and add them as L2ARC devices to my pool?


USB 2.0 is limited to 40 MBps per controller. A standard 7200 RPM hard drive can do 100 MBps. So, adding USB 2.0 drives to your pool as a cache is not going to increase the read bandwidth. At least not for large sequential reads. However, the seek latency of a NAND flash device is typically around 1 milliseconds to 3 milliseconds, whereas a platter HDD is around 12 milliseconds. If you do a lot of small random IO, like I do, then your USB drives will actually provide an overall performance increase that HDDs cannot provide.

Also, because there are no moving parts with NAND flash, this is less data that needs to be read from the HDD, which means less movement of the actuator arm, which means consuming less power in the long term. So, not only are they better for small random IO, they're saving you power at the same time! Yay for going green!

Lastly, the L2ARC should be read intensive. However, it can also be write intensive if you don't have enough room in your ARC and L2ARC to store all the requested data. If this is the case, you'll be constantly writing to your L2ARC. For USB drives without wear leveling algorithms, you'll chew through the drive quickly, and it will be dead in no time. If this is your case, you could store only metadata, rather than the actual data block pages in the L2ARC. You can do this with the following:

# zfs set secondarycache=metadata pool

You can set this pool-wide, or per dataset. In the case outlined above, I would certainly do it pool-wide, which each dataset will inherit by default.


To this up, it's rather straight forward. Just identify what the drives are, by using their unique identifiers, then add them to the pool:

# ls /dev/disk/by-id/usb-* | grep -v part

So, there are my seven drives that I outlined at the beginning of the post. So, to add them to the system as L2ARC drives, just run the following command:

# zpool add -f pool cache usb-Kingston_DataTraveler_G3_0014780D8CEBEBC145E80163-0:0\

Of course, these are the unique identifiers for my USB drives. Change them as necessary for your drives. Now that they are installed, are they filling up?

# zpool iostat -v
pool                                                          alloc   free   read  write   read  write
------------------------------------------------------------  -----  -----  -----  -----  -----  -----
pool                                                           695G  1.13T     21     59  53.6K   457K
  mirror                                                       349G   579G     10     28  25.2K   220K
    ata-ST1000DM003-9YN162_S1D1TM4J                               -      -      4     21  25.8K   267K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50708780                       -      -      4     21  27.9K   267K
  mirror                                                       347G   581G     11     30  28.3K   237K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50713154                       -      -      4     22  16.7K   238K
    ata-WDC_WD10EARS-00Y5B1_WD-WMAV50710024                       -      -      4     22  19.4K   238K
logs                                                              -      -      -      -      -      -
  mirror                                                         4K  1016M      0      0      0      0
    ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part1                  -      -      0      0      0      0
    ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part1                  -      -      0      0      0      0
cache                                                             -      -      -      -      -      -
  ata-OCZ-REVODRIVE_OCZ-33W9WE11E9X73Y41-part2                52.2G    16M      4      2  51.3K   291K
  ata-OCZ-REVODRIVE_OCZ-X5RG0EIY7MN7676K-part2                52.2G    16M      4      2  52.6K   293K
  usb-Kingston_DataTraveler_G3_0014780D8CEBEBC145E80163-0:0    465M  6.80G      0      0    319  72.8K
  usb-Kingston_DataTraveler_SE9_00187D0F567FEC2090007621-0:0  1.02G  13.5G      0      0  1.58K  63.0K
  usb-Kingston_DataTraveler_SE9_00248121ABD5EC2070002E70-0:0  1.17G  13.4G      0      0    844  72.3K
  usb-Kingston_DataTraveler_SE9_00D0C9CE66A2EC2070002F04-0:0   990M  13.6G      0      0  1.02K  59.9K
  usb-_USB_DISK_Pro_070B2605FA99D033-0:0                      1.08G  6.36G      0      0  1.18K  67.0K
  usb-_USB_DISK_Pro_070B2607A029C562-0:0                      1.76G  5.68G      0      1  2.48K   109K
  usb-_USB_DISK_Pro_070B2608976BFD58-0:0                      1.20G  6.24G      0      0    530  38.8K
------------------------------------------------------------  -----  -----  -----  -----  -----  -----

Something important to understand here, is the drives do not need to be all the same size. You can mix and match as you have on hand. Of course, the more space you can give to the cache, the better off you'll be.


While this certainly isn't designed for speed, it can be used for lower random IO latencies, and it well reduce power in the datacenter. Further, what else are you going to do with those USB devices just lying around? Might as well put them to good use. Definitely seeing as though "the cloud" is making it trivial to get all of your files online.

Password Attacks, Part III- The Combination Attack


It's important to understand that most of the password attacks to offline databases where only hashes are stored are extensions of either the brute force attack or the dictionary attack, or a hybrid combination of both. There isn't really anything new outside of those two basic attacks. The combination attack is one such attack where we combine two dictionaries to create a much larger one. This larger one becomes the basis for creating passphrase dictionaries.

Combination Attack

Suppose we have two (very small) dictionaries. The first dictionary is a list of sizes while the second dictionary is a list of animals:

Dictionary 1:


Dictionary 2:


In order to combine these two dictionaries, we use a standard cross product between the two dictionaries. This means that there will be a total list of 25 words in our combined dictionary:


We have begun to assemble some crude rudimentary phrases. They may not make a lot of sense, but building that dictionary was cheap. I could create a script, similar to to the following, to have it build me that list:

while read A; do
    while read B; do
        echo "${A}${B}"
    done < /tmp/dict2.txt
done < /tmp/dict1.txt > /tmp/comb-dict.txt

Personal Attack Building

Let's extend the combination attack a bit to target personal data. Suppose dictionary one is a list of male names and dictionary two is a list of four-digit years, starting from year 0 through year 2013. We can then create a combined dictionary that has a list of males and their birth (or death) years. For those passwords that don't pass simple dictionary attacks, maybe these hashes are passwords of when their kid was born, or when their dad died. I've shoulder surfed a number of different people, and watched as they type in their password, and you would be surprised how many passwords meet this exact criteria. Something like "Christian1995".

Now that I have you thinking, it should be obvious now that we can create all sorts of personalized dictionaries. How about a list of male names, a list of female names, a list of surnames, and a list of dates in MMDDYY format? Or how about a list of cities, states, universities, sports teams? Lists of "last 4 digits of your SSN", fully qualified domain names of popular websites, and common patterns on keyboards, such as "qwert", "asdf" and "zxcv", or "741852963" (I've seen this hundreds of times, where people just swipe their finger down their 10-key).

Even though the success rate of getting a personalized password out of one of these dictionaries might be low compared to the common dictionary attack, it's still far more efficient than the brute force, and it's so trivial to make these dictionaries, that I could have one server continuing to create combination dictionaries, while another works on the dictionary lists that are produced.

Extending the Attack

There are always sites that are forcing numbers on you, as well as upper case characters, and non-alphanumeric characters. Believe it or not, but just adding a "0" or a "1" at the beginning and end of these dictionaries can greatly improve my chances of discovering your password. Or, even just adding an exclamation point at the beginning or end might be all that's needed to extend a standard dictionary.

However, we can get a bit more creative with little cost. Rather than just concatenating the words together, we can insert the dash "-" and the underscore "_" between our concatenated words. These are common characters to use when forced to use non-alphanumeric characters in passwords that must be 8 characters or longer. Passwords such as "i-love-you", or "Alice_Bob" are common ways to satisfy the requirement, as well as keeping the password easy to remember. And, of course, we can append "!" or "0" to our password dictionary for the numerical requirement.

According to Wikipedia, the Oxford English Standard Dictionary, which is the Go To for all things English definitions, is the following size:

As of 30 November 2005, the Oxford English Dictionary contained approximately 301,100 main entries.

This is smaller than what most Unix-like operating systems will ship, but it's targeted enough, that most people if relying on dictionary words for their passwords, won't deviate much from. So, creating a full three-word passphrase dictionary would consist of only 90,661,210,000 entries. Assuming the average word length is 9 characters long, plus the newline character, this would be about a 900 GB file. Certainly nothing to laugh at, but with compression, we should be able to get this down do about 250 GB, give or take. Now let's generate two or three of those files with various combinations of word separation or prepending/appending numbers or non-alphanumeric characters, and we have a great attack vector starting point.

It's important to note that these files only need to be generated once, then stored for long-term use. The cost of generating these files initially will be high, of course. For only $11,000 USD, you can purchase a Backblaze storage pod with 180 TB of raw disk to store these behemoth files, and many combinations of them. It may be a large initial expense, but the potential value of a cracked password may be worth it in the end- something many attackers and their parent companies might just consider. Further, you may even find some of these dictionaries online.


We must admit, that if we have gotten to this point in the game, we are getting desperate for the password. This attack is certainly more effective than the brute force, and it won't take us long to exhaust these combined dictionaries. However, the rate at which we pull passwords out of them will certainly be much smaller than our success for finding 70% of the passwords with a plain dictionary. That doesn't mean it's not worth it. Even if we find only 10% of the passwords out of our hashed database, that's great success for the effort put into it.

ZFS Administration, Appendix A- Visualizing The ZFS Intent LOG (ZIL)

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats


While taking a walk around the city with the rest of the system administration team at work today (we have our daily "admin walk"), a discussion came up about asynchronous writes and the contents of the ZFS Intent Log. Previously, as shown in the Table of Contents, I blogged about the ZIL in great length. However, I didn't really discuss what the contents of the ZIL were, and to be honest, I didn't fully understand it myself. Thanks to Andrew Kuhnhausen, this was clarified. So, based on the discussion we had during our walk, as well as some pretty graphs on the whiteboard, I'll give you the breakdown here.

Let's start at the beginning. ZFS behaves more like an ACID compliant RDBMS than a traditional filesystem. Its writes are transactions, meaning there are no partial writes, and they are fully atomic, meaning you get all or nothing. This is true whether the write is synchronous or asynchronous. So, best case is you have all of your data. Worst case is you missed the last transactional write, and your data is 5 seconds old (by default). So, let's look at those too cases- the synchronous write and the asynchronous write. With synchronous, we'll consider the write both with and without a separate logging device (SLOG).

The ZIL Function

The primary, and only function of the ZIL is to replay lost transactions in the event of a failure. When a power outage, crash, or other catastrophic failure occurs, pending transactions in RAM may have not been committed to slow platter disk. So, when the system recovers, the ZFS will notice the missing transactions. At this point, the ZIL is read to replay those transactions, and commit the data to stable storage. While the system is up and running, the ZIL is never read. It is only written to. You can verify this by doing the following (assuming you have SLOG in your system). Pull up two terminals. In one terminal, run an IOZone benchmark. Do something like the following:

$ iozone -ao

This will run a whole series of tests to see how your disks perform. While this benchmark is running, in the other terminal, as root, run the following command:

# zpool iostat -v 1

This will clearly show you that when the ZIL resides on a SLOG, the SLOG devices are only written to. You never see any numbers in the read columns. This is becaus the ZIL is never read, unless the need to replay transactions from a crash are necessary. Here is one of those seconds illustrating the write:

                                                            capacity     operations    bandwidth
pool                                                     alloc   free   read  write   read  write
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
pool                                                     87.7G   126G      0    155      0   601K
  mirror                                                 87.7G   126G      0    138      0   397K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9421741-part5          -      -      0     69      0   727K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9755779-part5          -      -      0     68      0   727K
logs                                                         -      -      -      -      -      -
  mirror                                                 2.43M   478M      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part1      -      -      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part1      -      -      0      8      0   108K
  mirror                                                 2.57M   477M      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part1      -      -      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part1      -      -      0      7      0  95.9K
cache                                                        -      -      -      -      -      -
  scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part5    26.6G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part5    26.5G  56.8G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part5    26.7G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part5    26.7G  56.7G      0      0      0      0
-------------------------------------------------------  -----  -----  -----  -----  -----  -----

The ZIL should always be on non-volatile stable storage! You want your data to remain consistent across power outages. Putting your ZIL on a SLOG that is built from TMPFS, RAMFS, or RAM drives that are not battery backed means you will lose any pending transactions. This doesn't mean you'll have corrupted data. It only means you'll have old data. With the ZIL on volatile storage, you'll never be able to get the new data that was pending a write to stable storage. Depending on how busy your servers are, this could be a Big Deal. SSDs, such as from Intel or OCZ, are good cheap ways to have a fast, low latentcy SLOG that is reliable when power is cut.

Synchronous Writes without a SLOG

When you do not have a SLOG, the application only interfaces with RAM and slow platter disk. As previously discussed, the ZFS Intent LOG (ZIL) can be thought of as a file that resides on the slow platter disk. When the application needs to make a synchronous write, the contents of that write are sent to RAM, where the application is currently living, as well as sent to the ZIL. So, the data blocks of your synchronous write at this exact moment in time have two homes- RAM and the ZIL. Once the data has been written to the ZIL, the platter disk sends an acknowledgement back to the application letting it know that it has the data, at which point the data is flushed from RAM to slow platter disk.

This isn't ALWAYS the case, however. In the case of slow platter disk, ZFS can actually store the transaction group (TXG) on platter immediately, with pointers in the ZIL to the locations on platter. When the disk ACKs back that the ZIL contains the pointers to the data, then the write TXG is closed in RAM, and the space in the ZIL opened up for future transactions. So, in essence, you could think of the TXG SYNCHRONOUS write commit happening in three ways:

  1. All data blocks are synchronously written to both the RAM ARC and the ZIL.
  2. All data blocks are synchronously written to both the RAM ARC and the VDEV, with pointers to the blocks written in the ZIL.
  3. All data blocks are synchronously written to disk, where the ZIL is completely ignored.

In the image below, I tried to capture a simplified view of the first process. The pink arrows, labeled as number one, show the application committing its data to both RAM and the ZIL. Technically, the application is running in RAM already, but I took it out to make the image a bit more clean. After the blocks have been committed to RAM, the platter ACKs the write to the ZIL, noted by the green arrow labeled as number two. Finally, ZFS flushes the data blocks out of RAM to disk as noted by the gray arrow labeled as number three.

Show how a synchronous write works with ZFS and the ZIL on platter.
Image showing a synchronous write with ZFS without a SLOG

Synchronous Writes with a SLOG

The advantage of a SLOG, as previously outlined, is the ability to use low latency, fast disk to send the ACK back to the application. Notice that the ZIL now resides on the SLOG, and no longer resides on platter. The SLOG will catch all synchronous writes (well those called with O_SYNC and fsync(2) at least). Just as with platter disk, the ZIL will contain the data blocks the application is trying to commit to stable storage. However, the SLOG, being a fast SSD or NVRAM drive, ACKs the write to the ZIL, at which point ZFS flushes the data out of RAM to slow platter.

Notice that ZFS is not flushing the data out of the ZIL to platter. This is what confused me at first. The data is flushed from RAM to platter. Just like an ACID compliant RDBMS, the ZIL is only there to replay the transaction, should a failure occur, and the data is lost. Otherwise, the data is never read from the ZIL. So really, the write operation doesn't change at all. Only the location of the ZIL changes. Otherwise, the operation is exactly the same.

As shown in the image, again the pink arrows labeled number one show the application committing its data to both the RAM and the ZIL on the SLOG. The SLOG ACKs the write, as identified by the green arrow labeled number two, then ZFS flushes the data out of RAM to platter as identified by the gray arrow labeled number three.

Show how a synchronous write works with ZFS and the ZIL on a SLOG.
Image showing a synchronous write with ZFS with a SLOG

Asynchronous Writes

Asynchronous writes have a history of being "unstable". You have been taught that you should avoid asynchronous writes, and if you decide to go down that path, you should prepare for corrupted data in the event of a failure. For most filesystems, there is good counsel there. However, with ZFS, it's a nothing to be afraid of. Because of the architectural design of ZFS, all data is committed to disk in transaction groups. Further, the transactions are atomic, meaning you get it all, or you get none. You never get partial writes. This is true with asynchronous writes. So, your data is ALWAYS consistent on disk- even with asynchronous writes.

So, if that's the case, then what exactly is going on? Well, there actually resides a ZIL in RAM when you enable "sync=disabled" on your dataset. As is standard with the previous synchronous architectures, the data blocks of the application are sent to a ZIL located in RAM. As soon as the data is in the ZIL, RAM acknowledges the write, and then flushes the data do disk, as would be standard with synchronous data.

I know what you're thinking: "Now wait a minute! The are no acknowledgements with asynchronous writes!" Not always true. With ZFS, there is most certainly an acknowledgement, it's just one coming from very, very fast and extremely low latent volatile storage. The ACK is near instantaneous. Should there be a crash or some other failure that causes RAM to lose power, and the write was not saved to non-volatile storage, then the write is lost. However, all this means is you lost new data, and you're stuck with old but consistent data. Remember, with ZFS, data is committed in atomic transactions.

The image below illustrates an asynchronous write. Again, the pink number one arrow shows the application data blocks being initially written to the ZIL in RAM. RAM ACKs back with the green number two arrow. ZFS then flushes the data to disk, as per every previous implementation, as noted by the gray number 3 arrow. Notice in this image, even if you have a SLOG, with asynchronous writes, it's bypassed, and never used.

Show how an asynchronous write works with ZFS and the ZIL.
Image showing an asynchronous write with ZFS.


This is how I and my coworkers understand the ZIL. This is after reading loads of documentation, understanding a bit of computer science theory, and understanding how an ACID compliant RDBMS works, which is architected in a similar manner. If you think this is not correct, please let me know in the comments, and we can have a discussion about the architecture.

There are certainly some details I am glossing over, such as how much data the ZIL will hold before its no longer utilized, timing of the transaction group writes, and other things. However, it should also be noted that aside from some obscure documentation, there doesn't seem to be any solid examples of exactly how the ZIL functions. So, I thought it would be best to illustrate that here, so others aren't left confused like I was. For me, images always make things clearer to understand.

Password Attacks, Part II - The Dictionary Attack

Before we start delving into the obscure attacks, it probably makes the most sense to get introduced to the most common attacks. The dictionary attack is one such attack. Previously we talked about the brute force attack, which is highly ineffective, and exceptionally slow and expensive to maintain. Here, we'll introduce a much more effective attack that will open up the ability to crack 15 character passwords, and longer, with ease.

Dictionary Attack
The dictionary is another dumb search, except for one thing: an assumption is made that people choose passwords that are based on dictionary words, because adding mutations to the password requires more work, is more difficult to remember, and more difficult to type. Because humans are largely lazy by default, we take the lazy approach to password creation- base it on a dictionary word, and be done with it. After all, no one is really going to hack my account. Right?

A couple years ago, during the height of the Sony PlayStation 3 hacking saga, 77 million PlayStation Network accounts were leaked. These were accounts from all over the globe. Worse, SONY STORED 1 MILLION OF THOSE PASSWORDS IN PLAINTEXT! I'll let that sink in for a minute. These 1 million passwords were leaked to Bittorrent. So, we can do some analysis on the passwords themselves, such as length and difficulty. Troy Hunt did some amazing work on the analysis of the passwords, so this data should be credited to him, but let's look it over:

  • 93% of the passwords were between 6 and 10 characters long.
  • 50% of the passwords were less than 8 characters.
  • Of the following character sets, only 4% of the passwords had 3 or more: numbers, uppercase, lowercase, everything else.
  • 45% of the passwords were lowercase only.
  • 99% of the passwords did not contain non-alphanumeric characters.
  • 65% of the passwords can be found in a dictionary.
  • Within Sony, there were two separate accounts: "Beauty" and "Delboca". Where there was a common email address between the accounts, 92% of the accounts used the same password between both.
  • Comparing the Sony and Gawker hacks, where there was a common email address, 67% of those accounts used the same password.
  • 82% of the passwords would fall victim to a Rainbow Table Attack (something we'll cover later).

With the cryptography circles I run in, these numbers are not very surprising. 65% of the words found in a dictionary is actually a bit low. I've seen the average sit more around 70%, which is troubling. This means that a dictionary attack is extremely effective, no matter how long your password is. If it can be found in a dictionary, you'll fall victim.

Creating a Dictionary
So, what exactly is a dictionary that can be used for this attack? Generally, it's nothing more than a word list, with one word on each line. Standard Unix operating systems have a dictionary installed when a spell checking utility is installed. This can be found in /usr/share/dict/words. For the case of my Debian GNU/Linux system, I have about 100,000 words in the dictionary:

$ wc -l /usr/share/dict/words 
99171 /usr/share/dict/words

But, I can install a much larger wordlist:

$ sudo aptitude install wamerican-insane
$ sudo select-default-wordlist
$ wc -l /usr/share/dict/words

Even though my word list has grown by 6x the previous size, this still pales in comparison to some dictionaries you can download online. The Openwall word list contains 40 million entries, and is over 500 MB in size. It consists of words from over 20+ languages, and includes passwords generated with pwgen(1). It will cost you $27.95 USD for the download, however. There are plenty of other word lists all over the Internet. Spend some time searching, and you can generate a decently sized word list on your own.

Precomputed Dictionary
This will open the discussion for Rainbow Tables, something we'll discuss later on. However, with a precomputed dictionary attack, I can spend the time hashing all the values in my dictionary, and store them as a key/value pair, where the key is the hash, and the value is the password. This can save considerable time for the password cracking utility when doing the lookup. However, it comes at a cost; I must spend the time precomputing all the values in the dictionary. However, once they are computed, I can use this over and over for my lookups as needed. Disk space is also a concern. For a SHA1 hash, you'll be adding 40 bytes to every entry. For the Openwall word list, this means your dictionary will grow from 500 MB to 2 GB. Not a problem for today's storage, but certainiy something you should be aware of.

Rainbow tables are a version of the precomputed dictionary attack well look at later. The advantage of a rainbow table is savings on disk space for the cost of a bit longer lookup times. We still have precomputed hashes for dictionary words, but they don't occupy as much space.

Thwarting precomputed hashes can be accomplished by salting your password. I discuss password salts on my blog when discussing the shadowed password on Unix systems. Because hashing functions produce the same output for a given input, if that input changes, such as by adding a salt, the output will change. Even though your password was the same, by appending a salt to your password, the computed hash will be completely different. Even if I have your salt in my possession, precomputed dictionary attacks are of no use, because each salt for each account will likely be different, which means I need to precompute different dictionaries with different salts, a very costly task for both CPU and disk space.

However, if I have the salt in my possession, I can still use the salt in conjunction with a standard word list, to compute the desired hash. If I find the hash, I have found the word in your dictionary, even if I needed the salt to help me get there.

Because 65%-70% of people use dictionary words for their passwords, this makes the dictionary attack extremely attractive for attackers who have offline password databases. Even with the Openwall word list of 40 million words, most CPUs can exhaust that word list in seconds, meaning 70% of the passwords will be found in very little time with very little effort. Further, because 67% of the population or more use the same password across multiple accounts, if we know something about the accounts we've just attacked, we can now use that information to login to their bank, Facebook, email, Twitter and other accounts. For the effort, dictionary attacks are very valuable, and a first pick for many attackers.

Password Attacks, Part I - The Brute Force Attack


For those who follow my blog know I have blogged about password security in the past. No matter how you spin it, no matter how you argue it, no matter what your opinions are on password security. If you don't think entropy matters, think again. Entropy is everything. Now, I've blogged about entropy from information theory in one way or another. Just search my blog for the word "entropy". It should keep you occupied for a while.

I'm not going to cover entropy here. Instead, I wish to put my focus on attacking the passwords, rather than building them. Entropy is certainly important, but let's discuss it from the perspective of actually using it against the owner of that password. For simplicity, we'll assume best-case scenarios with entropy. That is, we'll assume that even though the human has influence to weaken the entropy of their password, we'll assume it hasn't been weakened. Let's first start with the brute force attack, which is the most simplistic, although most inefficient attack on passwords.

Brute Force
Brute force is a dumb search. Literally. It means we know absolutely nothing about the search space. We don't know the character restrictions. We don't know length restrictions. We don't know anything about the password. All we have is a file of hashed passwords. We have enough data that if I use a utility like John the Ripper or Hashcat, I can start incrementing through a search space. My search may look something like this:

a, b, c, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ..., bx, by, bz, ...

This is only encompassing lowercase letters in the English alphabet, but it illustrates the point. I start with a common denominator, and increment its value by one until I've exhausted the search space for that single character. Then I concatenate an additional character, and increment until all possibilities in the two-character space are exhausted. Then I move to the three-character space, and so forth, as fast as my hardware will allow.

On the United States English keyboard, there are 94 printable case sensitive characters. Because this is a brute force search, we can make very little assumptions about our search space. We can assume a lot of non-printable characters will not be valid, such as the BELL, the TAB and the ENTER. In fact, looking at the ASCII table, we can safely assume that ASCII decimal values 0 through 31 won't be used. However, it is safe to assume that decimal values 32 through 126 will be used, where decimal value 32 is the SPACE. This is 95 total characters for my search space.

Hashing Algorithms
Before we get into speed, we must also address hashing. Many password databases will hash their passwords, and we're pretending that I have access to one of these databases offline. Most of these databases will also store the salt, if the password is salted, and the algorithm of the hashing type, if it supports more than one hash. So, let's assume that I know the hashing algorithm. This is important, because it will play into how fast I can brute force these passwords.

A designer of a password database store should keep the following things in mind:

  • Slower algorithms, such as Blowfish and MD5, will hinder brute force attacks. However, it can also be slow for busy server farms.
  • Algorithms such as SHA256 and SHA512 should be used when possible due to their lack of practical crypanalysist attacks.
  • Rotating the password thousands of times before storing the resulting hash on disk makes brute force attacks very, very ineffective.

What do I mean by "rotating passwords"? Let's assume the password is the text "". Let's further assume that the password is being hashed with SHA1. The SHA1 of "" is "bea133441136e7a7c992aa2e961002c170e54c93". Now, let's take that hash, and use it as the input to SHA1 again. The SHA1 of "bea133441136e7a7c992aa2e961002c170e54c93" is "cec33ee7686fc2a91b167e096b1f3182ae7828b7". Now let's take that hash, and use it as the input to SHA1 again, continuing until we have called the SHA1 function 1,000 times. The resulting hash should be "4f019154eaefe82946d833f1b05aa1968f1f202a".

A basic Unix shell script could look like this:

for i in `seq 1000`; do
    PASSWORD=`echo -n $PASSWORD | sha1sum - | awk '{print $1}'`

MD5 is a slow(er) algorithm, but it is also horribly broken, and should be avoided as a result. SHA1 is faster than MD5, but still slower than others. Practical cryptanalysis is approaching, and should probably be avoided. However, Blowfish is exceptionally slow, even slower than MD5, and has not shown any practical attack against the algorithm. So, it makes for a perfect candidate for storing passwords. If the password is hashed and rotated 1,000 times before storing on disk, this cripples brute force attacks to an absolute crawl. Other candidates, such as SHA256, SHA512 and the newly NIST approved SHA3 algorithms are much faster, so their passwords should definitely be rotated 1,000 or even 10,000 times before storing to disk to slow the brute force search.

Raw Speed
Now the question remains- how fast can I search through passwords using a brute force method? Well, this will depend largely on hardware, and as just discussed, the chosen algorithm. Let us suppose that I have a rack of 25 of the latest AMD Radeon graphics cards. According to Ars Technica, I can achieve a bind blowing speed of 350 BILLION passwords per second if attacking the NTLM hash used to store Windows passwords. If every password in the database was 8 characters in length, then my search space is:

95 * 95 * 95 * 95 * 95 * 95 * 95 * 95 = 95^8 = 6,634,204,312,890,625 passwords

Six-and-a-half QUADRILLION passwords. That should give me a search space large enough to not worry about, right? Wrong. At a crazy pace of 350 billion passwords per second:

6634204312890625 / 350000000000 = 18955 seconds = 316 minutes = 5.3 hours

In just under 6 hours, I can exhaust the 8-character search space using this crazy machine. But what about 9 characters? What about 10 characters? What about 15 characters? These are a legitimate question, and why entropy is EVERYTHING with regards to your password and the search space to which it belongs. Think of your password as a needle, and entropy as the haystack. The larger your entropy is (total number of possible passwords), the harder it will be to find that needle in the haystack.

The Exponential Wall
Let's do some math, and explain what is going on. I'll give you a table of values, showing passwords that start with 8 characters through 15 characters in the first column. In the second column will be the search space, and in the 3rd column will be the time it takes to fully exhaust that search space. Before continuing, remember that it may not be necessary to exhaust the search space. Whenever I find the password, I should stop. There is no need to continue. So, I may find that password very early in the search, and I may find it very late in the search, or anywhere between. So, the 3rd column is merely a maximum time.

Length Search Space Max at 350 bpps
8 6634204312890625 5.3 hours
9 630249409724609375 20.8 days
10 59873693923837890625 5.4 years
11 5688000922764599609375 5.1 centuries
12 540360087662636962890625 48.9 millenia
13 51334208327950511474609375 4,650.1 millenia
14 4876749791155298590087890625 441,830.6 millenia
15 463291230159753366058349609375 41,973,910.1 millenia

What's going on here? By nearly doubling the length of my password, my search space went from 5 hours to 41 million millenia. We've hit what's called "the exponential wall". What is happening is we are traveling the road of 95^x, where 95 is our character set, and "x" is our length. To put this visually, it looks exactly like this;

Graphic showing the exponential wall.

As you can see, it gets exceptionally expensive to continue searching past 9 characters, and absolutely prohibitive to search past 10, let alone 15. We've hit the exponential wall. Now, I've heard the argument from all of my conspiracy theorist friends that faceless nameless government agencies with endlessly deep pockets (nevermind most countries are practically bankrupt), who have access to quantum supercomputers (nevermind quantum computing has yet to be implemented) can squash these numbers like the gnat you are. Let's put that to the test.

Just recently, the NSA built a super spy datacenter in Utah. Let's assume that the NSA has one of these spy datacenters in each of the 50 states in the USA. Further, let's assume that they have not only 1 of these GPU clusters, but there are 30 of them in each datacenter. That means we have 750 GPUs in one datacenter, which means they have 37,500 total across the country. That's 13 quadrillion NTLM hashes per second (assuming the communication between all 37,500 GPUs across all 50 states does not add any latencies)! Surely we can exhaust the 15 length passwords with ease!

I'll do the math for you. It would only take 5 days to exhaust the 11 character length passwords, 1.3 years for 12 character passwords, 1.2 centuries for 13 character passwords, 11.8 millenia for 14 character passwords and 1,119.3 millenia for 15 character passwords. Again, we've hit that exponential wall, where 11 characters is doable (although exceptionally expensive), 12 characters is challenging, and 13 characters is impractical. All this horsepower only bought us 2 extra characters in the brute force search using only 25 GPUs. For an initial pricetag that is in the billions, if not trillions for this sort of setup (not to mention the power bill to keep it running as well as the cooling bill), it just isn't practical for a nameless faceless government agency. The math just isn't on their side. It just doesn't pencil in financially, nor does it pencil in theoretically. Such much for that conspiracy.

As is clearly demonstrated, truly random 12-character passwords would be sufficient to thwart even the most wealthy and dedicated attackers. After all, you hit that exponential wall. Surely, all bets are off! Not so. There are plenty of additional attacks that the attacker can use, which are much more effective than the brute force, and which makes finding a 12-character password more feasible. So, we'll be discovering each of those in this series, how effective they are, how they work, and software utilities that take advantage of them.

Stay tuned!

Please Consider A Donation

I just received an anonymous $5 donation via PayPal for my series on ZFS. Thank you, whomever you are! I've received other donations in the past, and never once have I had a donation page or ads on the site or asked for money. It's awesome to know that some people will financially support your effort to keep up a blog. Also, thanks to everyone who has bought me lunches, books and other things for the same reason.

The donation page can be found at As explained on the donation page, there will never be ads on this site. I'm not interested in writing for income. I write, because I love to. So, if you like what you've read, please consider a donation to the blog. I have a Bitcoin address, or you can use PayPal.

This will be the only post I'll make about the issue. You can follow the "Donate" link on the sidebar. Thanks a bunch!

Tiny Tiny RSS - The Google Reader Replacement

With all the weeping, wailing and gnashing of teeth about Google killing Reader, I figured I'd blog something productive. Rather than piss and moan, here is a valid solution you can build for at most two bucks, using entirely Free Software, running on your own server, under your control. The solution is to install Tiny Tiny RSS on your own server, and if you have an Android smartphone, the official Tiny Tiny RSS app ($2 for the unlock key (support the developer- this stuff rocks)). Here are the step-by-step installation directions that should get you an up-and-running Reader replacement in less than 30 minutes.

First, create a directory on your webserver where you will install Tiny Tiny RSS. You will need Apache, lighttpd, Cherokee, or some other web server, PHP with the necessary modules as well as the PHP CLI interpreter, and either MySQL or PostgreSQL as prerequisites.:

# mkdir /var/www/rss
# wget
# tar -xf 1.7.4.tar.gz -C /var/www/rss/
# chown -R root.www-data /var/www/rss/
# chmod -R g+w,o+ /var/www/rss/

Pull up the web interface by navigating to (replace "" with your domain name). The default login credentials are "admin" and "password". Make sure to change the default password. Also, Tiny Tiny RSS uses a multiuser setup by default. You can add additional users, including one for yourself that isn't "admin", or you can change it to single user mode in the preferences.

After the setup is the way you want it, you'll want to get your Google Reader feeds into Tiny Tiny RSS. Navigate to Reader, and export your data. This will take you to Google Takeout, and you'll download a massive ZIP archive, that contains an OPML file, as well as a ton of other data. Grab your "subscriptions.xml" from that ZIP file, and import them into your Tiny Tiny RSS installation.

One awesome benefit of Tiny Tiny RSS, is that it has a built-in mobile version, if browsing the install from a mobile browser. It looks good too.

The only thing left to do, is navigate to the preferences, and enable the external API. There are additional 3rd party desktop-based readers that have Tiny Tiny RSS support, such as Liferea and Newsbeuter. Even the official Android app will need the option enabled. This will give you full synchronization between the web interface, your Android app, and your desktop RSS reader.

Unfortunately, Tiny Tiny RSS doesn't update the feeds by default. You need to setup a script that manages this for you. The best solution is to write a proper init script that starts and stops the updating daemon. I didn't do this. Instead, I did the next best thing. I put the following into my /etc/rc.local configuration file:

#!/bin/sh -e
# rc.local
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
# In order to enable or disable this script just change the execution
# bits.
# By default this script does nothing.

sudo -u www-data php /var/www/rss/update_daemon2.php > /dev/null&
exit 0

A couple of things to notice here. First, the redirection to /dev/null. Whichever terminal you execute that script from, it will send a ton of data to STDOUT. Also, if it doesn't succeed, the redirection to /dev/null may not display some error output. So, only after you are sure that everything is setup correctly, should you be redirecting the output. Second, is the fact that we are running the script as the "www-data" user (the default user for Apache on Debian/Ubuntu). The script should not run as root.

Now, execute the following, and you should be good to go:

# /etc/init.d/rc.local start

You now have a web-based RSS reader licensed under the GPL running under your domain that you control. If you have an Android smartphone, then install the official Tiny Tiny RSS app, along with it's unlock key, and put in your username, password, and URL to the installation. The Android app is also GPL licensed. Make sure you purchase the unlock key, or your app will only be good for 7 days (it's trialware).

Lastly, both Liferea and Newsbeuter support Tiny Tiny RSS. However, make sure you get the latest upstream versions from both, as Tiny Tiny RSS changed their API recently. For Newsbeuter, this means version 2.6 or later (I actually haven't tested Liferea). I'll show how to get Newsbeuter working with Tiny Tiny RSS. All you need to do, is edit your ~/.config/newsbeuter/config file with the following contents:

auto-reload "yes"
reload-time 360
text-width 80
ttrss-flag-star "s"
ttrss-flag-publish "p"
ttrss-login "admin"
ttrss-mode "multi"
ttrss-password "password"
ttrss-url ""
urls-source "ttrss"

Restart Newsbeuter, and you should be good to go.

You now have a full Google Reader replacement, with entirely Free Software. Synchronization between the web, your phone, and your desktop, including a mobile version of the page for mobile devices. And it only cost you two bux.

Changing RSS Source

Due to the recent news about Google shutting down a number of services, including Reader an the CalDAV API, I came to the realization that my RSS source for this blog should probably go back to the main feed. So, as of July 1, 2013, the same day Reader shuts down, the old feed at will be deprecated in favor of

I chose Feed Burner years ago for its data collection. Then Google purchased Feed Burner, and the data runs through their servers. Due to the killing off of Reader, it would not surprise me if they killed off Feed Burner as well. I don't look at the stats much, so they're not as important to me now, as they were when I set it up.

Many of you may be using the source already. If so, you're good. If you're using the source, then you may want to make the switch before July 1, 2013. The Ubuntu Planet uses my Feed Burner source. I'll get it updated before the deadline.

Create Your Own Graphical Web Of Trust- Updated

A couple years ago, I wrote about how you can create a graphical representation of your OpenPGP Web of Trust. It's funny how I've been keeping mine up-to-date for these past couple years as I attend keysigning parties, without really thinking about what it looks like. Well, I recently returned from the SCaLE 11x conference, which had a PGP keysigning party. So, I've been keeping the graph up-to-date as new signatures would come in. Then it hit me: am I graphing ONLY the signatures on my key, or all the signatures in my public keyring, or something somewhere in between? It seemed to be the latter, so I decided to do something about it.

The following script assumes you have the signing-party, graphviz and imagemagick packages installed. It grabs only the signatures on your OpenPGP key, downloads any keys that have signed your key that you may not have downloaded, places them in their own public keyring, then uses that information to graph your Web of Trust. Here's the script:

# Replace $KEY with your own KEYID
echo "Getting initial list of signatures..."
gpg --with-colons --fast-list-mode --list-sigs $KEY | awk -F ':' '$1 ~ /sig|rev/ {print $5}' | sort -u > ${KEY}.ids
echo "Refreshing your keyring..."
gpg --recv-keys $(cat ${KEY}.ids) > /dev/null 2>&1
echo "Creating public keyring..."
gpg --export $(cat ${KEY}.ids) > ${KEY}.gpg
echo "Creating dot file..."
gpg --keyring ./${KEY}.gpg --no-default-keyring --list-sigs | sig2dot > ${KEY}.dot 2> ${KEY}.err
echo "Creating PostScript document..."
neato -Tps ${KEY}.dot > ${KEY}.ps
echo "Creating graphic..."
convert ${KEY}.ps ${KEY}.gif
echo "Finished."

It may take some time to download and refresh your keyring, and it may take some time generating the .dot file. Don't be surprised if it takes 5-10 minutes, or so. However, when it finishes, you should end up with something like what is below (it's obvious when you've attended keysigning parties by the clusters of strength in your web):

Click for a larger version

GlusterFS Linked List Topology

Lately, a few coworkers and myself decided to put our workstations into a GlusterFS cluster. We wanted to test distributed replication. Our workstations are already running ZFS on Linux, so we built two datasets on each of our workstations, and made them the bricks for GlusterFS. We created a nested "brick" directory to prevent GlusterFS from sending data to the root ZFS mountpoint, if the dataset is not mounted. Here is our setup on each of our workstations:

# zfs create -o sync=disabled pool/vol1
# zfs create -o sync=disabled pool/vol2
# mkdir /pool/vol1/brick /pool/vol2/brick

Notice that I've disabled synchronous writes. This is because GlusterFS is synchronous by default already. Because ZFS resides underneath the GlusterFS client mount, and GlusterFS is communicating with the application about synchronous data, there is no need to increase write latencies with synchronous writes on ZFS.

Now the question comes as to how to set the right topology for our storage cluster. I wish to maintain two copies of the data in a distributed manner. Meaning that the local peer has a copy of the data, and a remote peer also has a copy. Thus, distributed replication. But, how do you decide where the copies get distributed? I looked at two different topologies before making my decision, which I'll discuss here.

Paired Server Topology

Paired server GlusterFS topology

In this topology, servers are completely paired together. This means you always know where both copies of your data reside. You could think of it as a mirrored setup. The bricks on serverA will hold identical data to the bricks on serverB. This obviously simplifies administration and troubleshooting a great deal. And, it's easy to setup. Suppose we wish to create a volume named "testing", and assumed that we've peered with all the necessary nodes, we would proceed as follows:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol1/brick\
serverA:/pool/vol2/brick serverB:/pool/vol2/brick
# gluster volume info testing
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick

If we wish to add more storage to the volume, the commands are pretty straight forward:

# gluster volume add-brick testing\
serverC:/pool/vol1/brick serverD:/pool/vol1/brick\
serverC:/pool/vol2/brick serverD:/pool/vol2/brick
# gluster volume info testing
Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol1/brick
Brick3: serverA:/pool/vol2/brick
Brick4: serverB:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol1/brick
Brick7: serverC:/pool/vol2/brick
Brick8: serverD:/pool/vol2/brick

The drawback to this setup, as it should be obvious, is when servers are added, they must be added in pairs. You cannot have an odd number of servers in this topology. However, as shown in both the image, and the commands, this is fairly straight forward from an administration perspective, and from a storage perspective.

Linked List Topology


In computer science, a "linked list" is a data structure sequence, where the tail of one node points to the head of another. In the case of our topology, the "head" is the first brick, and the "tail" is the second brick. As a result, this creates a circular storage setup, as shown in the image above.

To set something like this up with say 3 peers, you would do the following:

# gluster volume create testing replica 2\
serverA:/pool/vol1/brick serverB:/pool/vol2/brick\
serverB:/pool/vol1/brick serverC:/pool/vol2/brick\
serverC:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverA:/pool/vol2/brick

Now, if you wanted to add a new server to the cluster, you can. You can add servers individually, unlike the paired server topology above. But, the trick is replacing bricks, as well as adding bricks, and it's not 100% intuitive on how to proceed. Thus, if I wanted to add "serverD" with its two bricks to the setup, I would first need to recognize that "serverA:/pool/vol2/brick" is going to be replaced with "serverD:/pool/vol2/brick". Then, I will have two bricks available to add to the volume, namely "serverD:/pool/vol1/brick" and "serverA:/pool/vol2/brick". Armed with that information, and assuming that "serverD" has already peered with the others, let's proceed:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick start

I can run "gluster volume replace-brick testing status" to keep an eye on the brick replacement. When ready, I need to commit it:

# gluster volume replace-brick testing\
serverA:/pool/vol2/brick serverD:/pool/vol2/brick commit

Now we have two bricks to add to the cluster. However, the "serverA:/pool/vol2/brick" brick was previously part of the cluster. As such, it contains metadata that is no longer relevant when adding the new server. As such, we must clear the metadata off of the brick, so it starts from a clean slate, then we can add it without problem. Here are the steps we need to do next:

(serverA)# setfattr -x trusted.glusterfs.volume-id /pool/vol2/brick
(serverA)# setfattr -x trusted.gfid /pool/vol2/brick
(serverA)# rm -rf /pool/vol2/brick/.glusterfs/
(serverA)# service glusterfs-server restart

We are now ready to add the bricks cleanly:

# gluster volume add-brick testing\
serverD:/pool/vol1/brick serverA:/pool/vol2/brick
# gluster volume info testing

Volume Name: testing
Type: Distributed-Replicate
Volume ID: 8ee0a256-8da4-4d4b-ae98-3c9a5c62d1b8
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Brick1: serverA:/pool/vol1/brick
Brick2: serverB:/pool/vol2/brick
Brick3: serverB:/pool/vol1/brick
Brick4: serverC:/pool/vol2/brick
Brick5: serverC:/pool/vol1/brick
Brick6: serverD:/pool/vol2/brick
Brick7: serverD:/pool/vol1/brick
Brick8: serverA:/pool/vol2/brick

It should be obvious that this is a more complicated setup. It's more abstract from a topological perspective, and it more difficult to implement and get right from an application perspective. And there is certainly a strong argument for simplifying storage architectures and administration. However, this linked list topology has the advantage of adding and removing one server at a time, unlike the paired server setup. If this is something you need, or you have an odd-number of servers in your cluster, the linked list topology will work well.

For our workstation cluster at the office, we went with a linked list topology, because it will mimic our production setup needs more closely. There may also be other topologies that we haven't explored. We also added "geo-replication" by replicating our volume to a larger storage node in the server room. This allows us to ensure data integrity, should two servers go down in our cluster.

ZFS Administration, Part XVII- Best Practices and Caveats

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Best Practices

As with all recommendations, some of these guidelines carry a great amount of weight, while others might not. You may not even be able to follow them as rigidly as you would like. Regardless, you should be aware of them. I’ll try to provide a reason why for each. They’re listed in no specific order. The idea of “best practices” is to optimize space efficiency, performance and ensure maximum data integrity.

  • Always enable compression. There is almost certainly no reason to keep it disabled. It hardly touches the CPU and hardly touches throughput to the drive, yet the benefits are amazing.
  • Unless you have the RAM, avoid using deduplication. Unlike compression, deduplication is very costly on the system. The deduplication table consumes massive amounts of RAM.
  • Avoid running a ZFS root filesystem on GNU/Linux for the time being. It's a bit too experimental for /boot and GRUB. However, do create datasets for /home/, /var/log/ and /var/cache/.
  • Snapshot frequently and regularly. Snapshots are cheap, and can keep a plethora of file versions over time. Consider using something like the zfs-auto-snapshot script.
  • Snapshots are not a backup. Use "zfs send" and "zfs receive" to send your ZFS snapshots to an external storage.
  • If using NFS, use ZFS NFS rather than your native exports. This can ensure that the dataset is mounted and online before NFS clients begin sending data to the mountpoint.
  • Don't mix NFS kernel exports and ZFS NFS exports. This is difficult to administer and maintain.
  • For /home/ ZFS installations, setting up nested datasets for each user. For example, pool/home/atoponce and pool/home/dobbs. Consider using quotas on the datasets.
  • When using "zfs send" and "zfs receive", send incremental streams with the "zfs send -i" switch. This can be an exceptional time saver.
  • Consider using "zfs send" over "rsync", as the "zfs send" command can preserve dataset properties.


The point of the caveat list is by no means to discourage you from using ZFS. Instead, as a storage administrator planning out your ZFS storage server, these are things that you should be aware of, so as not to catch you with your pants down, and without your data. If you don’t head these warnings, you could end up with corrupted data. The line may be blurred with the “best practices” list above. I’ve tried making this list all about data corruption if not headed. Read and head the caveats, and you should be good.

  • A "zfs destroy" can cause downtime for other datasets. A "zfs destroy" will touch every file in the dataset that resides in the storage pool. The larger the dataset, the longer this will take, and it will use all the possible IOPS out of your drives to make it happen. Thus, if it take 2 hours to destroy the dataset, that's 2 hours of potential downtime for the other datasets in the pool.
  • Debian and Ubuntu will not start the NFS daemon without a valid export in the /etc/exports file. You must either modify the /etc/init.d/nfs init script to start without an export, or create a local dummy export.
  • Debian and Ubuntu, and probably other systems use a parallized boot. As such, init script execution order is no longer prioritized. This creates problems for mounting ZFS datasets on boot. For Debian and Ubuntu, touch the "/etc/init.d/.legacy-bootordering file, and make sure that the /etc/init.d/zfs init script is the first to start, before all other services in that runlevel.
  • Do not create ZFS storage pools from files in other ZFS datasets. This will cause all sorts of headaches and problems.
  • When creating ZVOLs, make sure to set the block size as the same, or a multiple, of the block size that you will be formatting the ZVOL with. If the block sizes do not align, performance issues could arise.
  • When loading the "zfs" kernel module, make sure to set a maximum number for the ARC. Doing a lot of "zfs send" or snapshot operations will cache the data. If not set, RAM will slowly fill until the kernel invokes OOM killer, and the system becomes responsive. I have set in my /etc/modprobe.d/zfs.conf file "options zfs zfs_arc_max=2147483648", which is a 2 GB limit for the ARC.

ZFS Administration, Part XVI- Getting and Setting Properties

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats


Just as with Zpool properties, datasets also contain properties that can be changed. Because datasets are where you actually store your data, there are quite a bit more than with storage pools. Further, properties can be inherited from parent datasets. Again, not every property is tunable. Many are read-only. But, this again gives us the ability to tune our filesystem based on our storage needs. One aspect with ZFS datasets also, is the ability to set your own custom properties. These are known as "user properties" and differ from "native properties".

Because there are just so many properties, I've decided to put the administration "above the fold", and put the properties, along with some final thoughts at the end of the post.

Getting and Setting Properties

As with getting and setting storage pool properties, there are a few ways that you can get at dataset properties as well- you can get all properties at once, only one property, or more than one, comma-separated. For example, suppose I wanted to get just the compression ration of the dataset. I could issue the following command:

# zfs get compressratio tank/test
tank/test  compressratio  1.00x  -

If I wanted to get multiple settings, say the amount of disk used by the dataset, as well as how much is available, and the compression ratio, I could issue this command instead:

# zfs get used,available,compressratio tank/test
tank/test  used                  1.00G                  -
tank/test  available             975M                   -
tank/test  compressratio         1.00x                  -

And of course, if I wanted to get all the settings available, I could run:

# zfs get all tank/test
NAME       PROPERTY              VALUE                  SOURCE
tank/test  type                  filesystem             -
tank/test  creation              Tue Jan  1  6:07 2013  -
tank/test  used                  1.00G                  -
tank/test  available             975M                   -
tank/test  referenced            1.00G                  -
tank/test  compressratio         1.00x                  -
tank/test  mounted               yes                    -
tank/test  quota                 none                   default
tank/test  reservation           none                   default
tank/test  recordsize            128K                   default
tank/test  mountpoint            /tank/test             default
tank/test  sharenfs              off                    default
tank/test  checksum              on                     default
tank/test  compression           lzjb                   inherited from tank
tank/test  atime                 on                     default
tank/test  devices               on                     default
tank/test  exec                  on                     default
tank/test  setuid                on                     default
tank/test  readonly              off                    default
tank/test  zoned                 off                    default
tank/test  snapdir               hidden                 default
tank/test  aclinherit            restricted             default
tank/test  canmount              on                     default
tank/test  xattr                 on                     default
tank/test  copies                1                      default
tank/test  version               5                      -
tank/test  utf8only              off                    -
tank/test  normalization         none                   -
tank/test  casesensitivity       sensitive              -
tank/test  vscan                 off                    default
tank/test  nbmand                off                    default
tank/test  sharesmb              off                    default
tank/test  refquota              none                   default
tank/test  refreservation        none                   default
tank/test  primarycache          all                    default
tank/test  secondarycache        all                    default
tank/test  usedbysnapshots       0                      -
tank/test  usedbydataset         1.00G                  -
tank/test  usedbychildren        0                      -
tank/test  usedbyrefreservation  0                      -
tank/test  logbias               latency                default
tank/test  dedup                 off                    default
tank/test  mlslabel              none                   default
tank/test  sync                  standard               default
tank/test  refcompressratio      1.00x                  -
tank/test  written               0                      -


As you may have noticed in the output above, properties can be inherited from their parents. In that case, I set the compression algorithm to "lzjb" on the storage pool filesystem "tank" ("tank" is more than just a storage pool- it is a valid ZFS dataset). As such, any datasets created under the "tank" dataset will inherit that property. Let's create a nested dataset, and see how this comes into play:

# zfs create -o compression=gzip tank/test/one
# zfs get -r compression tank
tank           compression  lzjb      local
tank/test      compression  lzjb      inherited from tank
tank/test/one  compression  gzip      local

Notice that the "tank" and "tank/test" datasets are using the "lzjb" compression algorithm, where "tank/test" inherited it from its parent "tank". Whereas with the "tank/test/one" dataset, we chose a different compression algorithm. Let's now inherit the parent compression algorithm from "tank", and see what happens to "tank/test/one":

# zfs inherit compression tank/test/one
# zfs get -r compression tank
tank           compression  lzjb      local
tank/test      compression  lzjb      inherited from tank
tank/test/one  compression  lzjb      inherited from tank

In this case, we made the change from the "gzip" algorithm to the "lzjb" algorithm, by inheriting from its parent. Now, the "zfs inherit" command also supports recursion. I can set the "tank" dataset to be "gzip", and apply the property recursively to all children datasets:

# zfs set compression=gzip tank
# zfs inherit -r compression tank/test
# zfs get -r compression tank
tank           compression  gzip      local
tank/test      compression  gzip      inherited from tank
tank/test/one  compression  gzip      inherited from tank

Be very careful when using the "-r" switch. Suppose you quickly typed the command, and gave the "tank" dataset as your argument, rather than "tank/test":

# zfs inherit -r compression tank
# zfs get -r compression tank
tank           compression  off       default
tank/test      compression  off       default
tank/test/one  compression  off       default

What happened? All compression algorithms got reset to their defaults of "off". As a result, be very fearful of the "-r" recursive switch with the "zfs inherit" command. As you can see here, this is a way that you can clear dataset properties back to their defaults, and apply it to all children. This applies to datasets, volumes and snapshots.

User Dataset Properties

Now that you understand about inheritance, you can understand setting custom user properties on your datasets. The goal of user properties is for applications designed around ZFS specifically, to take advantage of those settings. For example, poudriere is a tool for FreeBSD designed to test package production, and to build FreeBSD packages in bulk. If using ZFS with FreeBSD, you can create a dataset for poudriere, and then create some custom properties for it to take advantage of.

Custom user dataset properties have no effect on ZFS performance. Think of them merely as "annotation" for administrators and developers. User properties must use a colon ":" in the property name to distinguish them from native dataset properties. They may contain lowercase letters, numbers, the colon ":", dash "-", period "." and underscore "_". They can be at ost 256 characters, and must not begin with a dash "-".

To create a custom property, just use the "module:property" syntax. This is not enforced by ZFS, but is probably the cleanest approach:

# zfs set poudriere:type=ports tank/test/one
# zfs set poudriere:name=my_ports_tree tank/test/one
# zfs get all tank/test/one | grep poudriere
tank/test/one  poudriere:name        my_ports_tree          local
tank/test/one  poudriere:type        ports                  local

I am not aware of a way to remove user properties from a ZFS filesystem. As such, if it bothers you, and is cluttering up your property list, the only way to remove the user property is to create another dataset with the properties you want, copy over the data, then destroy the old cluttered dataset. Of course, you can inherit user properties with "zfs inherit" as well. And all the standard utilities, such as "zfs set", "zfs get", "zfs list", et cetera will work with user properties.

With that said, let's get to the native properties.

Native ZFS Dataset Properties

  • aclinherit: Controls how ACL entries are inherited when files and directories are created. Currently, ACLs are not functioning in ZFS on Linux as of 0.6.0-rc13. Default is "restricted". Valid values for this property are:
    • discard: do not inherit any ACL properties
    • noallow: only inherit ACL entries that specify "deny" permissions
    • restricted: remove the "write_acl" and "write_owner" permissions when the ACL entry is inherited
    • passthrough: inherit all inheritable ACL entries without any modifications made to the ACL entries when they are inherited
    • passthrough-x: has the same meaning as passthrough, except that the owner@, group@, and everyone@ ACEs inherit the execute permission only if the file creation mode also requests the execute bit.
  • aclmode: Controls how the ACL is modified using the "chmod" command. The value "groupmask" is default, which reduces user or group permissions. The permissions are reduced, such that they are no greater than the group permission bits, unless it is a user entry that has the same UID as the owner of the file or directory. Valid values are "discard", "groupmask", and "passthrough".
  • acltype: Controls whether ACLs are enabled and if so what type of ACL to use. When a file system has the acltype property set to noacl (the default) then ACLs are disabled. Setting the acltype property to posixacl indicates Posix ACLs should be used. Posix ACLs are specific to Linux and are not functional on other platforms. Posix ACLs are stored as an xattr and therefore will not overwrite any existing ZFS/NFSv4 ACLs which may be set. Currently only posixacls are supported on Linux.
  • atime: Controls whether or not the access time of files is updated when the file is read. Default is "on". Valid values are "on" and "off".
  • available: Read-only property displaying the available space to that dataset and all of its children, assuming no other activity on the pool. Can be referenced by its shortened name "avail". Availability can be limited by a number of factors, including physical space in the storage pool, quotas, reservations and other datasets in the pool.
  • canmount: Controls whether the filesystem is able to be mounted when using the "zfs mount" command. Default is "on". Valid values can be "on", "off", or "noauto". When the noauto option is set, a dataset can only be mounted and unmounted explicitly. The dataset is not mounted automatically when the dataset is created or imported, nor is it mounted by the "zfs mount" command or unmounted with the "zfs unmount" command. This property is not inherited.
  • casesensitivity: Indicates whether the file name matching algorithm used by the file system should be case-sensitive, case-insensitive, or allow a combination of both styles of matching. Default value is "sensitive". Valid values are "sensitive", "insensitive", and "mixed". Using the "mixed" value would be beneficial in heterogenous environments where Unix POSIX and CIFS filenames are deployed. Can only be set during dataset creation.
  • checksum: Controls the checksum used to verify data integrity. The default value is "on", which automatically selects an appropriate algorithm. Currently, that algorithm is "fletcher2". Valid values is "on", "off", "fletcher2", "fletcher4", or "sha256". Changing this property will only affect newly written data, and will not apply retroactively.
  • clones: Read-only property for snapshot datasets. Displays in a comma-separated list datasets which are clones of this snapshot. If this property is not empty, then this snapshot cannot be destroyed (not even with the "-r" or "-f" options). Destroy the clone first.
  • compression: Controls the compression algorithm for this dataset. Default is "off". Valid values are "on", "off", "lzjb", "gzip", "gzip-N", and "zle". The "lzjb" algorithm is optimized for speed, while provide good compression ratios. The setting of "on" defaults to "lzjb". It is recommended that you use "lzjb", "gzip", "gzip-N", or "zle" rather than "on", as the ZFS developers or package maintainers may change the algorithm "on" uses. The gzip compression algorithm uses the same compression as the "gzip" command. You can specify the gzip level by using "gzip-N" where "N" is a valid number of 1 through 9. "zle" compresses runs of binary zeroes, and is very fast. Changing this property will only affect newly written data, and will not apply retroactively.
  • compressratio: Read-only property that displays the compression ratio achieved by the compression algorithm set on the "compression" property. Expressed as a multiplier. Does not take into account snapshots; see "refcompressratio". Compression is not enabled by default.
  • copies: Controls the number of copies to store in this dataset. Default value is "1". Valid values are "1", "2", and "3". These copies are in addition to any redundancy provided by the pool. The copies are stored on different disks, if possible. The space used by multiple copies is charged to the associated file and dataset. Changing this property only affects newly written data, and does not apply retroactively.
  • creation: Read-only property that displays the time the dataset was created.
  • defer_destroy: Read-only property for snapshots. This property is "on" if the snapshot has been marked for deferred destruction by using the "zfs destroy -d" command. Otherwise, the property is "off".
  • dedup: Controls whether or not data deduplication is in effect for this dataset. Default is "off". Valid values are "off", "on", "verify", and "sha256[,verify]". The default checksum used for deduplication is SHA256, which is subject to change. When the "dedup" property is enabled, it overrides the "checksum" property. If the property is set to "verify", then if two blocks have the same checksum, ZFS will do a byte-by-byte comparison with the existing block to ensure the blocks are identical. Changing this property only affects newly written data, and is not applied retroactively. Enabling deduplication in the dataset will dedupe data in that dataset against all data in the storage pool. Disabling this property does not destroy the deduplication table. Data will continue to remain deduped.
  • devices: Controls whether device nodes can be opened on this file system. The default value is "on". Valid values are "on" and "off".
  • exec: Controls whether processes can be executed from within this file system. The default value is "on". Valid values are "on" and "off".
  • groupquota@<group>: Limits the amount of space consumed by the specified group. Group space consumption is identified by the "userquota@<user>" property. Default value is "none". Valid values are "none", and a size in bytes.
  • groupsused@<group>: Read-only property displaying the amount of space consumed by the specified group in this dataset. Space is charged to the group of each file, as displayed by "ls -l". See the userused@<user> property for more information.
  • logbias: Controls how to use the SLOG, if one exists. Provides a hint to ZFS on how to handle synchronous requests. Default value is "latency", which will use a SLOG in the pool if present. The other valid value is "throughput" which will not use the SLOG on synchronous requests, and go straight to platter disk.
  • mlslabel: The mlslabel property is a sensitivity label that determines if a dataset can be mounted in a zone on a system with Trusted Extensions enabled. Default value is "none". Valid values are a Solaris Zones label or "none". Note, Zones are a Solaris feature, and not relevant to GNU/Linux. However, this may be something that could be implemented with SELinux an Linux containers in the future.
  • mounted: Read-only property that indicates whether the dataset is mounted. This property will display either "yes" or "no".
  • mountpoint: Controls the mount point used for this file system. Default value is "<pool>/<dataset>". Valid values are an absolute path on the filesystem, "none", or "legacy". When the "mountpoint" property is changed, the new destination must not contain any child files. The dataset will be unmounted and re-mounted to the new destination.
  • nbmand: Controls whether the file system should be mounted with non-blocking mandatory locks. This is used for CIFS clients. Default value is "on". Valid values are "on" and "off". Changing the property will only take effect after the dataset has ben unmounted then re-mounted.
  • normalization: Indicates whether the file system should perform a unicode normalization of file names whenever two file names are compared and which normalization algorithm should be used. Default value is "none". Valid values are "formC", "formD", "formKC", and "formKD". This property cannot be changed after the dataset is created.
  • origin: Read-only property for clones or volumes, which displays the snapshot from whence the clone was created.
  • primarycache: Controls what is cached in the primary cache (ARC). If this property is set to "all", then both user data and metadata is cached. If set to "none", then neither are cached. If set to "metadata", then only metadata is cached. Default is "all".
  • quota: Limits the amount of space a dataset and its descendents can consume. This property enforces a hard limit on the amount of space used. There is no soft limit. This includes all space consumed by descendents, including file systems and snapshots. Setting a quota on a descendant of a dataset that already has a quota does not override the ancestor's quota, but rather imposes an additional limit. Quotas cannot be set on volumes, as the volsize property acts as an implicit quota. Default value is "none" Valid values are a size in bytes or "none".
  • readonly: Controls whether this dataset can be modified. The default value is off. Valid values are "on" and "off". This property can also be referred to by its shortened column name, "rdonly".
  • recordsize: Specifies a suggested block size for files in the file system. This property is designed solely for use with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes according to internal algorithms optimized for typical access patterns. The size specified must be a power of two greater than or equal to 512 and less than or equal to 128 KB. Changing the file system's recordsize affects only files created afterward; existing files are unaffected. This property can also be referred to by its shortened column name, "recsize".
  • refcompressratio: Read only property displaying the compression ratio achieved by the space occupied in the "referenced" property.
  • referenced: Read-only property displaying the amount of data that the dataset can access. Initially, this will be the same number as the "used" property. As snapshots are created, and data is modified however, those numbers will diverge. This property can be reference by its shortened name "refer".
  • refquota: Limits the amount of space a dataset can consume. This property enforces a hard limit on the amount of space used. This hard limit does not include space used by descendents, including file systems and snapshots. Default value is "none". Valid values are "none", and a size in bytes.
  • refreservation: The minimum amount of space guaranteed to a dataset, not including its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by refreservation. Default value is "none". Valid values are "none" and a size in bytes. This property can also be referred to by its shortened column name, "refreserv".
  • reservation: The minimum amount of space guaranteed to a dataset and its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by its reservation. Reservations are accounted for in the parent datasets' space used, and count against the parent datasets' quotas and reservations. This property can also be referred to by its shortened column name, reserv. Default value is "none". Valid values are "none" and a size in bytes.
  • secondarycache: Controls what is cached in the secondary cache (L2ARC). If this property is set to "all", then both user data and metadata is cached. If this property is set to "none", then neither user data nor metadata is cached. If this property is set to "metadata", then only metadata is cached. The default value is "all".
  • setuid: Controls whether the set-UID bit is respected for the file system. The default value is on. Valid values are "on" and "off".
  • shareiscsi: Indicates whether a ZFS volume is exported as an iSCSI target. Currently, this is not implemented in ZFS on Linux, but is pending. Valid values will be "on", "off", and "type=disk". Other disk types may also be supported. Default value will be "off".
  • sharenfs: Indicates whether a ZFS dataset is exported as an NFS export, and what options are used. Default value is "off". Valid values are "on", "off", and a list of valid NFS export options. If set to "on", the export can then be shared with the "zfs share" command, and unshared with the "zfs unshare" command. An NFS daemon must be running on the host before the export can be used. Debian and Ubuntu require a valid export in the /etc/exports file before the daemon will start.
  • sharesmb: Indicates whether a ZFS dataset is export as a SMB share. Default value is "off". Valid values are "on" and "off". Currently, a bug exists preventing this from being used. When fixed, it will require a running Samba daemon, just like with NFS, and will be shared and unshared with the "zfs share" and "zfs unshare" commands.
  • snapdir: Controls whether the ".zfs" directory is hidden or visible in the root of the file system. Default value is "hidden". Valid values are "hidden" and "visible". Even though the "hidden" value might be set, it is still possible to change directories into the ".zfs" directory, to access the shares and snapshots.
  • sync: Controls the behavior of synchronous requests (e.g. fsync, O_DSYNC). Default value is "default", which is POSIX behavior to ensure all synchronous requests are written to stable storage and all devices are flushed to ensure data is not cached by device controllers. Valid values are "default", "always", and "disabled". The value of "always" causes every file system transaction to be written and flushed before its system call returns. The value of "disabled" does not honor synchronous requests, which will give the highest performance.
  • type: Read-only property that displays the type of filesystem, whether it be a "dataset", "volume" or "snapshot".
  • used: Read-only property that displays the amount of space consumed by this dataset and all its children. When snapshots are created, the space is initially shared between the parent dataset and its snapshot. As data is modified in the dataset, space that was previously shared becomes unique to the snapshot, and is only counted in the "used" property for that snapshot. Further, deleting snapshots can free up space unique to other snapshots.
  • usedbychildren: Read-only property that displays the amount of space used by children of this dataset, which is freed if all of the children are destroyed.
  • usedbydataset: Read-only property that displays the amount of space used by this dataset itself., which would then be freed if this dataset is destroyed.
  • usedbyrefreservation: Read-only property that displays the amount of space used by a refreservation set on this dataset, which would be freed if the refreservation was removed.
  • usedbysnapshots: Read-only property that displays the amount of space consumed by snapshots of this dataset. In other words, this is the data that is unique to the snapshots. Note, this is not a sum of each snapshot's "used" property, as data can be shared across snapshots.
  • userquota@<user>: Limits the amount of space consumed by the specified user. Similar to the "refquota" property, the userquota space calculation does not include space that is used by descendent datasets, such as snapshots and clones. Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the EDQUOT error message. This property is not available on volumes, on file systems before version 4, or on pools before version 15. Default value is "none". Valid values are "none" and a size in bytes.
  • userrefs: Read-only property on snapshots that displays the number of user holds on this snapshot. User holds are set by using the zfs hold command.
  • userused@<user>: Read-only property that displays the amount of space consumed by the specified user in this dataset. Space is charged to the owner of each file, as displayed by "ls -l". The amount of space charged is displayed by du and ls -s. See the zfs userspace subcommand for more information. The "userused@<user>" properties are not displayed with "zfs get all". The user's name must be appended after the @ symbol, using one of the following forms:
    • POSIX name (for example, joe)
    • POSIX numeric ID (for example, 789)
    • SID name (for example, joe.smith@mydomain)
    • SID numeric ID (for example, S-1-123-456-789)
  • utf8only: Indicates whether the file system should reject file names that include characters that are not present in the UTF-8 character set. Default value is "off". Valid values are "on" and "off". This property cannot be changed after the dataset has been created.
  • version: The on-disk version of this file system, which is independent of the pool version. This property can only be set to later supported versions. Valid values are "current", "1", "2", "3", "4", or "5".
  • volblocksize: Read-only property for volumes that specifies the block size of the volume. The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time. The default blocksize for volumes is 8 KB. Any power of 2 from 512 bytes to 128 KB is valid.
  • vscan: Controls whether regular files should be scanned for viruses when a file is opened and closed. In addition to enabling this property, the virus scan service must also be enabled for virus scanning to occur. The default value is "off". Valid values are "on" and "off".
  • written: Read-only property that displays the amount of referenced space written to this dataset since the previous snapshot.
  • written@<snapshot>: Read-only property on a snapshot that displays the amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot.
  • xattr: Controls whether extended attributes are enabled for this file system. The default value is "on". Valid values are "on" and "off".
  • zoned: Controls whether the dataset is managed from a non-global zone. Zones are a Solaris feature and are not relevant on Linux. Default value is "off". Valid values are "on" and "off".

Final Thoughts

As you have probably noticed, some ZFS dataset properties are not fully implemented with ZFS on Linux, such as sharing a volume via iSCSI. Other dataset properties apply to the whole pool, such as the case with deduplication, even though they are applied to specific datasets. Many properties only apply to newly written data, and are not retroactive. As such, be aware of each property, and the pros/cons of what it provides. Because the parent storage pool is also a valid ZFS dataset, any child datasets will inherit non-default properties, as seen. And, the same is true for nested datasets, snapshots and volumes.

With ZFS dataset properties, you now have all the tuning at your fingertips to setup a solid ZFS storage backend. And everything has been handled with the "zfs" command, and its necessary subcommands. In fact, up to this point, we've only learned two commands: "zpool" and "zfs", yet we've been able to build and configure powerful, large, redundant, consistent, fast and tuned ZFS filesystems. This is unprecedented in the storage world, especially with GNU/Linux. The only thing left to discuss is some best practices and caveats, and then a brief post on the "zdb" command (which you should never need), and we'll be done with this series. Hell, if you've made it this far, I commend you. This has been no small series (believe me, my fingers hate me).

ZFS Administration, Part XV- iSCSI, NFS and Samba

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

I spent the previous week celebrating the Christmas holiday with family and friends, and as a result, took a break from blogging. However, other than the New Year, I'm finished with holidays for a while, and eager to get back to blogging, and finishing off this series. Only handful of posts left to go. So, let's continue our discussion with ZFS administration on GNU/Linux, by discussing sharing datasets.


I have been trying to keep these ZFS posts as operating system agnostic as much as possible. Even though they have had a slant towards Linux kernels, you should be able to take much of this to BSD or any of the Solaris derivatives, such as OpenIndiana or Nexenta. With this post, however, it's going to be Linux-kernel specific, and even Ubuntu and Debian specific at that. The reason being is iSCSI support is not compiled into ZFS on Linux as of the writing of this post, but sharing via NFS and SMB is. Further, the implementation details for sharing via NFS and SMB will be specific to Debian and Ubuntu in this post. So, you may need to make adjustments if using Fedora, openSUSE, Gentoo, Arch, et cetera.


You are probably asking why you would want to use ZFS specific sharing of datasets rather than using the "tried and true" methods with standard software. The reason is simple. When the system boots up, and goes through its service initialization process (typically by executing the shell scripts found in /etc/init.d/), it has a method to the madness on which service starts first. Loosely speaking, filesystems are mounted first, networking is enabled, then services are started at last. Some of these are tied together, such as NFS exports, which requires the filesystem to be mounted, a firewall in place, networking started, and the NFS daemon running. But, what happens when the filesystem is not mounted? If the directory is still accessible, it will be exported via NFS, and applications could begin dumping data into the export. This could lead to all sorts of issues, such as data inconsistencies. As such, administrators have put checks into place, such as exporting only nested directories in the mount point, which would not be available if the filesystem fails to mount. These are clever hacks, but certainly not elegant.

When tying the export directly into the filesystem, you can solve this beautifully, which ZFS does. In the case of ZFS, you can share a specific dataset via NFS, for example. However, if the dataset does not mount, then the export will not be available to the application, and the NFS client will block. Because the network share is inherent to the filesystem, there is no concern for data inconsistencies, and no need for silly check hacks or scripts. As a result, ZFS from Oracle has the ability to share a dataset via NFS, SMB (CIFS or Samba) and iSCSI. ZFS on Linux only supports NFS and SMB currently, with iSCSI support on the way.

In each case, you still must install the necessary daemon software to make the share available. For example, if you wish to share a dataset via NFS, then you need to install the NFS server software, and it must be running. Then, all you need to do is flip the sharing NFS switch on the dataset, and it will be immediately available.

Sharing via NFS

To share a dataset via NFS, you first need to make sure the NFS daemon is running. On Debian and Ubuntu, this is the "nfs-kernel-server" package. Further, with Debian and Ubuntu, the NFS daemon will not start unless there is an export in the /etc/exports file. So, you have two options: you can create a dummy export, only available to localhost, or you can edit the init script to start without checking for a current export. I prefer the former. Let's get that setup:

$ sudo aptitude install -R nfs-kernel-server
$ echo '/mnt localhost(ro)' >> /etc/exports
$ sudo /etc/init.d/nfs-kernel-server start
$ showmount -e
Export list for
/mnt localhost

With our NFS daemon running, we can now start sharing ZFS datasets. I'll assume already that you have created your dataset, it's mounted, and you're ready to start committing data to it. You'll notice in the zfs(8) manpage, that for the "sharenfs" property, it can be "on", "off" or "opts", where "opts" are valid NFS export options. So, if I wanted to share my "pool/srv" dataset, which is mounted to "/srv" to the network, I could do something like:

# zfs set sharenfs="rw=@" pool/srv
# zfs share pool/srv
# showmount -e
Export list for
/mnt localhost

If you want your ZFS datasets to be shared on boot, then you need to install the /etc/default/zfs config file. If using the Ubuntu PPA, this will be installed by default for you. If compiling from source, this will not be provided. Here are the contents of that file. I've added emphasis to the two lines that should be modified for persistence across boots, if you want to enable sharing via NFS. Default is 'no':

$ cat /etc/default/zfs
# /etc/default/zfs
# Instead of changing these default ZFS options, Debian systems should install
# the zfs-mount package, and Ubuntu systems should install the zfs-mountall
# package. The debian-zfs and ubuntu-zfs metapackages ensure a correct system
# configuration.
# If the system runs parallel init jobs, like upstart or systemd, then the
# `zfs mount -a` command races in a way that causes sporadic mount failures.

# Automatically run `zfs mount -a` at system start. Disabled by default.

# Automatically run `zfs share -a` at system start. Disabled by default.
# Requires nfsd and/or smbd. Incompletely implemented for Linux.

As mentioned in the comments, running a parallel init system creates problems for ZFS. This is something I recently banged my head against, as my /var/log/ and /var/cache/ datasets were not mounting on boot. To fix the problem, and run a serialized boot, thus ensuring that everything gets executed in the proper order, you need to touch a file:

# touch /etc/init.d/.legacy-bootordering

This will add time to your bootup, but given the fact that my system is up months at a time, I'm not worried about the extra 5 seconds this puts on my boot. This is documented in the /etc/init.d/rc script, setting the "CONCURRENCY=none" variable.

You should now be able to mount the NFS export from an NFS client:

(client)# mount -t nfs /mnt

Sharing via SMB

Currently, SMB integration is not working 100%. See bug #1170 I reported on Github. However, when things get working, this will likely be the way.

As with NFS, to share a ZFS dataset via SMB/CIFS, you need to have the daemon installed and running. Recently, the Samba development team released Samba version 4. This release gives the Free Software world a Free Software implementation of Active Directory running on GNU/Linux systems, SMB 2.1 file sharing support, clustered file servers, and much more. Currently, Debian testing has the beta 2 packages. Debian experimental has the stable release, and it may make its way up the chain for the next stable release. One can hope. Samba v4 is not needed to share ZFS datasets via SMB/CIFS, but it's worth mentioning. We'll stick with version 3 of the Samba packages, until version 4 stabilizes.

# aptitude install -R aptitude install samba samba-client samba-doc samba-tools samba-doc-pdf
# ps -ef | grep smb
root     22413     1  0 09:05 ?        00:00:00 /usr/sbin/smbd -D
root     22423 22413  0 09:05 ?        00:00:00 /usr/sbin/smbd -D
root     22451 21308  0 09:06 pts/1    00:00:00 grep smb

At this point, all we need to do is share the dataset, and verify that it's been shared. It is also worth noting that Microsoft Windows machines are not case sensitive, as things are in Unix. As such, if you are in a heterogenious environment, it may be worth disabling case sensitivity on the ZFS dataset. Setting this value can only be done on creation time. So, you may wish to issue the following when creating that dataset:

# zfs create -o casesensitivity=mixed pool/srv

Now you can continue with configuring the rest of the dataset:

# zfs set sharesmb=on pool/srv
# zfs share pool/srv
# smbclient -U guest -N -L localhost
Domain=[WORKGROUP] OS=[Unix] Server=[Samba 3.6.6]

        Sharename       Type      Comment
        ---------       ----      -------
        print$          Disk      Printer Drivers
        sysvol          Disk      
        netlogon        Disk      
        IPC$            IPC       IPC Service (eightyeight server)
        Canon-imageRunner-3300 Printer   Canon imageRunner 3300
        HP-Color-LaserJet-3600 Printer   HP Color LaserJet 3600
        salesprinter    Printer   Canon ImageClass MF7460
        pool_srv        Disk      Comment: /srv
Domain=[WORKGROUP] OS=[Unix] Server=[Samba 3.6.6]

        Server               Comment
        ---------            -------
        EIGHTYEIGHT          eightyeight server

        Workgroup            Master
        ---------            -------
        WORKGROUP            EIGHTYEIGHT

You can see that in this environment (my workstation hostname is 'eightyeight'), there are some printers being shared, and a couple disks. I've emphasized the disk that we are sharing in my output, to verify that it is working correctly. So, we should be able to mount that share as a CIFS mount, and access the data:

# aptitude install -R cifs-utils
# mount -t cifs -o username=USERNAME //localhost/srv /mnt
# ls /mnt

Sharing via iSCSI

Unfortunately, sharing ZFS datasets via iSCSI is not yet supported with ZFS on Linux. However, it is available in the Illumos source code upstream, and work is being done to get it working in GNU/Linux. As with SMB and NFS, you will need the iSCSI daemon installed and running. When support is enabled, I'll finish writing up this post on demonstrating how you can access iSCSI targets that are ZFS datasets. In the meantime, you would do something like the following:

# aptitude install -R openiscsi
# zfs set shareiscsi=on pool/srv

At which point, from the iSCSI client, you would access the target, format it, mount it, and start working with the data..

ZFS Administration, Part XIV- ZVOLS

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

What is a ZVOL?

A ZVOL is a "ZFS volume" that has been exported to the system as a block device. So far, when dealing with the ZFS filesystem, other than creating our pool, we haven't dealt with block devices at all, even when mounting the datasets. It's almost like ZFS is behaving like a userspace application more than a filesystem. I mean, on GNU/Linux, when working with filesystems, you're constantly working with block devices, whether they be full disks, partitions, RAID arrays or logical volumes. Yet somehow, we've managed to escape all that with ZFS. Well, not any longer. Now we get our hands dirty with ZVOLs.

A ZVOL is a ZFS block device that resides in your storage pool. This means that the single block device gets to take advantage of your underlying RAID array, such as mirrors or RAID-Z. It gets to take advantage of the copy-on-write benefits, such as snapshots. It gets to take advantage of online scrubbing, compression and data deduplication. It gets to take advantage of the ZIL and ARC. Because it's a legitimate block device, you can do some very interesting things with your ZVOL. We'll look at three of them here- swap, ext4, and VM storage. First, we need to learn how to create a ZVOL.

Creating a ZVOL

To create a ZVOL, we use the "-V" switch with our "zfs create" command, and give it a size. For example, if I wanted to create a 1 GB ZVOL, I could issue the following command. Notice further that there are a couple new symlinks that exist in /dev/zvol/tank/ and /dev/tank/ which points to a new block device in /dev/:

# zfs create -V 1G tank/disk1
# ls -l /dev/zvol/tank/disk1
lrwxrwxrwx 1 root root 11 Dec 20 22:10 /dev/zvol/tank/disk1 -> ../../zd144
# ls -l /dev/tank/disk1
lrwxrwxrwx 1 root root 8 Dec 20 22:10 /dev/tank/disk1 -> ../zd144

Because this is a full fledged, 100% bona fide block device that is 1 GB in size, we can do anything with it that we would do with any other block device, and we get all the benefits of ZFS underneath. Plus, creating a ZVOL is near instantaneous, regardless of size. Now, I could create a block device with GNU/Linux from a file on the filesystem. For example, if running ext4, I can create a 1 GB file, then make a block device out of it as follows:

# fallocate -l 1G /tmp/file.img
# losetup /dev/loop0 /tmp/file.img

I now have the block device /dev/loop0 that represents my 1 GB file. Just as with any other block device, I can format it, add it to swap, etc. But it's not as elegant, and it has severe limitations. First off, by default you only have 8 loopback devices for your exported block devices. You can change this number, however. With ZFS, you can create 2^64 ZVOLs by default. Also, it requires a preallocated image, on top of your filesystem. So, you are managing three layers of data: the block device, the file, and the blocks on the filesystem. With ZVOLs, the block device is exported right off the storage pool, just like any other dataset.

Let's look at some things we can do with this ZVOL.

Swap on a ZVOL

Personally, I'm not a big fan of swap. I understand that it's a physical extension of RAM, but swap is only used when RAM fills, spilling the cache. If this is happening regularly and consistently, then you should probably look into getting more RAM. It can act as part of a healthy system, keeping RAM dedicated to what the kernel actively needs. But, when active RAM starts spilling over to swap, then you have "the swap of death", as your disks thrash, trying to keep up with the demands of the kernel. So, depending on your system and needs, you may or may not need swap.

First, let's create 1 GB block device for our swap. We'll call the dataset "tank/swap" to make it easy to identify its intention. Before we begin, let's check out how much swap we currently have on our system with the "free" command:

# free
             total       used       free     shared    buffers     cached
Mem:      12327288    8637124    3690164          0     175264    1276812
-/+ buffers/cache:    7185048    5142240
Swap:            0          0          0

In this case, we do not have any swap enabled. So, let's create 1 GB of swap on a ZVOL, and add it to the kernel:

# zfs create -V 1G tank/swap
# mkswap /dev/zvol/tank/swap
# swapon /dev/zvol/tank/swap
# free
             total       used       free     shared    buffers     cached
Mem:      12327288    8667492    3659796          0     175268    1276804
-/+ buffers/cache:    7215420    5111868
Swap:      1048572          0    1048572

It worked! We have a legitimate Linux kernel swap device on top of ZFS. Sweet. As is typical with swap devices, they don't have a mountpoint. They are either enabled, or disabled, and this swap device is no different.

Ext4 on a ZVOL

This may sound wacky, but you could put another filesystem, and mount it, on top of a ZVOL. In other words, you could have an ext4 formatted ZVOL and mounted to /mnt. You could even partition your ZVOL, and put multiple filesystems on it. Let's do that!

# zfs create -V 100G tank/ext4
# fdisk /dev/tank/ext4
( follow the prompts to create 2 partitions- the first 1 GB in size, the second to fill the rest )
# fdisk -l /dev/tank/ext4

Disk /dev/tank/ext4: 107.4 GB, 107374182400 bytes
16 heads, 63 sectors/track, 208050 cylinders, total 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disk identifier: 0x000a0d54

          Device Boot      Start         End      Blocks   Id  System
/dev/tank/ext4p1            2048     2099199     1048576   83  Linux
/dev/tank/ext4p2         2099200   209715199   103808000   83  Linux

Let's create some filesystems, and mount them:

# mkfs.ext4 /dev/zd0p1
# mkfs.ext4 /dev/zd0p2
# mkdir /mnt/zd0p{1,2}
# mount /dev/zd0p1 /mnt/zd0p1
# mount /dev/zd0p2 /mnt/zd0p2

Enable compression on the ZVOL, copy over some data, then take a snapshot:

# zfs set compression=lzjb pool/ext4
# tar -cf /mnt/zd0p1/files.tar /etc/
# tar -cf /mnt/zd0p2/files.tar /etc /var/log/
# zfs snapshot tank/ext4@001

You probably didn't notice, but you just enabled transparent compression and took a snapshot of your ext4 filesystem. These are two things you can't do with ext4 natively. You also have all the benefits of ZFS that ext4 normally couldn't give you. So, now you regularly snapshot your data, you perform online scrubs, and send it offsite for backup. Most importantly, your data is consistent.

ZVOL storage for VMs

Lastly, you can use these block devices as the backend storage for VMs. It's not uncommon to create logical volume block devices as the backend for VM storage. After having the block device available for Qemu, you attach the block device to the virtual machine, and from its perspective, you have a "/dev/vda" or "/dev/sda" depending on the setup.

If using libvirt, you would have a /etc/libvirt/qemu/vm.xml file. In that file, you could have the following, where "/dev/zd0" is the ZVOL block device:

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none'/>
  <source dev='/dev/zd0'/>
  <target dev='vda' bus='virtio'/>
  <alias name='virtio-disk0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>

At this point, your VM gets all the ZFS benefits underneath, such as snapshots, compression, deduplication, data integrity, drive redundancy, etc.


ZVOLs are a great way to get to block devices quickly while taking advantage of all of the underlying ZFS features. Using the ZVOLs as the VM backing storage is especially attractive. However, I should note that when using ZVOLs, you cannot replicate them across a cluster. ZFS is not a clustered filesystem. If you want data replication across a cluster, then you should not use ZVOLs, and use file images for your VM backing storage instead. Other than that, you get all of the amazing benefits of ZFS that we have been blogging about up to this point, and beyond, for whatever data resides on your ZVOL.