Image of the glider from the Game of Life by John Conway
Skip to content

ZFS Administration, Part II- RAIDZ

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

The previous post introduced readers to the concept of VDEVs with ZFS. This post continues the topic discusing the RAIDZ VDEVs in great detail.

Standards Parity RAID

To understand RAIDZ, you first need to understand parity-based RAID levels, such as RAID-5 and RAID-6. Let's discuss the standard RAID-5 layout. You need a minimum of 3 disks for a proper RAID-5 array. On two disks, the data is striped. A parity bit is then calculated such than the XOR of all three stripes in the set is calculated to zero. The parity is then written to disk. This allows you to suffer one disk failure, and recalculate the data. Further, in RAID-5, no single disk in the array is dedicated for the parity data. Instead, the parity is distributed throughout all of the disks. Thus, any disk can fail, and the data can still be restored.

However, we have a problem. Suppose that you write the data out in the RAID-5 stripe, but a power outtage occurs before you can write the parity. You now have inconsistent data. Jeff Bonwick, the creator of ZFS, refers to this as a "RAID-5 write hole". In reality, it's a problem, no matter how small, for all parity-based RAID arrays. If there exists any possibility that you can write the data blocks without writing the parity bit, then we have the "write hole". What sucks, is the software-based RAID is not aware that a problem exists. Now, there are software work-arounds to identify that the parity is inconsistent with the data, but they're slow, and not reliable. As a result, software-based RAID has fallen out of favor with storage administrators. Rather, expensive (and failure-prone) hardware cards, with battery backups on the card, have become commonplace.

There is also a big performance problem to deal with. If the data being written to the stripe is smaller than the stripe size, then the data must be read on the rest of the stripe, and the parity recalculated. This causes you to read and write data that is not pertinent to the application. Rather than reading only live, running data, you spend a great deal of time reading "dead" or old data. So, as a result, expensive battery-backed NVRAM hardware RAID cards can hide this latency from the user, while the NVRAM buffer fills working on this stripe, until it's been flushed to disk.

In both cases, the RAID-5 write hole, and writing data to disk that is smaller than the stripe size, the atomic transactional nature of ZFS does not like the hardware solutions, as it's impossible, and does not like existing software solutions as it opens up the possibility of corrupted data. So, we need to rethink parity-based RAID.

ZFS RAIDZ

Enter RAIDZ. Rather than the stripe width be statically set at creation, the stripe width is dynamic. Every block transactionally flushed to disk is its own stripe width. Every RAIDZ write is a full stripe write. Further, the parity bit is flushed with the stripe simultaneously, completely eliminating the RAID-5 write hole. So, in the event of a power failure, you either have the latest flush of data, or you don't. But, your disks will not be inconsistent.

Image showing the stripe differences between RAID5 and RAIDZ-1.
Demonstrating the dynamic stripe size of RAIDZ

There's a catch however. With standardized parity-based RAID, the logic is as simple as "every disk XORs to zero". With dynamic variable stripe width, such as RAIDZ, this doesn't work. Instead, we must pull up the ZFS metadata to determine RAIDZ geometry on every read. If you're paying attention, you'll notice the impossibility of such if the filesystem and the RAID are separate products; your RAID card knows nothing of your filesystem, and vice-versa. This is what makes ZFS win.

Further, because ZFS knows about the underlying RAID, performance isn't an issue unless the disks are full. Reading filesystem metadata to construct the RAID stripe means only reading live, running data. There is no worry about reading "dead" data, or unallocated space. So, metadata traversal of the filesystem can actually be faster in many respects. You don't need expensive NVRAM to buffer your write, nor do you need it for battery backup in the event of RAID write hole. So, ZFS comes back to the old promise of a "Redundant Array of Inexpensive Disks". In fact, it's highly recommended that you use cheap SATA disk, rather than expensive fiber channel or SAS disks for ZFS.

Self-healing RAID

This brings us to the single-largest reason why I've become such a ZFS fan. ZFS can detect silent errors, and fix them on the fly. Suppose for a moment that there is bad data on a disk in the array, for whatever reason. When the application requests the data, ZFS constructs the stripe as we just learned, and compares ecah block against a SHA-256 checksum in the metadata. If the read stripe does not match the checksum, ZFS finds the corrupted block, it then reads the parity, and fixes it through combinatorial reconstruction. It then returns good data to the application. This is all accomplished in ZFS itself, without the help of special hardware. Another aspect of the RAIDZ levels is the fact that if the stripe is longer than the disks in the array, if there is a disk failure, not enough data with the parity can reconstruct the data. Thus, ZFS will mirror some of the data in the stripe to prevent this from happening.

Again, if your RAID and filesystem are separate products, they are not aware of each other, so detecting and fixing silent data errors is not possible. So, with that out of the way, let's build some RAIDZ pools. As with my previous post, I'll be using 5 USB thumb drives /dev/sde, /dev/sdf, /dev/sdg, /dev/sdh and /dev/sdi which are all 8 GB in size.

RAIDZ-1

RAIDZ-1 is similar to RAID-5 in that there is a single parity bit distributed across all the disks in the array. The stripe width is variable, and could cover the exact width of disks in the array, fewer disks, or more disks, as evident in the image above. This still allows for one disk failure to maintain data. Two disk failures would result in data loss. A minimum of 3 disks should be used in a RAIDZ-1. The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus one disk for parity storage (there is a caveat to zpool storage sizes I'll get to in another post). So in my example, I should have roughly 16 GB of usable disk.

To setup a zpool with RAIDZ-1, we use the "raidz1" VDEV, in this case using only 3 USB drives:

# zpool create tank raidz1 sde sdf sdg
# zpool status tank
  pool: pool
 state: ONLINE
 scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        pool          ONLINE       0     0     0
          raidz1-0    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
            sdg       ONLINE       0     0     0

errors: No known data errors

Cleanup before moving on, if following in your terminal:

# zpool destroy tank

RAIDZ-2

RAIDZ-2 is similar to RAID-6 in that there is a dual parity bit distributed across all the disks in the array. The stripe width is variable, and could cover the exact width of disks in the array, fewer disks, or more disks, as evident in the image above. This still allows for two disk failures to maintain data. Three disk failures would result in data loss. A minimum of 4 disks should be used in a RAIDZ-2. The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus two disks for parity storage. So in my example, I should have roughly 16 GB of usable disk.

To setup a zpool with RAIDZ-2, we use the "raidz2" VDEV:

# zpool create tank raidz2 sde sdf sdg sdh
# zpool status tank
  pool: pool
 state: ONLINE
 scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        pool          ONLINE       0     0     0
          raidz2-0    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
            sdg       ONLINE       0     0     0
            sdh       ONLINE       0     0     0

errors: No known data errors

Cleanup before moving on, if following in your terminal:

# zpool destroy tank

RAIDZ-3

RAIDZ-3 does not have a standardized RAID level to compare it to. However, it is the logical continuation of RAIDZ-1 and RAIDZ-2 in that there is a triple parity bit distributed across all the disks in the array. The stripe width is variable, and could cover the exact width of disks in the array, fewer disks, or more disks, as evident in the image above. This still allows for three disk failures to maintain data. Four disk failures would result in data loss. A minimum of 5 disks should be used in a RAIDZ-3. The capacity of your storage will be the number of disks in your array times the storage of the smallest disk, minus three disks for parity storage. So in my example, I should have roughly 16 GB of usable disk.

To setup a zpool with RAIDZ-3, we use the "raidz3" VDEV:

# zpool create tank raidze sde sdf sdg sdh sdi
# zpool status tank
  pool: pool
 state: ONLINE
 scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        pool          ONLINE       0     0     0
          raidz3-0    ONLINE       0     0     0
            sde       ONLINE       0     0     0
            sdf       ONLINE       0     0     0
            sdg       ONLINE       0     0     0
            sdh       ONLINE       0     0     0
            sdi       ONLINE       0     0     0

errors: No known data errors

Cleanup before moving on, if following in your terminal:

# zpool destroy tank

Some final thoughts on RAIDZ

Various recommendations exist on when to use RAIDZ-1/2/3 and when not to. Some people say that a RAIDZ-1 and RAIDZ-3 should use an odd number of disks. RAIDZ-1 should start with 3 and not exceed 7 disks in the array, while RAIDZ-3 should start at 7 and not exceed 15. RAIDZ-2 should use an even number of disks, starting with 6 disks and not exceeding 12. This is to ensure that you have an even number of disks the data is actually being written to, and to maximize performance on the array.

Instead, in my opinion, you should keep your RAIDZ array at a low power of 2 plus parity. For RAIDZ-1, this is 3, 5 and 9 disks. For RAIDZ-2, this is 4, 6, 10, and 18 disks. For RAIDZ-3, this is 5, 7, 11, and 19 disks. If going north of these recommendations, I would use RAID-1+0 setups personally. This is largely due to the time it will take to rebuild the data (called "resilvering"- a post coming in a bit). Because calculating the parity bit is so expensive, the more disks in the RAIDZ arary, the more expensive this operation will be, as compared to RAID-1+0.

Further, I've seen recommendations on the sizes that the disks should be, saying not to exceed 1 TB per disk for RAIDZ-1, 2 TB per disk for RAIDZ-2 and 3 TB per disk for RAIDZ-3. For sizes exceeding these values, you should use 2-way or 3-way mirrors with striping. Whether or not there is any validity to these claims, I cannot say. But, I can tell you that with the fewer number of disks, you should use a RAID level that accomadates your shortcomings. In a 4-disk RAID array, as I have above, calculating multiple parity bits can kill performance. Further, I could suffer at most two disk failures (if using RAID-1+0 or RAIDZ-2). RAIDZ-1 meets somewhere in the middle, where I can suffer a disk failure while stil maintaining a decent level of performance. If I had say 12 disks in the array, then maybe a RAIDZ-3 would be better suited, as the chances of suffering multiple disk failures increases.

Ultimately, you need to understand your storage problem and benchmark your disks. Put them in various RAID configurations, and use a utility such as IOZone 3 to benchmark and stress the array. You know what data you are going to store on the disk. You know what sort of hardware the disks are being installed into. You know what sort of performance you are looking for. It's your decision, and if you spend your time doing research, homework and sleuthing, you will arrive at the right decision. There may be "best practices", but they only go as far as your specific situation.

Lastly, in terms of performance, mirrors will always outperform RAIDZ levels. On both reads and writes. Further, RAIDZ-1 will outperform RAIDZ-2, which it turn will outperform RAIDZ-3. The more parity bits you have to calculate, the longer it's going to take to both read and write the data. Of course, you can always add striping to your VDEVs to maximize on some of this performance. Nested RAID levels, such as RAID-1+0 are considered "the Cadillac of RAID levels" due to the flexibility in which you can lose disks without parity, and the throughput you get from the stripe. So, in a nutshell, from fastest to slowest, your non-nested RAID levels will perform as:

  • RAID-0 (fastest)
  • RAID-1
  • RAIDZ-1
  • RAIDZ-2
  • RAIDZ-3 (slowest)

{ 17 } Comments

  1. Jon using Firefox 17.0 on Windows 7 | December 5, 2012 at 9:11 am | Permalink

    Thanks for the pair of articles. I've started messed around with ZFS on one of the scrap server that sits next to my desk. I've read the docs and FAQs, but it's good to see a different perspective of the basic setup.

    I look forward to your next article since, as of last night, one of the drives in the test server has started racking up SMART errors at an alarming rate. I guess I'll get to test resilvering in the real case and not just by faking a drive failure. :O

  2. Aaron Toponce using Debian IceWeasel 10.0.11 on GNU/Linux 64 bits | December 5, 2012 at 9:40 am | Permalink

    Np. However, the 3rd post will be covering more VDEVs (there is an order to my chaos). In this case, I'll be covering the L2ARC and the ZIL. Hope to have it up tomorrow morning. Might be a day late though.

  3. David using Firefox 17.0 on Ubuntu 64 bits | December 5, 2012 at 2:58 pm | Permalink

    Very helpful articles! I've been using ZFS for the past year, and have been extremely impressed by it. Looking forward to your L2ARC and ZIL article, as that's something we'll definitely be wanting to add in the near future.

  4. Mark using Internet Explorer 8.0 on Windows 7 | December 7, 2012 at 11:00 pm | Permalink

    Aaron, I've enjoyed reading the article. Is it really a bad idea to use 5 disks in a RAID-Z2 arrangement? I have 5 x 2TB disks that I want to use in my FreeNAS box, and prefer to have dual parity (rather than RAID-Z1).

  5. Aaron Toponce using Google Chrome 22.0.1229.94 on GNU/Linux 64 bits | December 8, 2012 at 7:43 am | Permalink

    "A bad idea", no. However, it's also not optimized. My hypervisors are using RAIDZ-1 with 4 disks, as I needed the space. My motherboard does not have enough SATA ports for 5 disks, and I need more space than what 3 disks would give. Thus, RAIDZ-1 on four disks it is. You do what you can.

  6. boneidol using Debian IceWeasel 15.0.1 on GNU/Linux 64 bits | December 28, 2012 at 7:36 pm | Permalink

    "In relatiy" <- trivial typo

  7. boneidol using Debian IceWeasel 15.0.1 on GNU/Linux 64 bits | December 28, 2012 at 7:42 pm | Permalink

    "Instead, in my opinion, you should keep your RAIDZ array at a low power of 2 plus parity. For RAIDZ-1, this is 3, 5 and 9 disks. For RAIDZ-2, this is 4, 8 and 16 disks. For RAIDZ-3, this is 5, 9 and 17 disks"

    hi I don't understand these numbers above

    Z1 = 2^1 + 1 , 2 ^2 + 1 , 2^3 +1 = 3,5,9
    Z2 = 2^1 + 2 , 2^2 +2 , 2^3 +2 = 4,6,10
    Z3 = 2^1 + 3 , 2^2 +3 , 2^3 +3 = 5,7,11

    Sorry!

  8. Aaron Toponce using Google Chrome 25.0.1364.5 on Mac OS | December 29, 2012 at 6:24 am | Permalink

    Fixed. Thanks!

  9. Alvin using Google Chrome 24.0.1312.57 on GNU/Linux 64 bits | February 2, 2013 at 9:53 pm | Permalink

    Okay here's one for you, I can't find ANY documentation ANYWHERE for using brackets (parentheses) to describe what drives to select when creating a zpool. For example, I am in a VERY sticky situation with money and physical drive constraints. I have figured out a method to make the best use of what I have but it results in a pretty unorthodox (yet completely redundant and failproof [1 drive] way) of getting it all to work AND maximize the use of my motherboard's ports to make it completly expandable in the future. I am basically creating a single-vdev pool containing a bunch of different raid levels, mirrors, and stripes.

    HOWEVER, this is how I have to do it, because of hardware constraints.
    If you were to imagine how to use the zpool create, this is how it would look USING BRACKETS. BUT THERE IS NO MENTION OF HOW TO USE BRACKETS PROPERLY in any zfs documentation. Basically either brackets, commas, &&s, etc, anything that would give me the desired affect.

    zpool create mycoolpool RAIDZ1 ((mirror A B) (mirror C D) (mirror E F) (G) (stripe H, I) (stripe J, K, L) (M))

    Yes I have 7 1TB 'blocks' or 'chunks' in a RAIDZ1, each consisting of different configurations.

    You see, if I were to do this without the brackets, it would create this mess:
    zpool create mycoolpool RAIDZ1 mirror a b mirror c d mirror e f g h i j k l m
    ^^Basically you see here that I would end up with a RAIDZ1 across 3 mirrors, the third of which consisting of a redundancy level such that 8 drives could fail... not what I want.

    And yes, I have indeed seen all the warnings and read countless people say "you shouldn't" but NEVER have I seen anyone deny that it could be done and NEVER have I seen anyone actually answer on HOW to do it.

    I've made up my mind that this is the method and approach that I need to take so please heed your warnings as much as you can as they will be said in vain.

    Thank you very much in advance for a response!!!

  10. Aaron Toponce using IceApe 2.7.11 on GNU/Linux 64 bits | February 7, 2013 at 10:24 am | Permalink

    No, this is not possible. Other than disks and files, you cannot nest VDEVs. ZFS stripes across RAIDZ and mirror VDEVs, and there's no way around it. You need to rethink your storage.

  11. ssl using Google Chrome 24.0.1312.57 on Mac OS | March 26, 2013 at 11:54 am | Permalink

    I don't quite understand how zfs could recover from certain single disk failures in your example (picture) .. say for example you lost the last drive in your raidz-1 configuration as shown. for the long stripe (A) you lose the parity bit as well as the data in block A4... How could this possibly be recovered, unless zfs puts additional parity blocks in for all stripes whose length exceeds the number of disks??

  12. Aaron Toponce using Debian IceWeasel 10.0.12 on GNU/Linux 64 bits | March 27, 2013 at 1:44 pm | Permalink

    Correct. The image isn't 100% accurate. I may fix it, but yes. If you lose too much of a single stripe, then you can't recreate the data. For each stripe written, and this is where my image needs to be updated, a parity bit is written. So, if a stripe crosses the disks twice, then there will be extra parity bits.

    Thanks for pointing this out.

  13. Veniamin using Opera 9.80 on Windows 7 | April 30, 2013 at 12:54 am | Permalink

    Thanks for articke.
    I wonder how RAIDZ will work with two or more parity stripes.
    I think that in the case of data is longer than recsize x n_data_disks, raidz slpits it into several writes.

  14. Aaron Toponce using Google Chrome 31.0.1650.63 on GNU/Linux 64 bits | December 20, 2013 at 11:02 pm | Permalink

    I've updated the image (finally) to reflect the inconsistencies I had before.

  15. Heny using Google Chrome 21.0.1180.89 on Windows 7 | January 14, 2014 at 9:34 am | Permalink

    ZFS RAIDZ as declustered RAID, how to acheive it?

  16. Chris using Google Chrome 31.0.1650.63 on GNU/Linux 64 bits | April 3, 2014 at 2:01 pm | Permalink

    Great articles! Thanks a lot. I was wondering if you have any source for the comments on maximum drive size for the various raidz types? I am very interested why someone thinks maximum 2TB for raidz-2 (as I want to create an array of 8 disks, each 4TB large in a raidz-2 configuration).

  17. Aaron Toponce using Google Chrome 33.0.1750.152 on GNU/Linux 64 bits | April 12, 2014 at 3:36 pm | Permalink

    I haven't seen anything regarding maximum drive size. Of course, you need to benchmark your own system, but the more storage you have, the more storage you have. Generally speaking too, the more spindles you have, the better performance will be.

{ 13 } Trackbacks

  1. [...] The previous post about using ZFS with GNU/Linux concerned covering the three RAIDZ virtual devices .... This post will cover another VDEV- the ZFS Intent Log, or the ZIL. [...]

  2. [...] RAIDZ [...]

  3. [...] RAIDZ [...]

  4. [...] RAIDZ [...]

  5. [...] RAIDZ [...]

  6. [...] RAIDZ [...]

  7. [...] RAIDZ [...]

  8. [...] RAIDZ [...]

  9. [...] RAIDZ [...]

  10. [...] RAIDZ [...]

  11. [...] RAIDZ [...]

  12. […] RAIDZ […]

  13. […] RAIDZ […]

Post a Comment

Your email is never published nor shared.

Switch to our mobile site