Image of the glider from the Game of Life by John Conway
Skip to content

ZFS Administration, Part VIII- Zpool Best Practices and Caveats

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

We now reach the end of ZFS storage pool administration, as this is the last post in that subtopic. After this, we move on to a few theoretical topics about ZFS that will lay the groundwork for ZFS Datasets. Our previous post covered the properties of a zpool. Without any ado, let's jump right into it. First, we'll discuss the best practices for a ZFS storage pool, then we'll discuss some of the caveats I think it's important to know before building your pool.

Best Practices

As with all recommendations, some of these guidelines carry a great amount of weight, while others might not. You may not even be able to follow them as rigidly as you would like. Regardless, you should be aware of them. I'll try to provide a reason why for each. They're listed in no specific order. The idea of "best practices" is to optimize space efficiency, performance and ensure maximum data integrity.

  • Only run ZFS on 64-bit kernels. It has 64-bit specific code that 32-bit kernels cannot do anything with.
  • Install ZFS only on a system with lots of RAM. 1 GB is a bare minimum, 2 GB is better, 4 GB would be preferred to start. Remember, ZFS will use 7/8 of the available RAM for the ARC.
  • Use ECC RAM when possible for scrubbing data in registers and maintaining data consistency. The ARC is an actual read-only data cache of valuable data in RAM.
  • Use whole disks rather than partitions. ZFS can make better use of the on-disk cache as a result. If you must use partitions, backup the partition table, and take care when reinstalling data into the other partitions, so you don't corrupt the data in your pool.
  • Keep each VDEV in a storage pool the same size. If VDEVs vary in size, ZFS will favor the larger VDEV, which could lead to performance bottlenecks.
  • Use redundancy when possible, as ZFS can and will want to correct data errors that exist in the pool. You cannot fix these errors if you do not have a redundant good copy elsewhere in the pool. Mirrors and RAID-Z levels accomplish this.
  • For the number of disks in the storage pool, use the "power of two plus parity" recommendation. This is for storage space efficiency and hitting the "sweet spot" in performance. So, for a RAIDZ-1 VDEV, use three (2+1), five (4+1), or nine (8+1) disks. For a RAIDZ-2 VDEV, use four (2+2), six (4+2), ten (8+2), or eighteen (16+2) disks. For a RAIDZ-3 VDEV, use five (2+3), seven (4+3), eleven (8+3), or nineteen (16+3) disks. For pools larger than this, consider striping across mirrored VDEVs.
  • Consider using RAIDZ-2 or RAIDZ-3 over RAIDZ-1. You've heard the phrase "when it rains, it pours". This is true for disk failures. If a disk fails in a RAIDZ-1, and the hot spare is getting resilvered, until the data is fully copied, you cannot afford another disk failure during the resilver, or you will suffer data loss. With RAIDZ-2, you can suffer two disk failures, instead of one, increasing the probability you have fully resilvered the necessary data before the second, and even third disk fails.
  • Perform regular (at least weekly) backups of the full storage pool. It's not a backup, unless you have multiple copies. Just because you have redundant disk, does not ensure live running data in the event of a power failure, hardware failure or disconnected cables.
  • Use hot spares to quickly recover from a damaged device. Set the "autoreplace" property to on for the pool.
  • Consider using a hybrid storage pool with fast SSDs or NVRAM drives. Using a fast SLOG and L2ARC can greatly improve performance.
  • If using a hybrid storage pool with multiple devices, mirror the SLOG and stripe the L2ARC.
  • If using a hybrid storage pool, and partitioning the fast SSD or NVRAM drive, unless you know you will need it, 1 GB is likely sufficient for your SLOG. Use the rest of the SSD or NVRAM drive for the L2ARC. The more storage for the L2ARC, the better.
  • Keep pool capacity under 80% for best performance. Due to the copy-on-write nature of ZFS, the filesystem gets heavily fragmented. Email reports of capacity at least monthly.
  • Scrub consumer-grade SATA and SCSI disks weekly and enterprise-grade SAS and FC disks monthly.
  • Email reports of the storage pool health weekly for redundant arrays, and bi-weekly for non-redundant arrays.
  • When using advanced format disks that read and write data in 4 KB sectors, set the "ashift" value to 12 on pool creation for maximum performance. Default is 9 for 512-byte sectors.
  • Set "autoexpand" to on, so you can expand the storage pool automatically after all disks in the pool have been replaced with larger ones. Default is off.
  • Always export your storage pool when moving the disks from one physical system to another.
  • When considering performance, know that for sequential writes, mirrors will always outperform RAID-Z levels. For sequential reads, RAID-Z levels will perform more slowly than mirrors on smaller data blocks and faster on larger data blocks. For random reads and writes, mirrors and RAID-Z seem to perform in similar manners. Striped mirrors will outperform mirrors and RAID-Z in both sequential, and random reads and writes.
  • Compression is disabled by default. This doesn't make much sense with today's hardware. ZFS compression is extremely cheap, extremely fast, and barely adds any latency to the reads and writes. In fact, in some scenarios, your disks will respond faster with compression enabled than disabled. A further benefit is the massive space benefits.

Caveats

The point of the caveat list is by no means to discourage you from using ZFS. Instead, as a storage administrator planning out your ZFS storage server, these are things that you should be aware of, so as not to catch you with your pants down, and without your data. If you don't head these warnings, you could end up with corrupted data. The line may be blurred with the "best practices" list above. I've tried making this list all about data corruption if not headed. Read and head the caveats, and you should be good.

  • Your VDEVs determine the IOPS of the storage, and the slowest disk in that VDEV will determine the IOPS for the entire VDEV.
  • ZFS uses 1/64 of the available raw storage for metadata. So, if you purchased a 1 TB drive, the actual raw size is 976 GiB. After ZFS uses it, you will have 961 GiB of available space. The "zfs list" command will show an accurate representation of your available storage. Plan your storage keeping this in mind.
  • ZFS wants to control the whole block stack. It checksums, resilvers live data instead of full disks, self-heals corrupted blocks, and a number of other unique features. If using a RAID card, make sure to configure it as a true JBOD (or "passthrough mode"), so ZFS can control the disks. If you can't do this with your RAID card, don't use it. Best to use a real HBA.
  • Do not use other volume management software beneath ZFS. ZFS will perform better, and ensure greater data integrity, if it has control of the whole block device stack. As such, avoid using dm-crypt, mdadm or LVM beneath ZFS.
  • Do not share a SLOG or L2ARC DEVICE across pools. Each pool should have its own physical DEVICE, not logical drive, as is the case with some PCI-Express SSD cards. Use the full card for one pool, and a different physical card for another pool. If you share a physical device, you will create race conditions, and could end up with corrupted data.
  • Do not share a single storage pool across different servers. ZFS is not a clustered filesystem. Use GlusterFS, Ceph, Lustre or some other clustered filesystem on top of the pool if you wish to have a shared storage backend.
  • Other than a spare, SLOG and L2ARC in your hybrid pool, do not mix VDEVs in a single pool. If one VDEV is a mirror, all VDEVs should be mirrors. If one VDEV is a RAIDZ-1, all VDEVs should be RAIDZ-1. Unless of course, you know what you are doing, and are willing to accept the consequences. ZFS attempts to balance the data across VDEVs. Having a VDEV of a different redundancy can lead to performance issues and space efficiency concerns, and make it very difficult to recover in the event of a failure.
  • Do not mix disk sizes or speeds in a single VDEV. Do mix fabrication dates, however, to prevent mass drive failure.
  • In fact, do not mix disk sizes or speeds in your storage pool at all.
  • Do not mix disk counts across VDEVs. If one VDEV uses 4 drives, all VDEVs should use 4 drives.
  • Do not put all the drives from a single controller in one VDEV. Plan your storage, such that if a controller fails, it affects only the number of disks necessary to keep the data online.
  • When using advanced format disks, you must set the ashift value to 12 at pool creation. It cannot be changed after the fact. Use "zpool create -o ashift=12 tank mirror sda sdb" as an example.
  • Hot spare disks will not be added to the VDEV to replace a failed drive by default. You MUST enable this feature. Set the autoreplace feature to on. Use "zpool set autoreplace=on tank" as an example.
  • The storage pool will not auto resize itself when all smaller drives in the pool have been replaced by larger ones. You MUST enable this feature, and you MUST enable it before replacing the first disk. Use "zpool set autoexpand=on tank" as an example.
  • ZFS does not restripe data in a VDEV nor across multiple VDEVs. Typically, when adding a new device to a RAID array, the RAID controller will rebuild the data, by creating a new stripe width. This will free up some space on the drives in the pool, as it copies data to the new disk. ZFS has no such mechanism. Eventually, over time, the disks will balance out due to the writes, but even a scrub will not rebuild the stripe width.
  • You cannot shrink a zpool, only grow it. This means you cannot remove VDEVs from a storage pool.
  • You can only remove drives from mirrored VDEV using the "zpool detach" command. You can replace drives with another drive in RAIDZ and mirror VDEVs however.
  • Do not create a storage pool of files or ZVOLs from an existing zpool. Race conditions will be present, and you will end up with corrupted data. Always keep multiple pools separate.
  • The Linux kernel may not assign a drive the same drive letter at every boot. Thus, you should use the /dev/disk/by-id/ convention for your SLOG and L2ARC. If you don't, your zpool devices could end up as a SLOG device, which would in turn clobber your ZFS data.
  • Don't create massive storage pools "just because you can". Even though ZFS can create 78-bit storage pool sizes, that doesn't mean you need to create one.
  • Don't put production directly into the zpool. Use ZFS datasets instead.
  • Don't commit production data to file VDEVs. Only use file VDEVs for testing scripts or learning the ins and outs of ZFS.

If there is anything I missed, or something needs to be corrected, feel free to add it in the comments below.

{ 10 } Comments

  1. boneidol using Debian IceWeasel 15.0.1 on GNU/Linux 64 bits | December 28, 2012 at 8:10 pm | Permalink

    "For the number of disks in the storage pool, use the “power of two plus parity” recommendation. This is for storage space efficiency and hitting the “sweet spot” in performance. So, for a RAIDZ-1 VDEV, use three (2+1), five (4+1), or nine (8+1) disks. For a RAIDZ-2 VDEV, use four (2+2), six (4+2), ten (8+2), or eighteen (16+2) disks. For a RAIDZ-3 VDEV, use five (2+3), seven (4+3), eleven (8+3), or nineteen (16+3) disks. For pools larger than this, consider striping across mirrored VDEVs."

    this differs from http://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/ ( and agrees with my math ( and you instructions :-) )

  2. Aaron Toponce using Google Chrome 25.0.1364.5 on Mac OS | December 29, 2012 at 6:24 am | Permalink

    Fixed the numbers in the post. Thanks!

  3. Roger Hunwicks using Firefox 17.0 on Ubuntu 64 bits | January 12, 2013 at 11:00 am | Permalink

    When you say "zpool set auoresize=on tank”

    Do you really mean "zpool set autoexpand=on tank"

    I get an "invalid property" for both "set autoresize" and "set auoresize".

    Great series - thanks :-)

  4. Aaron Toponce using Google Chrome 22.0.1229.94 on GNU/Linux 64 bits | January 14, 2013 at 9:13 am | Permalink

    Yes. Typo. You know how when you get a word in your head, it seems to get applied to everything you type? Yeah. That happened here. Thanks for the notice. Fixed.

  5. Dzezik using Firefox 15.0.1 on Windows 7 | September 12, 2013 at 2:20 pm | Permalink

    raw size of 1TB drive is 931GiB
    -> 1TB is 10^12, GiB is 2^30
    -> (10^12)/(2^30)~=931

    so
    ZFS gives You 916,77GiB

  6. Ghat Yadav using Firefox 24.0 on GNU/Linux 64 bits | October 16, 2013 at 5:00 am | Permalink

    hi
    very nice and useful guide... however as a home user, I have a request for you to add one more section on how to add more drives to a existing pool.
    I started with a 4x4tb pool with raidz2. I have a 12bay device, and just populated 4 slots, for budget reasons, when I set it up I felt I will buy more disks in the future as they become cheap... but looks like thats not possible.
    once you create a zpool you cannot expand it (or am I wrong)...
    If I get 2 more 4TB disks, now if my budget allows, how do I best use them ?
    Ghat

  7. JohnC using Firefox 25.0 on Windows 7 | November 4, 2013 at 7:29 am | Permalink

    The Linux kernel may not assign a drive the same drive letter at every boot. Thus, you should use the /dev/disk/by-id/ convention for your SLOG and L2ARC. If you don't, your zpool devices "could end up as a SLOG device, which would in turn clobber your ZFS data."

    I think this has just happenned to me. I had a controller fail, after a series of reboots, I acquired a new controller. Now the disks on the new controller are fine, but the othed disks are "Faulted" with "corrupted data". I am sure the data is on them, but the order may be different. Loss of 8 of the 16 x 3Tb drives in a Raidz3 configuration is fatal.

    The status is Unavail with "the label is missing or invalid". How does one recover from this? Can it be done?

    Is there a fix for this

  8. Aaron Toponce using Google Chrome 30.0.1599.101 on GNU/Linux 64 bits | November 5, 2013 at 6:40 pm | Permalink

    Restore from backup. That's probably the best you've got at this point. Corrupted data, and too many lost drives out of a RAID are very likely not recoverable.

  9. Thumper advice using Google Chrome 27.0.1453.93 on GNU/Linux 64 bits | December 1, 2013 at 2:37 am | Permalink

    hi, I'm currently using a thumper (Sun X4500) and i'd like to give a try to ZFS on my SL64 x86_64. I'd like to export through NFS 22 x 1To hard drives. I know that there are a lot of options so basically I wanted to setup a RaidZ-1 with 21 hdd plus 1 spare drive. What do you think of that ? what about dedicating drives to ZIL and so on ?
    Thnks.
    François

  10. Mark using Firefox 27.0 on Mac OS | March 12, 2014 at 4:56 am | Permalink

    "For the number of disks in the storage pool, use the "power of two plus parity" recommendation. "

    What is the rational behind this? I have 5x4TB drives, which given the drive size means that I should be using RAIDZ-2 but that doesn't fit within the guidelines. What kind of performance hit can I expect to take? Is it mainly a cpu bound issue?

    "Don't put production directly into the zpool. Use ZFS datasets instead. Don't commit production data to file VDEVs. Only use file VDEVs for testing scripts or learning the ins and outs of ZFS."

    Can you elaborate on this? I don't quiet understand what you are saying in the above two points.

{ 10 } Trackbacks

  1. [...] ZFS Administration, Part VIII- Zpool Best Practices and Caveats [...]

  2. [...] Best Practices and Caveats [...]

  3. [...] Best Practices and Caveats [...]

  4. [...] Best Practices and Caveats [...]

  5. ZFS Homeserver Festplatten Beratung | January 31, 2013 at 6:47 am | Permalink

    [...] [...]

  6. [...] Best Practices and Caveats [...]

  7. [...] Best Practices and Caveats [...]

  8. […] Best Practices and Caveats […]

  9. […] Best Practices and Caveats […]

  10. Home Server (& Network) Setups - Page 4 | September 4, 2013 at 2:58 am | Permalink

    […] 8 drive raidz2s, which makes sense, but ZFS best practices according to this blog says different: Aaron Toponce : ZFS Administration, Part VIII- Zpool Best Practices and Caveats For the number of disks in the storage pool, use the “power of two plus parity” […]

Post a Comment

Your email is never published nor shared.

Switch to our mobile site