Image of the glider from the Game of Life by John Conway
Skip to content

ZFS Administration, Part XVII- Best Practices and Caveats

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Best Practices

As with all recommendations, some of these guidelines carry a great amount of weight, while others might not. You may not even be able to follow them as rigidly as you would like. Regardless, you should be aware of them. I’ll try to provide a reason why for each. They’re listed in no specific order. The idea of “best practices” is to optimize space efficiency, performance and ensure maximum data integrity.

  • Always enable compression. There is almost certainly no reason to keep it disabled. It hardly touches the CPU and hardly touches throughput to the drive, yet the benefits are amazing.
  • Unless you have the RAM, avoid using deduplication. Unlike compression, deduplication is very costly on the system. The deduplication table consumes massive amounts of RAM.
  • Avoid running a ZFS root filesystem on GNU/Linux for the time being. It's a bit too experimental for /boot and GRUB. However, do create datasets for /home/, /var/log/ and /var/cache/.
  • Snapshot frequently and regularly. Snapshots are cheap, and can keep a plethora of file versions over time. Consider using something like the zfs-auto-snapshot script.
  • Snapshots are not a backup. Use "zfs send" and "zfs receive" to send your ZFS snapshots to an external storage.
  • If using NFS, use ZFS NFS rather than your native exports. This can ensure that the dataset is mounted and online before NFS clients begin sending data to the mountpoint.
  • Don't mix NFS kernel exports and ZFS NFS exports. This is difficult to administer and maintain.
  • For /home/ ZFS installations, setting up nested datasets for each user. For example, pool/home/atoponce and pool/home/dobbs. Consider using quotas on the datasets.
  • When using "zfs send" and "zfs receive", send incremental streams with the "zfs send -i" switch. This can be an exceptional time saver.
  • Consider using "zfs send" over "rsync", as the "zfs send" command can preserve dataset properties.

Caveats

The point of the caveat list is by no means to discourage you from using ZFS. Instead, as a storage administrator planning out your ZFS storage server, these are things that you should be aware of, so as not to catch you with your pants down, and without your data. If you don’t head these warnings, you could end up with corrupted data. The line may be blurred with the “best practices” list above. I’ve tried making this list all about data corruption if not headed. Read and head the caveats, and you should be good.

  • A "zfs destroy" can cause downtime for other datasets. A "zfs destroy" will touch every file in the dataset that resides in the storage pool. The larger the dataset, the longer this will take, and it will use all the possible IOPS out of your drives to make it happen. Thus, if it take 2 hours to destroy the dataset, that's 2 hours of potential downtime for the other datasets in the pool.
  • Debian and Ubuntu will not start the NFS daemon without a valid export in the /etc/exports file. You must either modify the /etc/init.d/nfs init script to start without an export, or create a local dummy export.
  • Debian and Ubuntu, and probably other systems use a parallized boot. As such, init script execution order is no longer prioritized. This creates problems for mounting ZFS datasets on boot. For Debian and Ubuntu, touch the "/etc/init.d/.legacy-bootordering file, and make sure that the /etc/init.d/zfs init script is the first to start, before all other services in that runlevel.
  • Do not create ZFS storage pools from files in other ZFS datasets. This will cause all sorts of headaches and problems.
  • When creating ZVOLs, make sure to set the block size as the same, or a multiple, of the block size that you will be formatting the ZVOL with. If the block sizes do not align, performance issues could arise.
  • When loading the "zfs" kernel module, make sure to set a maximum number for the ARC. Doing a lot of "zfs send" or snapshot operations will cache the data. If not set, RAM will slowly fill until the kernel invokes OOM killer, and the system becomes responsive. I have set in my /etc/modprobe.d/zfs.conf file "options zfs zfs_arc_max=2147483648", which is a 2 GB limit for the ARC.

{ 14 } Comments

  1. Asif using Firefox 17.0 on Windows 7 | January 8, 2013 at 12:09 am | Permalink

    Awesome series! You just helped me to learn and plan deployment of ZFS for my Home NAS in a day. I have gone through the whole series and it's been easy to follow while also providing details on necessary parts! Thank you!

  2. aasche using Firefox 18.0 on GNU/Linux | January 30, 2013 at 3:57 pm | Permalink

    18 Parts - enough stuff for a small book. Thank you very much for your efforts :)

  3. ovigia using Firefox 18.0 on GNU/Linux 64 bits | February 25, 2013 at 12:27 pm | Permalink

    great tips...

    thank you very much!

  4. Michael using Firefox 19.0 on Windows 7 | March 20, 2013 at 2:03 pm | Permalink

    Nice series !!!

    I looked and my rhel6/oel6 box doesn't have a "/etc/modprobe.d/zfs.conf " file anywhere. Is that something you added & just put that one command in (options zfs zfs_arc_max=2147483648)?

    I was also curious how you came up with 2GB as your limit & how much RAM your storage box has and whether you are using the box for anything else?

    My box is currently dedicated to just ZFS & currently has 16GB and I was considering expanding to 32GB. If that scenario any idea what a good arc max is?

    Thanks again !!!

  5. Aaron Toponce using Google Chrome 25.0.1364.160 on GNU/Linux 64 bits | March 20, 2013 at 2:39 pm | Permalink

    Yeah. the /etc/modprobe.d/zfs.conf is manually created. The 2 GB is just an example. It's up to how much RAM you have in your system. You should keep it under 1/4 your RAM size, IMO.

  6. Mike using Safari 8536.25 on Mac OS | April 9, 2013 at 10:48 am | Permalink

    Just want to add my thanks for a great series and all the obvious effort that went into it. While I have enough desktop experience, I am a complete newbie to servers in general and ZFS in particular. You've given me the confidence to proceed.

  7. Graham Perrin using Safari 537.73.11 on Mac OS | December 2, 2013 at 11:48 am | Permalink

    Please: is the ' zfs-auto-snapshot script' link correct? Unless I'm missing something, it doesn't lead to the script.

  8. Aaron Toponce using Debian IceWeasel 24.1.0 on GNU/Linux 64 bits | December 4, 2013 at 4:23 pm | Permalink

    Fixed. Sorry about that. I don't know what caused it to change.

  9. Joshua Zitting using Google Chrome 31.0.1650.63 on Mac OS | January 7, 2014 at 9:32 pm | Permalink

    This is an AWESOME Tutorial!!! I read every word and added it to bookmarks for safe keeping! Great work!!! My next project is Postgres... You havent done a Tutorial on it have you?? if so you should start charging!

  10. Scott Burson using Firefox 31.0 on Mac OS | August 7, 2014 at 10:28 am | Permalink

    I think you're understating the importance of regular snapshots. They're not just a good idea; they're a critical part of ZFS's redundancy strategy. Data loss with ZFS is rare, but it can be caused by faulty hardware. Regular snapshots significantly decrease the likelihood of data loss in such cases. This is especially important for home and small business users, who tend to back up irregularly at best.

  11. Paul using Google Chrome 38.0.2125.104 on Mac OS | October 16, 2014 at 5:55 pm | Permalink

    Thank you very much for this informative and well-written series. I used Freenas for ~2 years before deciding to re-roll my home NAS using Linux, but I knew I wanted to use ZFS. I have learned more about ZFS from reading this guide than I did from using it over the past 2 years.

    Some confusion remains regarding Freenas's ZFS implementation in that it used 2Gb from each drive as "swap". Do I still need to do this and if so, could I just use a small (8-10Gb) partition on a separate platter disk instead of multiple partitions across all drives in the zpool? Would I assign this to the zpool as L2ARC or something else?

    Thank you!

  12. John using Firefox 31.0 on Windows 7 | February 17, 2015 at 10:47 am | Permalink

    I'm new to ZFS so your series is awesome! I still have a couple questions though if you have time:

    I want to use ZFS for a home NAS, I currently use Unraid (visualized) but I'm getting hit with silent corruption.

    I have 15 2TB disks (media/pc backups/ office files) and a 240GB SSD (running ESXi with VMs (no redundancy - just backed up to unraid)

    I like unraid as it will spin down disks when not in use. I have primarily media that gets accessed maybe a couple times a day in the evening, so my disks are down most of the day.

    I'm thinking of moving to an all linux based server (ditch esxi) with all content in zfs (inc the VMs - I'll move to host them in kvm/xen). I would expect to use the SSD as a ZIL/ARC drive.

    Topics...

    Spin down:
    1) Does zfs ever spin down disks?
    2) If yes, at what level is it managed? pool, dataset, vol, vdev?
    3) Therefore, is there a way to arrange the configuration to ensure those holding media spin down when no demand for them?

    VM:
    4) What performance will I expect to get from my VM's when moving from SSD to ZFS (spinners for content, SSD for ZIL/ARC) - e.g. will it be a noticeable degradation or will the SSD ZIL/ARC mask the slower performance of the spinners? I'll take 'a' hit if I have to, as long as these busy VM's don't grind to a halt!

    Pool config:
    5) depending on the spin down, I had been considering either multiple raidz1 or z2 vdevs, but as I understand it any data written to a dataset or zvol in the pool will spread across all vdevs in a pool for performance - What happens if I lose a vdev (i.e never to return)? do I lose the whole pool or just the vdev
    6) Any suggestions on how to config my 15 2TB disks taking into account the spin down question with 'always on VMs' and 'nearly always off media'? I had considered 2 x raidz1 (for media, PC backups etc.) 1 x raidz2 for VMs and Important office docs but if I can't separate data in a pool is there any benefit!? Would I need multiple pools to achieve this?
    7) you mention a revo drive presenting as two disks/partitions and then striping them for the ARC, any benefit in creating two partitions on my regular SATA SSD for the arc stripe?

    Prioritization:
    8) If I use zVOLS for each VM, can I assign any level of IO weighting in ZFS to ensure my high priority VM gets first dibs at ZFS access? Or is this negated by using SSD ZIL/ARC?

    Thank you in advance!!
    John

  13. Aaron Toponce using Debian IceWeasel 31.4.0 on GNU/Linux 64 bits | February 17, 2015 at 11:44 am | Permalink

    To answer your questions directly:

    1) Does zfs ever spin down disks?

    Not to my knowledge. ZFS was designed with datacenter storage in mind, and not home use.

    3) Therefore, is there a way to arrange the configuration to ensure those holding media spin down when no demand for them?

    From just brief testing on my workstation, trying to spin down my 250GB drives, ZFS immediately spins them right back up. Also, doing a quick search for spinning down ZFS disks shows up this posts on the FreeBSD forums: https://forums.freebsd.org/threads/spinning-disks-down-with-zfs.23973/#post-135480. So, I don't know if it's FreeBSD specific, and not working with Linux as a result, or something else. Long story short, I just don't know.

    4) What performance will I expect to get from my VM's when moving from SSD to ZFS (spinners for content, SSD for ZIL/ARC) - e.g. will it be a noticeable degradation or will the SSD ZIL/ARC mask the slower performance of the spinners? I'll take 'a' hit if I have to, as long as these busy VM's don't grind to a halt!

    Well, if you are migrating from SSDs to spinning rust, then you will most definitely notice a degredation in throughput. If the SLOG is fast disk, such as modern SSDs, and it is also partitioned for the L2ARC, you will notice that read/write access latencies are the same, but raw read/write throughput is victim to the throughput of the spinning rust. I don't know anything about your VM architecture, but I wouldn't expect your "VM's [to] grind to a halt". I am running two KVM hypervisors sharing disk via GlusterFS for live migrations. The underlying ZFS pool consists for 4x1TB drives in a RAID-1+0 with OCZ SSDs for the L2ARC. I am not using my SSDs for a SLOG, but instead, writing data ascyncronously (because GlusterFS is already fully synchronous). While performance isn't "AMAZING", it's satisfactory. This blog is running in one of the VMs in this 2-node cluster.

    5) depending on the spin down, I had been considering either multiple raidz1 or z2 vdevs, but as I understand it any data written to a dataset or zvol in the pool will spread across all vdevs in a pool for performance - What happens if I lose a vdev (i.e never to return)? do I lose the whole pool or just the vdev?

    Unless you know you need the space, I would advise against parity based RAIDZ. Parity RAID is always slower than RAID-1+0. In terms of performance, RAID-1+0 > RAIDZ1 > RAIDZ2 > RAIDZ3. It's expensive, but I would highly recommend it. However, if you have enough disks, you could probably get away with a RAIDZ1+0 or a RAIDZ2+0, to help keep performance up. Regardless, to answer your question, if you lose a disk in a redundant pool, the pool will continue operating, although in "degraded" mode. Whether or not you can suffer another disk failure is entirely dependent on the RAID array.

    6) Any suggestions on how to config my 15 2TB disks taking into account the spin down question with 'always on VMs' and 'nearly always off media'? I had considered 2 x raidz1 (for media, PC backups etc.) 1 x raidz2 for VMs and Important office docs but if I can't separate data in a pool is there any benefit!? Would I need multiple pools to achieve this?

    For 15 disks, I would recommend setting up 5xRAIDZ1 VDEVs of 3 disks eeach. IE:

    # zpool status pthree
      pool: pthree
     state: ONLINE
      scan: none requested
    config:
    
    	NAME             STATE     READ WRITE CKSUM
    	pthree           ONLINE       0     0     0
    	  raidz1-0       ONLINE       0     0     0
    	    /tmp/file1   ONLINE       0     0     0
    	    /tmp/file2   ONLINE       0     0     0
    	    /tmp/file3   ONLINE       0     0     0
    	  raidz1-1       ONLINE       0     0     0
    	    /tmp/file4   ONLINE       0     0     0
    	    /tmp/file5   ONLINE       0     0     0
    	    /tmp/file6   ONLINE       0     0     0
    	  raidz1-2       ONLINE       0     0     0
    	    /tmp/file7   ONLINE       0     0     0
    	    /tmp/file8   ONLINE       0     0     0
    	    /tmp/file9   ONLINE       0     0     0
    	  raidz1-3       ONLINE       0     0     0
    	    /tmp/file10  ONLINE       0     0     0
    	    /tmp/file11  ONLINE       0     0     0
    	    /tmp/file12  ONLINE       0     0     0
    	  raidz1-4       ONLINE       0     0     0
    	    /tmp/file13  ONLINE       0     0     0
    	    /tmp/file14  ONLINE       0     0     0
    	    /tmp/file15  ONLINE       0     0     0
    
    errors: No known data errors

    You lose 5 disks of storage space (one-third of the raw size), but then you have 5xRAIDZ VDEVs striped. This will help keep performance up, at minimal cost, and you could suffer a drive failure in each VDEV (5 failures total), and still have an operational pool.

    7) you mention a revo drive presenting as two disks/partitions and then striping them for the ARC, any benefit in creating two partitions on my regular SATA SSD for the arc stripe?

    Striping across a singe disk will be limited to the throughput of the SATA controller connected to the drive. It will certainly improve performance, but only up to what the controller can sustain as the upper bottlneck. Also, I should modify this post. When adding drives to the cache (L2ARC), they aren't actually "striped" in the true sense of the word. The drives are balanced evenly, of course, but there actually isn't any striping going on. IE- the pages aren't split into multiple chunks, and placed on each drive in the cache.

    8) If I use zVOLS for each VM, can I assign any level of IO weighting in ZFS to ensure my high priority VM gets first dibs at ZFS access? Or is this negated by using SSD ZIL/ARC?

    I'm not aware of any such prioritization for ZVOLs. If it exists, I'll certainly add it to the series, but I'm not aware of it.

  14. John using Firefox 31.0 on Windows 7 | February 18, 2015 at 2:23 am | Permalink

    Aaron - a big thanks for the speedy response, some food for thought!

{ 4 } Trackbacks

  1. [...] Best Practices and Caveats [...]

  2. [...] Best Practices and Caveats [...]

  3. [...] Best Practices and Caveats [...]

  4. [...] Best Practices and Caveats [...]

Post a Comment

Your email is never published nor shared.