Image of the glider from the Game of Life by John Conway
Skip to content

ZFS Administration, Part VI- Scrub and Resilver

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Standard Validation

In GNU/Linux, we have a number of filesystem checking utilities for verifying data integrity on the disk. This is done through the "fsck" utility. However, it has a couple major drawbacks. First, you must fsck the disk offline if you are intending on fixing data errors. This means downtime. So, you must use the "umount" command to unmount your disks, before the fsck. For root partitions, this further means booting from another medium, like a CDROM or USB stick. Depending on the size of the disks, this downtime could take hours. Second, the filesystem, such as ext3 or ext4, knows nothing of the underlying data structures, such as LVM or RAID. You may only have a bad block on one disk, but a good block on another disk. Unfortunately, Linux software RAID has no idea which is good or bad, and from the perspective of ext3 or ext4, it will get good data if read from the disk containing the good block, and corrupted data from the disk containing the bad block, without any control over which disk to pull the data from, and fixing the corruption. These errors are known as "silent data errors", and there is really nothing you can do about it with the standard GNU/Linux filesystem stack.

ZFS Scrubbing

With ZFS on Linux, detecting and correcting silent data errors is done through scrubbing the disks. This is similar in technique to ECC RAM, where if an error resides in the ECC DIMM, you can find another register that contains the good data, and use it to fix the bad register. This is an old technique that has been around for a while, so it's surprising that it's not available in the standard suite of journaled filesystems. Further, just like you can scrub ECC RAM on a live running system, without downtime, you should be able to scrub your disks without downtime as well. With ZFS, you can.

While ZFS is performing a scrub on your pool, it is checking every block in the storage pool against its known checksum. Every block from top-to-bottom is checksummed using an appropriate algorithm by default. Currently, this is the "fletcher4" algorithm, which is a 256-bit algorithm, and it's fast. This can be changed to using the SHA-256 algorithm, although it may not recommended, as calculating the SHA-256 checksum is more costly than fletcher4. However, because of SHA-256, you have a 1 in 2^256 or 1 in 10^77 chance that a corrupted block hashes to the same SHA-256 checksum. This is a 0.00000000000000000000000000000000000000000000000000000000000000000000000000001% chance. For reference, uncorrected ECC memory errors will happen on about 50 orders of magnitude more frequently, with the most reliable hardware on the market. So, when scrubbing your data, the probability is that either the checksum will match, and you have a good data block, or it won't match, and you have a corrupted data block.

Scrubbing ZFS storage pools is not something that happens automatically. You need to do it manually, and it's highly recommended that you do it on a regularly scheduled interval. The recommended frequency at which you should scrub the data depends on the quality of the underlying disks. If you have SAS or FC disks, then once per month should be sufficient. If you have consumer grade SATA or SCSI, you should do once per week. You can schedule a scrub easily with the following command:

# zpool scrub tank
# zpool status tank
  pool: tank
 state: ONLINE
 scan: scrub in progress since Sat Dec  8 08:06:36 2012
    32.0M scanned out of 48.5M at 16.0M/s, 0h0m to go
    0 repaired, 65.99% done
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0

errors: No known data errors

As you can see, you can get a status of the scrub while it is in progress. Doing a scrub can severely impact performance of the disks and the applications needing them. So, if for any reason you need to stop the scrub, you can pass the "-s" switch to the scrub subcommand. However, you should let the scrub continue to completion.

# zpool scrub -s tank

You should put something similar to the following in your root's crontab, which will execute a scrub every Sunday at 02:00 in the morning:

0 2 * * 0 /sbin/zpool scrub tank

Self Healing Data

If your storage pool is using some sort of redundancy, then ZFS will not only detect the silent data errors on a scrub, but it will also correct them if good data exists on a different disk. This is known as "self healing", and can be demonstrated in the following image. In my RAIDZ post, I discussed how the data is self-healed with RAIDZ, using the parity and a reconstruction algorithm. I'm going to simplify it a bit, and use just a two way mirror. Suppose that an application needs some data blocks, and in those blocks, on of them is corrupted. How does ZFS know the data is corrupted? By checking the SHA-256 checksum of the block, as already mentioned. If a checksum does not match on a block, it will look at my other disk in the mirror to see if a good block can be found. If so, the good block is passed to the application, then ZFS will fix the bad block in the mirror, so that it also passes the SHA-256 checksum. As a result, the application will always get good data, and your pool will always be in a good, clean, consistent state.

Image showing the three steps ZFS would take to deliver good data blocks to the application, by self-healing the data.
Image courtesy of root.cz, showing how ZFS self heals data.

Resilvering Data

Resilvering data is the same concept as rebuilding or resyncing data onto the new disk into the array. However, with Linux software RAID, hardware RAID controllers, and other RAID implementations, there is no distinction between which blocks are actually live, and which aren't. So, the rebuild starts at the beginning of the disk, and does not stop until it reaches the end of the disk. Because ZFS knows about the the RAID structure and the filesystem metadata, we can be smart about rebuilding the data. Rather than wasting our time on free disk, where live blocks are not stored, we can concern ourselves with ONLY those live blocks. This can provide significant time savings, if your storage pool is only partially filled. If the pool is only 10% filled, then that means only working on 10% of the drives. Win. Thus, with ZFS we need a new term than "rebuilding", "resyncing" or "reconstructing". In this case, we refer to the process of rebuilding data as "resilvering".

Unfortunately, disks will die, and need to be replaced. Provided you have redundancy in your storage pool, and can afford some failures, you can still send data to and receive data from applications, even though the pool will be in "DEGRADED" mode. If you have the luxury of hot swapping disks while the system is live, you can replace the disk without downtime (lucky you). If not, you will still need to identify the dead disk, and replace it. This can be a chore if you have many disks in your pool, say 24. However, most GNU/Linux operating system vendors, such as Debian or Ubuntu, provide a utility called "hdparm" that allows you to discover the serial number of all the disks in your pool. This is, of course, that the disk controllers are presenting that information to the Linux kernel, which they typically do. So, you could run something like:

# for i in a b c d e f g; do echo -n "/dev/sd$i: "; hdparm -I /dev/sd$i | awk '/Serial Number/ {print $3}'; done
/dev/sda: OCZ-9724MG8BII8G3255
/dev/sdb: OCZ-69ZO5475MT43KNTU
/dev/sdc: WD-WCAPD3307153
/dev/sdd: JP2940HD0K9RJC
/dev/sde: /dev/sde: No such file or directory
/dev/sdf: JP2940HD0SB8RC
/dev/sdg: S1D1C3WR

It appears that /dev/sde is my dead disk. I have the serial numbers for all the other disks in the system, but not this one. So, by process of elimination, I can go to the storage array, and find which serial number was not printed. This is my dead disk. In this case, I find serial number "JP2940HD01VLMC". I pull the disk, replace it with a new one, and see if /dev/sde is repopulated, and the others are still online. If so, I've found my disk, and can add it to the pool. This has actually happened to me twice already, on both of my personal hypervisors. It was a snap to replace, and I was online in under 10 minutes.

To replace an dead disk in the pool with a new one, you use the "replace" subcommand. Suppose the new disk also identifed itself as /dev/sde, then I would issue the following command:

# zpool replace tank sde sde
# zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h2m, 16.43% done, 0h13m to go
config:

        NAME          STATE       READ WRITE CKSUM
        tank          DEGRADED       0     0     0
          mirror-0    DEGRADED       0     0     0
            replacing DEGRADED       0     0     0
            sde       ONLINE         0     0     0
            sdf       ONLINE         0     0     0
          mirror-1    ONLINE         0     0     0
            sdg       ONLINE         0     0     0
            sdh       ONLINE         0     0     0
          mirror-2    ONLINE         0     0     0
            sdi       ONLINE         0     0     0
            sdj       ONLINE         0     0     0

The resilver is analagous to a rebuild with Linux software RAID. It is rebuilding the data blocks on the new disk until the mirror, in this case, is in a completely healthy state. Viewing the status of the resilver will help you get an idea of when it will complete.

Identifying Pool Problems

Determining quickly if everything is functioning as it should be, without the full output of the "zpool status" command can be done by passing the "-x" switch. This is useful for scripts to parse without fancy logic, which could alert you in the event of a failure:

# zpool status -x
all pools are healthy

The rows in the "zpool status" command give you vital information about the pool, most of which are self-explanatory. They are defined as follows:

  • pool- The name of the pool.
  • state- The current health of the pool. This information refers only to the ability of the pool to provide the necessary replication level.
  • status- A description of what is wrong with the pool. This field is omitted if no problems are found.
  • action- A recommended action for repairing the errors. This field is an abbreviated form directing the user to one of the following sections. This field is omitted if no problems are found.
  • see- A reference to a knowledge article containing detailed repair information. Online articles are updated more often than this guide can be updated, and should always be referenced for the most up-to-date repair procedures. This field is omitted if no problems are found.
  • scrub- Identifies the current status of a scrub operation, which might include the date and time that the last scrub was completed, a scrub in progress, or if no scrubbing was requested.
  • errors- Identifies known data errors or the absence of known data errors.
  • config- Describes the configuration layout of the devices comprising the pool, as well as their state and any errors generated from the devices. The state can be one of the following: ONLINE, FAULTED, DEGRADED, UNAVAILABLE, or OFFLINE. If the state is anything but ONLINE, the fault tolerance of the pool has been compromised.

The columns in the status output, "READ", "WRITE" and "CHKSUM" are defined as follows:

  • NAME- The name of each VDEV in the pool, presented in a nested order.
  • STATE- The state of each VDEV in the pool. The state can be any of the states found in "config" above.
  • READ- I/O errors occurred while issuing a read request.
  • WRITE- I/O errors occurred while issuing a write request.
  • CHKSUM- Checksum errors. The device returned corrupted data as the result of a read request.

Conclusion

Scrubbing your data on regular intervals will ensure that the blocks in the storage pool remain consistent. Even though the scrub can put strain on applications wishing to read or write data, it can save hours of headache in the future. Further, because you could have a "damaged device" at any time (see http://docs.oracle.com/cd/E19082-01/817-2271/gbbvf/index.html about damaged devices with ZFS), properly knowing how to fix the device, and what to expect when replacing one, is critical to storage administration. Of course, there is plenty more I could discuss about this topic, but this should at least introduce you to the concepts of scrubbing and resilvering data.

{ 12 } Comments

  1. Anca Emanuel using Google Chrome 23.0.1271.91 on GNU/Linux 64 bits | December 11, 2012 at 11:51 am | Permalink

    This is wrestling with the octopus ??
    Conclusion: the need for simple standard administation was ignored.

  2. Aaron Toponce using Google Chrome 22.0.1229.94 on GNU/Linux 64 bits | December 11, 2012 at 12:45 pm | Permalink

    Huh?

  3. eagle275 using Firefox 10.0.12 on Windows XP | February 18, 2013 at 8:36 am | Permalink

    I believe he meant :

    Put (self adhesive) Labels on each device , that contain Serial Number AND to which device the disk was "discovered" by linux - so you should have known "oh sde fails" - look at your array, find attached label "sde" and you have the missing disk, without trial and error

  4. Graham Perrin using Safari 536.28.10 on Mac OS | February 28, 2013 at 12:58 am | Permalink

    zpool status -x

    "… will return “all pools are healthy” even if one device is failed in a RAIDZ pool. In the other words, your data is healthy doesn’t mean all devices in your pool are healthy. So go with “zpool status” at any time. …"

    https://diigo.com/0x79w for highlights from http://icesquare.com/wordpress/how-to-improve-zfs-performance/

  5. Ryan using Firefox 17.0 on Mac OS | July 25, 2013 at 2:24 pm | Permalink

    Can you elaborate on what "then ZFS will fix the bad block in the mirror" entails? (Self Healing Data section)

    Does this mean that ZFS relocates the data from the bad block to another block on that device? At the same time, does the SMART function of the drive mark that sector as bad and relocate it to a spare, so it isn't written to again?

  6. Aaron Toponce using Google Chrome 28.0.1500.95 on GNU/Linux 64 bits | August 7, 2013 at 10:11 am | Permalink

    First off, ZFS doesn't have any built in SMART functionality. If you want SMART, you need to install the smartmontools package for your operating system. Second, when self healing the data, due to the COW nature of the filesystem, the healed block will be in a physically different location, with the metadata and uberblock updated. ZFS does have the capability of knowing where bad blocks exist on the filesystem, and not writing to them again in the future.

  7. Ryan using Firefox 17.0 on Mac OS | September 15, 2013 at 8:51 am | Permalink

    Wow... what does zfs not do?

    Thanks for writing this blog on zfs, by the way. It's well written and easy to follow. It convinced me to switch!

  8. ianjo using Firefox 24.0 on Ubuntu 64 bits | September 29, 2013 at 10:59 am | Permalink

    You state that the default checksum algorithm is sha-256, but searching on the internet I believe that no zfs implementation uses sha-256 by default. I'm using zfs 0.6.2 on linux and the manpage states:
    Controls the checksum used to verify data integrity. The default value is on, which automati‐
    cally selects an appropriate algorithm (currently, fletcher4, but this may change in future
    releases). The value off disables integrity checking on user data. Disabling checksums is NOT a
    recommended practice.

    To interested readers, this can be changed with zfs set checksum=sha256 -- I do it for all my newly-created filesystems.

  9. Torge Kummerow using Firefox 30.0 on Ubuntu 64 bits | July 10, 2014 at 1:08 pm | Permalink

    What is the behaviour, if the corruption happens in the SHA Hash?

  10. Aaron Toponce using Debian IceWeasel 30.0 on GNU/Linux 64 bits | July 11, 2014 at 9:37 am | Permalink

    Each node in the merkle tree is also SHA256 checksummed, all the way up to the root node. 128 Merkle tree revisions are kept, before reusing. As such, if a SHA256 checksum is corrupted, that leaf in the node is bad, and can be fixed, provided rendundancy, by looking at the checksum in the parent leaf of the tree, and rebuilding based on the redundancy in the pool.

  11. Francois Scheurer using Firefox 31.0 on Ubuntu | October 6, 2014 at 3:19 am | Permalink

    "Unfortunately, Linux software RAID has no idea which is good or bad, and from the perspective of ext3 or ext4, it will get good data if read from the disk containing the good block, and corrupted data from the disk containing the bad block, without any control over which disk to pull the data from, and fixing the corruption. These errors are known as "silent data errors", and there is really nothing you can do about it with the standard GNU/Linux filesystem stack."

    I think this is not completly correct:
    Linux and ext3/4 does not store block checksums (unfortunately..), but hard disk controllers write ECC codes along the data. When reading errors occurs, they will be detected and if possible corrected.
    If correction with ECC is not possible, then the linux kernel will see an unrecoverable read error and with software raid (mdadm) it will then re-read the block from another raid replica (other disk).
    Then it will overwrite the bad block with the correct data, like with the zfs scrubbing, but with the advantage of doing it on demand instead of having to scrub the huge whole disks.
    If the overwrite fails (write error), then it will automatically put the disk as offline and send an email alert.

    We are currently looking for a similar behavior on zfs pools, because right now we are seeing read errors with zfs pool on freebsd but unfortunately the bad disls stay online until some sysadmin put them manually offline...

    "However, with Linux software RAID, hardware RAID controllers, and other RAID implementations, there is no distinction between which blocks are actually live, and which aren't. So, the rebuild starts at the beginning of the disk, and does not stop until it reaches the end of the disk. Because ZFS knows about the the RAID structure and the filesystem metadata, we can be smart about rebuilding the data. Rather than wasting our time on free disk, where live blocks are not stored, we can concern ourselves with ONLY those live blocks."

    Yes Linux mdadm will know nothing about fs usage and free blocks/inodes.
    But a cool feature of mdadm is the bitmap that allows a resync (after a power loss for example) to only resync the modified/unsynchronized blocks of the raid.

  12. JeanDo using Google Chrome 41.0.2272.101 on Windows 7 | March 24, 2015 at 11:52 am | Permalink

    Better raidz3 or spares ?

    I have a brand new server with 8 disks. Same model/date/size. I have three options (among plenty of others):

    1. Make a raidz3 5+3 pool.
    2. Make a raid 0+1 pool: "mirror sda sdb mirror sdc sdd mirror sde sdf mirror sdg sdh"
    3. Make a raid 0+z1 : "raidz1 sda sdb sdc raidz1 sdd sde sdf spare sdg sdh"

    The question is: what is the safer in the long run ?

    I feel that with schemes 1 or 2, all disks will worn out at the same speed, and might tend to crash at about the same date.
    While with scheme 3, the two disks will be in vacation until the first replacement, and will have a fresh history then...

    Do you have anything in your experience pro/against that ? Any advice ?

{ 13 } Trackbacks

  1. [...] from our last post on scrubbing and resilvering data in zpools, we move on to changing properties in the [...]

  2. [...] ZFS Intent Log (ZIL) The Adjustable Replacement Cache (ARC) Exporting and Importing Storage Pools Scrub and Resilver Getting and Setting Properties Best Practices and [...]

  3. [...] Scrub and Resilver [...]

  4. [...] ZFS Intent Log (ZIL) The Adjustable Replacement Cache (ARC) Exporting and Importing Storage Pools Scrub and Resilver Getting and Setting Properties Best Practices and [...]

  5. [...] ZFS Intent Log (ZIL) The Adjustable Replacement Cache (ARC) Exporting and Importing Storage Pools Scrub and Resilver Getting and Setting Properties Best Practices and [...]

  6. [...] ZFS Intent Log (ZIL) The Adjustable Replacement Cache (ARC) Exporting and Importing Storage Pools Scrub and Resilver Getting and Setting Properties Best Practices and [...]

  7. [...] Scrub and Resilver [...]

  8. [...] Scrub and Resilver [...]

  9. [...] Scrub and Resilver [...]

  10. [...] Scrub and Resilver [...]

  11. [...] Scrub and Resilver [...]

  12. [...] Scrub and Resilver [...]

  13. [...] Scrub and Resilver [...]

Post a Comment

Your email is never published nor shared.