Table of Contents
|Zpool Administration||ZFS Administration||Appendices|
|0. Install ZFS on Debian GNU/Linux||9. Copy-on-write||A. Visualizing The ZFS Intent Log (ZIL)|
|1. VDEVs||10. Creating Filesystems||B. Using USB Drives|
|2. RAIDZ||11. Compression and Deduplication||C. Why You Should Use ECC RAM|
|3. The ZFS Intent Log (ZIL)||12. Snapshots and Clones||D. The True Cost Of Deduplication|
|4. The Adjustable Replacement Cache (ARC)||13. Sending and Receiving Filesystems|
|5. Exporting and Importing Storage Pools||14. ZVOLs|
|6. Scrub and Resilver||15. iSCSI, NFS and Samba|
|7. Getting and Setting Properties||16. Getting and Setting Properties|
|8. Best Practices and Caveats||17. Best Practices and Caveats|
With the proliferation of ZFS into FreeBSD, Linux, FreeNAS, Illumos, and many other operating systems, and with the introduction of OpenZFS to unify all the projects under one collective whole, more and more people are beginning to tinker with ZFS in many different situations. Some install it on their main production servers, others install it on large back-end storage arrays, and even yet, some install it on their workstations or laptops. As ZFS grows in popularity, you'll see more and more ZFS installations on commodity hardware, rather than enterprise hardware. As such, you'll see more and more installations of ZFS on hardware that does not support ECC RAM.
The question I pose here: Is this a bad idea? If you spend some time searching the Web, you'll find posts over and over on why you should choose ECC RAM for your ZFS install, with great arguments, and for good reason too. In this post, I wish to reiterate those points, and make the case for ECC RAM. Your chain is only as strong as your weakest link, and if that link is non-ECC RAM, you lose everything ZFS developers have worked so hard to achieve on keeping your data from corruption.
Good RAM vs Bad RAM vs ECC RAM
To begin, let's make a clear distinction between "Good RAM" and "Bad RAM" and how that compares to "ECC RAM":
- Good RAM- High quality RAM modules with a low failure rate.
- Bad RAM- Low quality RAM modules with a high failure rate.
- ECC RAM- RAM modules with error correcting capabilities.
"Bad RAM" isn't necessarily non-ECC RAM. I've deployed bad ECC RAM in the past, where even though they are error correcting, they fail frequently, and need to be replaced. Further, ECC RAM isn't necessarily "Good RAM". I've deployed non-ECC RAM that has been in production for years, and have yet to see a corrupted file due to not having error correction in the hardware. The point is, you can have exceptional non-ECC "Good RAM" that will never fail you, and you can have horrid ECC "Bad RAM" that still creates data corruption.
What you need to realize is the rate of failure. An ECC RAM module can fail just as frequently as a non-ECC module of the same build quality. Hopefully, the failure rate is such that ECC can fix the errors it detects, and still function without data corruption. But just to beat a dead horse dead, ECC RAM and hardware failure rates are disjointed. Just because it's ECC RAM does not mean that the hardware fails less frequently. All it means is that it detects the failures, and attempts to correct them.
Failure rates are hard to get a handle on. If you read the Wikipedia article on ECC RAM, it mentions a couple studies that have been attempted to get a handle on how often bit errors occur in DIMM modules:
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10^(−10) [to] 10^(−17) error/bit-h[ours], roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory. A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 2.5^(–7) x 10^(−11) error/bit-h[ours])(i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year.
So roughly, from what Google was seeing in their datacenters, 5 bit errors in 8 GB of RAM per hour in 8% of their installed RAM. If you don't think this is significant, you're fooling yourself. Most of these bit errors are caused by background radiation affecting the installed DIMMs, due to neutrons from cosmic rays. But voltage fluctuations, bad circuitry, and just poor build quality can also come in as factors to "bit flips" in your RAM.
ECC RAM works by detecting this bad bit by using an extra parity bit per byte. In other words, for every 8 bits, there is a 9th parity bit which operates as the checksum for the previous 8. So, for a DIMM module registering itself as 64 GB to the system, there is actually 72 GB physically installed on the chip to give space for parity. However, it's important to note that ECC RAM can only correct 1 bit flip per byte (8 bits). If you have 2 bit flips per byte, ECC RAM will not be able to recover the data.
ZFS was designed to detect silent data errors that happen due to hardware and other factors. ZFS checksums your data from top to bottom to ensure that you do not have data corruption. If you've read this series from the beginning, you'll know how ZFS is architected, and how data integrity is first priority for ZFS. People who use ZFS use it because they cannot stand data corruption anywhere in their filesystem, at any time. However, if your RAM is not ECC RAM, then you do not have the guarantee that your file is not corrupt when stored to disk. If the file was corrupted in RAM, due to a frozen bit in RAM, then when stored to ZFS, it will get checksummed with this bad bit, as ZFS will assume the data it is receiving is good. As such, you'll have corrupted data in your ZFS dataset, and it will be checksummed corrupted, with no way to fix the error internally.
This is bad.
To drive the point home further about ECC RAM in ZFS, let's create a scenario. Let's suppose that you are not using ECC RAM. Maybe this is installed on your workstation or laptop, because you like the ZFS userspace tools, and you like the idea behind ZFS. So, you want to use it locally. However, let's assume that you have non-ECC "Bad RAM" as defined above. For whatever reason, you have a "frozen bit" in one of your modules. The DIMM is only storing a "0" or a "1" in a specific location. Let's say it always reports a "0" due to the hardware failure, no matter what should be written there. To keep things simple, we'll look at 8 bits, or 1 byte in our example. I'll show the bad bit with a red "0".
Your application wishes to write "11001011", but due to your Bad RAM, you end up with "11000011". As a result, "11000011" is sent to ZFS to be stored. ZFS adds a checksum to "11000011" and stores it in the pool. You have data corruption, and ZFS doesn't know any different. ZFS assumes that the data coming out of RAM is intentional, so parity and checksums are calculated based on that result.
But what happens when you read the data off disk and store it back in your faulty non-ECC RAM? Things get ugly at this point. So, you read back "11000011" to RAM. However, it's stored in _almost_ the same position before it was sent to disk. Assume it is stored only 4 bits later. Then, you get back "01000011". Not only was your file corrupt on disk, but you've made things worse by storing them back into RAM where the faulty hardware is. But, ZFS is designed to correct this, right? So, we can fix the bad bit back to "11000011", but the problem is that the data is still corrupted!
Things go downhill from here. Because this is a physical hardware failure, we actually can't set that first bit to "1". So, any attempt at doing so, will immediately revert it back to "0". So, while the data is stored in our faulty non-ECC RAM, the byte will remain as "01000011". Now, suppose we're ready to flush the data in RAM to disk, we've compounded our errors by storing "01000011" on platter. ZFS calculates a new checksum based on the newly corrupted data, again assuming our DIMM modules are telling us the truth, and we've further corrupted our data.
As you can see, the more we read and write data to and from non-ECC RAM, the more we have a chance of corrupting data on the filesystem. ZFS was designed to protect us against this, but our chain is only as strong as the weakest link, which in this case is non-ECC RAM corrupting our data.
You might think that backups can save you. If you have a non-corrupted file you can restore from, great. However, if your ZFS snapshot, or rsync(1) copied over the corrupted bit to your backups, then you're sunk. And ZFS scrubbing won't help you here either. As already mentioned, you stored corrupted data on disk correctly, meaning the checksum for the corrupted byte is correct. However, if you have additional bad bits in RAM, scrubbing will actually try to "fix" the bad bits. But, because it's a hardware failure, causing a frozen bit, the scrub will continue, and continue, thrashing your pool. So, scrubbing will actually bring performance to its knees, trying to fix that one bad bit in RAM. Further, scrubbing won't fix bad bits already stored in your pool that have been properly checksummed.
No matter how you slice and dice it, you trusted your non-ECC RAM, and your trusty RAM failed you, with no recourse to fall back on.
Due to the extra hardware on the DIMM, ECC RAM is certainly more costly than their non-ECC counterparts, but not by much. In fact, because ECC DIMMs have 9/8 additional more hardware, the price pretty closely reflects that. In my experience, 64 GB of ECC RAM is roughly 9/8 more costly than 64 GB of non-ECC RAM. Many general purpose motherboards will support unbuffered ECC RAM also, although you should choose a motherboard that supports active ECC scrubbing, to keep bit corruption minimized.
You can get high quality ECC DDR3 SDRAM off of Newegg for about $50 per 4 GB. Non-ECC DDR3 SDRAM retails for almost exactly the same price. To me, it just seems obvious. All you need is a motherboard supporting it, and Supermicro motherboards supporting ECC RAM can also be relatively inexpensive. I know this is subjective, but I recently built a two-node KVM hypervisor shared storage cluster with 32 GB of registered ECC RAM in each box with Tyan motherboards. Total for all 32 GB was certainly more costly than everything else in the system, but I was able to get them at ~ $150 per 16 GB, or $600 for all 64 GB total. The boards were ~$250 each, or $500 total for two boards. So, it total, for two very beefy servers, I spent ~$1100, minus CPU, disk, etc. To me, this is a small investment to ensure data integrity, and I would not have saved much going the non-ECC route.
The very small extra investment was very much worth it, to make sure I have data integrity from top-to-bottom.
ZFS was built from the ground up with parity, mirroring, checksums and other mechanisms to protect your data. If a checksum fails, ZFS can make attempts at loading good data based on redundancy in the pool, and fix the corrupted bit. But ZFS is assuming that a correct checksum means the bits were correct before the checksum was applied. This is where ECC RAM is so critical. ECC RAM can greatly reduce the risk that your bits are not correct before they get stored into the pool.
So, some lessons you should take away from this article:
- ZFS checksums assume the data is correct from RAM.
- Regular ZFS scrubs will greatly reduce the risk of corrupted bits, but can be your worst enemy with non-ECC RAM hardware failures.
- Backups are only as good as the data they store. If the backup is corrupted, it's not a backup.
- ZFS parity data, checksums, and physical data all need to match. When they don't, repairs start taking place. If it is corrupted out the gate, due to non-ECC RAM, why are you using ZFS again?
Thanks to "cyberjock" on the FreeBSD forums for inspiring this post.