Comments on: ZFS Administration, Appendix C- Why You Should Use ECC RAM https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/ Linux. GNU. Freedom. Tue, 31 Oct 2017 18:00:46 +0000 hourly 1 https://wordpress.org/?v=5.0-alpha-42127 By: Klaus https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-271767 Mon, 28 Aug 2017 16:21:18 +0000 https://pthree.org/?p=3352#comment-271767 @Daryl: The first DDR4 modules on the market had ECC. Non-ECC-DDR4-RAM appeared later on the market. That probably explains the (false) rumor that "DDR4 has better error handling than DDR3". Plus, there are numerous articles on the web which "prove" the increased reliability of DDR4-RAM (with ECC) by comparing it to DDR3-RAM...without ECC. Yep. Very funny.

I do not yet know how DDR4 compares to DDR3 regarding reliability. However, we do know that DDR3 was more reliable than DDR2-RAM. The Google report to which the article refers showed high error rates in DDR2-RAM. Note that at this time Google also did not replace RAM which began to show correctable errors - no wonder you see higher error rates when you decide to keep your failing RAM in use. Also note that Google used non-standard memory modules which were, according to the specs, incompatible with the mainboards (they worked in real life, of course, but possibly less reliably than standard modules).

Back to DDR4: DDR4-RAM can *optionally* have a "Write CRC" feature which can detect errors occurring on the bus when data is written to the RAM (the host could then retry the data transmission). However, this optional feature will, AFAIK, not be present on non-ECC-DDR4-RAM.

]]>
By: Daryl https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-270616 Fri, 16 Jun 2017 17:01:17 +0000 https://pthree.org/?p=3352#comment-270616 DDR4 supposedly improves error handling, with CRC checks and on-chip parity detection, over DDR3. How does this stack up in comparison with ECC?

]]>
By: Michael https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-261784 Mon, 29 Feb 2016 17:10:00 +0000 https://pthree.org/?p=3352#comment-261784 Here is an interesting article explaining that ZFS does not corrupt your data even if your RAM goes south. Thus, bad ECC dimms will not corrupt your data:
http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

"...Say you only corrupt one block in 5,000 this way. That would still be hellacious. So let’s examine the more reasonable idea of corrupting some data due to bad RAM during a scrub. And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:

First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?

Next, you read a copy of the same block – this copy might be a redundant copy, or it might be reconstructed from parity, depending on your topology. The redundant copy is easy to visualize – you literally stored another copy of the block on another disk. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk. No data has been corrupted. A later scrub will attempt to read all copies of that block and validate them just as though the error had never happened, and if this time either copy passes, the error will be cleared and the block will be marked valid again (with any copies that don’t pass validation being overwritten from the one that did)..."

Also, Matt Ahrens (one of the ZFS architects) explains that ECC RAM is not needed. Also, he says that ZFS can checksum the data in RAM to catch errors in ECC dimms:
http://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271
"...There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS..."

]]>
By: Yatti420 https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-261581 Sun, 21 Feb 2016 10:05:18 +0000 https://pthree.org/?p=3352#comment-261581 Look it's 2016! Just get the ECC RAM!! It's not like it's costing an arm and a leg for consumer builds..

]]>
By: Scott S. https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-260954 Sat, 06 Feb 2016 01:06:56 +0000 https://pthree.org/?p=3352#comment-260954 For what it is worth you may as well buy and use ECC memory. The interesting thing to note is with Solaris on FMA (fault management architecture) if a scenario such as a bit is constantly flipping on some exact memory cell then the page will be disabled, so you can use the ECC memory to even greater effect than any other OS to my knowledge (As I've not seen it implemented in any version of Linux yet nor BSD's). Over time more cells may (or will) fail so can allow you more time before you must buy a replacement or find out that some manufactures memory modules have more faults than another's. Just my little add-on here.

]]>
By: wondra https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-232851 Sat, 09 May 2015 07:17:10 +0000 https://pthree.org/?p=3352#comment-232851 I agree with Ivar. It is impossible to correct errors with only parity. Also, the configuration 8+1 bits was last used before memory modules were introduced, when there were 64 megabit x1 chips installed in single sockets on the mainboard. Nowadays, the memory lines are much longer and that governs the width of the ECC code that must be used.
en.m.wikipedia.org/wiki/Hamming_distance
en.m.wikipedia.org/wiki/ECC_memory
en.m.wikipedia.org/wiki/Hamming_code

]]>
By: Aaron Toponce https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-227908 Tue, 17 Feb 2015 20:48:48 +0000 https://pthree.org/?p=3352#comment-227908

Thank you for this excellent ZFS serie!

No problem. Glad you're enjoying it!

This particular post had me baffled and alarmed with the huge DIMM error rate reported ("5 bit errors per 8GB per hour"; wow!).

And I couldn't get the meaning of "more than 8% of DIMM memory modules affected by errors per year".

So I went through the Google/Sigmetrics09 article.
Worth noting in this article:

[ch. 3.2] "the [...] number of correctable errors per year is highly variable [perhaps] because the majority of the DIMMs see zero errors, while those affected see a large number of them".

[ch. 5.4] "corectable error rates starts to increase quickly as the [DIMM] population ages beyond 10 months up until around 20 months, [after which] the correctable error incidence remains constant. [...] this may indicate that older DIMMs that did not have correctable errors in the past, possibly will not develop them later on"

[ch.7] "over 8% of DIMMs [...] saw at least one correctable error per year"

[ch.7] "error rates are unlikely to be dominated by soft errors" (as opposed to hard errors; hard errors = hardware defect = reproducible errors)

Those are no excuses for not using ECC for mission-critical applications.

But it can certainly help (and ease the alarm) of those who can not go ECC.

Agreed. ZFS isn't unique with ECC RAM. ECC RAM should be deployed whenever fiscally possible, ZFS or not. But, when deploying ZFS with non-ECC RAM, you lose the guarantee that ZFS will keep your data in tact and correct. But, that's the case with any filesystem. Again, ZFS isn't unique here.

]]>
By: Cédric Dufour https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-226611 Fri, 30 Jan 2015 17:13:47 +0000 https://pthree.org/?p=3352#comment-226611 Thank you for this excellent ZFS serie!

This particular post had me baffled and alarmed with the huge DIMM error rate reported ("5 bit errors per 8GB per hour"; wow!).

And I couldn't get the meaning of "more than 8% of DIMM memory modules affected by errors per year".

So I went through the Google/Sigmetrics09 article.
Worth noting in this article:

[ch. 3.2] "the [...] number of correctable errors per year is highly variable [perhaps] because the majority of the DIMMs see zero errors, while those affected see a large number of them".

[ch. 5.4] "corectable error rates starts to increase quickly as the [DIMM] population ages beyond 10 months up until around 20 months, [after which] the correctable error incidence remains constant. [...] this may indicate that older DIMMs that did not have correctable errors in the past, possibly will not develop them later on"

[ch.7] "over 8% of DIMMs [...] saw at least one correctable error per year"

[ch.7] "error rates are unlikely to be dominated by soft errors" (as opposed to hard errors; hard errors = hardware defect = reproducible errors)

Those are no excuses for not using ECC for mission-critical applications.

But it can certainly help (and ease the alarm) of those who can not go ECC.

]]>
By: Philip Robar https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-137642 Wed, 11 Jun 2014 07:00:53 +0000 https://pthree.org/?p=3352#comment-137642 > AMD has broad support for ECC in a lot of their chips, but for Intel, this means the Xeons only.

This is not true. Many Intel desktop CPUs of recent generations support ECC memory: 11 of 30 non-legacy Celerons, 21 of 38 non-legacy Pentiums and all 19 3rd and 4th generation (Ivy Bridge and Haswell respectively) Core i3s support ECC memory. Also there are Xeons that do not support ECC memory and all Atom Processors for Storage and Servers do support ECC memory. (All other Atoms don't.)

Intel's Ark site has very has extensive filtering capabilities that let you find just the right processor for your needs: http://ark.intel.com

]]>
By: Ivar https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-133189 Sun, 27 Apr 2014 13:13:18 +0000 https://pthree.org/?p=3352#comment-133189 Thanks for the informative post. I just have a nitpick on the description of ECC memory:

"ECC RAM works by detecting this bad bit by using an extra parity bit per byte. In other words, for every 8 bits, there is a 9th parity bit which operates as the checksum for the previous 8.[...] However, it's important to note that ECC RAM can only correct 1 bit flip per byte (8 bits). If you have 2 bit flips per byte, ECC RAM will not be able to recover the data."

To me this makes it sound like it is possible to correct a single-bit error in a byte using only a single parity bit. This is of course impossible. ECC RAM generally uses 8 bits per 64 bits, and can then correct a single bit of error in those 64 bits, or detect (but not correct) two bits of error.

]]>
By: Aaron Toponce https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-131135 Wed, 11 Dec 2013 17:19:18 +0000 https://pthree.org/?p=3352#comment-131135 ECC must be supported by the CPU. AMD has broad support for ECC in a lot of their chips, but for Intel, this means the Xeons only. So, your CPU choice will limit your motherboard choice. Some motherboard BIOS settings allow you to enable active ECC scrubbing, while others allow you to set the scrub frequency. However, some motherboards don't have any such ECC scrub support in the BIOS.

I was always under the impression that ECC scrubbing support in the BIOS required buffered, or registered ECC RAM. However, after doing a bit of research, that appears to not be the case. As such, I've updated the post.

]]>
By: Chris https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-131130 Wed, 11 Dec 2013 11:44:54 +0000 https://pthree.org/?p=3352#comment-131130 first of all, thanks for this great series of posts!
I have a question regarding unbuffered-vs-registered DIMMs: "although you should choose a motherboard that supports active ECC scrubbing, to keep bit corruption minimized, which would require registered ECC DIMMs."
are you referring to patrol vs demand scrubbing as described here http://en.wikipedia.org/wiki/Memory_scrubbing#Scrubbing_Types ? Only regECC would then allow for patrol scrubbing. Or am I missing something?

]]>
By: Anonymous https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/#comment-131119 Wed, 11 Dec 2013 02:46:16 +0000 https://pthree.org/?p=3352#comment-131119 While the price for ECC RAM is reasonable, Intel forces you into server class motherboards and Xeon chips to be able to use ECC RAM. Fortunately an AMD Phenom combined with an Asus motherboard will work with ECC RAM. Yes it is slower and less energy efficient, but the price to performance ratio is much better. This is what I am using for my home ZFS on Linux file server.

]]>