Image of the glider from the Game of Life by John Conway
Skip to content

ZFS Administration, Appendix A- Visualizing The ZFS Intent LOG (ZIL)

Table of Contents

Zpool Administration ZFS Administration Appendices
0. Install ZFS on Debian GNU/Linux 9. Copy-on-write A. Visualizing The ZFS Intent Log (ZIL)
1. VDEVs 10. Creating Filesystems B. Using USB Drives
2. RAIDZ 11. Compression and Deduplication C. Why You Should Use ECC RAM
3. The ZFS Intent Log (ZIL) 12. Snapshots and Clones D. The True Cost Of Deduplication
4. The Adjustable Replacement Cache (ARC) 13. Sending and Receiving Filesystems
5. Exporting and Importing Storage Pools 14. ZVOLs
6. Scrub and Resilver 15. iSCSI, NFS and Samba
7. Getting and Setting Properties 16. Getting and Setting Properties
8. Best Practices and Caveats 17. Best Practices and Caveats

Background

While taking a walk around the city with the rest of the system administration team at work today (we have our daily "admin walk"), a discussion came up about asynchronous writes and the contents of the ZFS Intent Log. Previously, as shown in the Table of Contents, I blogged about the ZIL in great length. However, I didn't really discuss what the contents of the ZIL were, and to be honest, I didn't fully understand it myself. Thanks to Andrew Kuhnhausen, this was clarified. So, based on the discussion we had during our walk, as well as some pretty graphs on the whiteboard, I'll give you the breakdown here.

Let's start at the beginning. ZFS behaves more like an ACID compliant RDBMS than a traditional filesystem. Its writes are transactions, meaning there are no partial writes, and they are fully atomic, meaning you get all or nothing. This is true whether the write is synchronous or asynchronous. So, best case is you have all of your data. Worst case is you missed the last transactional write, and your data is 5 seconds old (by default). So, let's look at those too cases- the synchronous write and the asynchronous write. With synchronous, we'll consider the write both with and without a separate logging device (SLOG).

The ZIL Function

The primary, and only function of the ZIL is to replay lost transactions in the event of a failure. When a power outage, crash, or other catastrophic failure occurs, pending transactions in RAM may have not been committed to slow platter disk. So, when the system recovers, the ZFS will notice the missing transactions. At this point, the ZIL is read to replay those transactions, and commit the data to stable storage. While the system is up and running, the ZIL is never read. It is only written to. You can verify this by doing the following (assuming you have SLOG in your system). Pull up two terminals. In one terminal, run an IOZone benchmark. Do something like the following:

$ iozone -ao

This will run a whole series of tests to see how your disks perform. While this benchmark is running, in the other terminal, as root, run the following command:

# zpool iostat -v 1

This will clearly show you that when the ZIL resides on a SLOG, the SLOG devices are only written to. You never see any numbers in the read columns. This is becaus the ZIL is never read, unless the need to replay transactions from a crash are necessary. Here is one of those seconds illustrating the write:

                                                            capacity     operations    bandwidth
pool                                                     alloc   free   read  write   read  write
-------------------------------------------------------  -----  -----  -----  -----  -----  -----
pool                                                     87.7G   126G      0    155      0   601K
  mirror                                                 87.7G   126G      0    138      0   397K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9421741-part5          -      -      0     69      0   727K
    scsi-SATA_WDC_WD2500AAKX-_WD-WCAYU9755779-part5          -      -      0     68      0   727K
logs                                                         -      -      -      -      -      -
  mirror                                                 2.43M   478M      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part1      -      -      0      8      0   108K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part1      -      -      0      8      0   108K
  mirror                                                 2.57M   477M      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part1      -      -      0      7      0  95.9K
    scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part1      -      -      0      7      0  95.9K
cache                                                        -      -      -      -      -      -
  scsi-SATA_OCZ-REVODRIVE_XOCZ-6G9S9B5XDR534931-part5    26.6G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-THM0SU3H89T5XGR1-part5    26.5G  56.8G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-V402GS0LRN721LK5-part5    26.7G  56.7G      0      0      0      0
  scsi-SATA_OCZ-REVODRIVE_XOCZ-WI4ZOY2555CH3239-part5    26.7G  56.7G      0      0      0      0
-------------------------------------------------------  -----  -----  -----  -----  -----  -----

The ZIL should always be on non-volatile stable storage! You want your data to remain consistent across power outages. Putting your ZIL on a SLOG that is built from TMPFS, RAMFS, or RAM drives that are not battery backed means you will lose any pending transactions. This doesn't mean you'll have corrupted data. It only means you'll have old data. With the ZIL on volatile storage, you'll never be able to get the new data that was pending a write to stable storage. Depending on how busy your servers are, this could be a Big Deal. SSDs, such as from Intel or OCZ, are good cheap ways to have a fast, low latentcy SLOG that is reliable when power is cut.

Synchronous Writes without a SLOG

When you do not have a SLOG, the application only interfaces with RAM and slow platter disk. As previously discussed, the ZFS Intent LOG (ZIL) can be thought of as a file that resides on the slow platter disk. When the application needs to make a synchronous write, the contents of that write are sent to RAM, where the application is currently living, as well as sent to the ZIL. So, the data blocks of your synchronous write at this exact moment in time have two homes- RAM and the ZIL. Once the data has been written to the ZIL, the platter disk sends an acknowledgement back to the application letting it know that it has the data, at which point the data is flushed from RAM to slow platter disk.

This isn't ALWAYS the case, however. In the case of slow platter disk, ZFS can actually store the transaction group (TXG) on platter immediately, with pointers in the ZIL to the locations on platter. When the disk ACKs back that the ZIL contains the pointers to the data, then the write TXG is closed in RAM, and the space in the ZIL opened up for future transactions. So, in essence, you could think of the TXG SYNCHRONOUS write commit happening in three ways:

  1. All data blocks are synchronously written to both the RAM ARC and the ZIL.
  2. All data blocks are synchronously written to both the RAM ARC and the VDEV, with pointers to the blocks written in the ZIL.
  3. All data blocks are synchronously written to disk, where the ZIL is completely ignored.

In the image below, I tried to capture a simplified view of the first process. The pink arrows, labeled as number one, show the application committing its data to both RAM and the ZIL. Technically, the application is running in RAM already, but I took it out to make the image a bit more clean. After the blocks have been committed to RAM, the platter ACKs the write to the ZIL, noted by the green arrow labeled as number two. Finally, ZFS flushes the data blocks out of RAM to disk as noted by the gray arrow labeled as number three.

Show how a synchronous write works with ZFS and the ZIL on platter.
Image showing a synchronous write with ZFS without a SLOG

Synchronous Writes with a SLOG

The advantage of a SLOG, as previously outlined, is the ability to use low latency, fast disk to send the ACK back to the application. Notice that the ZIL now resides on the SLOG, and no longer resides on platter. The SLOG will catch all synchronous writes (well those called with O_SYNC and fsync(2) at least). Just as with platter disk, the ZIL will contain the data blocks the application is trying to commit to stable storage. However, the SLOG, being a fast SSD or NVRAM drive, ACKs the write to the ZIL, at which point ZFS flushes the data out of RAM to slow platter.

Notice that ZFS is not flushing the data out of the ZIL to platter. This is what confused me at first. The data is flushed from RAM to platter. Just like an ACID compliant RDBMS, the ZIL is only there to replay the transaction, should a failure occur, and the data is lost. Otherwise, the data is never read from the ZIL. So really, the write operation doesn't change at all. Only the location of the ZIL changes. Otherwise, the operation is exactly the same.

As shown in the image, again the pink arrows labeled number one show the application committing its data to both the RAM and the ZIL on the SLOG. The SLOG ACKs the write, as identified by the green arrow labeled number two, then ZFS flushes the data out of RAM to platter as identified by the gray arrow labeled number three.

Show how a synchronous write works with ZFS and the ZIL on a SLOG.
Image showing a synchronous write with ZFS with a SLOG

Asynchronous Writes

Asynchronous writes have a history of being "unstable". You have been taught that you should avoid asynchronous writes, and if you decide to go down that path, you should prepare for corrupted data in the event of a failure. For most filesystems, there is good counsel there. However, with ZFS, it's a nothing to be afraid of. Because of the architectural design of ZFS, all data is committed to disk in transaction groups. Further, the transactions are atomic, meaning you get it all, or you get none. You never get partial writes. This is true with asynchronous writes. So, your data is ALWAYS consistent on disk- even with asynchronous writes.

So, if that's the case, then what exactly is going on? Well, there actually resides a ZIL in RAM when you enable "sync=disabled" on your dataset. As is standard with the previous synchronous architectures, the data blocks of the application are sent to a ZIL located in RAM. As soon as the data is in the ZIL, RAM acknowledges the write, and then flushes the data do disk, as would be standard with synchronous data.

I know what you're thinking: "Now wait a minute! The are no acknowledgements with asynchronous writes!" Not always true. With ZFS, there is most certainly an acknowledgement, it's just one coming from very, very fast and extremely low latent volatile storage. The ACK is near instantaneous. Should there be a crash or some other failure that causes RAM to lose power, and the write was not saved to non-volatile storage, then the write is lost. However, all this means is you lost new data, and you're stuck with old but consistent data. Remember, with ZFS, data is committed in atomic transactions.

The image below illustrates an asynchronous write. Again, the pink number one arrow shows the application data blocks being initially written to the ZIL in RAM. RAM ACKs back with the green number two arrow. ZFS then flushes the data to disk, as per every previous implementation, as noted by the gray number 3 arrow. Notice in this image, even if you have a SLOG, with asynchronous writes, it's bypassed, and never used.

Show how an asynchronous write works with ZFS and the ZIL.
Image showing an asynchronous write with ZFS.

Disclaimer

This is how I and my coworkers understand the ZIL. This is after reading loads of documentation, understanding a bit of computer science theory, and understanding how an ACID compliant RDBMS works, which is architected in a similar manner. If you think this is not correct, please let me know in the comments, and we can have a discussion about the architecture.

There are certainly some details I am glossing over, such as how much data the ZIL will hold before its no longer utilized, timing of the transaction group writes, and other things. However, it should also be noted that aside from some obscure documentation, there doesn't seem to be any solid examples of exactly how the ZIL functions. So, I thought it would be best to illustrate that here, so others aren't left confused like I was. For me, images always make things clearer to understand.

{ 3 } Comments

  1. udc using Firefox 20.0 on Windows 7 | April 22, 2013 at 6:30 pm | Permalink

    Hi, first of all thanks for the whole ZFS series, you write in an easily understandable way and your practical experiences and opinions gives an additional value to it.
    (Though, not trying to be a pedant or anything but sometimes you present a personal opinion as a solid unchangeable fact, like in the deduplication section where you ended up in conclusion that you need x multiplies RAM of what seems really necessary because of some hard-written ratio. I didn't check the source code whether there is really such a ratio but even if so this is an open source software so nothing is set in stone and thus nothing is stopping you or anyone else to simply go and change it, thus saying that to effectively manage deduplication for a 12 TB storage you need not 60 GB, letting the actual reasoning behind this first number itself aside, but actually 240 GB RAM, that's kind of cheap. Don't take it in a wrong way.)

    What provoked me to a reaction is this appendix article. It seems to me that the images are drawn in a little bit unfortunate way. Let's take the first image. You made it look like the application when writing to ZFS without a separate ZIL writes its data first to the ZIL and also to the RAM, then it gets confirmation of write done and then the data is written from RAM to the disk. That's exactly what your picture and arrows show. But the application is already in the RAM and all its data is thus already in the RAM, also the ZIL is already the disk, so what you are actually indicating here is that the data is pointlessly duplicated first in the memory while being simultaneously written to the disk and then it's again pointlessly duplicated, only this time in the disk! I didn't check the source code but honestly I very much doubt that the ZFS developers would be that stupid. I would accept that the data is duplicated in the RAM, or more precisely put it's copied to the ZFS part of memory, but I will certainly not accept that the data is written to the slow disk twice.
    Also, you kind of make the ZIL look like some special entity that resides in the disk but is somewhat not part of the disk and you are somewhat hinting that the data is written to the ZIL (on the disk) and then it's duplicated or moved from the ZIL to the disk, effectively writing the same data to the same disk twice. A non-IT analogy to this idea would be a secretary in a company who sits at her front desk and when a messenger from the outside world comes in and brings a package she receives it, logs (records) this event and then stands up and goes back inside the company to deliver the package to the actual recipient. Surely that's not what is happening. If that idea was true then all journaling file systems would have twice as slow writing speed as the non-journaling file systems. ZIL is nothing special, it's an ordinary journal. The journaling is not about the middle man through whom the data is passed to the disk (thus if the journal is actually residing on the disk the data is written to the disk twice), it's about the way how the data is written to the disk, i.e. in a documented way without making premature assumption that the write will be finished successfully and thus without "touching" any live data or file system structures that's already on the disk, all that to make sure that in case of a failure resulting in an unfinished write there would be no, or very little, inconsistency.
    And lastly, your remark that after the write done is acknowledged to the application the data is flushed to the disk (letting alone the fact that the data was on the disk already safe and sound) goes against the whole principle of a synchronous write. The point of synchronous write is that the writing application is not notified about the write done until the data is really written to the target. You of course do know that, it's just a little bit unfortunate wording you have chosen.

    So I would suggest you to consider the following adjustment to the first picture. Make the application box a little part of the RAM rectangle, pretty much in a similar way as the little ZIL rectangle is a part of the cylinder. Then draw another little box next to the application box within the RAM rectangle and call it "ZFS". Then the whole process would be: a very short horizontal arrow "1" between application and ZFS boxes within the RAM rectangle, then vertical arrow "2" between ZFS box in the RAM and the ZIL box in the disk, and lastly again a short horizontal arrow "3" back between ZFS and application boxes within the RAM. Plain and simple.

    The next picture would be similarly (1) application to ZFS, (2) ZFS to ZIL (in SLOG), (3) ZFS back to application, (4) ZFS to disk. As a side effect this would also remove any confusion you noticed you have had as for whether there is write from ZIL to the disk in (4). Of course not, of course the write goes from the ZFS box that's (as any other process) obviously in the memory.

    The last picture would be (1) application to ZFS, (2) ZFS to ZIL (in RAM, thus there would be 3 little boxes as part of the RAM rectangle), (3) ZFS to application, (4) ZFS to disk. Although thinking about it maybe even more precise would be to draw the ZIL box in the RAM as a part (a sub-box, if you will) of the ZFS box, thus the arrows would be: (1) application to ZFS (containing ZIL inside), (2) ZFS to application, (3) ZFS to disk. It's simplified but that's what pictures are for, to show clearly and simply the process.

    Just my 2 cents.

  2. Aaron Toponce using Google Chrome 25.0.1364.160 on GNU/Linux 64 bits | April 23, 2013 at 7:33 am | Permalink

    But the application is already in the RAM and all its data is thus already in the RAM,

    I mentioned that in the post. I said:

    When the application needs to make a synchronous write, the contents of that write are sent to RAM, where the application is currently living, as well as sent to the ZIL.

    Then in the next paragraph, I said:

    Technically, the application is running in RAM already, but I took it out to make the image a bit more clean.

    With regards to the ZIL, you should read the zil.c source code. This is exactly the behavior. The data is indeed written twice, but as I mentioned in the article, I am glossing over some details, such as when the ZIL is used and when it isn't with respect to data size. In some cases, the ZIL is completely bypassed, even for fully synchronous writes. In some cases, it stores the pointers to the TXG on slower platter disk, rather than the data itself. And then in some cases, which my blog addresses, it stores the actual data blocks. It's highly variable on the environment, the intense details of which, are not relevant to this post. You should read https://blogs.oracle.com/realneel/entry/the_zfs_intent_log if you want to get a better understanding of when it is used, and when it isn't. I'll update the post to make mention of the pointers on slower platter disk though. Thanks for helping me realize I missed that.

    ZIL is nothing special, it’s an ordinary journal.

    Heh. No, it's very special. The function of the ZIL is to replay the last transaction in the event of a catastrophe. Without a fsck(8). This is not the function of a journal. With a filesystem journal, the journal is opened before the write, the write occurs, for both the inodes then the data blocks, then the journal is closed. For every one application write, 4 disk writes are committed, one after the other. If a power outage occurs while committing data to the inode, but before committing the blocks, you have data corruption. A fsck(8) will identify the opened idnode, and read whatever data is in that physical location on disk, but it may or may not be the data you're looking for, the latter likely the case.

    With the ZIL, the data is committed to the ZIL by ZFS, which could be on slow platter, as a single write. Then when the disk ACKs the write, a second write is committed to disk with the same transaction. This transaction group (TXG) includes the checksums, inodes, data blocks, etc. All in one write. So, if there is a power outage, either you got the write, or you didn't (part of the atomic nature of the transaction). So, you either have new data, or old data. But not corrupted data. Also, we only have 2 writes to disk, instead of 4, as you would with a journal. You really should read how transactional databases works, because ZFS is very, very similar in function.

    And lastly, your remark that after the write done is acknowledged to the application the data is flushed to the disk (letting alone the fact that the data was on the disk already safe and sound) goes against the whole principle of a synchronous write. The point of synchronous write is that the writing application is not notified about the write done until the data is really written to the target. You of course do know that, it’s just a little bit unfortunate wording you have chosen.

    This is EXACTLY the function of the ZIL. When the ZIL has the data, it's on stable storage. Thus, we can ACK back to the application the data has been committed to stable storage. When the ZIL in on a fast SSD or RAM disk, the ACK is substantially improved. The application can move on to other functions faster as a result. This is documented all over the Web. I appreciate your concern for my knowledge on the subject, but you REALLY should read the docs. Your lack of understanding how transactional writes work is showing your ignorance.

    So I would suggest you to consider the following adjustment to the first picture. Make the application box a little part of the RAM rectangle, pretty much in a similar way as the little ZIL rectangle is a part of the cylinder. Then draw another little box next to the application box within the RAM rectangle and call it “ZFS”. Then the whole process would be: a very short horizontal arrow “1″ between application and ZFS boxes within the RAM rectangle, then vertical arrow “2″ between ZFS box in the RAM and the ZIL box in the disk, and lastly again a short horizontal arrow “3″ back between ZFS and application boxes within the RAM. Plain and simple.

    I chose the design I did, because it's easy to understand. And, if you read the post, you'll see where I made clarifications, such as the application already residing in RAM. Making those adjustments would complicate the picture and add a lot of unnecessary noise. If you read the post, you understand what the images are communicating.

    Thanks for stopping by. Please also read the following documentation about the ZIL, transaction groups, and just ZFS in general:

    * Official docs: http://docs.oracle.com/cd/E19253-01/819-5461/
    * Async writes: http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-slogzil-and-data-fragmentation/
    * FAQ about the ZIL and SSDs: http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs#benefit. Also read http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs#spacezil about the space and what it contains.
    * And probably the best post on the subject, which confirms everything in my post: http://nex7.blogspot.com/2013/04/zfs-intent-log.html

    So, to summarize so far, you've got a ZFS Intent Log that is very similar to the log a SQL database uses, write once and forget (unless something bad happens), and you've got an in-RAM write cache and transaction group commits that handle actually writing to the data vdevs (and by the by, the txg commits are sequential, so all your random write traffic that came in between commits is sequential when it hits disk). The write cache is volatile as its in RAM, so the ZIL is in place to store the synchronous writes on stable media to restore from if things go south.

    I think I understand the ZIL. Please read the docs.

  3. patrick domack using Safari 534.30 on Android | April 27, 2013 at 12:54 pm | Permalink

    yes, sync writes are always written twice.
    once to the intent log (journal) and once to the pool.

    if you have no slog (seperate log/journal), then the journal exists within your pool.
    and if your paranoud, ext4 will work this same method by adjusting its options.

    i didnt see any talk about the performance issues with using a slog though.

    all writes to an slog happen one at a time (queue depth=1 always). so the latency of your writes matter all other performance metrics on you slog device.

    if you have multible slog devs you can increase performance, cause each will be used for a different writer, but only multible writers cause async blocks the current one.

    while zfs is waiting for the slog, it will group other writes together to create a larger transaction though, so all isnt lost, but you generally are going have to have a lot of writers to gain performance here.

    other things to muse over,
    zfs never verifies writes to the slog are correct, till it needs them for recovery.
    zfs depends on the slog not to lie about it flushing data to disk, or you might as well use async writes instead.

{ 1 } Trackback

  1. […] Visualizing The ZFS Intent Log (ZIL) […]

Post a Comment

Your email is never published nor shared.

Switch to our mobile site