Comments on: ZFS Administration, Appendix A- Visualizing The ZFS Intent LOG (ZIL) https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/ Linux. GNU. Freedom. Tue, 31 Oct 2017 18:00:46 +0000 hourly 1 https://wordpress.org/?v=5.0-alpha-42127 By: Roger Qiu https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-138605 Tue, 24 Jun 2014 08:34:20 +0000 http://pthree.org/?p=3073#comment-138605 Asynchronous writes seems to have a higher performance than synchronous writes.

However this article http://milek.blogspot.com.au/2010/05/zfs-synchronous-vs-asynchronous-io.html
says that asynchronous writes could result in inconsistency from the application's point of view. And it especially affects databases, since databases may be implementing their own transactions.

In what situations would asynchronous writes be preferable to synchronous writes?

]]>
By: Richard Elling https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-134415 Sat, 10 May 2014 18:06:08 +0000 http://pthree.org/?p=3073#comment-134415 Each dataset has a separate ZIL and there is some concurrency related to multithreaded sync writes for interesting workloads (think databases or NFS). Thus the "ZIL" is often "ZILs" and the workload is only single-threaded in the degenerate case. Expecting queue depth =1 is a bad assumption.

]]>
By: Aaron Toponce : ZFS Administration, Part XIII- Sending and Receiving Filesystems https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-127235 Tue, 02 Jul 2013 13:25:09 +0000 http://pthree.org/?p=3073#comment-127235 […] Visualizing The ZFS Intent Log (ZIL) […]

]]>
By: patrick domack https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-125471 Sat, 27 Apr 2013 18:54:19 +0000 http://pthree.org/?p=3073#comment-125471 yes, sync writes are always written twice.
once to the intent log (journal) and once to the pool.

if you have no slog (seperate log/journal), then the journal exists within your pool.
and if your paranoud, ext4 will work this same method by adjusting its options.

i didnt see any talk about the performance issues with using a slog though.

all writes to an slog happen one at a time (queue depth=1 always). so the latency of your writes matter all other performance metrics on you slog device.

if you have multible slog devs you can increase performance, cause each will be used for a different writer, but only multible writers cause async blocks the current one.

while zfs is waiting for the slog, it will group other writes together to create a larger transaction though, so all isnt lost, but you generally are going have to have a lot of writers to gain performance here.

other things to muse over,
zfs never verifies writes to the slog are correct, till it needs them for recovery.
zfs depends on the slog not to lie about it flushing data to disk, or you might as well use async writes instead.

]]>
By: Aaron Toponce https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-125164 Tue, 23 Apr 2013 13:33:31 +0000 http://pthree.org/?p=3073#comment-125164

But the application is already in the RAM and all its data is thus already in the RAM,

I mentioned that in the post. I said:

When the application needs to make a synchronous write, the contents of that write are sent to RAM, where the application is currently living, as well as sent to the ZIL.

Then in the next paragraph, I said:

Technically, the application is running in RAM already, but I took it out to make the image a bit more clean.

With regards to the ZIL, you should read the zil.c source code. This is exactly the behavior. The data is indeed written twice, but as I mentioned in the article, I am glossing over some details, such as when the ZIL is used and when it isn't with respect to data size. In some cases, the ZIL is completely bypassed, even for fully synchronous writes. In some cases, it stores the pointers to the TXG on slower platter disk, rather than the data itself. And then in some cases, which my blog addresses, it stores the actual data blocks. It's highly variable on the environment, the intense details of which, are not relevant to this post. You should read https://blogs.oracle.com/realneel/entry/the_zfs_intent_log if you want to get a better understanding of when it is used, and when it isn't. I'll update the post to make mention of the pointers on slower platter disk though. Thanks for helping me realize I missed that.

ZIL is nothing special, it’s an ordinary journal.

Heh. No, it's very special. The function of the ZIL is to replay the last transaction in the event of a catastrophe. Without a fsck(8). This is not the function of a journal. With a filesystem journal, the journal is opened before the write, the write occurs, for both the inodes then the data blocks, then the journal is closed. For every one application write, 4 disk writes are committed, one after the other. If a power outage occurs while committing data to the inode, but before committing the blocks, you have data corruption. A fsck(8) will identify the opened idnode, and read whatever data is in that physical location on disk, but it may or may not be the data you're looking for, the latter likely the case.

With the ZIL, the data is committed to the ZIL by ZFS, which could be on slow platter, as a single write. Then when the disk ACKs the write, a second write is committed to disk with the same transaction. This transaction group (TXG) includes the checksums, inodes, data blocks, etc. All in one write. So, if there is a power outage, either you got the write, or you didn't (part of the atomic nature of the transaction). So, you either have new data, or old data. But not corrupted data. Also, we only have 2 writes to disk, instead of 4, as you would with a journal. You really should read how transactional databases works, because ZFS is very, very similar in function.

And lastly, your remark that after the write done is acknowledged to the application the data is flushed to the disk (letting alone the fact that the data was on the disk already safe and sound) goes against the whole principle of a synchronous write. The point of synchronous write is that the writing application is not notified about the write done until the data is really written to the target. You of course do know that, it’s just a little bit unfortunate wording you have chosen.

This is EXACTLY the function of the ZIL. When the ZIL has the data, it's on stable storage. Thus, we can ACK back to the application the data has been committed to stable storage. When the ZIL in on a fast SSD or RAM disk, the ACK is substantially improved. The application can move on to other functions faster as a result. This is documented all over the Web. I appreciate your concern for my knowledge on the subject, but you REALLY should read the docs. Your lack of understanding how transactional writes work is showing your ignorance.

So I would suggest you to consider the following adjustment to the first picture. Make the application box a little part of the RAM rectangle, pretty much in a similar way as the little ZIL rectangle is a part of the cylinder. Then draw another little box next to the application box within the RAM rectangle and call it “ZFS”. Then the whole process would be: a very short horizontal arrow “1″ between application and ZFS boxes within the RAM rectangle, then vertical arrow “2″ between ZFS box in the RAM and the ZIL box in the disk, and lastly again a short horizontal arrow “3″ back between ZFS and application boxes within the RAM. Plain and simple.

I chose the design I did, because it's easy to understand. And, if you read the post, you'll see where I made clarifications, such as the application already residing in RAM. Making those adjustments would complicate the picture and add a lot of unnecessary noise. If you read the post, you understand what the images are communicating.

Thanks for stopping by. Please also read the following documentation about the ZIL, transaction groups, and just ZFS in general:

* Official docs: http://docs.oracle.com/cd/E19253-01/819-5461/
* Async writes: http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-slogzil-and-data-fragmentation/
* FAQ about the ZIL and SSDs: http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs#benefit. Also read http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs#spacezil about the space and what it contains.
* And probably the best post on the subject, which confirms everything in my post: http://nex7.blogspot.com/2013/04/zfs-intent-log.html

So, to summarize so far, you've got a ZFS Intent Log that is very similar to the log a SQL database uses, write once and forget (unless something bad happens), and you've got an in-RAM write cache and transaction group commits that handle actually writing to the data vdevs (and by the by, the txg commits are sequential, so all your random write traffic that came in between commits is sequential when it hits disk). The write cache is volatile as its in RAM, so the ZIL is in place to store the synchronous writes on stable media to restore from if things go south.

I think I understand the ZIL. Please read the docs.

]]>
By: udc https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/#comment-125117 Tue, 23 Apr 2013 00:30:43 +0000 http://pthree.org/?p=3073#comment-125117 Hi, first of all thanks for the whole ZFS series, you write in an easily understandable way and your practical experiences and opinions gives an additional value to it.
(Though, not trying to be a pedant or anything but sometimes you present a personal opinion as a solid unchangeable fact, like in the deduplication section where you ended up in conclusion that you need x multiplies RAM of what seems really necessary because of some hard-written ratio. I didn't check the source code whether there is really such a ratio but even if so this is an open source software so nothing is set in stone and thus nothing is stopping you or anyone else to simply go and change it, thus saying that to effectively manage deduplication for a 12 TB storage you need not 60 GB, letting the actual reasoning behind this first number itself aside, but actually 240 GB RAM, that's kind of cheap. Don't take it in a wrong way.)

What provoked me to a reaction is this appendix article. It seems to me that the images are drawn in a little bit unfortunate way. Let's take the first image. You made it look like the application when writing to ZFS without a separate ZIL writes its data first to the ZIL and also to the RAM, then it gets confirmation of write done and then the data is written from RAM to the disk. That's exactly what your picture and arrows show. But the application is already in the RAM and all its data is thus already in the RAM, also the ZIL is already the disk, so what you are actually indicating here is that the data is pointlessly duplicated first in the memory while being simultaneously written to the disk and then it's again pointlessly duplicated, only this time in the disk! I didn't check the source code but honestly I very much doubt that the ZFS developers would be that stupid. I would accept that the data is duplicated in the RAM, or more precisely put it's copied to the ZFS part of memory, but I will certainly not accept that the data is written to the slow disk twice.
Also, you kind of make the ZIL look like some special entity that resides in the disk but is somewhat not part of the disk and you are somewhat hinting that the data is written to the ZIL (on the disk) and then it's duplicated or moved from the ZIL to the disk, effectively writing the same data to the same disk twice. A non-IT analogy to this idea would be a secretary in a company who sits at her front desk and when a messenger from the outside world comes in and brings a package she receives it, logs (records) this event and then stands up and goes back inside the company to deliver the package to the actual recipient. Surely that's not what is happening. If that idea was true then all journaling file systems would have twice as slow writing speed as the non-journaling file systems. ZIL is nothing special, it's an ordinary journal. The journaling is not about the middle man through whom the data is passed to the disk (thus if the journal is actually residing on the disk the data is written to the disk twice), it's about the way how the data is written to the disk, i.e. in a documented way without making premature assumption that the write will be finished successfully and thus without "touching" any live data or file system structures that's already on the disk, all that to make sure that in case of a failure resulting in an unfinished write there would be no, or very little, inconsistency.
And lastly, your remark that after the write done is acknowledged to the application the data is flushed to the disk (letting alone the fact that the data was on the disk already safe and sound) goes against the whole principle of a synchronous write. The point of synchronous write is that the writing application is not notified about the write done until the data is really written to the target. You of course do know that, it's just a little bit unfortunate wording you have chosen.

So I would suggest you to consider the following adjustment to the first picture. Make the application box a little part of the RAM rectangle, pretty much in a similar way as the little ZIL rectangle is a part of the cylinder. Then draw another little box next to the application box within the RAM rectangle and call it "ZFS". Then the whole process would be: a very short horizontal arrow "1" between application and ZFS boxes within the RAM rectangle, then vertical arrow "2" between ZFS box in the RAM and the ZIL box in the disk, and lastly again a short horizontal arrow "3" back between ZFS and application boxes within the RAM. Plain and simple.

The next picture would be similarly (1) application to ZFS, (2) ZFS to ZIL (in SLOG), (3) ZFS back to application, (4) ZFS to disk. As a side effect this would also remove any confusion you noticed you have had as for whether there is write from ZIL to the disk in (4). Of course not, of course the write goes from the ZFS box that's (as any other process) obviously in the memory.

The last picture would be (1) application to ZFS, (2) ZFS to ZIL (in RAM, thus there would be 3 little boxes as part of the RAM rectangle), (3) ZFS to application, (4) ZFS to disk. Although thinking about it maybe even more precise would be to draw the ZIL box in the RAM as a part (a sub-box, if you will) of the ZFS box, thus the arrows would be: (1) application to ZFS (containing ZIL inside), (2) ZFS to application, (3) ZFS to disk. It's simplified but that's what pictures are for, to show clearly and simply the process.

Just my 2 cents.

]]>