diff mbox

[RFC,v2] ext4: Don't send extra barrier during fsync if there are no dirty pages.

Message ID 20100629205102.GM15515@tux1.beaverton.ibm.com
State Superseded, archived
Headers show

Commit Message

Darrick J. Wong June 29, 2010, 8:51 p.m. UTC
Hmm.  A while ago I was complaining that an evil program that calls fsync() in
a loop will send a continuous stream of write barriers to the hard disk.  Ted
theorized that it might be possible to set a flag in ext4_writepage and clear
it in ext4_sync_file; if we happen to enter ext4_sync_file and the flag isn't
set (meaning that nothing has been dirtied since the last fsync()) then we
could skip issuing the barrier.

Here's an experimental patch to do something sort of like that.  From a quick
run with blktrace, it seems to skip the redundant barriers and improves the ffsb
mail server scores.  However, I haven't done extensive power failure testing to
see how much data it can destroy.  For that matter I'm not even 100% sure it's
correct at what it aims to do.

This second version of the patch uses the inode state flags and (suboptimally)
also catches directio writes.  It might be a better idea to try to coordinate
all the barrier requests across the whole filesystem, though that's a bit more
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Aug. 5, 2010, 4:40 p.m. UTC | #1
On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
> 
> This second version of the patch uses the inode state flags and
> (suboptimally) also catches directio writes.  It might be a better
> idea to try to coordinate all the barrier requests across the whole
> filesystem, though that's a bit more difficult.

Hi Darrick,

When I looked at this patch more closely, and thought about it hard,
the fact that this helps the FFSB mail server benchmark surprised me,
and then I realized it's because it doesn't really accurately emulate
a mail server at all.  Or at least, not a MTA.  In a MTA, only one CPU
will touch a queue file, so there should never be a case of a double
fsync to a single file.  This is why I was thinking about a
coordinating barrier requests across the whole filesystem --- it helps
out in the case where you have all your CPU threads hammering
/var/spool/mqueue, or /var/spool/exim4/input, and where they are all
creating queue files, and calling fsync() in parallel.  This patch
won't help that case.

It will help the case of a MDA --- Mail Delivery Agent --- if you have
multiple e-mails all getting delivered at the same time into the same
/var/mail/<username> file, with an fsync() following after a mail
message is appended to the file.  This is a much rarer case, and I
can't think of any other workload where you will have multiple
processes racing against each other and fsync'ing the same inode.
Even in the MDA case, it's rare that you will have one mbox getting so
many deliveries that this case would be hit.

So while I was thinking about accepting this patch, I now find myself
hesitating.  There _is_ a minor race in the patch that I noticed,
which I'll point out below, but that's easily fixed.  The bigger issue
is it's not clear this patch will actually make a difference in the
real world.  I trying and failing to think of a real-life application
which is stupid enough to do back-to-back fsync commands, even if it's
because it has multiple threads all trying to write to the file and
fsync to it in an uncoordinated fashion.  It would be easily enough to
add instrumentation that would trigger a printk if the patch optimized
out a barrier --- and if someone can point out even one badly written
application --- whether it's mysql, postgresql, a GNOME or KDE
application, db2, Oracle, etc., I'd say sure.  But adding even a tiny
amount of extra complexity for something which is _only_ helpful for a
benchmark grates against my soul....

So if you can think of something, please point it out to me.  If it
would help ext4 users in real life, I'd be all for it.  But at this
point, I'm thinking that perhaps the real issue is that the mail
server benchmark isn't accurately reflecting a real life workload.

Am I missing something?

						- Ted

> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 592adf2..96625c3 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -130,8 +130,11 @@ int ext4_sync_file(struct file *file, int datasync)
>  			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
>  					NULL, BLKDEV_IFL_WAIT);
>  		ret = jbd2_log_wait_commit(journal, commit_tid);
> -	} else if (journal->j_flags & JBD2_BARRIER)
> +	} else if (journal->j_flags & JBD2_BARRIER &&
> +		   ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
>  		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
>  			BLKDEV_IFL_WAIT);
> +		ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
> +	}
>  	return ret;

This is the minor race I was talking about; you should move the
ext4_clear_inode_state() call above blkdev_issue_flush().  If there is
a race, you want to fail safe, by accidentally issuing a second
barrier, instead of possibly skipping a barrier if a page gets dirtied
*after* the blkdev_issue_flush() has taken effect, but *before* we
have a chance to clear the EXT4_STATE_DIRTY_DATA flag.

BTW, my apologies for not looking at this sooner, and giving you this
feedback earlier.  This summer has been crazy busy, and I didn't have
time until the merge window provided a forcing function to look at
outstanding patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 5, 2010, 4:45 p.m. UTC | #2
P.S.  If it wasn't clear, I'm still in favor of trying to coordinate
barriers across the whole file system, since that is much more likely
to help use cases that arise in real life.

   	    	       	     	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 6, 2010, 7:04 a.m. UTC | #3
On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
> P.S.  If it wasn't clear, I'm still in favor of trying to coordinate
> barriers across the whole file system, since that is much more likely
> to help use cases that arise in real life.

Ok.  I have a rough sketch of a patch to do that, and I was going to send it
out today, but the test machine caught on fire while I was hammering it with
the fsync tests one last time and ... yeah.  I'm fairly sure the patch didn't
cause the fire, but I'll check anyway after I finish cleaning up.

"[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P

(The patch did seem to cut barrier requests counts by about 20% though the
impact on performance was pretty small.)

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 6, 2010, 7:13 a.m. UTC | #4
On Thu, Aug 05, 2010 at 12:40:08PM -0400, Ted Ts'o wrote:
> On Tue, Jun 29, 2010 at 01:51:02PM -0700, Darrick J. Wong wrote:
> > 
> > This second version of the patch uses the inode state flags and
> > (suboptimally) also catches directio writes.  It might be a better
> > idea to try to coordinate all the barrier requests across the whole
> > filesystem, though that's a bit more difficult.
> 
> Hi Darrick,
> 
> When I looked at this patch more closely, and thought about it hard,
> the fact that this helps the FFSB mail server benchmark surprised me,
> and then I realized it's because it doesn't really accurately emulate
> a mail server at all.  Or at least, not a MTA.  In a MTA, only one CPU
> will touch a queue file, so there should never be a case of a double
> fsync to a single file.  This is why I was thinking about a
> coordinating barrier requests across the whole filesystem --- it helps
> out in the case where you have all your CPU threads hammering
> /var/spool/mqueue, or /var/spool/exim4/input, and where they are all
> creating queue files, and calling fsync() in parallel.  This patch
> won't help that case.
> 
> It will help the case of a MDA --- Mail Delivery Agent --- if you have
> multiple e-mails all getting delivered at the same time into the same
> /var/mail/<username> file, with an fsync() following after a mail
> message is appended to the file.  This is a much rarer case, and I
> can't think of any other workload where you will have multiple
> processes racing against each other and fsync'ing the same inode.
> Even in the MDA case, it's rare that you will have one mbox getting so
> many deliveries that this case would be hit.
> 
> So while I was thinking about accepting this patch, I now find myself
> hesitating.  There _is_ a minor race in the patch that I noticed,
> which I'll point out below, but that's easily fixed.  The bigger issue
> is it's not clear this patch will actually make a difference in the
> real world.  I trying and failing to think of a real-life application
> which is stupid enough to do back-to-back fsync commands, even if it's
> because it has multiple threads all trying to write to the file and
> fsync to it in an uncoordinated fashion.  It would be easily enough to
> add instrumentation that would trigger a printk if the patch optimized
> out a barrier --- and if someone can point out even one badly written
> application --- whether it's mysql, postgresql, a GNOME or KDE
> application, db2, Oracle, etc., I'd say sure.  But adding even a tiny
> amount of extra complexity for something which is _only_ helpful for a
> benchmark grates against my soul....
> 
> So if you can think of something, please point it out to me.  If it
> would help ext4 users in real life, I'd be all for it.  But at this
> point, I'm thinking that perhaps the real issue is that the mail
> server benchmark isn't accurately reflecting a real life workload.

Yes, it's a proxy for something else.  One of our larger products would like to
use fsync() to flush dirty data out to disk (right now it looks like they use
O_SYNC), but they're concerned that the many threads they use can create an
fsync() storm.  So, they wanted to know how to mitigate the effects of those
storms.  Not calling fsync() except when they really need to guarantee a disk
write is a good start, but I'd like to get ahead of them to pick off more low
hanging fruit like the barrier coordination and not sending barriers when
there's no dirty data ... before they run into it. :)

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Aug. 6, 2010, 10:17 a.m. UTC | #5
On 08/06/2010 03:04 AM, Darrick J. Wong wrote:
> On Thu, Aug 05, 2010 at 12:45:04PM -0400, Ted Ts'o wrote:
>> P.S.  If it wasn't clear, I'm still in favor of trying to coordinate
>> barriers across the whole file system, since that is much more likely
>> to help use cases that arise in real life.
> Ok.  I have a rough sketch of a patch to do that, and I was going to send it
> out today, but the test machine caught on fire while I was hammering it with
> the fsync tests one last time and ... yeah.  I'm fairly sure the patch didn't
> cause the fire, but I'll check anyway after I finish cleaning up.
>
> "[PATCH] ext4: Don't set my machine ablaze with barrier requests" :P
>
> (The patch did seem to cut barrier requests counts by about 20% though the
> impact on performance was pretty small.)
>
> --D

Just a note, one thing that we have been doing is trying to get a reasonable 
regression test in place for testing data integrity. That might be useful to 
share as we float patches around barrier changes.

Basic test:

(1) Get a box with an external e-sata (or USB) connected drive

(2) Fire off some large load on that drive (Chris Mason had one, some of our QE 
engineers have been using fs_mark (fs_mark -d /your_fs/test_dir -S 0 -t 8 -F)

(3) Pull the power cable to that external box.

Of course, you can use any system and drop power, but the above setup will make 
sure that we kill the write cache on the device without letting the firmware 
destage the cache contents.

The test passes if you can now do the following:

(1) Mount the file system without error

(2) Unmount and force an fsck - that should run without reporting errors as well.

Note that the above does not use fsync in the testing.

Thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 6, 2010, 6:04 p.m. UTC | #6
On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> Yes, it's a proxy for something else.  One of our larger products would like to
> use fsync() to flush dirty data out to disk (right now it looks like they use
> O_SYNC), but they're concerned that the many threads they use can create an
> fsync() storm.  So, they wanted to know how to mitigate the effects of those
> storms.  Not calling fsync() except when they really need to guarantee a disk
> write is a good start, but I'd like to get ahead of them to pick off more low
> hanging fruit like the barrier coordination and not sending barriers when
> there's no dirty data ... before they run into it. :)

Do they need a barrier operation, or do they just want to initiate the
I/O?  One of the reasons I found it hard to believe you would have
multiple threads all fsync()'ing the same file is that keeping the the
file consistent is very hard to do in such a scenario.  Maintaining
ACID-level consistency without a single thread which coordinates when
commit records gets written is I'm sure theoretically possible, but in
practice, I wasn't sure any applications would actually be _written_
that way.

If the goal is just to make sure I/O is getting initiated, without
necessarily waiting for assurance that a specific file write has hit
the disk platters, it may be that the Linux-specific
sync_file_range(2) system call might be a far more efficient way of
achieving those ends.  Without more details about what this product is
doing, it's hard to say, of course.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 9, 2010, 7:36 p.m. UTC | #7
On Fri, Aug 06, 2010 at 02:04:54PM -0400, Ted Ts'o wrote:
> On Fri, Aug 06, 2010 at 12:13:56AM -0700, Darrick J. Wong wrote:
> > Yes, it's a proxy for something else.  One of our larger products would like to
> > use fsync() to flush dirty data out to disk (right now it looks like they use
> > O_SYNC), but they're concerned that the many threads they use can create an
> > fsync() storm.  So, they wanted to know how to mitigate the effects of those
> > storms.  Not calling fsync() except when they really need to guarantee a disk
> > write is a good start, but I'd like to get ahead of them to pick off more low
> > hanging fruit like the barrier coordination and not sending barriers when
> > there's no dirty data ... before they run into it. :)
> 
> Do they need a barrier operation, or do they just want to initiate the
> I/O?  One of the reasons I found it hard to believe you would have
> multiple threads all fsync()'ing the same file is that keeping the the
> file consistent is very hard to do in such a scenario.  Maintaining
> ACID-level consistency without a single thread which coordinates when
> commit records gets written is I'm sure theoretically possible, but in
> practice, I wasn't sure any applications would actually be _written_
> that way.

> If the goal is just to make sure I/O is getting initiated, without
> necessarily waiting for assurance that a specific file write has hit
> the disk platters, it may be that the Linux-specific
> sync_file_range(2) system call might be a far more efficient way of
> achieving those ends.  Without more details about what this product is
> doing, it's hard to say, of course.

I don't know for sure, though given what I've seen of the app behavior I
suspect they simply want the disk cache flushed, and don't need the full
ordering semantics.  That said, I do think they want to make sure that data
actually hits the disk platters.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

difficult.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 fs/ext4/ext4.h  |    1 +
 fs/ext4/fsync.c |    5 ++++-
 fs/ext4/inode.c |    7 +++++++
 3 files changed, 12 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..d2e8e40 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1181,6 +1181,7 @@  enum {
 	EXT4_STATE_EXT_MIGRATE,		/* Inode is migrating */
 	EXT4_STATE_DIO_UNWRITTEN,	/* need convert on dio done*/
 	EXT4_STATE_NEWENTRY,		/* File just added to dir */
+	EXT4_STATE_DIRTY_DATA,		/* dirty data, need barrier */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field)					\
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..96625c3 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -130,8 +130,11 @@  int ext4_sync_file(struct file *file, int datasync)
 			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
 					NULL, BLKDEV_IFL_WAIT);
 		ret = jbd2_log_wait_commit(journal, commit_tid);
-	} else if (journal->j_flags & JBD2_BARRIER)
+	} else if (journal->j_flags & JBD2_BARRIER &&
+		   ext4_test_inode_state(inode, EXT4_STATE_DIRTY_DATA)) {
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
 			BLKDEV_IFL_WAIT);
+		ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+	}
 	return ret;
 }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..486d349 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2685,6 +2685,8 @@  static int ext4_writepage(struct page *page,
 	else
 		len = PAGE_CACHE_SIZE;
 
+	ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
 	if (page_has_buffers(page)) {
 		page_bufs = page_buffers(page);
 		if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
@@ -2948,6 +2950,8 @@  static int ext4_da_writepages(struct address_space *mapping,
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
+	ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
 	range_cyclic = wbc->range_cyclic;
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index;
@@ -3996,6 +4000,9 @@  static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
+	if (rw == WRITE)
+		ext4_set_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		return ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs);