Patchwork [RFC] ext4: Don't send extra barrier during fsync if there are no dirty pages.

login
register
mail settings
Submitter Jan Kara
Date July 21, 2010, 5:16 p.m.
Message ID <20100721171609.GC1215@atrey.karlin.mff.cuni.cz>
Download mbox | patch
Permalink /patch/59483/
State New
Headers show

Comments

Jan Kara - July 21, 2010, 5:16 p.m.
Hi,

> On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > 
> > The problem with not issuing a cache flush when you have dirty meta
> > data or data is that it does not have any tie to the state of the
> > volatile write cache of the target storage device.
> 
> We track whether or not there is any metadata updates associated with
> the inode already; if it does, we force a journal commit, and this
> implies a barrier operation.
> 
> The case we're talking about here is one where either (a) there is no
> journal, or (b) there have been no metadata updates (I'm simplifying a
> little here; in fact we track whether there have been fdatasync()- vs
> fsync()- worthy metadata updates), and so there hasn't been a journal
> commit to do the cache flush.
> 
> In this case, we want to track when is the last time an fsync() has
> been issued, versus when was the last time data blocks for a
> particular inode have been pushed out to disk.
> 
> To use an example I used as motivation for why we might want an
> fsync2(int fd[], int flags[], int num) syscall, consider the situation
> of:
> 
> 	fsync(control_fd);
> 	fdatasync(data_fd);
> 
> The first fsync() will have executed a cache flush operation.  So when
> we do the fdatasync() (assuming that no metadata needs to be flushed
> out to disk), there is no need for the cache flush operation.
> 
> If we had an enhanced fsync command, we would also be able to
> eliminate a second journal commit in the case where data_fd also had
> some metadata that needed to be flushed out to disk.
  Current implementation already avoids journal commit because of
fdatasync(data_fd). We remeber a transaction ID when inode metadata has
last been updated and do not force a transaction commit if it is already
committed. Thus the first fsync might force a transaction commit but second
fdatasync likely won't.
  We could actually improve the scheme to work for data as well. I wrote
a proof-of-concept patches (attached) and they nicely avoid second barrier
when doing:
echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1

  Ted, would you be interested in something like this?

								Honza
Darrick J. Wong - Aug. 3, 2010, 12:09 a.m.
On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
>   Hi,
> 
> > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > > 
> > > The problem with not issuing a cache flush when you have dirty meta
> > > data or data is that it does not have any tie to the state of the
> > > volatile write cache of the target storage device.
> > 
> > We track whether or not there is any metadata updates associated with
> > the inode already; if it does, we force a journal commit, and this
> > implies a barrier operation.
> > 
> > The case we're talking about here is one where either (a) there is no
> > journal, or (b) there have been no metadata updates (I'm simplifying a
> > little here; in fact we track whether there have been fdatasync()- vs
> > fsync()- worthy metadata updates), and so there hasn't been a journal
> > commit to do the cache flush.
> > 
> > In this case, we want to track when is the last time an fsync() has
> > been issued, versus when was the last time data blocks for a
> > particular inode have been pushed out to disk.
> > 
> > To use an example I used as motivation for why we might want an
> > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > of:
> > 
> > 	fsync(control_fd);
> > 	fdatasync(data_fd);
> > 
> > The first fsync() will have executed a cache flush operation.  So when
> > we do the fdatasync() (assuming that no metadata needs to be flushed
> > out to disk), there is no need for the cache flush operation.
> > 
> > If we had an enhanced fsync command, we would also be able to
> > eliminate a second journal commit in the case where data_fd also had
> > some metadata that needed to be flushed out to disk.
>   Current implementation already avoids journal commit because of
> fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> last been updated and do not force a transaction commit if it is already
> committed. Thus the first fsync might force a transaction commit but second
> fdatasync likely won't.
>   We could actually improve the scheme to work for data as well. I wrote
> a proof-of-concept patches (attached) and they nicely avoid second barrier
> when doing:
> echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
> 
>   Ted, would you be interested in something like this?

Well... on my fsync-happy workloads, this seems to cut the barrier count down
by about 20%, and speeds it up by about 20%.

I also have a patch to ext4_sync_files that batches the fsync requests together
for a further 20% decrease in barrier IOs, which makes it run another 20%
faster.  I'll send that one out shortly, though I've not safety-tested it at
all.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - Aug. 3, 2010, 9:01 a.m.
On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.

Care to share the test case for this?  I'd be especially interesting on
how it behaves with non-draining barriers / cache flushes in fsync.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 3, 2010, 1:21 p.m.
On Mon 02-08-10 17:09:39, Darrick J. Wong wrote:
> On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote:
> >   Hi,
> > 
> > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote:
> > > > 
> > > > The problem with not issuing a cache flush when you have dirty meta
> > > > data or data is that it does not have any tie to the state of the
> > > > volatile write cache of the target storage device.
> > > 
> > > We track whether or not there is any metadata updates associated with
> > > the inode already; if it does, we force a journal commit, and this
> > > implies a barrier operation.
> > > 
> > > The case we're talking about here is one where either (a) there is no
> > > journal, or (b) there have been no metadata updates (I'm simplifying a
> > > little here; in fact we track whether there have been fdatasync()- vs
> > > fsync()- worthy metadata updates), and so there hasn't been a journal
> > > commit to do the cache flush.
> > > 
> > > In this case, we want to track when is the last time an fsync() has
> > > been issued, versus when was the last time data blocks for a
> > > particular inode have been pushed out to disk.
> > > 
> > > To use an example I used as motivation for why we might want an
> > > fsync2(int fd[], int flags[], int num) syscall, consider the situation
> > > of:
> > > 
> > > 	fsync(control_fd);
> > > 	fdatasync(data_fd);
> > > 
> > > The first fsync() will have executed a cache flush operation.  So when
> > > we do the fdatasync() (assuming that no metadata needs to be flushed
> > > out to disk), there is no need for the cache flush operation.
> > > 
> > > If we had an enhanced fsync command, we would also be able to
> > > eliminate a second journal commit in the case where data_fd also had
> > > some metadata that needed to be flushed out to disk.
> >   Current implementation already avoids journal commit because of
> > fdatasync(data_fd). We remeber a transaction ID when inode metadata has
> > last been updated and do not force a transaction commit if it is already
> > committed. Thus the first fsync might force a transaction commit but second
> > fdatasync likely won't.
> >   We could actually improve the scheme to work for data as well. I wrote
> > a proof-of-concept patches (attached) and they nicely avoid second barrier
> > when doing:
> > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1
> > 
> >   Ted, would you be interested in something like this?
> 
> Well... on my fsync-happy workloads, this seems to cut the barrier count down
> by about 20%, and speeds it up by about 20%.
  Nice, thanks for measurement.

								Honza
Darrick J. Wong - Aug. 4, 2010, 6:16 p.m.
On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote:
> > Well... on my fsync-happy workloads, this seems to cut the barrier count down
> > by about 20%, and speeds it up by about 20%.
> 
> Care to share the test case for this?  I'd be especially interesting on
> how it behaves with non-draining barriers / cache flushes in fsync.

Sure.  When I run blktrace with the ffsb profile, I get these results:

barriers	transactions/sec
16212		206
15625		201
10442		269
10870		266
15658		201

Without Jan's patch:
barriers	transactions/sec
20855		177
20963		177
20340		174
20908		177

The two ~270 results are a little odd... if we ignore them, the net gain with
Jan's patch is about a 25% reduction in barriers issued and about a 15%
increase in tps.  (If we don't, it's ~30% and ~30%, respectively.)  That said,
I was running mkfs between runs, so it's possible that the disk layout could
have shifted a bit.  If I turn off the fsync parts of the ffsb profile, the
barrier counts drop to about a couple every second or so, which means that
Jan's patch doesn't have much of an effect.  But it does help if someone is
hammering on the filesystem with fsync.

The ffsb profile is attached below.

--D

-----------

time=300
alignio=1
directio=1

[filesystem0]
	location=/mnt/
	num_files=100000
	num_dirs=1000

reuse=1
	# File sizes range from 1kB to 1MB.
	size_weight 1KB 10
	size_weight 2KB 15
	size_weight 4KB 16
	size_weight 8KB 16
	size_weight 16KB 15
	size_weight 32KB 10
	size_weight 64KB 8
	size_weight 128KB 4
	size_weight 256KB 3
	size_weight 512KB 2
	size_weight 1MB 1

create_blocksize=1048576
[end0]

[threadgroup0]
	num_threads=64

	readall_weight=4
	create_fsync_weight=2
	delete_weight=1

	append_weight		= 1
	append_fsync_weight	= 1
	stat_weight		= 1
	create_weight		= 1
	writeall_weight		= 1
	writeall_fsync_weight	= 1
	open_close_weight	= 1


	write_size=64KB
	write_blocksize=512KB

	read_size=64KB
	read_blocksize=512KB

	[stats]
		enable_stats=1
		enable_range=1

		msec_range    0.00      0.01
		msec_range    0.01      0.02
		msec_range    0.02      0.05
		msec_range    0.05      0.10
		msec_range    0.10      0.20
		msec_range    0.20      0.50
		msec_range    0.50      1.00
		msec_range    1.00      2.00
		msec_range    2.00      5.00
		msec_range    5.00     10.00
		msec_range   10.00     20.00
		msec_range   20.00     50.00
		msec_range   50.00    100.00
		msec_range  100.00    200.00
		msec_range  200.00    500.00
		msec_range  500.00   1000.00
		msec_range 1000.00   2000.00
		msec_range 2000.00   5000.00
		msec_range 5000.00  10000.00
	[end]
[end0]
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

From 4fe836d5554dc8cb0732af1f4e5b317d7e59febf Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 21 Jul 2010 19:01:51 +0200
Subject: [PATCH 2/2] ext4: Send barriers on fsync only when needed

It isn't necessary to send a barrier to disk for fsync of file 'f'
when we already sent one after all the data of 'f' have been written.
Implement logic to detect this condition and avoid sending barrier
in this case.

We use counters of submitted and completed IO barriers for a block device.
When a page is written to the block device, we store current number of
barriers submitted in the inode. When we handle fsync, we check whether
the number of completed barriers is at least that large.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  |    4 +++-
 fs/ext4/fsync.c |   19 +++++++++++++++++--
 fs/ext4/inode.c |    4 ++++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..cc67e72 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -832,10 +832,12 @@  struct ext4_inode_info {
 
 	/*
 	 * Transactions that contain inode's metadata needed to complete
-	 * fsync and fdatasync, respectively.
+	 * fsync and fdatasync, respectively and barrier id when we last
+	 * wrote data to this file.
 	 */
 	tid_t i_sync_tid;
 	tid_t i_datasync_tid;
+	unsigned i_data_bid;
 };
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..d8a6995 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -57,6 +57,21 @@  static void ext4_sync_parent(struct inode *inode)
 	}
 }
 
+static int ext4_need_issue_data_flush(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	int comp_bid, inode_bid = ei->i_data_bid;
+
+	if (!(journal->j_flags & JBD2_BARRIER))
+		return 0;
+	comp_bid = atomic_read(&inode->i_sb->s_bdev->bd_barriers_completed);
+	/* inode_bid < completed_bid safe against wrapping */
+	if (inode_bid - comp_bid < 0)
+		return 0;
+	return 1;
+}
+
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -126,11 +141,11 @@  int ext4_sync_file(struct file *file, int datasync)
 		 */
 		if (ext4_should_writeback_data(inode) &&
 		    (journal->j_fs_dev != journal->j_dev) &&
-		    (journal->j_flags & JBD2_BARRIER))
+		    ext4_need_issue_data_flush(inode))
 			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
 					NULL, BLKDEV_IFL_WAIT);
 		ret = jbd2_log_wait_commit(journal, commit_tid);
-	} else if (journal->j_flags & JBD2_BARRIER)
+	} else if (ext4_need_issue_data_flush(inode))
 		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
 			BLKDEV_IFL_WAIT);
 	return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..8d57aae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2758,6 +2758,10 @@  static int ext4_writepage(struct page *page,
 	} else
 		ret = block_write_full_page(page, noalloc_get_block_write,
 					    wbc);
+	/* Make sure we read current value of bd_barriers_sent */
+	smp_rmb();
+	EXT4_I(inode)->i_data_bid =
+		atomic_read(&inode->i_sb->s_bdev->bd_barriers_sent);
 
 	return ret;
 }
-- 
1.6.4.2