Message ID | 20100721171609.GC1215@atrey.karlin.mff.cuni.cz |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote: > Hi, > > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote: > > > > > > The problem with not issuing a cache flush when you have dirty meta > > > data or data is that it does not have any tie to the state of the > > > volatile write cache of the target storage device. > > > > We track whether or not there is any metadata updates associated with > > the inode already; if it does, we force a journal commit, and this > > implies a barrier operation. > > > > The case we're talking about here is one where either (a) there is no > > journal, or (b) there have been no metadata updates (I'm simplifying a > > little here; in fact we track whether there have been fdatasync()- vs > > fsync()- worthy metadata updates), and so there hasn't been a journal > > commit to do the cache flush. > > > > In this case, we want to track when is the last time an fsync() has > > been issued, versus when was the last time data blocks for a > > particular inode have been pushed out to disk. > > > > To use an example I used as motivation for why we might want an > > fsync2(int fd[], int flags[], int num) syscall, consider the situation > > of: > > > > fsync(control_fd); > > fdatasync(data_fd); > > > > The first fsync() will have executed a cache flush operation. So when > > we do the fdatasync() (assuming that no metadata needs to be flushed > > out to disk), there is no need for the cache flush operation. > > > > If we had an enhanced fsync command, we would also be able to > > eliminate a second journal commit in the case where data_fd also had > > some metadata that needed to be flushed out to disk. > Current implementation already avoids journal commit because of > fdatasync(data_fd). We remeber a transaction ID when inode metadata has > last been updated and do not force a transaction commit if it is already > committed. Thus the first fsync might force a transaction commit but second > fdatasync likely won't. > We could actually improve the scheme to work for data as well. I wrote > a proof-of-concept patches (attached) and they nicely avoid second barrier > when doing: > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1 > > Ted, would you be interested in something like this? Well... on my fsync-happy workloads, this seems to cut the barrier count down by about 20%, and speeds it up by about 20%. I also have a patch to ext4_sync_files that batches the fsync requests together for a further 20% decrease in barrier IOs, which makes it run another 20% faster. I'll send that one out shortly, though I've not safety-tested it at all. --D -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote: > Well... on my fsync-happy workloads, this seems to cut the barrier count down > by about 20%, and speeds it up by about 20%. Care to share the test case for this? I'd be especially interesting on how it behaves with non-draining barriers / cache flushes in fsync. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 02-08-10 17:09:39, Darrick J. Wong wrote: > On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote: > > Hi, > > > > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote: > > > > > > > > The problem with not issuing a cache flush when you have dirty meta > > > > data or data is that it does not have any tie to the state of the > > > > volatile write cache of the target storage device. > > > > > > We track whether or not there is any metadata updates associated with > > > the inode already; if it does, we force a journal commit, and this > > > implies a barrier operation. > > > > > > The case we're talking about here is one where either (a) there is no > > > journal, or (b) there have been no metadata updates (I'm simplifying a > > > little here; in fact we track whether there have been fdatasync()- vs > > > fsync()- worthy metadata updates), and so there hasn't been a journal > > > commit to do the cache flush. > > > > > > In this case, we want to track when is the last time an fsync() has > > > been issued, versus when was the last time data blocks for a > > > particular inode have been pushed out to disk. > > > > > > To use an example I used as motivation for why we might want an > > > fsync2(int fd[], int flags[], int num) syscall, consider the situation > > > of: > > > > > > fsync(control_fd); > > > fdatasync(data_fd); > > > > > > The first fsync() will have executed a cache flush operation. So when > > > we do the fdatasync() (assuming that no metadata needs to be flushed > > > out to disk), there is no need for the cache flush operation. > > > > > > If we had an enhanced fsync command, we would also be able to > > > eliminate a second journal commit in the case where data_fd also had > > > some metadata that needed to be flushed out to disk. > > Current implementation already avoids journal commit because of > > fdatasync(data_fd). We remeber a transaction ID when inode metadata has > > last been updated and do not force a transaction commit if it is already > > committed. Thus the first fsync might force a transaction commit but second > > fdatasync likely won't. > > We could actually improve the scheme to work for data as well. I wrote > > a proof-of-concept patches (attached) and they nicely avoid second barrier > > when doing: > > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1 > > > > Ted, would you be interested in something like this? > > Well... on my fsync-happy workloads, this seems to cut the barrier count down > by about 20%, and speeds it up by about 20%. Nice, thanks for measurement. Honza
On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote: > On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote: > > Well... on my fsync-happy workloads, this seems to cut the barrier count down > > by about 20%, and speeds it up by about 20%. > > Care to share the test case for this? I'd be especially interesting on > how it behaves with non-draining barriers / cache flushes in fsync. Sure. When I run blktrace with the ffsb profile, I get these results: barriers transactions/sec 16212 206 15625 201 10442 269 10870 266 15658 201 Without Jan's patch: barriers transactions/sec 20855 177 20963 177 20340 174 20908 177 The two ~270 results are a little odd... if we ignore them, the net gain with Jan's patch is about a 25% reduction in barriers issued and about a 15% increase in tps. (If we don't, it's ~30% and ~30%, respectively.) That said, I was running mkfs between runs, so it's possible that the disk layout could have shifted a bit. If I turn off the fsync parts of the ffsb profile, the barrier counts drop to about a couple every second or so, which means that Jan's patch doesn't have much of an effect. But it does help if someone is hammering on the filesystem with fsync. The ffsb profile is attached below. --D ----------- time=300 alignio=1 directio=1 [filesystem0] location=/mnt/ num_files=100000 num_dirs=1000 reuse=1 # File sizes range from 1kB to 1MB. size_weight 1KB 10 size_weight 2KB 15 size_weight 4KB 16 size_weight 8KB 16 size_weight 16KB 15 size_weight 32KB 10 size_weight 64KB 8 size_weight 128KB 4 size_weight 256KB 3 size_weight 512KB 2 size_weight 1MB 1 create_blocksize=1048576 [end0] [threadgroup0] num_threads=64 readall_weight=4 create_fsync_weight=2 delete_weight=1 append_weight = 1 append_fsync_weight = 1 stat_weight = 1 create_weight = 1 writeall_weight = 1 writeall_fsync_weight = 1 open_close_weight = 1 write_size=64KB write_blocksize=512KB read_size=64KB read_blocksize=512KB [stats] enable_stats=1 enable_range=1 msec_range 0.00 0.01 msec_range 0.01 0.02 msec_range 0.02 0.05 msec_range 0.05 0.10 msec_range 0.10 0.20 msec_range 0.20 0.50 msec_range 0.50 1.00 msec_range 1.00 2.00 msec_range 2.00 5.00 msec_range 5.00 10.00 msec_range 10.00 20.00 msec_range 20.00 50.00 msec_range 50.00 100.00 msec_range 100.00 200.00 msec_range 200.00 500.00 msec_range 500.00 1000.00 msec_range 1000.00 2000.00 msec_range 2000.00 5000.00 msec_range 5000.00 10000.00 [end] [end0] -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From 4fe836d5554dc8cb0732af1f4e5b317d7e59febf Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 21 Jul 2010 19:01:51 +0200 Subject: [PATCH 2/2] ext4: Send barriers on fsync only when needed It isn't necessary to send a barrier to disk for fsync of file 'f' when we already sent one after all the data of 'f' have been written. Implement logic to detect this condition and avoid sending barrier in this case. We use counters of submitted and completed IO barriers for a block device. When a page is written to the block device, we store current number of barriers submitted in the inode. When we handle fsync, we check whether the number of completed barriers is at least that large. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/ext4.h | 4 +++- fs/ext4/fsync.c | 19 +++++++++++++++++-- fs/ext4/inode.c | 4 ++++ 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 19a4de5..cc67e72 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -832,10 +832,12 @@ struct ext4_inode_info { /* * Transactions that contain inode's metadata needed to complete - * fsync and fdatasync, respectively. + * fsync and fdatasync, respectively and barrier id when we last + * wrote data to this file. */ tid_t i_sync_tid; tid_t i_datasync_tid; + unsigned i_data_bid; }; /* diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 592adf2..d8a6995 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -57,6 +57,21 @@ static void ext4_sync_parent(struct inode *inode) } } +static int ext4_need_issue_data_flush(struct inode *inode) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; + int comp_bid, inode_bid = ei->i_data_bid; + + if (!(journal->j_flags & JBD2_BARRIER)) + return 0; + comp_bid = atomic_read(&inode->i_sb->s_bdev->bd_barriers_completed); + /* inode_bid < completed_bid safe against wrapping */ + if (inode_bid - comp_bid < 0) + return 0; + return 1; +} + /* * akpm: A new design for ext4_sync_file(). * @@ -126,11 +141,11 @@ int ext4_sync_file(struct file *file, int datasync) */ if (ext4_should_writeback_data(inode) && (journal->j_fs_dev != journal->j_dev) && - (journal->j_flags & JBD2_BARRIER)) + ext4_need_issue_data_flush(inode)) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT); ret = jbd2_log_wait_commit(journal, commit_tid); - } else if (journal->j_flags & JBD2_BARRIER) + } else if (ext4_need_issue_data_flush(inode)) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT); return ret; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 42272d6..8d57aae 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2758,6 +2758,10 @@ static int ext4_writepage(struct page *page, } else ret = block_write_full_page(page, noalloc_get_block_write, wbc); + /* Make sure we read current value of bd_barriers_sent */ + smp_rmb(); + EXT4_I(inode)->i_data_bid = + atomic_read(&inode->i_sb->s_bdev->bd_barriers_sent); return ret; } -- 1.6.4.2