diff mbox

[RFC,v3] ext4: Combine barrier requests coming from fsync

Message ID 20100809195324.GG2109@tux1.beaverton.ibm.com
State Superseded, archived
Headers show

Commit Message

Darrick J. Wong Aug. 9, 2010, 7:53 p.m. UTC
This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for 500us to see if there are any other
threads that are close behind in ext4_sync_file.  After that wait, the state
transitions to 2 and the barrier is issued.  Once that's done, the state goes
back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled.  Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

Without Jan's prior barrier generation patch, I can run my 5min fsync-happy
workload and see the barrier count drop from ~150000 to ~37000, and the
transaction rate increase from ~630 to ~680.

With Jan's patch, I see the barrier count drop from ~18000 to ~12000, and the
transaction rate jumps from ~720 to ~760 when using this patch.

There are some things to clean up -- the wait duration probably ought to be a
mount option and not a module parameter, but this is just a test patch.  I'm
also not sure that it won't totally eat the performance of a single thread that
calls fsync frequently, though ... that seems like sort of bad behavior.
If you set the delay time to 0 you get the old behavior.

I'm wondering at this point if this is useful?  Ted, is this the sort of fsync
coordination that you were looking for?

---

 fs/ext4/ext4.h  |    5 +++++
 fs/ext4/fsync.c |   47 +++++++++++++++++++++++++++++++++++++++++++----
 fs/ext4/super.c |    4 ++++
 3 files changed, 52 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig Aug. 9, 2010, 9:07 p.m. UTC | #1
Can you try with the new barrier implementation in the

	[PATCH, RFC] relaxed barriers

by making cache flushes just that and not complicated drain barrier
it should speed this case up a lot.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Aug. 9, 2010, 9:19 p.m. UTC | #2
On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> This patch attempts to coordinate barrier requests being sent in by fsync.  Instead of each fsync call initiating its own barrier, there's now a flag to indicate if (0) no barriers are ongoing, (1) we're delaying a short time to collect other fsync threads, or (2) we're actually in-progress on a barrier.
> 
> So, if someone calls ext4_sync_file and no barriers are in progress, the flag shifts from 0->1 and the thread delays for 500us to see if there are any other threads that are close behind in ext4_sync_file.  After that wait, the state transitions to 2 and the barrier is issued.  Once that's done, the state goes back to 0 and a completion is signalled.

You shouldn't use a fixed delay for the thread.  500us _seems_ reasonable, if you have a single HDD.  If you have an SSD, or an NVRAM-backed array, then 2000 IOPS is a serious limitation.

What is done in the JBD2 code is to scale the commit sleep interval based on the average commit time.  In fact, the ext4_force_commit-> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code to merge journal commits.  It looks like we are duplicating some of this machinery in ext4_sync_file() already.

It seems like a better idea to have a single piece of code to wait to merge the IOs.  For the non-journal ext4 filesystems it should implement the wait for merges explicitly, otherwise it should defer the wait to jbd2.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 9, 2010, 11:38 p.m. UTC | #3
On Mon, Aug 09, 2010 at 05:19:22PM -0400, Andreas Dilger wrote:
> On 2010-08-09, at 15:53, Darrick J. Wong wrote:
> > This patch attempts to coordinate barrier requests being sent in by fsync.
> > Instead of each fsync call initiating its own barrier, there's now a flag
> > to indicate if (0) no barriers are ongoing, (1) we're delaying a short time
> > to collect other fsync threads, or (2) we're actually in-progress on a
> > barrier.
> > 
> > So, if someone calls ext4_sync_file and no barriers are in progress, the
> > flag shifts from 0->1 and the thread delays for 500us to see if there are
> > any other threads that are close behind in ext4_sync_file.  After that
> > wait, the state transitions to 2 and the barrier is issued.  Once that's
> > done, the state goes back to 0 and a completion is signalled.
> 
> You shouldn't use a fixed delay for the thread.  500us _seems_ reasonable, if
> you have a single HDD.  If you have an SSD, or an NVRAM-backed array, then
> 2000 IOPS is a serious limitation.

2000 fsyncs per second, anyway.  I wasn't explicitly trying to limit any other
types of IO.

> What is done in the JBD2 code is to scale the commit sleep interval based on
> the average commit time.  In fact, the ext4_force_commit->
> ...->jbd2_journal_force_commit() call will itself be waiting in the jbd2 code
> to merge journal commits.  It looks like we are duplicating some of this
> machinery in ext4_sync_file() already.

I actually picked 500us arbitrarily because it seemed to work, even for SSDs.
It was a convenient test vehicle, and not much more.  That said, I like your
recommendation much better.  I'll look into that.

> It seems like a better idea to have a single piece of code to wait to merge
> the IOs.  For the non-journal ext4 filesystems it should implement the wait
> for merges explicitly, otherwise it should defer the wait to jbd2.

I wondered if this would have been better off in the block layer than ext4?
Though I suppose that could imply two kinds of flush: flush-immediately, and
flush-shortly.  I intend to try those flush drain elimination patches before I
think about this much more.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 16, 2010, 4:14 p.m. UTC | #4
On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> Can you try with the new barrier implementation in the
> 
> 	[PATCH, RFC] relaxed barriers
> 
> by making cache flushes just that and not complicated drain barrier
> it should speed this case up a lot.

Indeed it does!  The barrier count increases to about 21000, but I also see
much higher throughput, about 830 transactions per second (versus 12000 and 760
respectively before Tejun's patch).

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Aug. 19, 2010, 2:07 a.m. UTC | #5
On Mon, Aug 16, 2010 at 09:14:33AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 09, 2010 at 05:07:23PM -0400, Christoph Hellwig wrote:
> > Can you try with the new barrier implementation in the
> > 
> > 	[PATCH, RFC] relaxed barriers
> > 
> > by making cache flushes just that and not complicated drain barrier
> > it should speed this case up a lot.
> 
> Indeed it does!  The barrier count increases to about 21000, but I also see
> much higher throughput, about 830 transactions per second (versus 12000 and 760
> respectively before Tejun's patch).

Oddly, I ran the entire suite of tests against a larger set of machines, and
with Tejun's RFC patchset I didn't see nearly as much of an improvement.  I
have been trying to put together a new tree based on "replace barrier with
sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
gotten lost in the weeds. :(

I also experienced some sort of crash with Tejun's relaxed barrier patch on one
of my systems.  I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

Rather than hold on to (aging) test results any further, I'll be posting a new
fsync coordination patch shortly that includes Andreas' suggestion to use the
average barrier time instead of a static 500us, and a spreadsheet that shows
what happens with various patches, and on a wider range of hardware.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 19, 2010, 8:53 a.m. UTC | #6
On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
> Oddly, I ran the entire suite of tests against a larger set of machines, and
> with Tejun's RFC patchset I didn't see nearly as much of an improvement.  I
> have been trying to put together a new tree based on "replace barrier with
> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
> gotten lost in the weeds. :(

Tejun's patches don't allow concurrent cache flushes to happen, while
my patch did.  Tejun said there are drivers that can't handly empty
flushes with a bio attached, making this nessecary.   

Tejun, any idea what drivers that would be?

> I also experienced some sort of crash with Tejun's relaxed barrier patch on one
> of my systems.  I was hitting the BUG_ON in drivers/scsi/scsi_lib.c, line 1115.

My kernel source doesn't have a BUG_ON line there, but only one two
lines above.  A req->nr_phys_segments that's zero sounds a bit like
empty flush requests, I'll need to look into it again.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Aug. 19, 2010, 9:17 a.m. UTC | #7
Hello,

On 08/19/2010 10:53 AM, Christoph Hellwig wrote:
> On Wed, Aug 18, 2010 at 07:07:34PM -0700, Darrick J. Wong wrote:
>> Oddly, I ran the entire suite of tests against a larger set of machines, and
>> with Tejun's RFC patchset I didn't see nearly as much of an improvement.  I
>> have been trying to put together a new tree based on "replace barrier with
>> sequenced flush" and Christoph's "explicitly flush/FUA" patch sets, though I've
>> gotten lost in the weeds. :(
> 
> Tejun's patches don't allow concurrent cache flushes to happen, while
> my patch did.  Tejun said there are drivers that can't handly empty
> flushes with a bio attached, making this nessecary.   
> 
> Tejun, any idea what drivers that would be?

What was the configuration?  If dm was involved, both md and dm can
only process single flush request at a time.  Supporing multiple
flushes in flight wouldn't be too difficult.  It's just the way things
are implemented with the previous barrier implementation.  It can
definitely be improved.

>> I also experienced some sort of crash with Tejun's relaxed barrier
>> patch on one of my systems.  I was hitting the BUG_ON in
>> drivers/scsi/scsi_lib.c, line 1115.
> 
> My kernel source doesn't have a BUG_ON line there, but only one two
> lines above.  A req->nr_phys_segments that's zero sounds a bit like
> empty flush requests, I'll need to look into it again.

Yeah, definitely sounds like REQ_FS|REQ_FLUSH request causing problem.
Can you post the kernel log?

Thanks.
Tejun Heo Aug. 19, 2010, 3:48 p.m. UTC | #8
Hello, again.

On 08/19/2010 11:17 AM, Tejun Heo wrote:
> What was the configuration?  If dm was involved, both md and dm can
> only process single flush request at a time.  Supporing multiple
> flushes in flight wouldn't be too difficult.  It's just the way things
> are implemented with the previous barrier implementation.  It can
> definitely be improved.

Oh, I just realized that the current bio-based dm implementation
doesn't allow other bio's to be processed if a flush is in progress.
So, it's not about not being able to handle multiple flushes.  The
queue just stalls while flush is in progress and because it also waits
for flush completion by waiting for all commands in progress to
finish.  It basically ends up draining and stalling everything.
Shouldn't be too hard to make it better.

Thanks.
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..e51535a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,11 @@  struct ext4_sb_info {
 
 	/* workqueue for dio unwritten */
 	struct workqueue_struct *dio_unwritten_wq;
+
+	/* fsync barrier reduction */
+	spinlock_t barrier_flag_lock;
+	int in_barrier;
+	struct completion barrier_completion;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..c5afb45 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -57,6 +57,10 @@  static void ext4_sync_parent(struct inode *inode)
 	}
 }
 
+static int barrier_wait = 500;
+module_param(barrier_wait, int, 0644);
+MODULE_PARM_DESC(barrier_wait, "fsync barrier wait time (usec)");
+
 /*
  * akpm: A new design for ext4_sync_file().
  *
@@ -75,7 +79,8 @@  int ext4_sync_file(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+	journal_t *journal = sb->s_journal;
 	int ret;
 	tid_t commit_tid;
 
@@ -130,8 +135,42 @@  int ext4_sync_file(struct file *file, int datasync)
 			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
 					NULL, BLKDEV_IFL_WAIT);
 		ret = jbd2_log_wait_commit(journal, commit_tid);
-	} else if (journal->j_flags & JBD2_BARRIER)
-		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
-			BLKDEV_IFL_WAIT);
+	} else if (journal->j_flags & JBD2_BARRIER) {
+		if (!barrier_wait) {
+			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+				BLKDEV_IFL_WAIT);
+			goto fast_out;
+		}
+again:
+		spin_lock(&sb->barrier_flag_lock);
+		if (sb->in_barrier == 2) {
+			spin_unlock(&sb->barrier_flag_lock);
+			ret = wait_for_completion_interruptible(&sb->barrier_completion);
+			goto again;
+		} else if (sb->in_barrier) {
+			spin_unlock(&sb->barrier_flag_lock);
+			ret = wait_for_completion_interruptible(&sb->barrier_completion);
+		} else {
+			sb->in_barrier = 1;
+			INIT_COMPLETION(sb->barrier_completion);
+			spin_unlock(&sb->barrier_flag_lock);
+
+			udelay(barrier_wait);
+
+			spin_lock(&sb->barrier_flag_lock);
+			sb->in_barrier = 2;
+			spin_unlock(&sb->barrier_flag_lock);
+
+			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+			complete_all(&sb->barrier_completion);
+
+			spin_lock(&sb->barrier_flag_lock);
+			sb->in_barrier = 0;
+			spin_unlock(&sb->barrier_flag_lock);
+		}
+fast_out:
+		if (!ret)
+			ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_DATA);
+	}
 	return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..0dc96d2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3045,6 +3045,10 @@  no_journal:
 	ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
 		"Opts: %s", descr, orig_data);
 
+	EXT4_SB(sb)->in_barrier = 0;
+	spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+	init_completion(&EXT4_SB(sb)->barrier_completion);
+
 	lock_kernel();
 	kfree(orig_data);
 	return 0;