diff mbox

[rfc,v2] ext3/4: enhance fsync performance when using cfq

Message ID x49ochv7yi3.fsf@segfault.boston.devel.redhat.com
State New, archived
Headers show

Commit Message

Jeff Moyer April 7, 2010, 9:18 p.m. UTC
Hi again,

So, here's another stab at fixing this.  This patch is very much an RFC,
so do not pull it into anything bound for Linus.  ;-)  For those new to
this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344

The basic problem is that, when running iozone on smallish files (up to
8MB in size) and including fsync in the timings, deadline outperforms
CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
files.  From examining the blktrace data, it appears that iozone will
issue an fsync() call, and will have to wait until it's CFQ timeslice
has expired before the journal thread can run to actually commit data to
disk.

The approach below puts an explicit call into the filesystem-specific
fsync code to yield the disk so that the jbd[2] process has a chance to
issue I/O.  This bring performance of CFQ in line with deadline.

There is one outstanding issue with the patch that Vivek pointed out.
Basically, this could starve out the sync-noidle workload if there is a
lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
now, I wanted to get the idea out there for others to comment on.

Thanks a ton to Vivek for spotting the problem with the initial
approach, and for his continued review.

Cheers,
Jeff

# git diff --stat
 block/blk-core.c            |   15 +++++++++++++++
 block/cfq-iosched.c         |   38 +++++++++++++++++++++++++++++++++++++-
 block/elevator.c            |    8 ++++++++
 fs/ext3/fsync.c             |    5 ++++-
 fs/ext4/fsync.c             |    6 +++++-
 include/linux/backing-dev.h |   13 +++++++++++++
 include/linux/blkdev.h      |    1 +
 include/linux/elevator.h    |    3 +++
 8 files changed, 86 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vivek Goyal April 7, 2010, 9:46 p.m. UTC | #1
On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> Hi again,
> 
> So, here's another stab at fixing this.  This patch is very much an RFC,
> so do not pull it into anything bound for Linus.  ;-)  For those new to
> this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> 
> The basic problem is that, when running iozone on smallish files (up to
> 8MB in size) and including fsync in the timings, deadline outperforms
> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> files.  From examining the blktrace data, it appears that iozone will
> issue an fsync() call, and will have to wait until it's CFQ timeslice
> has expired before the journal thread can run to actually commit data to
> disk.
> 
> The approach below puts an explicit call into the filesystem-specific
> fsync code to yield the disk so that the jbd[2] process has a chance to
> issue I/O.  This bring performance of CFQ in line with deadline.
> 
> There is one outstanding issue with the patch that Vivek pointed out.
> Basically, this could starve out the sync-noidle workload if there is a
> lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> now, I wanted to get the idea out there for others to comment on.
> 
> Thanks a ton to Vivek for spotting the problem with the initial
> approach, and for his continued review.
> 

Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
layers explicitly telling CFQ not to idle/yield the slice.

My firefox timing test is perfoming much better now.

real	0m15.957s
user	0m0.608s
sys	0m0.165s

real	0m12.984s
user	0m0.602s
sys	0m0.148s

real	0m13.057s
user	0m0.624s
sys	0m0.145s

So we got to take care of two issues now.

- Make it work with dm/md devices also. Somehow shall have to propogate
  this yield semantic down the stack.

- The issue of making sure we don't yield if we are servicing sync-noidle
  tree and there is other IO going on which relies on sync-noidle tree
  idling (as you have already mentioned).


[..]
> +static void cfq_yield(struct request_queue *q)
> +{
> +	struct cfq_data *cfqd = q->elevator->elevator_data;
> +	struct cfq_io_context *cic;
> +	struct cfq_queue *cfqq;
> +	unsigned long flags;
> +
> +	cic = cfq_cic_lookup(cfqd, current->io_context);
> +	if (!cic)
> +		return;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +
> +	/*
> +	 * This is primarily called to ensure that the long synchronous
> +	 * time slice does not prevent other I/O happenning (like journal
> +	 * commits) while we idle waiting for it.  Thus, check to see if the
> +	 * current cfqq is the sync cfqq for this process.
> +	 */
> +	cfqq = cic_to_cfqq(cic, 1);
> +	if (!cfqq)
> +		goto out_unlock;
> +
> +	if (cfqd->active_queue != cfqq)
> +		goto out_unlock;
> +
> +	cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> +
> +	__cfq_slice_expired(cfqd, cfqq, 1);
> +	__blk_run_queue(cfqd->queue);

I think it would be good if we also check that cfqq is empty or not. If cfqq
is not empty we don't want to give up slice? But this is a minor point. At
least in case of fsync, we seem to be waiting for all the iozone request
to finish and then calling yield. So cfqq should be empty at this point of
time. 

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 8, 2010, 11 a.m. UTC | #2
On Wed, Apr 07 2010, Jeff Moyer wrote:
> Hi again,
> 
> So, here's another stab at fixing this.  This patch is very much an RFC,
> so do not pull it into anything bound for Linus.  ;-)  For those new to
> this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> 
> The basic problem is that, when running iozone on smallish files (up to
> 8MB in size) and including fsync in the timings, deadline outperforms
> CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> files.  From examining the blktrace data, it appears that iozone will
> issue an fsync() call, and will have to wait until it's CFQ timeslice
> has expired before the journal thread can run to actually commit data to
> disk.
> 
> The approach below puts an explicit call into the filesystem-specific
> fsync code to yield the disk so that the jbd[2] process has a chance to
> issue I/O.  This bring performance of CFQ in line with deadline.
> 
> There is one outstanding issue with the patch that Vivek pointed out.
> Basically, this could starve out the sync-noidle workload if there is a
> lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> now, I wanted to get the idea out there for others to comment on.
> 
> Thanks a ton to Vivek for spotting the problem with the initial
> approach, and for his continued review.

I like the concept, it's definitely useful (and your results amply
demonstrate that). I was thinking if there was a way in through the ioc
itself, rather than bdi -> queue and like you are doing. But I can't
think of a nice way to do it, so this is probably as good as it gets.

> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index dee9d93..a76ccd1 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -2065,7 +2065,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>  	cfqd->serving_group = cfqg;
>  
>  	/* Restore the workload type data */
> -	if (cfqg->saved_workload_slice) {
> +	if (cfqg && cfqg->saved_workload_slice) {
>  		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
>  		cfqd->serving_type = cfqg->saved_workload;
>  		cfqd->serving_prio = cfqg->saved_serving_prio;

Unrelated change?

> +static void cfq_yield(struct request_queue *q)
> +{
> +	struct cfq_data *cfqd = q->elevator->elevator_data;
> +	struct cfq_io_context *cic;
> +	struct cfq_queue *cfqq;
> +	unsigned long flags;
> +
> +	cic = cfq_cic_lookup(cfqd, current->io_context);
> +	if (!cic)
> +		return;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);

spin_lock_irq() is sufficient here.
Jens Axboe April 8, 2010, 11:04 a.m. UTC | #3
On Wed, Apr 07 2010, Vivek Goyal wrote:
> On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> > Hi again,
> > 
> > So, here's another stab at fixing this.  This patch is very much an RFC,
> > so do not pull it into anything bound for Linus.  ;-)  For those new to
> > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> > 
> > The basic problem is that, when running iozone on smallish files (up to
> > 8MB in size) and including fsync in the timings, deadline outperforms
> > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > files.  From examining the blktrace data, it appears that iozone will
> > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > has expired before the journal thread can run to actually commit data to
> > disk.
> > 
> > The approach below puts an explicit call into the filesystem-specific
> > fsync code to yield the disk so that the jbd[2] process has a chance to
> > issue I/O.  This bring performance of CFQ in line with deadline.
> > 
> > There is one outstanding issue with the patch that Vivek pointed out.
> > Basically, this could starve out the sync-noidle workload if there is a
> > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> > now, I wanted to get the idea out there for others to comment on.
> > 
> > Thanks a ton to Vivek for spotting the problem with the initial
> > approach, and for his continued review.
> > 
> 
> Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
> layers explicitly telling CFQ not to idle/yield the slice.
> 
> My firefox timing test is perfoming much better now.
> 
> real	0m15.957s
> user	0m0.608s
> sys	0m0.165s
> 
> real	0m12.984s
> user	0m0.602s
> sys	0m0.148s
> 
> real	0m13.057s
> user	0m0.624s
> sys	0m0.145s
> 
> So we got to take care of two issues now.
> 
> - Make it work with dm/md devices also. Somehow shall have to propogate
>   this yield semantic down the stack.

The way that Jeff set it up, it's completely parallel to eg congestion
or unplugging. So that should be easily doable.

> - The issue of making sure we don't yield if we are servicing sync-noidle
>   tree and there is other IO going on which relies on sync-noidle tree
>   idling (as you have already mentioned).

Agree.

> > +	/*
> > +	 * This is primarily called to ensure that the long synchronous
> > +	 * time slice does not prevent other I/O happenning (like journal
> > +	 * commits) while we idle waiting for it.  Thus, check to see if the
> > +	 * current cfqq is the sync cfqq for this process.
> > +	 */
> > +	cfqq = cic_to_cfqq(cic, 1);
> > +	if (!cfqq)
> > +		goto out_unlock;
> > +
> > +	if (cfqd->active_queue != cfqq)
> > +		goto out_unlock;
> > +
> > +	cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> > +
> > +	__cfq_slice_expired(cfqd, cfqq, 1);
> > +	__blk_run_queue(cfqd->queue);
> 
> I think it would be good if we also check that cfqq is empty or not. If cfqq
> is not empty we don't want to give up slice? But this is a minor point. At
> least in case of fsync, we seem to be waiting for all the iozone request
> to finish and then calling yield. So cfqq should be empty at this point of
> time. 

That's a bit of a problem. Say the queue has one request left, it'll
service that and the start idling. But we already got this hint that we
should yield, but as it happened before we reached the idling state,
we'll not yield. Perhaps it would be better to mark the cfqq as yielding
if it isn't ready to yield just yet.
Vivek Goyal April 8, 2010, 1:59 p.m. UTC | #4
On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> On Wed, Apr 07 2010, Jeff Moyer wrote:
> > Hi again,
> > 
> > So, here's another stab at fixing this.  This patch is very much an RFC,
> > so do not pull it into anything bound for Linus.  ;-)  For those new to
> > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> > 
> > The basic problem is that, when running iozone on smallish files (up to
> > 8MB in size) and including fsync in the timings, deadline outperforms
> > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > files.  From examining the blktrace data, it appears that iozone will
> > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > has expired before the journal thread can run to actually commit data to
> > disk.
> > 
> > The approach below puts an explicit call into the filesystem-specific
> > fsync code to yield the disk so that the jbd[2] process has a chance to
> > issue I/O.  This bring performance of CFQ in line with deadline.
> > 
> > There is one outstanding issue with the patch that Vivek pointed out.
> > Basically, this could starve out the sync-noidle workload if there is a
> > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> > now, I wanted to get the idea out there for others to comment on.
> > 
> > Thanks a ton to Vivek for spotting the problem with the initial
> > approach, and for his continued review.
> 
> I like the concept, it's definitely useful (and your results amply
> demonstrate that). I was thinking if there was a way in through the ioc
> itself, rather than bdi -> queue and like you are doing. But I can't
> think of a nice way to do it, so this is probably as good as it gets.
> 

I think, one issue with ioc based approach will be that it will then call
yield operation on all the devices in the system where this context has ever
done any IO. With bdi based approach this call will remain limited to
a smaller set of devices.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 8, 2010, 2:03 p.m. UTC | #5
On Thu, Apr 08 2010, Vivek Goyal wrote:
> On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> > On Wed, Apr 07 2010, Jeff Moyer wrote:
> > > Hi again,
> > > 
> > > So, here's another stab at fixing this.  This patch is very much an RFC,
> > > so do not pull it into anything bound for Linus.  ;-)  For those new to
> > > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> > > 
> > > The basic problem is that, when running iozone on smallish files (up to
> > > 8MB in size) and including fsync in the timings, deadline outperforms
> > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > > files.  From examining the blktrace data, it appears that iozone will
> > > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > > has expired before the journal thread can run to actually commit data to
> > > disk.
> > > 
> > > The approach below puts an explicit call into the filesystem-specific
> > > fsync code to yield the disk so that the jbd[2] process has a chance to
> > > issue I/O.  This bring performance of CFQ in line with deadline.
> > > 
> > > There is one outstanding issue with the patch that Vivek pointed out.
> > > Basically, this could starve out the sync-noidle workload if there is a
> > > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> > > now, I wanted to get the idea out there for others to comment on.
> > > 
> > > Thanks a ton to Vivek for spotting the problem with the initial
> > > approach, and for his continued review.
> > 
> > I like the concept, it's definitely useful (and your results amply
> > demonstrate that). I was thinking if there was a way in through the ioc
> > itself, rather than bdi -> queue and like you are doing. But I can't
> > think of a nice way to do it, so this is probably as good as it gets.
> > 
> 
> I think, one issue with ioc based approach will be that it will then call
> yield operation on all the devices in the system where this context has ever
> done any IO. With bdi based approach this call will remain limited to
> a smaller set of devices.

Oh, you'd want the bdi as well. And as I said, I don't think it was
workable, just trying to think it over and consider potentially other
ways to accomplish this.

At one point I had a patch that did the equivalant of this yield on
being scheduled out on the CPU side, which is probably why I was in the
ioc mindset.
Jeff Moyer April 8, 2010, 2:03 p.m. UTC | #6
Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:

>> I like the concept, it's definitely useful (and your results amply
>> demonstrate that). I was thinking if there was a way in through the ioc
>> itself, rather than bdi -> queue and like you are doing. But I can't
>> think of a nice way to do it, so this is probably as good as it gets.
>> 
>
> I think, one issue with ioc based approach will be that it will then call
> yield operation on all the devices in the system where this context has ever
> done any IO. With bdi based approach this call will remain limited to
> a smaller set of devices.

Which actually brings up the question of whether this needs some
knowledge of whether the journal is on the same device as the file
system!  In such a case, we need not yield.  I think I'll stick my head
in the sand for this one.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal April 8, 2010, 2:05 p.m. UTC | #7
On Thu, Apr 08, 2010 at 01:04:42PM +0200, Jens Axboe wrote:
> On Wed, Apr 07 2010, Vivek Goyal wrote:
> > On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> > > Hi again,
> > > 
> > > So, here's another stab at fixing this.  This patch is very much an RFC,
> > > so do not pull it into anything bound for Linus.  ;-)  For those new to
> > > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> > > 
> > > The basic problem is that, when running iozone on smallish files (up to
> > > 8MB in size) and including fsync in the timings, deadline outperforms
> > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > > files.  From examining the blktrace data, it appears that iozone will
> > > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > > has expired before the journal thread can run to actually commit data to
> > > disk.
> > > 
> > > The approach below puts an explicit call into the filesystem-specific
> > > fsync code to yield the disk so that the jbd[2] process has a chance to
> > > issue I/O.  This bring performance of CFQ in line with deadline.
> > > 
> > > There is one outstanding issue with the patch that Vivek pointed out.
> > > Basically, this could starve out the sync-noidle workload if there is a
> > > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> > > now, I wanted to get the idea out there for others to comment on.
> > > 
> > > Thanks a ton to Vivek for spotting the problem with the initial
> > > approach, and for his continued review.
> > > 
> > 
> > Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
> > layers explicitly telling CFQ not to idle/yield the slice.
> > 
> > My firefox timing test is perfoming much better now.
> > 
> > real	0m15.957s
> > user	0m0.608s
> > sys	0m0.165s
> > 
> > real	0m12.984s
> > user	0m0.602s
> > sys	0m0.148s
> > 
> > real	0m13.057s
> > user	0m0.624s
> > sys	0m0.145s
> > 
> > So we got to take care of two issues now.
> > 
> > - Make it work with dm/md devices also. Somehow shall have to propogate
> >   this yield semantic down the stack.
> 
> The way that Jeff set it up, it's completely parallel to eg congestion
> or unplugging. So that should be easily doable.
> 

Ok, so various dm targets now need to define "yield_fn" and propogate the
yield call to all the component devices. 

> > - The issue of making sure we don't yield if we are servicing sync-noidle
> >   tree and there is other IO going on which relies on sync-noidle tree
> >   idling (as you have already mentioned).
> 
> Agree.
> 
> > > +	/*
> > > +	 * This is primarily called to ensure that the long synchronous
> > > +	 * time slice does not prevent other I/O happenning (like journal
> > > +	 * commits) while we idle waiting for it.  Thus, check to see if the
> > > +	 * current cfqq is the sync cfqq for this process.
> > > +	 */
> > > +	cfqq = cic_to_cfqq(cic, 1);
> > > +	if (!cfqq)
> > > +		goto out_unlock;
> > > +
> > > +	if (cfqd->active_queue != cfqq)
> > > +		goto out_unlock;
> > > +
> > > +	cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> > > +
> > > +	__cfq_slice_expired(cfqd, cfqq, 1);
> > > +	__blk_run_queue(cfqd->queue);
> > 
> > I think it would be good if we also check that cfqq is empty or not. If cfqq
> > is not empty we don't want to give up slice? But this is a minor point. At
> > least in case of fsync, we seem to be waiting for all the iozone request
> > to finish and then calling yield. So cfqq should be empty at this point of
> > time. 
> 
> That's a bit of a problem. Say the queue has one request left, it'll
> service that and the start idling. But we already got this hint that we
> should yield, but as it happened before we reached the idling state,
> we'll not yield. Perhaps it would be better to mark the cfqq as yielding
> if it isn't ready to yield just yet.

That makes sense. So another cfqq flag in place. As soon as all the
request from cfqq are dispatched, yield the slice instead of idling. And
clear the flag when slice is expiring.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 8, 2010, 2:06 p.m. UTC | #8
On Thu, Apr 08 2010, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> 
> >> I like the concept, it's definitely useful (and your results amply
> >> demonstrate that). I was thinking if there was a way in through the ioc
> >> itself, rather than bdi -> queue and like you are doing. But I can't
> >> think of a nice way to do it, so this is probably as good as it gets.
> >> 
> >
> > I think, one issue with ioc based approach will be that it will then call
> > yield operation on all the devices in the system where this context has ever
> > done any IO. With bdi based approach this call will remain limited to
> > a smaller set of devices.
> 
> Which actually brings up the question of whether this needs some
> knowledge of whether the journal is on the same device as the file
> system!  In such a case, we need not yield.  I think I'll stick my head
> in the sand for this one.  ;-)

Yes, that is true. But that should be relatively easy to check.
Jens Axboe April 8, 2010, 2:09 p.m. UTC | #9
On Thu, Apr 08 2010, Vivek Goyal wrote:
> On Thu, Apr 08, 2010 at 01:04:42PM +0200, Jens Axboe wrote:
> > On Wed, Apr 07 2010, Vivek Goyal wrote:
> > > On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
> > > > Hi again,
> > > > 
> > > > So, here's another stab at fixing this.  This patch is very much an RFC,
> > > > so do not pull it into anything bound for Linus.  ;-)  For those new to
> > > > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
> > > > 
> > > > The basic problem is that, when running iozone on smallish files (up to
> > > > 8MB in size) and including fsync in the timings, deadline outperforms
> > > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
> > > > files.  From examining the blktrace data, it appears that iozone will
> > > > issue an fsync() call, and will have to wait until it's CFQ timeslice
> > > > has expired before the journal thread can run to actually commit data to
> > > > disk.
> > > > 
> > > > The approach below puts an explicit call into the filesystem-specific
> > > > fsync code to yield the disk so that the jbd[2] process has a chance to
> > > > issue I/O.  This bring performance of CFQ in line with deadline.
> > > > 
> > > > There is one outstanding issue with the patch that Vivek pointed out.
> > > > Basically, this could starve out the sync-noidle workload if there is a
> > > > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
> > > > now, I wanted to get the idea out there for others to comment on.
> > > > 
> > > > Thanks a ton to Vivek for spotting the problem with the initial
> > > > approach, and for his continued review.
> > > > 
> > > 
> > > Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher
> > > layers explicitly telling CFQ not to idle/yield the slice.
> > > 
> > > My firefox timing test is perfoming much better now.
> > > 
> > > real	0m15.957s
> > > user	0m0.608s
> > > sys	0m0.165s
> > > 
> > > real	0m12.984s
> > > user	0m0.602s
> > > sys	0m0.148s
> > > 
> > > real	0m13.057s
> > > user	0m0.624s
> > > sys	0m0.145s
> > > 
> > > So we got to take care of two issues now.
> > > 
> > > - Make it work with dm/md devices also. Somehow shall have to propogate
> > >   this yield semantic down the stack.
> > 
> > The way that Jeff set it up, it's completely parallel to eg congestion
> > or unplugging. So that should be easily doable.
> > 
> 
> Ok, so various dm targets now need to define "yield_fn" and propogate the
> yield call to all the component devices. 

Exactly.

> > > - The issue of making sure we don't yield if we are servicing sync-noidle
> > >   tree and there is other IO going on which relies on sync-noidle tree
> > >   idling (as you have already mentioned).
> > 
> > Agree.
> > 
> > > > +	/*
> > > > +	 * This is primarily called to ensure that the long synchronous
> > > > +	 * time slice does not prevent other I/O happenning (like journal
> > > > +	 * commits) while we idle waiting for it.  Thus, check to see if the
> > > > +	 * current cfqq is the sync cfqq for this process.
> > > > +	 */
> > > > +	cfqq = cic_to_cfqq(cic, 1);
> > > > +	if (!cfqq)
> > > > +		goto out_unlock;
> > > > +
> > > > +	if (cfqd->active_queue != cfqq)
> > > > +		goto out_unlock;
> > > > +
> > > > +	cfq_log_cfqq(cfqd, cfqq, "yielding queue");
> > > > +
> > > > +	__cfq_slice_expired(cfqd, cfqq, 1);
> > > > +	__blk_run_queue(cfqd->queue);
> > > 
> > > I think it would be good if we also check that cfqq is empty or not. If cfqq
> > > is not empty we don't want to give up slice? But this is a minor point. At
> > > least in case of fsync, we seem to be waiting for all the iozone request
> > > to finish and then calling yield. So cfqq should be empty at this point of
> > > time. 
> > 
> > That's a bit of a problem. Say the queue has one request left, it'll
> > service that and the start idling. But we already got this hint that we
> > should yield, but as it happened before we reached the idling state,
> > we'll not yield. Perhaps it would be better to mark the cfqq as yielding
> > if it isn't ready to yield just yet.
> 
> That makes sense. So another cfqq flag in place. As soon as all the
> request from cfqq are dispatched, yield the slice instead of idling. And
> clear the flag when slice is expiring.

Precisely. The next question would be how to control the yielding. In
this particular case, you want to be yielding to a specific cfqq. IOW,
you essentially want to pass your slide on to that queue. The way the
above is implemented, you could easily just switch to another unrelated
queue. And if that is done, fairness is skewed without helping the
yielding process at all (which was the intention).
Vivek Goyal April 8, 2010, 2:10 p.m. UTC | #10
On Thu, Apr 08, 2010 at 10:03:24AM -0400, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Apr 08, 2010 at 01:00:45PM +0200, Jens Axboe wrote:
> 
> >> I like the concept, it's definitely useful (and your results amply
> >> demonstrate that). I was thinking if there was a way in through the ioc
> >> itself, rather than bdi -> queue and like you are doing. But I can't
> >> think of a nice way to do it, so this is probably as good as it gets.
> >> 
> >
> > I think, one issue with ioc based approach will be that it will then call
> > yield operation on all the devices in the system where this context has ever
> > done any IO. With bdi based approach this call will remain limited to
> > a smaller set of devices.
> 
> Which actually brings up the question of whether this needs some
> knowledge of whether the journal is on the same device as the file
> system!  In such a case, we need not yield.  I think I'll stick my head
> in the sand for this one.  ;-)

Jeff even if journal is not on same device, what harm yielding could do?
Anyway there is no IO on that queue and we are idling. Only side affect is
that yielding process could lose a bit if after fsync it immediately submits
more IO. Because this process has yielded it slice, it is back in the queue
instead of doing more IO in the current slice immediately.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal April 8, 2010, 2:17 p.m. UTC | #11
On Thu, Apr 08, 2010 at 04:09:44PM +0200, Jens Axboe wrote:

[..]
> Precisely. The next question would be how to control the yielding. In
> this particular case, you want to be yielding to a specific cfqq. IOW,
> you essentially want to pass your slide on to that queue. The way the
> above is implemented, you could easily just switch to another unrelated
> queue. And if that is done, fairness is skewed without helping the
> yielding process at all (which was the intention).
> 

True. I guess this is relatively simple yield implementation where we are
telling IO scheduler that there is no more IO coming on this context so
don't waste time idling. That's a different thing that after giving up
slice, cfq might schedule in another sequential reader instead of
journalling thread.

Ideally it would be better to be also able to specify who to transfer
remaining slice to. I am not sure if there is an easy way to pass this
info CFQ.

So this implementation might be good first step.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer April 8, 2010, 2:24 p.m. UTC | #12
Jens Axboe <jens.axboe@oracle.com> writes:

> Precisely. The next question would be how to control the yielding. In
> this particular case, you want to be yielding to a specific cfqq. IOW,
> you essentially want to pass your slide on to that queue. The way the
> above is implemented, you could easily just switch to another unrelated
> queue. And if that is done, fairness is skewed without helping the
> yielding process at all (which was the intention).

Well, that's true in part.  Prior to this patch, the process would idle,
keeping all other cfq_queues on the system from making progress.  With
this patch, at least *somebody* else makes progress, getting you closer
to running the journal thread that you're blocked on.  Ideally, you'd
want the thread you're waiting on to get disk time next, sure.  You
would have to pass the process information down to the I/O scheduler for
that, and I'm not sure that the file system code knows which process to
hand off to.  Does it?

Do we really want to go down this road at all?  I'm not convinced.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer April 8, 2010, 2:25 p.m. UTC | #13
Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Apr 08, 2010 at 10:03:24AM -0400, Jeff Moyer wrote:
>> Which actually brings up the question of whether this needs some
>> knowledge of whether the journal is on the same device as the file
>> system!  In such a case, we need not yield.  I think I'll stick my head
>> in the sand for this one.  ;-)
>
> Jeff even if journal is not on same device, what harm yielding could do?
> Anyway there is no IO on that queue and we are idling. Only side affect is
> that yielding process could lose a bit if after fsync it immediately submits
> more IO. Because this process has yielded it slice, it is back in the queue
> instead of doing more IO in the current slice immediately.

What happens if the journal is on a super fast device, and finishes up
very quickly allowing our process to initiate more I/O within the idle
window?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal April 8, 2010, 2:31 p.m. UTC | #14
On Thu, Apr 08, 2010 at 10:25:50AM -0400, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Apr 08, 2010 at 10:03:24AM -0400, Jeff Moyer wrote:
> >> Which actually brings up the question of whether this needs some
> >> knowledge of whether the journal is on the same device as the file
> >> system!  In such a case, we need not yield.  I think I'll stick my head
> >> in the sand for this one.  ;-)
> >
> > Jeff even if journal is not on same device, what harm yielding could do?
> > Anyway there is no IO on that queue and we are idling. Only side affect is
> > that yielding process could lose a bit if after fsync it immediately submits
> > more IO. Because this process has yielded it slice, it is back in the queue
> > instead of doing more IO in the current slice immediately.
> 
> What happens if the journal is on a super fast device, and finishes up
> very quickly allowing our process to initiate more I/O within the idle
> window?
> 

You lose. :-) But at the same time if journalling devices is not fast
enough, and you if can't submit next IO in idling window, then you are 
unnecessarily keeping the disk idle and preventing others from making
progress.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer April 8, 2010, 7:10 p.m. UTC | #15
Jens Axboe <jens.axboe@oracle.com> writes:

>> @@ -2065,7 +2065,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>>  	cfqd->serving_group = cfqg;
>>  
>>  	/* Restore the workload type data */
>> -	if (cfqg->saved_workload_slice) {
>> +	if (cfqg && cfqg->saved_workload_slice) {
>>  		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
>>  		cfqd->serving_type = cfqg->saved_workload;
>>  		cfqd->serving_prio = cfqg->saved_serving_prio;
>
> Unrelated change?

Probably not needed for this incarnation of the patch, though previous
iterations would Oops on boot w/o this.  If you look through all of the
code in this code path, cfqg == NULL seems to be handled, so it's
probably safe to take this.  I'll pull it out of this patch, though.

>> +static void cfq_yield(struct request_queue *q)
>> +{
>> +	struct cfq_data *cfqd = q->elevator->elevator_data;
>> +	struct cfq_io_context *cic;
>> +	struct cfq_queue *cfqq;
>> +	unsigned long flags;
>> +
>> +	cic = cfq_cic_lookup(cfqd, current->io_context);
>> +	if (!cic)
>> +		return;
>> +
>> +	spin_lock_irqsave(q->queue_lock, flags);
>
> spin_lock_irq() is sufficient here.

OK, thanks!

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe April 8, 2010, 7:23 p.m. UTC | #16
On Thu, Apr 08 2010, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > Precisely. The next question would be how to control the yielding. In
> > this particular case, you want to be yielding to a specific cfqq. IOW,
> > you essentially want to pass your slide on to that queue. The way the
> > above is implemented, you could easily just switch to another unrelated
> > queue. And if that is done, fairness is skewed without helping the
> > yielding process at all (which was the intention).
> 
> Well, that's true in part.  Prior to this patch, the process would idle,
> keeping all other cfq_queues on the system from making progress.  With
> this patch, at least *somebody* else makes progress, getting you closer
> to running the journal thread that you're blocked on.  Ideally, you'd
> want the thread you're waiting on to get disk time next, sure.  You
> would have to pass the process information down to the I/O scheduler for
> that, and I'm not sure that the file system code knows which process to
> hand off to.  Does it?
> 
> Do we really want to go down this road at all?  I'm not convinced.

Don't get me wrong, neither am I. I'm just thinking out loud and
pondering. As a general mechanism, yield to a specific cfqq is going to
be tricky and doing a generic yield to signal that _someone_ else must
make progress before we can is better than nothing.

Continuing that train of thought, I don't think we'll ever need full
'yield to X' functionality where 'X' is a really specific target. But
for this fsync case, we know at least what we are yielding to and it
seems like a shame to throw away that information. So you could include
a hint of what to yield to, which cfq could factor in.

Dunno, I need to think a bit about the cleanliness of such an approach.
We can definitely use your patch as a starting point.
Mike Snitzer April 21, 2010, 8:42 p.m. UTC | #17
On Thu, Apr 8, 2010 at 10:09 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Thu, Apr 08 2010, Vivek Goyal wrote:
>> On Thu, Apr 08, 2010 at 01:04:42PM +0200, Jens Axboe wrote:
>> > On Wed, Apr 07 2010, Vivek Goyal wrote:
>> > > On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
>> > > > Hi again,
>> > > >
>> > > > So, here's another stab at fixing this.  This patch is very much an RFC,
>> > > > so do not pull it into anything bound for Linus.  ;-)  For those new to
>> > > > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
>> > > >
>> > > > The basic problem is that, when running iozone on smallish files (up to
>> > > > 8MB in size) and including fsync in the timings, deadline outperforms
>> > > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
>> > > > files.  From examining the blktrace data, it appears that iozone will
>> > > > issue an fsync() call, and will have to wait until it's CFQ timeslice
>> > > > has expired before the journal thread can run to actually commit data to
>> > > > disk.
>> > > >
>> > > > The approach below puts an explicit call into the filesystem-specific
>> > > > fsync code to yield the disk so that the jbd[2] process has a chance to
>> > > > issue I/O.  This bring performance of CFQ in line with deadline.
>> > > >
>> > > > There is one outstanding issue with the patch that Vivek pointed out.
>> > > > Basically, this could starve out the sync-noidle workload if there is a
>> > > > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
>> > > > now, I wanted to get the idea out there for others to comment on.
>> > > >
>> > > > Thanks a ton to Vivek for spotting the problem with the initial
>> > > > approach, and for his continued review.
>> > > >
...
>> > > So we got to take care of two issues now.
>> > >
>> > > - Make it work with dm/md devices also. Somehow shall have to propogate
>> > >   this yield semantic down the stack.
>> >
>> > The way that Jeff set it up, it's completely parallel to eg congestion
>> > or unplugging. So that should be easily doable.
>> >
>>
>> Ok, so various dm targets now need to define "yield_fn" and propogate the
>> yield call to all the component devices.
>
> Exactly.

To do so doesn't DM (and MD) need a blk_queue_yield() setter to
establish its own yield_fn?  The established dm_yield_fn would call
blk_yield() for all real devices in a given DM target.  Something like
how blk_queue_merge_bvec() or blk_queue_make_request() allow DM to
provide functional extensions.

I'm not seeing such a yield_fn hook for stacking drivers to use. And
as is, jbd and jbd2 just call blk_yield() directly and there is no way
for the block layer to call into DM.

What am I missing?

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer April 21, 2010, 8:52 p.m. UTC | #18
Mike Snitzer <snitzer@redhat.com> writes:

> On Thu, Apr 8, 2010 at 10:09 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> On Thu, Apr 08 2010, Vivek Goyal wrote:
>>> On Thu, Apr 08, 2010 at 01:04:42PM +0200, Jens Axboe wrote:
>>> > On Wed, Apr 07 2010, Vivek Goyal wrote:
>>> > > On Wed, Apr 07, 2010 at 05:18:12PM -0400, Jeff Moyer wrote:
>>> > > > Hi again,
>>> > > >
>>> > > > So, here's another stab at fixing this.  This patch is very much an RFC,
>>> > > > so do not pull it into anything bound for Linus.  ;-)  For those new to
>>> > > > this topic, here is the original posting:  http://lkml.org/lkml/2010/4/1/344
>>> > > >
>>> > > > The basic problem is that, when running iozone on smallish files (up to
>>> > > > 8MB in size) and including fsync in the timings, deadline outperforms
>>> > > > CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB
>>> > > > files.  From examining the blktrace data, it appears that iozone will
>>> > > > issue an fsync() call, and will have to wait until it's CFQ timeslice
>>> > > > has expired before the journal thread can run to actually commit data to
>>> > > > disk.
>>> > > >
>>> > > > The approach below puts an explicit call into the filesystem-specific
>>> > > > fsync code to yield the disk so that the jbd[2] process has a chance to
>>> > > > issue I/O.  This bring performance of CFQ in line with deadline.
>>> > > >
>>> > > > There is one outstanding issue with the patch that Vivek pointed out.
>>> > > > Basically, this could starve out the sync-noidle workload if there is a
>>> > > > lot of fsync-ing going on.  I'll address that in a follow-on patch.  For
>>> > > > now, I wanted to get the idea out there for others to comment on.
>>> > > >
>>> > > > Thanks a ton to Vivek for spotting the problem with the initial
>>> > > > approach, and for his continued review.
>>> > > >
> ...
>>> > > So we got to take care of two issues now.
>>> > >
>>> > > - Make it work with dm/md devices also. Somehow shall have to propogate
>>> > >   this yield semantic down the stack.
>>> >
>>> > The way that Jeff set it up, it's completely parallel to eg congestion
>>> > or unplugging. So that should be easily doable.
>>> >
>>>
>>> Ok, so various dm targets now need to define "yield_fn" and propogate the
>>> yield call to all the component devices.
>>
>> Exactly.
>
> To do so doesn't DM (and MD) need a blk_queue_yield() setter to
> establish its own yield_fn?  The established dm_yield_fn would call
> blk_yield() for all real devices in a given DM target.  Something like
> how blk_queue_merge_bvec() or blk_queue_make_request() allow DM to
> provide functional extensions.
>
> I'm not seeing such a yield_fn hook for stacking drivers to use. And
> as is, jbd and jbd2 just call blk_yield() directly and there is no way
> for the block layer to call into DM.
>
> What am I missing?

Nothing, it is I who am missing something (extra code).  When I send out
the next version, I'll add the setter function and ensure that
queue->yield_fn is called from blk_yield.  Hopefully that's not viewed
as upside down.  We'll see.

Thanks for the review, Mike!

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..1be9413 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -323,6 +323,19 @@  void blk_unplug(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_unplug);
 
+void blk_backing_dev_yield(struct backing_dev_info *bdi)
+{
+	struct request_queue *q = bdi->yield_data;
+
+	blk_yield(q);
+}
+
+void blk_yield(struct request_queue *q)
+{
+	elv_yield(q);
+}
+EXPORT_SYMBOL(blk_yield);
+
 /**
  * blk_start_queue - restart a previously stopped queue
  * @q:    The &struct request_queue in question
@@ -498,6 +511,8 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
+	q->backing_dev_info.yield_fn = blk_backing_dev_yield;
+	q->backing_dev_info.yield_data = q;
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dee9d93..a76ccd1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2065,7 +2065,7 @@  static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	cfqd->serving_group = cfqg;
 
 	/* Restore the workload type data */
-	if (cfqg->saved_workload_slice) {
+	if (cfqg && cfqg->saved_workload_slice) {
 		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
 		cfqd->serving_type = cfqg->saved_workload;
 		cfqd->serving_prio = cfqg->saved_serving_prio;
@@ -2163,6 +2163,41 @@  keep_queue:
 	return cfqq;
 }
 
+static void cfq_yield(struct request_queue *q)
+{
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_io_context *cic;
+	struct cfq_queue *cfqq;
+	unsigned long flags;
+
+	cic = cfq_cic_lookup(cfqd, current->io_context);
+	if (!cic)
+		return;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/*
+	 * This is primarily called to ensure that the long synchronous
+	 * time slice does not prevent other I/O happenning (like journal
+	 * commits) while we idle waiting for it.  Thus, check to see if the
+	 * current cfqq is the sync cfqq for this process.
+	 */
+	cfqq = cic_to_cfqq(cic, 1);
+	if (!cfqq)
+		goto out_unlock;
+
+	if (cfqd->active_queue != cfqq)
+		goto out_unlock;
+
+	cfq_log_cfqq(cfqd, cfqq, "yielding queue");
+
+	__cfq_slice_expired(cfqd, cfqq, 1);
+	__blk_run_queue(cfqd->queue);
+
+out_unlock:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
 {
 	int dispatched = 0;
@@ -3854,6 +3889,7 @@  static struct elevator_type iosched_cfq = {
 		.elevator_deactivate_req_fn =	cfq_deactivate_request,
 		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
+		.elevator_yield_fn =		cfq_yield,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_set_req_fn =		cfq_set_request,
diff --git a/block/elevator.c b/block/elevator.c
index df75676..2cf6f89 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -855,6 +855,14 @@  void elv_completed_request(struct request_queue *q, struct request *rq)
 	}
 }
 
+void elv_yield(struct request_queue *q)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e && e->ops->elevator_yield_fn)
+		e->ops->elevator_yield_fn(q);
+}
+
 #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
 
 static ssize_t
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 8209f26..44f9db0 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -50,6 +50,7 @@  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
 	int ret = 0;
 	tid_t commit_tid;
+	int new_commit;
 
 	if (inode->i_sb->s_flags & MS_RDONLY)
 		return 0;
@@ -80,7 +81,9 @@  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	else
 		commit_tid = atomic_read(&ei->i_sync_tid);
 
-	if (log_start_commit(journal, commit_tid)) {
+	new_commit = log_start_commit(journal, commit_tid);
+	blk_yield_backing_dev(inode->i_sb->s_bdi);
+	if (new_commit) {
 		log_wait_commit(journal, commit_tid);
 		goto out;
 	}
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..1630c68 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -55,6 +55,7 @@  int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret;
 	tid_t commit_tid;
+	int new_commit;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -88,7 +89,10 @@  int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
+	new_commit = jbd2_log_start_commit(journal, commit_tid);
+	blk_yield_backing_dev(inode->i_sb->s_bdi);
+
+	if (new_commit) {
 		/*
 		 * When the journal is on a different device than the
 		 * fs data disk, we need to issue the barrier in
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index fcbc26a..5bb6c3f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -67,6 +67,8 @@  struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+	void (*yield_fn)(struct backing_dev_info *);
+	void *yield_data;
 
 	char *name;
 
@@ -344,4 +346,15 @@  static inline void blk_run_address_space(struct address_space *mapping)
 		blk_run_backing_dev(mapping->backing_dev_info, NULL);
 }
 
+static inline void blk_yield_backing_dev(struct backing_dev_info *bdi)
+{
+	if (bdi && bdi->yield_fn)
+		bdi->yield_fn(bdi);
+}
+static inline void blk_yield_address_space(struct address_space *mapping)
+{
+	if (mapping)
+		blk_yield_backing_dev(mapping->backing_dev_info);
+}
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ebd22db..83c06d4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -832,6 +832,7 @@  extern int blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 extern void blk_unplug(struct request_queue *q);
+extern void blk_yield(struct request_queue *q);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..9b4e2e9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -20,6 +20,7 @@  typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_queue_empty_fn) (struct request_queue *);
 typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
+typedef void (elevator_yield_fn) (struct request_queue *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
 typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
@@ -44,6 +45,7 @@  struct elevator_ops
 
 	elevator_queue_empty_fn *elevator_queue_empty_fn;
 	elevator_completed_req_fn *elevator_completed_req_fn;
+	elevator_yield_fn *elevator_yield_fn;
 
 	elevator_request_list_fn *elevator_former_req_fn;
 	elevator_request_list_fn *elevator_latter_req_fn;
@@ -105,6 +107,7 @@  extern void elv_merge_requests(struct request_queue *, struct request *,
 extern void elv_merged_request(struct request_queue *, struct request *, int);
 extern void elv_requeue_request(struct request_queue *, struct request *);
 extern int elv_queue_empty(struct request_queue *);
+extern void elv_yield(struct request_queue *);
 extern struct request *elv_former_request(struct request_queue *, struct request *);
 extern struct request *elv_latter_request(struct request_queue *, struct request *);
 extern int elv_register_queue(struct request_queue *q);