Patchwork [1/2] block: optimize non-queueable flush request drive

login
register
mail settings
Submitter Shaohua Li
Date April 19, 2011, 8:44 a.m.
Message ID <1303202686.3981.216.camel@sli10-conroe>
Download mbox | patch
Permalink /patch/91937/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Shaohua Li - April 19, 2011, 8:44 a.m.
In some drives, flush requests are non-queueable. This means when a flush request
is running, normal read/write requests are not. In such drive, when running flush
requests finish, we can make pending flush requests finish. Since normal write
requests are not running, pending flush requests also flush required drive cache
out. This reduces some flush requests running and improve performance.

This patch allows block core utilizes the optimization. Next patch will enable it
for SATA.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/blk-flush.c      |   15 +++++++++++++--
 include/linux/blkdev.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - April 22, 2011, 11:32 p.m.
Hello, Shaohua.

> +	list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
> +	/*
> +	 * If queue doesn't support queueable flush request, we can push the
> +	 * pending requests to the next stage too. For such queue, there are no
> +	 * normal requests running when flush request is running, so this still
> +	 * guarantees the correctness.
> +	 */
> +	if (!blk_queue_flush_queueable(q))
> +		list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
> +			&proceed_list);

I can't see how this is safe.  Request completion is decoupled from
issue.  What prevents low level driver from take in other requests
before control hits here?  And even if that holds for the current
implementation, that's hardly something which can be guaranteed from
!flush_queueable.  Am I missing something?

This kind of micro optimization is gonna bring very painful bugs which
are extremely difficult to reproduce and track down.  It scares the
hell out of me.  It's gonna silently skip flushes where it shouldn't.

If you wanna optimize this case, a much better way would be
implementing back-to-back flush optimization properly such that when
block layer detects two flushes back-to-back and _KNOWS_ that no
request has been issued inbetween, the second one is handled as noop.
Mark the queue clean on flush, dirty on any other request and if the
queue is clean all flushes can be completed immediately on issue which
would also allow us to avoid the whole queue at the front or back
issue without bothering low level drivers at all.

Thanks.
Shaohua Li - April 25, 2011, 1:33 a.m.
Hi,
On Sat, Apr 23, 2011 at 07:32:04AM +0800, Tejun Heo wrote: 
> > +	list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
> > +	/*
> > +	 * If queue doesn't support queueable flush request, we can push the
> > +	 * pending requests to the next stage too. For such queue, there are no
> > +	 * normal requests running when flush request is running, so this still
> > +	 * guarantees the correctness.
> > +	 */
> > +	if (!blk_queue_flush_queueable(q))
> > +		list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
> > +			&proceed_list);
> 
> I can't see how this is safe.  Request completion is decoupled from
> issue.  What prevents low level driver from take in other requests
> before control hits here?  And even if that holds for the current
> implementation, that's hardly something which can be guaranteed from
> !flush_queueable.  Am I missing something?
Say in one operation of fs, we issue write r1 and r2, after they finishes,
we issue flush f1. In another operation, we issue write r3 and r4, after
they finishes, we issue flush f2.
operation 1: r1 r2  f1
operation 2:  r3 r4  f2
At the time f1 finishes and f2 is in queue, we can make sure two things:
1. r3 and r4 is already finished, otherwise f2 will not be queued.
2. r3 and r4 should be finished before f1. We can only deliver one request
out for non-queueable request, so either f1 is dispatched after r3 and r4
are finished or before r3 and r4 are finished. Because of item1, f1 is
dispatched after r3 and r4 are finished.
From the two items, when f1 is finished, we can let f2 finished, because
f1 should flush disk cache out for all requests from r1 to r4.
 
> This kind of micro optimization is gonna bring very painful bugs which
> are extremely difficult to reproduce and track down.  It scares the
> hell out of me.  It's gonna silently skip flushes where it shouldn't.
> 
> If you wanna optimize this case, a much better way would be
> implementing back-to-back flush optimization properly such that when
> block layer detects two flushes back-to-back and _KNOWS_ that no
> request has been issued inbetween, the second one is handled as noop.
> Mark the queue clean on flush, dirty on any other request and if the
> queue is clean all flushes can be completed immediately on issue which
> would also allow us to avoid the whole queue at the front or back
> issue without bothering low level drivers at all.
If flush is queueable, I'm not sure if we can do the optimization. For example,
we dispatch 32 requests in the meantime. and the last request is flush, can
the hardware guarantee the cache for the first 31 requests are flushed out?
On the other hand, my optimization works even there are write requests in
between the back-to-back flush.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - April 25, 2011, 8:58 a.m.
Hello,

(cc'ing Darrick)

On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote:
> Say in one operation of fs, we issue write r1 and r2, after they finishes,
> we issue flush f1. In another operation, we issue write r3 and r4, after
> they finishes, we issue flush f2.
> operation 1: r1 r2  f1
> operation 2:  r3 r4  f2
> At the time f1 finishes and f2 is in queue, we can make sure two things:
> 1. r3 and r4 is already finished, otherwise f2 will not be queued.
> 2. r3 and r4 should be finished before f1. We can only deliver one request
> out for non-queueable request, so either f1 is dispatched after r3 and r4
> are finished or before r3 and r4 are finished. Because of item1, f1 is
> dispatched after r3 and r4 are finished.
> From the two items, when f1 is finished, we can let f2 finished, because
> f1 should flush disk cache out for all requests from r1 to r4.

What I was saying is that request completion is decoupled from driver
fetching requests from block layer and that the order of completion
doesn't necessarily follow the order of execution.  IOW, nothing
guarantees that FLUSH completion code would run before the low level
driver fetches the next command and _completes_ it, in which case your
code would happily mark flush complete after write without actually
doing it.

And, in general, I feel uncomfortable with this type of approach, it's
extremely fragile and difficult to understand and verify, and doesn't
match at all with the rest of the code.  If you think you can exploit
certain ordering constraint, reflect it into the overall design.
Don't stuff the magic into five line out-of-place code.

> If flush is queueable, I'm not sure if we can do the
> optimization. For example, we dispatch 32 requests in the
> meantime. and the last request is flush, can the hardware guarantee
> the cache for the first 31 requests are flushed out?  On the other
> hand, my optimization works even there are write requests in between
> the back-to-back flush.

Eh, wasn't your optimization only applicable if flush is not
queueable?  IIUC, what your optimization achieves is merging
back-to-back flushes and you're achieving that in a _very_ non-obvious
round-about way.  Do it in straight-forward way even if that costs
more lines of code.

Darrick, do you see flush performance regression between rc1 and rc2?
You're testing on higher end, so maybe it's still okay for you?

Thanks.
Tejun Heo - April 25, 2011, 9:13 a.m.
Hello,

On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote:
> Eh, wasn't your optimization only applicable if flush is not
> queueable?  IIUC, what your optimization achieves is merging
> back-to-back flushes and you're achieving that in a _very_ non-obvious
> round-about way.  Do it in straight-forward way even if that costs
> more lines of code.

To add a bit more, here, flush exclusivity gives you an extra ordering
contraint that while flush is in progress no other request can proceed
and thus if there's another flush queued, it can be completed
together, right?  If so, teach block layer the whole thing - let block
layer hold further requests while flush is in progress and complete
back-to-back flushes together on completion and then resume normal
queue processing.

Also, another interesting thing to investigate is why the two flushes
didn't get merged in the first place.  The two flushes apparently
didn't have any ordering requirement between them.  Why didn't they
get merged in the first place?  If the first flush were slightly
delayed, the second would have been issued together from the beginning
and we wouldn't have to think about merging them afterwards.  Maybe
what we really need is better algorithm than C1/2/3 described in the
comment?

What did sysbench do in the workload which showed the regression?  A
lot of parallel fsyncs combined with writes?

Thanks.
Shaohua Li - April 26, 2011, 12:42 a.m.
On Mon, 2011-04-25 at 16:58 +0800, Tejun Heo wrote:
> Hello,
> 
> (cc'ing Darrick)
> 
> On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote:
> > Say in one operation of fs, we issue write r1 and r2, after they finishes,
> > we issue flush f1. In another operation, we issue write r3 and r4, after
> > they finishes, we issue flush f2.
> > operation 1: r1 r2  f1
> > operation 2:  r3 r4  f2
> > At the time f1 finishes and f2 is in queue, we can make sure two things:
> > 1. r3 and r4 is already finished, otherwise f2 will not be queued.
> > 2. r3 and r4 should be finished before f1. We can only deliver one request
> > out for non-queueable request, so either f1 is dispatched after r3 and r4
> > are finished or before r3 and r4 are finished. Because of item1, f1 is
> > dispatched after r3 and r4 are finished.
> > From the two items, when f1 is finished, we can let f2 finished, because
> > f1 should flush disk cache out for all requests from r1 to r4.
> 
> What I was saying is that request completion is decoupled from driver
> fetching requests from block layer and that the order of completion
> doesn't necessarily follow the order of execution.  IOW, nothing
> guarantees that FLUSH completion code would run before the low level
> driver fetches the next command and _completes_ it, in which case your
> code would happily mark flush complete after write without actually
> doing it.
What I described is in the background of non-queueable flush request.
For queueable flush, this definitely isn't correct.

> And, in general, I feel uncomfortable with this type of approach, it's
> extremely fragile and difficult to understand and verify, and doesn't
> match at all with the rest of the code.  If you think you can exploit
> certain ordering constraint, reflect it into the overall design.
> Don't stuff the magic into five line out-of-place code.
> 
> > If flush is queueable, I'm not sure if we can do the
> > optimization. For example, we dispatch 32 requests in the
> > meantime. and the last request is flush, can the hardware guarantee
> > the cache for the first 31 requests are flushed out?  On the other
> > hand, my optimization works even there are write requests in between
> > the back-to-back flush.
> 
> Eh, wasn't your optimization only applicable if flush is not
> queueable?  IIUC, what your optimization achieves is merging
> back-to-back flushes and you're achieving that in a _very_ non-obvious
> round-about way.  Do it in straight-forward way even if that costs
> more lines of code.
This isn't a problem of more code or less code. I thought my patch is
already quite simple.

The method your described only works for non-queueable flush too. And it
has limitation that the requests between two back-to-back flushes must
not be write. my patch works for non-queueable flush but has no such
limitation.

> Darrick, do you see flush performance regression between rc1 and rc2?
> You're testing on higher end, so maybe it's still okay for you?
please ignore the regression. the patch isn't related to the regression,
but that problem motivates me to do the patch.
Actually I still need the RFC patch in another thread to recover the
regression. I hope you and Jens can seriously look at that issue too.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shaohua Li - April 26, 2011, 12:46 a.m.
On Mon, 2011-04-25 at 17:13 +0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote:
> > Eh, wasn't your optimization only applicable if flush is not
> > queueable?  IIUC, what your optimization achieves is merging
> > back-to-back flushes and you're achieving that in a _very_ non-obvious
> > round-about way.  Do it in straight-forward way even if that costs
> > more lines of code.
> 
> To add a bit more, here, flush exclusivity gives you an extra ordering
> contraint that while flush is in progress no other request can proceed
> and thus if there's another flush queued, it can be completed
> together, right?  If so, teach block layer the whole thing - let block
> layer hold further requests while flush is in progress and complete
> back-to-back flushes together on completion and then resume normal
> queue processing.
the blk-flush is part of block layer. If what you mean is the libata
part, block layer doesn't know if flush is queueable without knowledge
from driver.

> Also, another interesting thing to investigate is why the two flushes
> didn't get merged in the first place.  The two flushes apparently
> didn't have any ordering requirement between them.  Why didn't they
> get merged in the first place?  If the first flush were slightly
> delayed, the second would have been issued together from the beginning
> and we wouldn't have to think about merging them afterwards.  Maybe
> what we really need is better algorithm than C1/2/3 described in the
> comment?
the sysbench fileio does a 16 threads write-fsync, so it's quite normal
a flush is running and another flush is added into pending list.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - April 26, 2011, 10:40 a.m.
Hey,

On Tue, Apr 26, 2011 at 08:42:39AM +0800, Shaohua Li wrote:
> > What I was saying is that request completion is decoupled from driver
> > fetching requests from block layer and that the order of completion
> > doesn't necessarily follow the order of execution.  IOW, nothing
> > guarantees that FLUSH completion code would run before the low level
> > driver fetches the next command and _completes_ it, in which case your
> > code would happily mark flush complete after write without actually
> > doing it.
>
> What I described is in the background of non-queueable flush request.
> For queueable flush, this definitely isn't correct.

We're definitely having communication issues.  The above doesn't have
anything to do with queueability of flushes.  It's about the
asynchronous nature of block request completion and issue paths, so it
can happen whether flush is queueable or not, or am I still
misunderstanding you?

> > Eh, wasn't your optimization only applicable if flush is not
> > queueable?  IIUC, what your optimization achieves is merging
> > back-to-back flushes and you're achieving that in a _very_ non-obvious
> > round-about way.  Do it in straight-forward way even if that costs
> > more lines of code.
>
> This isn't a problem of more code or less code. I thought my patch is
> already quite simple.

Well, then, we'll have to agree to disagree there as it looks really
hackish to me and I don't think it's even correct as written above.

> The method your described only works for non-queueable flush too. And it
> has limitation that the requests between two back-to-back flushes must
> not be write. my patch works for non-queueable flush but has no such
> limitation.

No, I'm saying you can achieve about the same effect in cleaner and
safer way if you teach the issue and completion paths properly about
these back-to-back flushes at the cost of more code changes.  Your
patch doesn't work reliably whether flush is queueable or not.

> > Darrick, do you see flush performance regression between rc1 and rc2?
> > You're testing on higher end, so maybe it's still okay for you?
>
> please ignore the regression. the patch isn't related to the regression,
> but that problem motivates me to do the patch.
> Actually I still need the RFC patch in another thread to recover the
> regression. I hope you and Jens can seriously look at that issue too.

Ah, okay, it's a separately issue.  Sorry about confusing the two.
I'll continue on another reply.

Thanks.
Tejun Heo - April 26, 2011, 10:48 a.m.
Hey,

On Tue, Apr 26, 2011 at 08:46:30AM +0800, Shaohua Li wrote:
> the blk-flush is part of block layer. If what you mean is the libata
> part, block layer doesn't know if flush is queueable without knowledge
> from driver.

It seems my communication skill towards you sucks badly.  I was
talking about making both the issue and completion paths cooperate on
flush sequence handling instead of depending on queuability of flush
to make assumptions on completion order, which I still think is
incorrect.

> > Also, another interesting thing to investigate is why the two flushes
> > didn't get merged in the first place.  The two flushes apparently
> > didn't have any ordering requirement between them.  Why didn't they
> > get merged in the first place?  If the first flush were slightly
> > delayed, the second would have been issued together from the beginning
> > and we wouldn't have to think about merging them afterwards.  Maybe
> > what we really need is better algorithm than C1/2/3 described in the
> > comment?
>
> the sysbench fileio does a 16 threads write-fsync, so it's quite normal
> a flush is running and another flush is added into pending list.

I think you're optimizing in the wrong place.  These back-to-back
flushes better be merged on the issue side instead of completion.  The
current merging rules (basically how long to delay pending flushes)
are minimal and mechanical (timeout is the only huristic parameter).

For the initial implementation, my goal was matching the numbers of
Darrick's original implementation at higher level and keeping things
simple, but the code is intentionally structured to allow easy tuning
of merging criteria, so I suggest looking there instead of mucking
with completion path.

Thank you.

Patch

Index: linux/block/blk-flush.c
===================================================================
--- linux.orig/block/blk-flush.c	2011-04-19 09:21:47.000000000 +0800
+++ linux/block/blk-flush.c	2011-04-19 16:38:22.000000000 +0800
@@ -193,18 +193,29 @@  static bool blk_flush_complete_seq(struc
 static void flush_end_io(struct request *flush_rq, int error)
 {
 	struct request_queue *q = flush_rq->q;
-	struct list_head *running = &q->flush_queue[q->flush_running_idx];
+	LIST_HEAD(proceed_list);
 	bool queued = false;
 	struct request *rq, *n;
 
 	BUG_ON(q->flush_pending_idx == q->flush_running_idx);
 
+	list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list);
+	/*
+	 * If queue doesn't support queueable flush request, we can push the
+	 * pending requests to the next stage too. For such queue, there are no
+	 * normal requests running when flush request is running, so this still
+	 * guarantees the correctness.
+	 */
+	if (!blk_queue_flush_queueable(q))
+		list_splice_tail_init(&q->flush_queue[q->flush_pending_idx],
+			&proceed_list);
+
 	/* account completion of the flush request */
 	q->flush_running_idx ^= 1;
 	elv_completed_request(q, flush_rq);
 
 	/* and push the waiting requests to the next stage */
-	list_for_each_entry_safe(rq, n, running, flush.list) {
+	list_for_each_entry_safe(rq, n, &proceed_list, flush.list) {
 		unsigned int seq = blk_flush_cur_seq(rq);
 
 		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
Index: linux/include/linux/blkdev.h
===================================================================
--- linux.orig/include/linux/blkdev.h	2011-04-19 09:15:15.000000000 +0800
+++ linux/include/linux/blkdev.h	2011-04-19 10:04:46.000000000 +0800
@@ -366,6 +366,7 @@  struct request_queue
 	 * for flush operations
 	 */
 	unsigned int		flush_flags;
+	unsigned int		flush_not_queueable:1;
 	unsigned int		flush_pending_idx:1;
 	unsigned int		flush_running_idx:1;
 	unsigned long		flush_pending_since;
@@ -552,6 +553,16 @@  static inline void blk_clear_queue_full(
 		queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
 }
 
+static inline void blk_set_queue_flush_queueable(struct request_queue *q,
+	bool queueable)
+{
+	q->flush_not_queueable = !queueable;
+}
+
+static inline bool blk_queue_flush_queueable(struct request_queue *q)
+{
+	return !q->flush_not_queueable;
+}
 
 /*
  * mergeable request must not have _NOMERGE or _BARRIER bit set, nor may