Message ID | 1303202686.3981.216.camel@sli10-conroe |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Hello, Shaohua. > + list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list); > + /* > + * If queue doesn't support queueable flush request, we can push the > + * pending requests to the next stage too. For such queue, there are no > + * normal requests running when flush request is running, so this still > + * guarantees the correctness. > + */ > + if (!blk_queue_flush_queueable(q)) > + list_splice_tail_init(&q->flush_queue[q->flush_pending_idx], > + &proceed_list); I can't see how this is safe. Request completion is decoupled from issue. What prevents low level driver from take in other requests before control hits here? And even if that holds for the current implementation, that's hardly something which can be guaranteed from !flush_queueable. Am I missing something? This kind of micro optimization is gonna bring very painful bugs which are extremely difficult to reproduce and track down. It scares the hell out of me. It's gonna silently skip flushes where it shouldn't. If you wanna optimize this case, a much better way would be implementing back-to-back flush optimization properly such that when block layer detects two flushes back-to-back and _KNOWS_ that no request has been issued inbetween, the second one is handled as noop. Mark the queue clean on flush, dirty on any other request and if the queue is clean all flushes can be completed immediately on issue which would also allow us to avoid the whole queue at the front or back issue without bothering low level drivers at all. Thanks.
Hi, On Sat, Apr 23, 2011 at 07:32:04AM +0800, Tejun Heo wrote: > > + list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list); > > + /* > > + * If queue doesn't support queueable flush request, we can push the > > + * pending requests to the next stage too. For such queue, there are no > > + * normal requests running when flush request is running, so this still > > + * guarantees the correctness. > > + */ > > + if (!blk_queue_flush_queueable(q)) > > + list_splice_tail_init(&q->flush_queue[q->flush_pending_idx], > > + &proceed_list); > > I can't see how this is safe. Request completion is decoupled from > issue. What prevents low level driver from take in other requests > before control hits here? And even if that holds for the current > implementation, that's hardly something which can be guaranteed from > !flush_queueable. Am I missing something? Say in one operation of fs, we issue write r1 and r2, after they finishes, we issue flush f1. In another operation, we issue write r3 and r4, after they finishes, we issue flush f2. operation 1: r1 r2 f1 operation 2: r3 r4 f2 At the time f1 finishes and f2 is in queue, we can make sure two things: 1. r3 and r4 is already finished, otherwise f2 will not be queued. 2. r3 and r4 should be finished before f1. We can only deliver one request out for non-queueable request, so either f1 is dispatched after r3 and r4 are finished or before r3 and r4 are finished. Because of item1, f1 is dispatched after r3 and r4 are finished. From the two items, when f1 is finished, we can let f2 finished, because f1 should flush disk cache out for all requests from r1 to r4. > This kind of micro optimization is gonna bring very painful bugs which > are extremely difficult to reproduce and track down. It scares the > hell out of me. It's gonna silently skip flushes where it shouldn't. > > If you wanna optimize this case, a much better way would be > implementing back-to-back flush optimization properly such that when > block layer detects two flushes back-to-back and _KNOWS_ that no > request has been issued inbetween, the second one is handled as noop. > Mark the queue clean on flush, dirty on any other request and if the > queue is clean all flushes can be completed immediately on issue which > would also allow us to avoid the whole queue at the front or back > issue without bothering low level drivers at all. If flush is queueable, I'm not sure if we can do the optimization. For example, we dispatch 32 requests in the meantime. and the last request is flush, can the hardware guarantee the cache for the first 31 requests are flushed out? On the other hand, my optimization works even there are write requests in between the back-to-back flush. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, (cc'ing Darrick) On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote: > Say in one operation of fs, we issue write r1 and r2, after they finishes, > we issue flush f1. In another operation, we issue write r3 and r4, after > they finishes, we issue flush f2. > operation 1: r1 r2 f1 > operation 2: r3 r4 f2 > At the time f1 finishes and f2 is in queue, we can make sure two things: > 1. r3 and r4 is already finished, otherwise f2 will not be queued. > 2. r3 and r4 should be finished before f1. We can only deliver one request > out for non-queueable request, so either f1 is dispatched after r3 and r4 > are finished or before r3 and r4 are finished. Because of item1, f1 is > dispatched after r3 and r4 are finished. > From the two items, when f1 is finished, we can let f2 finished, because > f1 should flush disk cache out for all requests from r1 to r4. What I was saying is that request completion is decoupled from driver fetching requests from block layer and that the order of completion doesn't necessarily follow the order of execution. IOW, nothing guarantees that FLUSH completion code would run before the low level driver fetches the next command and _completes_ it, in which case your code would happily mark flush complete after write without actually doing it. And, in general, I feel uncomfortable with this type of approach, it's extremely fragile and difficult to understand and verify, and doesn't match at all with the rest of the code. If you think you can exploit certain ordering constraint, reflect it into the overall design. Don't stuff the magic into five line out-of-place code. > If flush is queueable, I'm not sure if we can do the > optimization. For example, we dispatch 32 requests in the > meantime. and the last request is flush, can the hardware guarantee > the cache for the first 31 requests are flushed out? On the other > hand, my optimization works even there are write requests in between > the back-to-back flush. Eh, wasn't your optimization only applicable if flush is not queueable? IIUC, what your optimization achieves is merging back-to-back flushes and you're achieving that in a _very_ non-obvious round-about way. Do it in straight-forward way even if that costs more lines of code. Darrick, do you see flush performance regression between rc1 and rc2? You're testing on higher end, so maybe it's still okay for you? Thanks.
Hello, On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote: > Eh, wasn't your optimization only applicable if flush is not > queueable? IIUC, what your optimization achieves is merging > back-to-back flushes and you're achieving that in a _very_ non-obvious > round-about way. Do it in straight-forward way even if that costs > more lines of code. To add a bit more, here, flush exclusivity gives you an extra ordering contraint that while flush is in progress no other request can proceed and thus if there's another flush queued, it can be completed together, right? If so, teach block layer the whole thing - let block layer hold further requests while flush is in progress and complete back-to-back flushes together on completion and then resume normal queue processing. Also, another interesting thing to investigate is why the two flushes didn't get merged in the first place. The two flushes apparently didn't have any ordering requirement between them. Why didn't they get merged in the first place? If the first flush were slightly delayed, the second would have been issued together from the beginning and we wouldn't have to think about merging them afterwards. Maybe what we really need is better algorithm than C1/2/3 described in the comment? What did sysbench do in the workload which showed the regression? A lot of parallel fsyncs combined with writes? Thanks.
On Mon, 2011-04-25 at 16:58 +0800, Tejun Heo wrote: > Hello, > > (cc'ing Darrick) > > On Mon, Apr 25, 2011 at 09:33:28AM +0800, Shaohua Li wrote: > > Say in one operation of fs, we issue write r1 and r2, after they finishes, > > we issue flush f1. In another operation, we issue write r3 and r4, after > > they finishes, we issue flush f2. > > operation 1: r1 r2 f1 > > operation 2: r3 r4 f2 > > At the time f1 finishes and f2 is in queue, we can make sure two things: > > 1. r3 and r4 is already finished, otherwise f2 will not be queued. > > 2. r3 and r4 should be finished before f1. We can only deliver one request > > out for non-queueable request, so either f1 is dispatched after r3 and r4 > > are finished or before r3 and r4 are finished. Because of item1, f1 is > > dispatched after r3 and r4 are finished. > > From the two items, when f1 is finished, we can let f2 finished, because > > f1 should flush disk cache out for all requests from r1 to r4. > > What I was saying is that request completion is decoupled from driver > fetching requests from block layer and that the order of completion > doesn't necessarily follow the order of execution. IOW, nothing > guarantees that FLUSH completion code would run before the low level > driver fetches the next command and _completes_ it, in which case your > code would happily mark flush complete after write without actually > doing it. What I described is in the background of non-queueable flush request. For queueable flush, this definitely isn't correct. > And, in general, I feel uncomfortable with this type of approach, it's > extremely fragile and difficult to understand and verify, and doesn't > match at all with the rest of the code. If you think you can exploit > certain ordering constraint, reflect it into the overall design. > Don't stuff the magic into five line out-of-place code. > > > If flush is queueable, I'm not sure if we can do the > > optimization. For example, we dispatch 32 requests in the > > meantime. and the last request is flush, can the hardware guarantee > > the cache for the first 31 requests are flushed out? On the other > > hand, my optimization works even there are write requests in between > > the back-to-back flush. > > Eh, wasn't your optimization only applicable if flush is not > queueable? IIUC, what your optimization achieves is merging > back-to-back flushes and you're achieving that in a _very_ non-obvious > round-about way. Do it in straight-forward way even if that costs > more lines of code. This isn't a problem of more code or less code. I thought my patch is already quite simple. The method your described only works for non-queueable flush too. And it has limitation that the requests between two back-to-back flushes must not be write. my patch works for non-queueable flush but has no such limitation. > Darrick, do you see flush performance regression between rc1 and rc2? > You're testing on higher end, so maybe it's still okay for you? please ignore the regression. the patch isn't related to the regression, but that problem motivates me to do the patch. Actually I still need the RFC patch in another thread to recover the regression. I hope you and Jens can seriously look at that issue too. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2011-04-25 at 17:13 +0800, Tejun Heo wrote: > Hello, > > On Mon, Apr 25, 2011 at 10:58:27AM +0200, Tejun Heo wrote: > > Eh, wasn't your optimization only applicable if flush is not > > queueable? IIUC, what your optimization achieves is merging > > back-to-back flushes and you're achieving that in a _very_ non-obvious > > round-about way. Do it in straight-forward way even if that costs > > more lines of code. > > To add a bit more, here, flush exclusivity gives you an extra ordering > contraint that while flush is in progress no other request can proceed > and thus if there's another flush queued, it can be completed > together, right? If so, teach block layer the whole thing - let block > layer hold further requests while flush is in progress and complete > back-to-back flushes together on completion and then resume normal > queue processing. the blk-flush is part of block layer. If what you mean is the libata part, block layer doesn't know if flush is queueable without knowledge from driver. > Also, another interesting thing to investigate is why the two flushes > didn't get merged in the first place. The two flushes apparently > didn't have any ordering requirement between them. Why didn't they > get merged in the first place? If the first flush were slightly > delayed, the second would have been issued together from the beginning > and we wouldn't have to think about merging them afterwards. Maybe > what we really need is better algorithm than C1/2/3 described in the > comment? the sysbench fileio does a 16 threads write-fsync, so it's quite normal a flush is running and another flush is added into pending list. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hey, On Tue, Apr 26, 2011 at 08:42:39AM +0800, Shaohua Li wrote: > > What I was saying is that request completion is decoupled from driver > > fetching requests from block layer and that the order of completion > > doesn't necessarily follow the order of execution. IOW, nothing > > guarantees that FLUSH completion code would run before the low level > > driver fetches the next command and _completes_ it, in which case your > > code would happily mark flush complete after write without actually > > doing it. > > What I described is in the background of non-queueable flush request. > For queueable flush, this definitely isn't correct. We're definitely having communication issues. The above doesn't have anything to do with queueability of flushes. It's about the asynchronous nature of block request completion and issue paths, so it can happen whether flush is queueable or not, or am I still misunderstanding you? > > Eh, wasn't your optimization only applicable if flush is not > > queueable? IIUC, what your optimization achieves is merging > > back-to-back flushes and you're achieving that in a _very_ non-obvious > > round-about way. Do it in straight-forward way even if that costs > > more lines of code. > > This isn't a problem of more code or less code. I thought my patch is > already quite simple. Well, then, we'll have to agree to disagree there as it looks really hackish to me and I don't think it's even correct as written above. > The method your described only works for non-queueable flush too. And it > has limitation that the requests between two back-to-back flushes must > not be write. my patch works for non-queueable flush but has no such > limitation. No, I'm saying you can achieve about the same effect in cleaner and safer way if you teach the issue and completion paths properly about these back-to-back flushes at the cost of more code changes. Your patch doesn't work reliably whether flush is queueable or not. > > Darrick, do you see flush performance regression between rc1 and rc2? > > You're testing on higher end, so maybe it's still okay for you? > > please ignore the regression. the patch isn't related to the regression, > but that problem motivates me to do the patch. > Actually I still need the RFC patch in another thread to recover the > regression. I hope you and Jens can seriously look at that issue too. Ah, okay, it's a separately issue. Sorry about confusing the two. I'll continue on another reply. Thanks.
Hey, On Tue, Apr 26, 2011 at 08:46:30AM +0800, Shaohua Li wrote: > the blk-flush is part of block layer. If what you mean is the libata > part, block layer doesn't know if flush is queueable without knowledge > from driver. It seems my communication skill towards you sucks badly. I was talking about making both the issue and completion paths cooperate on flush sequence handling instead of depending on queuability of flush to make assumptions on completion order, which I still think is incorrect. > > Also, another interesting thing to investigate is why the two flushes > > didn't get merged in the first place. The two flushes apparently > > didn't have any ordering requirement between them. Why didn't they > > get merged in the first place? If the first flush were slightly > > delayed, the second would have been issued together from the beginning > > and we wouldn't have to think about merging them afterwards. Maybe > > what we really need is better algorithm than C1/2/3 described in the > > comment? > > the sysbench fileio does a 16 threads write-fsync, so it's quite normal > a flush is running and another flush is added into pending list. I think you're optimizing in the wrong place. These back-to-back flushes better be merged on the issue side instead of completion. The current merging rules (basically how long to delay pending flushes) are minimal and mechanical (timeout is the only huristic parameter). For the initial implementation, my goal was matching the numbers of Darrick's original implementation at higher level and keeping things simple, but the code is intentionally structured to allow easy tuning of merging criteria, so I suggest looking there instead of mucking with completion path. Thank you.
Index: linux/block/blk-flush.c =================================================================== --- linux.orig/block/blk-flush.c 2011-04-19 09:21:47.000000000 +0800 +++ linux/block/blk-flush.c 2011-04-19 16:38:22.000000000 +0800 @@ -193,18 +193,29 @@ static bool blk_flush_complete_seq(struc static void flush_end_io(struct request *flush_rq, int error) { struct request_queue *q = flush_rq->q; - struct list_head *running = &q->flush_queue[q->flush_running_idx]; + LIST_HEAD(proceed_list); bool queued = false; struct request *rq, *n; BUG_ON(q->flush_pending_idx == q->flush_running_idx); + list_splice_init(&q->flush_queue[q->flush_running_idx], &proceed_list); + /* + * If queue doesn't support queueable flush request, we can push the + * pending requests to the next stage too. For such queue, there are no + * normal requests running when flush request is running, so this still + * guarantees the correctness. + */ + if (!blk_queue_flush_queueable(q)) + list_splice_tail_init(&q->flush_queue[q->flush_pending_idx], + &proceed_list); + /* account completion of the flush request */ q->flush_running_idx ^= 1; elv_completed_request(q, flush_rq); /* and push the waiting requests to the next stage */ - list_for_each_entry_safe(rq, n, running, flush.list) { + list_for_each_entry_safe(rq, n, &proceed_list, flush.list) { unsigned int seq = blk_flush_cur_seq(rq); BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH); Index: linux/include/linux/blkdev.h =================================================================== --- linux.orig/include/linux/blkdev.h 2011-04-19 09:15:15.000000000 +0800 +++ linux/include/linux/blkdev.h 2011-04-19 10:04:46.000000000 +0800 @@ -366,6 +366,7 @@ struct request_queue * for flush operations */ unsigned int flush_flags; + unsigned int flush_not_queueable:1; unsigned int flush_pending_idx:1; unsigned int flush_running_idx:1; unsigned long flush_pending_since; @@ -552,6 +553,16 @@ static inline void blk_clear_queue_full( queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q); } +static inline void blk_set_queue_flush_queueable(struct request_queue *q, + bool queueable) +{ + q->flush_not_queueable = !queueable; +} + +static inline bool blk_queue_flush_queueable(struct request_queue *q) +{ + return !q->flush_not_queueable; +} /* * mergeable request must not have _NOMERGE or _BARRIER bit set, nor may
In some drives, flush requests are non-queueable. This means when a flush request is running, normal read/write requests are not. In such drive, when running flush requests finish, we can make pending flush requests finish. Since normal write requests are not running, pending flush requests also flush required drive cache out. This reduces some flush requests running and improve performance. This patch allows block core utilizes the optimization. Next patch will enable it for SATA. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/blk-flush.c | 15 +++++++++++++-- include/linux/blkdev.h | 11 +++++++++++ 2 files changed, 24 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html