From patchwork Fri Jan 14 01:00:22 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Shaohua Li X-Patchwork-Id: 78856 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 552B7B70DA for ; Fri, 14 Jan 2011 12:00:30 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752381Ab1ANBA0 (ORCPT ); Thu, 13 Jan 2011 20:00:26 -0500 Received: from mail-bw0-f46.google.com ([209.85.214.46]:58295 "EHLO mail-bw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752343Ab1ANBAY (ORCPT ); Thu, 13 Jan 2011 20:00:24 -0500 Received: by bwz15 with SMTP id 15so2119465bwz.19 for ; Thu, 13 Jan 2011 17:00:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=IaHIJ3a7VqBlrpdAqcSNrVGD8MbdKoZn36VWs8Ed9NI=; b=Jv4OMFIjNH1TxZd42Vp+PuHpNMGLubvJ0RaNnF0gdthdItDkghYXRW02qV+jep/Y7M noPvpgWkdY6MTzZlaO8BKF3ilF3/ykzsD+lGjurOLc9ci+/0VWuJgFM5TlS4hJ8uuKkJ QtuNudKAa8GF8ZAimkeXnolWfc2aXjocfhX5g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=Z6jWmHaGMOPaUf3dDsQmIQHRtFXj6RjodE4TwYnxF1SahLltT0yFGMpXcUikuR7tFK W6NC7yjshyskUZfv0U+YOMeggm6nrKBVjGdAlQgRESGu5zopyIjLpAYsbQqjpprzPUvi xV87hQHcG8x+hobI9aW+TEtrU5UJx8bGxqeLk= MIME-Version: 1.0 Received: by 10.204.76.18 with SMTP id a18mr123850bkk.9.1294966822257; Thu, 13 Jan 2011 17:00:22 -0800 (PST) Received: by 10.204.113.146 with HTTP; Thu, 13 Jan 2011 17:00:22 -0800 (PST) In-Reply-To: <20110113074603.GC27381@tux1.beaverton.ibm.com> References: <20110113025646.GB27381@tux1.beaverton.ibm.com> <20110113074603.GC27381@tux1.beaverton.ibm.com> Date: Fri, 14 Jan 2011 09:00:22 +0800 X-Google-Sender-Auth: XyEfg6LHkyyV0YPMGuBLozJ162s Message-ID: Subject: Re: [PATCH v7.1] block: Coordinate flush requests From: Shaohua Li To: djwong@us.ibm.com Cc: Jens Axboe , "Theodore Ts'o" , Neil Brown , Andreas Dilger , Jan Kara , Mike Snitzer , linux-kernel , Keith Mannthey , Mingming Cao , Tejun Heo , linux-ext4@vger.kernel.org, Ric Wheeler , Christoph Hellwig , Josef Bacik Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org 2011/1/13 Darrick J. Wong : > On Thu, Jan 13, 2011 at 01:38:55PM +0800, Shaohua Li wrote: >> 2011/1/13 Darrick J. Wong : >> > On certain types of storage hardware, flushing the write cache takes a >> > considerable amount of time.  Typically, these are simple storage systems with >> > write cache enabled and no battery to save that cache during a power failure. >> > When we encounter a system with many I/O threads that try to flush the cache, >> > performance is suboptimal because each of those threads issues its own flush >> > command to the drive instead of trying to coordinate the flushes, thereby >> > wasting execution time. >> > >> > Instead of each thread initiating its own flush, we now try to detect the >> > situation where multiple threads are issuing flush requests.  The first thread >> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads >> > that enter blkdev_issue_flush before the flush finishes are queued up to wait >> > for the next flush.  When that first flush finishes, one of those sleeping >> > threads is woken up to perform the next flush and then wake up the other >> > threads which are asleep waiting for the second flush to finish. >> > >> > In the single-threaded case, the thread will simply issue the flush and exit. >> > >> > To test the performance of this latest patch, I created a spreadsheet >> > reflecting the performance numbers I obtained with the same ffsb fsync-happy >> > workload that I've been running all along:  http://tinyurl.com/6xqk5bs >> > >> > The second tab of the workbook provides easy comparisons of the performance >> > before and after adding flush coordination to the block layer.  Variations in >> > the runs were never more than about 5%, so the slight performance increases and >> > decreases are negligible.  It is expected that devices with low flush times >> > should not show much change, whether the low flush times are due to the lack of >> > write cache or the controller having a battery and thereby ignoring the flush >> > command. >> > >> > Notice that the elm3b231_ipr, elm3b231_bigfc, elm3b57, elm3c44_ssd, >> > elm3c44_sata_wc, and elm3c71_scsi profiles showed large performance increases >> > from flush coordination.  These 6 configurations all feature large write caches >> > without battery backups, and fairly high (or at least non-zero) average flush >> > times, as was discovered when I studied the v6 patch. >> > >> > Unfortunately, there is one very odd regression: elm3c44_sas.  This profile is >> > a couple of battery-backed RAID cabinets striped together with raid0 on md.  I >> > suspect that there is some sort of problematic interaction with md, because >> > running ffsb on the individual hardware arrays produces numbers similar to >> > elm3c71_extsas.  elm3c71_extsas uses the same type of hardware array as does >> > elm3c44_sas, in fact. >> > >> > FYI, the flush coordination patch shows performance improvements both with and >> > without Christoph's patch that issues pure flushes directly.  The spreadsheet >> > only captures the performance numbers collected without Christoph's patch. >> Hi, >> can you explain why there is improvement with your patch? If there are >> multiple flush, blk_do_flush already has queue for them (the >> ->pending_flushes list). > > With the current code, if we have n threads trying to issue flushes, the block > layer will issue n flushes one after the other.  I think the point of > Christoph's pure-flush patch is to skip the serialization step and allow > issuing of the n pure flushes in parallel.  The point of this patch is optimize > even more aggressively, such that as soon as the system becomes free to process > a pure flush (at time t), all the requests for a pure flush that were created > since the last time a pure flush was actually issued can be covered with a > single flush issued at time t.  In other words, this patch tries to reduce the > number of pure flushes issued. Thanks, but why just doing merge for pure flush? we can merge normal flush requests with a pure flush request too. + complete_all(&new_flush->ready); + spin_unlock(&disk->flush_flag_lock); - bio_put(bio); + complete_all(&flush->finish); this seems can't guarantee the second waiter runs after the first waiter, am I missing anything? it appears we can easily implement this in blk_do_flush, I had something at my hand too, passed test but no data yet. Task1: ...FS.....C Task2: ....F......S...C Task3: ......F.........S..C F means a flush is queued, S means a flush is dispatched, C means the flush is completed. In above, when Task2's flush is completed, we actually can make Task3 flush complete, as Task2's flush is dispatched after Task3's flush is queued. With the same reason, we can't merge Task2 and Task3's flush with Task1's. --- block/blk-core.c | 1 + block/blk-flush.c | 24 ++++++++++++++++++++++++ include/linux/blkdev.h | 2 ++ 3 files changed, 27 insertions(+) Index: linux/block/blk-core.c =================================================================== --- linux.orig/block/blk-core.c 2011-01-13 21:02:24.000000000 +0800 +++ linux/block/blk-core.c 2011-01-13 21:43:03.000000000 +0800 @@ -128,6 +128,7 @@ void blk_rq_init(struct request_queue *q rq->ref_count = 1; rq->start_time = jiffies; set_start_time_ns(rq); + rq->next_batched_flush = NULL; } EXPORT_SYMBOL(blk_rq_init); Index: linux/block/blk-flush.c =================================================================== --- linux.orig/block/blk-flush.c 2011-01-13 21:02:24.000000000 +0800 +++ linux/block/blk-flush.c 2011-01-14 08:22:56.000000000 +0800 @@ -42,8 +42,18 @@ static struct request *blk_flush_complet /* not complete yet, queue the next flush sequence */ next_rq = queue_next_fseq(q); } else { + struct request *tmp_rq = q->orig_flush_rq->next_batched_flush; + struct request *tmp_rq2; + sector_t error_sector = q->orig_flush_rq->bio->bi_sector; + /* complete this flush request */ __blk_end_request_all(q->orig_flush_rq, q->flush_err); + while (tmp_rq) { + tmp_rq2 = tmp_rq->next_batched_flush; + tmp_rq->bio->bi_sector = error_sector; + __blk_end_request_all(tmp_rq, q->flush_err); + tmp_rq = tmp_rq2; + } q->orig_flush_rq = NULL; q->flush_seq = 0; @@ -164,6 +174,20 @@ struct request *blk_do_flush(struct requ * processing. */ if (q->flush_seq) { + struct request *tmp_rq; + + if ((rq->cmd_flags & REQ_FUA) || blk_rq_sectors(rq)) + goto add_list; + list_for_each_entry(tmp_rq, &q->pending_flushes, queuelist) { + if (tmp_rq->cmd_flags & REQ_FLUSH) { + rq->next_batched_flush = + tmp_rq->next_batched_flush; + tmp_rq->next_batched_flush = rq; + list_del_init(&rq->queuelist); + return NULL; + } + } +add_list: list_move_tail(&rq->queuelist, &q->pending_flushes); return NULL; } Index: linux/include/linux/blkdev.h =================================================================== --- linux.orig/include/linux/blkdev.h 2011-01-13 21:02:24.000000000 +0800 +++ linux/include/linux/blkdev.h 2011-01-13 21:43:03.000000000 +0800 @@ -163,6 +163,8 @@ struct request { /* for bidi */ struct request *next_rq; + + struct request *next_batched_flush; }; static inline unsigned short req_get_ioprio(struct request *req)