diff mbox

Improve buffered streaming write ordering

Message ID 1223565080.14090.28.camel@think.oraclecorp.com
State Superseded, archived
Headers show

Commit Message

Chris Mason Oct. 9, 2008, 3:11 p.m. UTC
On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > For a 4.5GB streaming buffered write, this printk inside
> > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > 
> > 
> > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > call back with lots of unallocated buffer_heads(blocks).
> 
> Quite frankly, a simple streaming buffered write should *never*
> trigger writeback from the LRU in memory reclaim. That indicates
> that some feedback loop has broken down and we are not cleaning
> pages fast enough or perhaps in the correct order. Page reclaim in
> this case should be reclaiming clean pages (those that have already
> been written back), not writing back random dirty pages.

Here are some go faster stripes for the XFS buffered writeback.  This
patch has a lot of debatable features to it, but the idea is to show
which knobs are slowing us down today.

The first change is to avoid calling balance_dirty_pages_ratelimited on
every page.  When we know we're doing a largeish write it makes more
sense to balance things less often.  This might just mean our
ratelimit_pages magic value is too small.

The second change makes xfs bump wbc->nr_to_write (suggested by
Christoph), which probably makes delalloc go in bigger chunks.

On unpatched kernels, XFS does streaming writes to my 4 drive array at
around 205MB/s.  With the patch below, I come in at 326MB/s.  O_DIRECT
runs at 330MB/s, so that's pretty good.

With just the nr_to_write change, I get around 315MB/s.

With just the balance_dirty_pages_nr change, I get around 240MB/s.

-chris




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Oct. 10, 2008, 5:13 a.m. UTC | #1
On Thu, Oct 09, 2008 at 11:11:20AM -0400, Chris Mason wrote:
> On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > For a 4.5GB streaming buffered write, this printk inside
> > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > 
> > > 
> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks).
> > 
> > Quite frankly, a simple streaming buffered write should *never*
> > trigger writeback from the LRU in memory reclaim. That indicates
> > that some feedback loop has broken down and we are not cleaning
> > pages fast enough or perhaps in the correct order. Page reclaim in
> > this case should be reclaiming clean pages (those that have already
> > been written back), not writing back random dirty pages.
> 
> Here are some go faster stripes for the XFS buffered writeback.  This
> patch has a lot of debatable features to it, but the idea is to show
> which knobs are slowing us down today.
> 
> The first change is to avoid calling balance_dirty_pages_ratelimited on
> every page.  When we know we're doing a largeish write it makes more
> sense to balance things less often.  This might just mean our
> ratelimit_pages magic value is too small.

Ok, so how about doing something like this to reduce the
number of balances on large writes, but causing at least one
balance call for every write that occurs:

	int	nr = 0;
	.....
	while() {
		....
		if (!(nr % 256)) {
			/* do balance */
		}
		nr++;
		....
	}

That way you get a balance on the first page on every write,
but then hold off balancing on that write again for some
number of pages.

> The second change makes xfs bump wbc->nr_to_write (suggested by
> Christoph), which probably makes delalloc go in bigger chunks.

Hmmmm.  Reasonable theory. We used to do gigantic delalloc extents -
we paid no attention to congestion and could allocate and write
several GB at a time. Latency was an issue, though, so it got
changed to be bound by nr_to_write.

I guess we need to be issuing larger allocations. Can you remove
you patches and see what effect using the allocsize mount
option has on throughput? This changes the default delalloc EOF
preallocation size, which means more or less allocations. The
default is 64k and it can go as high as 1GB, IIRC.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index a44d68e..c72bd54 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -944,6 +944,9 @@  xfs_page_state_convert(
 	int			trylock = 0;
 	int			all_bh = unmapped;
 
+
+	wbc->nr_to_write *= 4;
+
 	if (startio) {
 		if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
 			trylock |= BMAPI_TRYLOCK;
diff --git a/mm/filemap.c b/mm/filemap.c
index 876bc59..b6c26e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2389,6 +2389,7 @@  static ssize_t generic_perform_write(struct file *file,
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned long nr = 0;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2460,11 +2461,17 @@  again:
 		}
 		pos += copied;
 		written += copied;
-
-		balance_dirty_pages_ratelimited(mapping);
+		nr++;
+		if (nr > 256) {
+			balance_dirty_pages_ratelimited_nr(mapping, nr);
+			nr = 0;
+		}
 
 	} while (iov_iter_count(i));
 
+	if (nr)
+		balance_dirty_pages_ratelimited_nr(mapping, nr);
+
 	return written ? written : status;
 }