diff mbox

[(RESEND)] don't scan/accumulate more pages than mballoc will allocate

Message ID 4BB0C761.50204@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen March 29, 2010, 3:29 p.m. UTC
(resend, email sent Friday seems lost)

There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

It'd be better to not have a magic number of 2048 in here, so I'll
look for a cleaner way to get this info out of mballoc; I still need
to look at what Aneesh has in the patch queue, that might help.
This is something we could probably put in for now, though; the 2048
is already enshrined in a comment in inode.c, at least.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o April 5, 2010, 1:11 p.m. UTC | #1
On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
> This patch makes mpage_add_bh_to_extent stop the loop after we've
> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
> causes the write_cache_pages loop to break.
> 
> Repeating the test with a dirty_ratio of 80 (to leave something for
> fsync to do), I don't see huge IO performance gains, but the reduction
> in cpu usage is striking: 80% usage with stock, and 2% with the
> below patch.  Instrumenting the loop in write_cache_pages clearly
> shows that we are wasting time here.
> 
> It'd be better to not have a magic number of 2048 in here, so I'll
> look for a cleaner way to get this info out of mballoc; I still need
> to look at what Aneesh has in the patch queue, that might help.
> This is something we could probably put in for now, though; the 2048
> is already enshrined in a comment in inode.c, at least.

I wonder if a better way of fixing this is to changing
mpage_da_map_pages() to call ext4_get_blocks() multiple times.  This
should be a lot easier after we integrate mpage_da_submit_io() into
mpage_da_map_pages().  That way we can way more efficient; in a loop,
we accumulate the pages, call ext4_get_blocks(), then submit the IO
(as a single block I/O submission, instead of 4k at a time through
ext4_writepages()), and then call ext4_get_blocks() again, etc.

I'm willing to include this patch as an interim stopgap, but
eventually, I think we need to refactor and reorganize
mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
mballoc (via ext4_get_blocks) multiple times in a loop.

Thoughts, suggestions?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen April 5, 2010, 2:42 p.m. UTC | #2
tytso@mit.edu wrote:
> On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
>> This patch makes mpage_add_bh_to_extent stop the loop after we've
>> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
>> causes the write_cache_pages loop to break.
>>
>> Repeating the test with a dirty_ratio of 80 (to leave something for
>> fsync to do), I don't see huge IO performance gains, but the reduction
>> in cpu usage is striking: 80% usage with stock, and 2% with the
>> below patch.  Instrumenting the loop in write_cache_pages clearly
>> shows that we are wasting time here.
>>
>> It'd be better to not have a magic number of 2048 in here, so I'll
>> look for a cleaner way to get this info out of mballoc; I still need
>> to look at what Aneesh has in the patch queue, that might help.
>> This is something we could probably put in for now, though; the 2048
>> is already enshrined in a comment in inode.c, at least.
> 
> I wonder if a better way of fixing this is to changing
> mpage_da_map_pages() to call ext4_get_blocks() multiple times.  This

That sounds reasonable, I'll look into writing something up and testing
it a bit.

Up to you whether the initial patch goes in, I know it's kind of
stopgap/hacky.

thanks,
-Eric

> should be a lot easier after we integrate mpage_da_submit_io() into
> mpage_da_map_pages().  That way we can way more efficient; in a loop,
> we accumulate the pages, call ext4_get_blocks(), then submit the IO
> (as a single block I/O submission, instead of 4k at a time through
> ext4_writepages()), and then call ext4_get_blocks() again, etc.



> I'm willing to include this patch as an interim stopgap, but
> eventually, I think we need to refactor and reorganize
> mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
> mballoc (via ext4_get_blocks) multiple times in a loop.
> 
> Thoughts, suggestions?
> 
> 					- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -2318,6 +2318,10 @@  static void mpage_add_bh_to_extent(struc
 	sector_t next;
 	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
 
+	/* Don't go larger than mballoc is willing to allocate */
+	if (nrblocks >= 2048)
+		goto flush_it;
+
 	/* check if thereserved journal credits might overflow */
 	if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
 		if (nrblocks >= EXT4_MAX_TRANS_DATA) {