diff mbox

ext4: fix 50% disk write performance regression

Message ID 4C7C7A72.3020001@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen Aug. 31, 2010, 3:43 a.m. UTC
Can you give this a shot?

The first hunk is, I think, the biggest problem.  Even if
we get the max number of pages we need, we keep scanning forward
until "done" without doing any more actual, useful work.

The 2nd hunk is an oddity, some places assign nr_to_write
to LONG_MAX, and we get here and multiply -that- by 8... giving
us "-8" for nr_to_write, that can't help things when we
do later comparisons on that number...

I also see us asking to find pages starting at "idx" and
the first dirty page we find is well ahead of that,
I'm not sure if that's indicative of a problem or not.

Anyway, want to give this a shot, in place of the patch you sent,
and see how it fares compared to stock and/or with your patch?

It's build-and-sanity tested but not really performance tested here.

Thanks,
-Eric



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Sandeen Aug. 31, 2010, 4:26 a.m. UTC | #1
Eric Sandeen wrote:
> Can you give this a shot?
> 
> The first hunk is, I think, the biggest problem.  Even if
> we get the max number of pages we need, we keep scanning forward
> until "done" without doing any more actual, useful work.
> 
> The 2nd hunk is an oddity, some places assign nr_to_write
> to LONG_MAX, and we get here and multiply -that- by 8... giving
> us "-8" for nr_to_write, that can't help things when we
> do later comparisons on that number...
> 
> I also see us asking to find pages starting at "idx" and
> the first dirty page we find is well ahead of that,
> I'm not sure if that's indicative of a problem or not.
> 
> Anyway, want to give this a shot, in place of the patch you sent,
> and see how it fares compared to stock and/or with your patch?
> 
> It's build-and-sanity tested but not really performance tested here.
> 
> Thanks,
> -Eric
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4b8debe..33c2167 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>  				break;
>  			idx++;
>  			num++;
> -			if (num >= max_pages)
> -				break;
> +			if (num >= max_pages) {
> +				pagevec_release(&pvec);
> +				return num;
> +			}
>  		}
>  		pagevec_release(&pvec);
>  	}
> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	 * sbi->max_writeback_mb_bump whichever is smaller.
>  	 */
>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> -	if (!range_cyclic && range_whole)
> +	if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
>  		desired_nr_to_write = wbc->nr_to_write * 8;

sorry no, this isn't right, we should just leave it at nr_to_write for the
LONG_MAX case, not go counting pages.  And something odd is going on where we
are looking for dirty pages starting at an index we've already written out.

Maybe:

        if (!range_cyclic && range_whole) {
                if (wbc->nr_to_write != LONG_MAX)
                        desired_nr_to_write = wbc->nr_to_write * 8;
                else
                        desired_nr_to_write = wbc->nr_to_write;
        } 

I'll have to look at this more when I'm not quite so sleepy, sorry.  :)

-Eric

>  	else
>  		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bill Fink Aug. 31, 2010, 4:53 a.m. UTC | #2
On Mon, 30 Aug 2010, Eric Sandeen wrote:

> Can you give this a shot?
> 
> The first hunk is, I think, the biggest problem.  Even if
> we get the max number of pages we need, we keep scanning forward
> until "done" without doing any more actual, useful work.
> 
> The 2nd hunk is an oddity, some places assign nr_to_write
> to LONG_MAX, and we get here and multiply -that- by 8... giving
> us "-8" for nr_to_write, that can't help things when we
> do later comparisons on that number...
> 
> I also see us asking to find pages starting at "idx" and
> the first dirty page we find is well ahead of that,
> I'm not sure if that's indicative of a problem or not.
> 
> Anyway, want to give this a shot, in place of the patch you sent,
> and see how it fares compared to stock and/or with your patch?
> 
> It's build-and-sanity tested but not really performance tested here.
> 
> Thanks,
> -Eric

Great!  It looks like that does the trick.

2.6.35 + your patch:

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s

That's the same performance as with my patch, and pretty darn
close to the original 2.6.31 performance.

						-Thanks a bunch

						-Bill



> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4b8debe..33c2167 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>  				break;
>  			idx++;
>  			num++;
> -			if (num >= max_pages)
> -				break;
> +			if (num >= max_pages) {
> +				pagevec_release(&pvec);
> +				return num;
> +			}
>  		}
>  		pagevec_release(&pvec);
>  	}
> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	 * sbi->max_writeback_mb_bump whichever is smaller.
>  	 */
>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> -	if (!range_cyclic && range_whole)
> +	if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
>  		desired_nr_to_write = wbc->nr_to_write * 8;
>  	else
>  		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Aug. 31, 2010, 5:05 a.m. UTC | #3
Bill Fink wrote:
> On Mon, 30 Aug 2010, Eric Sandeen wrote:
> 
>> Can you give this a shot?
>>
>> The first hunk is, I think, the biggest problem.  Even if
>> we get the max number of pages we need, we keep scanning forward
>> until "done" without doing any more actual, useful work.
>>
>> The 2nd hunk is an oddity, some places assign nr_to_write
>> to LONG_MAX, and we get here and multiply -that- by 8... giving
>> us "-8" for nr_to_write, that can't help things when we
>> do later comparisons on that number...
>>
>> I also see us asking to find pages starting at "idx" and
>> the first dirty page we find is well ahead of that,
>> I'm not sure if that's indicative of a problem or not.
>>
>> Anyway, want to give this a shot, in place of the patch you sent,
>> and see how it fares compared to stock and/or with your patch?
>>
>> It's build-and-sanity tested but not really performance tested here.
>>
>> Thanks,
>> -Eric
> 
> Great!  It looks like that does the trick.
> 
> 2.6.35 + your patch:
> 
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
> 
> That's the same performance as with my patch, and pretty darn
> close to the original 2.6.31 performance.

hah, that's good esp. considering my followup email that found
what I think is a problem with my patch.  ;)

What happens if you change:

	if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
		desired_nr_to_write = wbc->nr_to_write * 8;
  	else
  		desired_nr_to_write = ext4_num_dirty_pages(inode, index,

to:

        if (!range_cyclic && range_whole) {
                if (wbc->nr_to_write != LONG_MAX)
                        desired_nr_to_write = wbc->nr_to_write * 8;
                else
                        desired_nr_to_write = wbc->nr_to_write;
        } else
  		desired_nr_to_write = ext4_num_dirty_pages(inode, index,

and see how that fares?  I think that makes a little more sense, if we
got there with LONG_MAX that means "write everything" and there's no need
to bump it up or to go counting pages.  It may not make any real difference.

But I'm seeing really weird behavior in writeback, it starts out nicely
writing 32768 pages at a time, and then goes all wonky, revisiting pages
it's already done and doing IO in little chunks.   This is going to take
some staring I think.

-Eric



> 						-Thanks a bunch
> 
> 						-Bill
> 
> 
> 
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 4b8debe..33c2167 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>>  				break;
>>  			idx++;
>>  			num++;
>> -			if (num >= max_pages)
>> -				break;
>> +			if (num >= max_pages) {
>> +				pagevec_release(&pvec);
>> +				return num;
>> +			}
>>  		}
>>  		pagevec_release(&pvec);
>>  	}
>> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>>  	 * sbi->max_writeback_mb_bump whichever is smaller.
>>  	 */
>>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
:

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bill Fink Aug. 31, 2010, 5:31 a.m. UTC | #4
On Tue, 31 Aug 2010, Eric Sandeen wrote:

> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> > 
> >> Can you give this a shot?
> >>
> >> The first hunk is, I think, the biggest problem.  Even if
> >> we get the max number of pages we need, we keep scanning forward
> >> until "done" without doing any more actual, useful work.
> >>
> >> The 2nd hunk is an oddity, some places assign nr_to_write
> >> to LONG_MAX, and we get here and multiply -that- by 8... giving
> >> us "-8" for nr_to_write, that can't help things when we
> >> do later comparisons on that number...
> >>
> >> I also see us asking to find pages starting at "idx" and
> >> the first dirty page we find is well ahead of that,
> >> I'm not sure if that's indicative of a problem or not.
> >>
> >> Anyway, want to give this a shot, in place of the patch you sent,
> >> and see how it fares compared to stock and/or with your patch?
> >>
> >> It's build-and-sanity tested but not really performance tested here.
> >>
> >> Thanks,
> >> -Eric
> > 
> > Great!  It looks like that does the trick.
> > 
> > 2.6.35 + your patch:
> > 
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
> > 
> > That's the same performance as with my patch, and pretty darn
> > close to the original 2.6.31 performance.
> 
> hah, that's good esp. considering my followup email that found
> what I think is a problem with my patch.  ;)
> 
> What happens if you change:
> 
> 	if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
> 		desired_nr_to_write = wbc->nr_to_write * 8;
>   	else
>   		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> 
> to:
> 
>         if (!range_cyclic && range_whole) {
>                 if (wbc->nr_to_write != LONG_MAX)
>                         desired_nr_to_write = wbc->nr_to_write * 8;
>                 else
>                         desired_nr_to_write = wbc->nr_to_write;
>         } else
>   		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> 
> and see how that fares?  I think that makes a little more sense, if we
> got there with LONG_MAX that means "write everything" and there's no need
> to bump it up or to go counting pages.  It may not make any real difference.

That's also fine.

	-Bill



> But I'm seeing really weird behavior in writeback, it starts out nicely
> writing 32768 pages at a time, and then goes all wonky, revisiting pages
> it's already done and doing IO in little chunks.   This is going to take
> some staring I think.
> 
> -Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Taylor Sept. 9, 2010, 12:23 a.m. UTC | #5
Just wondering if this patch is adequate or there's more to come.

I want to put a fix into our 2.6.32 kernel.

Thanks.

> -----Original Message-----
> From: linux-ext4-owner@vger.kernel.org 
> [mailto:linux-ext4-owner@vger.kernel.org] On Behalf Of Eric Sandeen
> Sent: Monday, August 30, 2010 10:06 PM
> To: Bill Fink
> Cc: tytso@mit.edu; adilger@sun.com; 
> linux-ext4@vger.kernel.org; bill.fink@nasa.gov
> Subject: Re: [PATCH] ext4: fix 50% disk write performance regression
> 
> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> > 
> >> Can you give this a shot?
> >>
> >> The first hunk is, I think, the biggest problem.  Even if
> >> we get the max number of pages we need, we keep scanning forward
> >> until "done" without doing any more actual, useful work.
> >>
> >> The 2nd hunk is an oddity, some places assign nr_to_write
> >> to LONG_MAX, and we get here and multiply -that- by 8... giving
> >> us "-8" for nr_to_write, that can't help things when we
> >> do later comparisons on that number...
> >>
> >> I also see us asking to find pages starting at "idx" and
> >> the first dirty page we find is well ahead of that,
> >> I'm not sure if that's indicative of a problem or not.
> >>
> >> Anyway, want to give this a shot, in place of the patch you sent,
> >> and see how it fares compared to stock and/or with your patch?
> >>
> >> It's build-and-sanity tested but not really performance 
> tested here.
> >>
> >> Thanks,
> >> -Eric
> > 
> > Great!  It looks like that does the trick.
> > 
> > 2.6.35 + your patch:
> > 
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
> > 
> > That's the same performance as with my patch, and pretty darn
> > close to the original 2.6.31 performance.
> 
> hah, that's good esp. considering my followup email that found
> what I think is a problem with my patch.  ;)
> 
> What happens if you change:
> 
> 	if (!range_cyclic && range_whole && wbc->nr_to_write != 
> LONG_MAX)
> 		desired_nr_to_write = wbc->nr_to_write * 8;
>   	else
>   		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> 
> to:
> 
>         if (!range_cyclic && range_whole) {
>                 if (wbc->nr_to_write != LONG_MAX)
>                         desired_nr_to_write = wbc->nr_to_write * 8;
>                 else
>                         desired_nr_to_write = wbc->nr_to_write;
>         } else
>   		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> 
> and see how that fares?  I think that makes a little more sense, if we
> got there with LONG_MAX that means "write everything" and 
> there's no need
> to bump it up or to go counting pages.  It may not make any 
> real difference.
> 
> But I'm seeing really weird behavior in writeback, it starts 
> out nicely
> writing 32768 pages at a time, and then goes all wonky, 
> revisiting pages
> it's already done and doing IO in little chunks.   This is 
> going to take
> some staring I think.
> 
> -Eric
> 
> 
> 
> > 						-Thanks a bunch
> > 
> > 						-Bill
> > 
> > 
> > 
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 4b8debe..33c2167 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -1207,8 +1207,10 @@ static pgoff_t 
> ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> >>  				break;
> >>  			idx++;
> >>  			num++;
> >> -			if (num >= max_pages)
> >> -				break;
> >> +			if (num >= max_pages) {
> >> +				pagevec_release(&pvec);
> >> +				return num;
> >> +			}
> >>  		}
> >>  		pagevec_release(&pvec);
> >>  	}
> >> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct 
> address_space *mapping,
> >>  	 * sbi->max_writeback_mb_bump whichever is smaller.
> >>  	 */
> >>  	max_pages = sbi->s_max_writeback_mb_bump << (20 - 
> PAGE_CACHE_SHIFT);
> :
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4b8debe..33c2167 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1207,8 +1207,10 @@  static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 				break;
 			idx++;
 			num++;
-			if (num >= max_pages)
-				break;
+			if (num >= max_pages) {
+				pagevec_release(&pvec);
+				return num;
+			}
 		}
 		pagevec_release(&pvec);
 	}
@@ -3002,7 +3004,7 @@  static int ext4_da_writepages(struct address_space *mapping,
 	 * sbi->max_writeback_mb_bump whichever is smaller.
 	 */
 	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
-	if (!range_cyclic && range_whole)
+	if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
 		desired_nr_to_write = wbc->nr_to_write * 8;
 	else
 		desired_nr_to_write = ext4_num_dirty_pages(inode, index,