diff mbox

ext4: don't load the block bitmap for block groups which have no space

Message ID 1344622885-14982-1-git-send-email-tytso@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o Aug. 10, 2012, 6:21 p.m. UTC
Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available.  (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
Reported-by: mirek@me.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/mballoc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Eric Sandeen Aug. 13, 2012, 4:02 p.m. UTC | #1
On 8/10/12 1:21 PM, Theodore Ts'o wrote:
> Add a short circuit check to ext4_mb_group_group() so that we don't
> bother to load the block bitmap for a block group which does not have
> any space available.  (Or which does not have enough space until we
> are in desperation mode, i.e., when cr == 3.)
> 
> Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
> Reported-by: mirek@me.com
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Looks ok to me; I think this just further optimizes what was done
in

8a57d9d61a6e361c7bb159dda797672c1df1a691
ext4: check for a good block group before loading buddy pages

correct?

-Eric

> ---
>  fs/ext4/mballoc.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8eae947..3a57975 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1862,6 +1862,12 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  
>  	BUG_ON(cr < 0 || cr >= 4);
>  
> +	free = grp->bb_free;
> +	if (free == 0)
> +		return 0;
> +	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
> +		return 0;
> +
>  	/* We only do this if the grp has never been initialized */
>  	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
>  		int ret = ext4_mb_init_group(ac->ac_sb, group);
> @@ -1869,10 +1875,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  			return 0;
>  	}
>  
> -	free = grp->bb_free;
>  	fragments = grp->bb_fragments;
> -	if (free == 0)
> -		return 0;
>  	if (fragments == 0)
>  		return 0;
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 13, 2012, 6:49 p.m. UTC | #2
On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
> 
> Looks ok to me; I think this just further optimizes what was done
> in
> 
> 8a57d9d61a6e361c7bb159dda797672c1df1a691
> ext4: check for a good block group before loading buddy pages
> 
> correct?

Yes, that's right; it's a further optimization.

I can think of an additional optimization where if we are reading the
block bitmap for block group N, and the block bitmap for block group
N+1 hasn't been read before (so we don't have buddy bitmap stats), and
the block bitmap for bg N+1 is adjacent for bg N, we should read both
at the same time.  (And this could be generalized for N+2, N+3, etc.)

I'm not entirely sure whether it's worth the effort, but I suspect for
very full file systems, it might be very well be.  This is a more
general case of the problem where most people only benchmark mostly
empty file systems, and my experience has been that above 70-80%
utilization, our performance starts to fall off.  And while disk space
is cheap, it's not _that_ cheap, and there are always customers who
insist on using file systems up to a utilization of 99%, and expect
the same performance as when the file system was freshly formated.  :-(

	    	 	     	     	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Aug. 13, 2012, 6:51 p.m. UTC | #3
On 8/13/12 1:49 PM, Theodore Ts'o wrote:
> On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
>>
>> Looks ok to me; I think this just further optimizes what was done
>> in
>>
>> 8a57d9d61a6e361c7bb159dda797672c1df1a691
>> ext4: check for a good block group before loading buddy pages
>>
>> correct?
> 
> Yes, that's right; it's a further optimization.
> 
> I can think of an additional optimization where if we are reading the
> block bitmap for block group N, and the block bitmap for block group
> N+1 hasn't been read before (so we don't have buddy bitmap stats), and
> the block bitmap for bg N+1 is adjacent for bg N, we should read both
> at the same time.  (And this could be generalized for N+2, N+3, etc.)
> 
> I'm not entirely sure whether it's worth the effort, but I suspect for
> very full file systems, it might be very well be.  This is a more
> general case of the problem where most people only benchmark mostly
> empty file systems, and my experience has been that above 70-80%
> utilization, our performance starts to fall off.  And while disk space
> is cheap, it's not _that_ cheap, and there are always customers who
> insist on using file systems up to a utilization of 99%, and expect
> the same performance as when the file system was freshly formated.  :-(

I did some tests w/ very large filesystems, fallocating 1T at a time until
full.  ext4 tended to fall down pretty badly towards the end.  Anything that
can reduce the time it takes to find free blocks as a very large filesystem
fills would probably be useful....

-eric

> 	    	 	     	     	 - Ted
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Aug. 13, 2012, 11:20 p.m. UTC | #4
On 2012-08-13, at 12:49 PM, Theodore Ts'o wrote:
> On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
>> 
>> Looks ok to me; I think this just further optimizes what was done
>> in
>> 
>> 8a57d9d61a6e361c7bb159dda797672c1df1a691
>> ext4: check for a good block group before loading buddy pages
>> 
>> correct?
> 
> Yes, that's right; it's a further optimization.
> 
> I can think of an additional optimization where if we are reading the
> block bitmap for block group N, and the block bitmap for block group
> N+1 hasn't been read before (so we don't have buddy bitmap stats), and
> the block bitmap for bg N+1 is adjacent for bg N, we should read both
> at the same time.  (And this could be generalized for N+2, N+3, etc.)

I was thinking the same thing.  Seems a shame that we have contiguous
bitmaps with flex_bg and don't load them all at once.  However, I ended
up deciding not to pursue the issue, because I suspect the block device
will already be doing some physical block/track readahead.  I guess it
couldn't hurt to submit explicit readahead requests, so long as we don't
wait for anything but the first bitmap to actually be loaded.

> I'm not entirely sure whether it's worth the effort, but I suspect for
> very full file systems, it might be very well be.  This is a more
> general case of the problem where most people only benchmark mostly
> empty file systems, and my experience has been that above 70-80%
> utilization, our performance starts to fall off.  And while disk space
> is cheap, it's not _that_ cheap, and there are always customers who
> insist on using file systems up to a utilization of 99%, and expect
> the same performance as when the file system was freshly formated.  :-(

In my experience, there are so many factors that affect the performance
of a full filesystem that nothing can be done about it.

We've discussed changing statfs() reporting for Lustre to exclude the
"reserved" amount from the device size, so that people don't complain
"why can't I use the last 5% of the device" and/or "tune2fs -m 0" to
remove the reserved space, then complain when performance permanently
dives after hitting 100% full due to bad fragmentation of the last 5%
of files written that will not be deleted for many months.  Even with
SSDs, the fragmentation is going to be seen, due to erase block
fragmentation and more IO submission overhead for small chunks.

The other significant factor is the inner/outer track performance can
vary by a factor of 2x on some drives.  The ext4 allocator biases toward
outer tracks, which is good, but performance is down on the inner tracks
regardless of whether there is fragmentation or not.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8eae947..3a57975 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1862,6 +1862,12 @@  static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	BUG_ON(cr < 0 || cr >= 4);
 
+	free = grp->bb_free;
+	if (free == 0)
+		return 0;
+	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+		return 0;
+
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
 		int ret = ext4_mb_init_group(ac->ac_sb, group);
@@ -1869,10 +1875,7 @@  static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 			return 0;
 	}
 
-	free = grp->bb_free;
 	fragments = grp->bb_fragments;
-	if (free == 0)
-		return 0;
 	if (fragments == 0)
 		return 0;