diff mbox

Question on block group allocation

Message ID 20090423190817.GN3209@webber.adilger.int
State Accepted, archived
Headers show

Commit Message

Andreas Dilger April 23, 2009, 7:08 p.m. UTC
On Apr 23, 2009  09:41 -0700, Curt Wohlgemuth wrote:
> I'm seeing a performance problem on ext4 vs ext2, and in trying to
> narrow it down, I've got a question about block allocation in ext4
> that I'm having trouble figuring out.
> 
> Using dd, I created (in this order) two 4GB files and a 10GB file in
> the mount directory.
> 
> The extent blocks are reasonably close together for the two 4GB files,
> but the extents for the 10GB file show a huge gap, which seems to hurt
> the random read performance pretty substantially.  Here's the output
> from debugfs:
> 
> BLOCKS:
> (IND):8396832, (0-106495):8282112-8388607,
> (106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
> (888832-1116159):23889920-24117247, (1116160-1277951):71665664-
> 71827455, (1277952-1767423):78678016-79167487,
> (1767424-2125823):102402048-102760447,
> (2125824-2148351):102768672-102791199,
> (2148352-2621439):102793216-103266303
> TOTAL: 2621441
> 
> Note the gap between blocks 79167487 and 102402048.

Well, there are other even larger gaps for other chunks of the file.

> I was lucky enough to capture the mb_history from this 10GB create:
> 
> 29109 14       735/30720/32758@1114112 735/30720/2048@1114112
> 735/30720/2048@1114112  1     0     0  1568  M     0     0
> 29109 14       736/0/32758@1116160     736/0/2048@1116160
> 2187/2048/2048@1116160  1     1     0  1568        0     0
> 29109 14       2187/4096/32758@1118208 2187/4096/2048@1118208
> 2187/4096/2048@1118208  1     0     0  1568  M     2048  4096
> 
> I've been staring at ext4_mb_regular_allocator() trying to understand
> why an allocation with a goal block of 736 ends up with a best found
> extent group of 2187, and I'm stuck -- at least without a lot of
> printk messages.  It seems to me that we just cycle through the block
> groups starting with the goal group until we find a group that fits.
> Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
> 32768 free blocks.  So why we end up with a best fit group of 2187 is
> a mystery to me.

This is likely the "uninit_bg" feature that is causing the allocations
to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
of skipping the block bitmap read during e2fsck is probably not at all
beneficial compared to the cost of the extra seeking during IO.  As the
filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
so we might as well just keep the early allocations contiguous.

A simple change to verify this would be something like the following,
but it hasn't actually been tested.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Curt Wohlgemuth April 23, 2009, 10:02 p.m. UTC | #1
Hi Andreas:

On Thu, Apr 23, 2009 at 12:08 PM, Andreas Dilger <adilger@sun.com> wrote:
> On Apr 23, 2009  09:41 -0700, Curt Wohlgemuth wrote:
>> I'm seeing a performance problem on ext4 vs ext2, and in trying to
>> narrow it down, I've got a question about block allocation in ext4
>> that I'm having trouble figuring out.
>>
>> Using dd, I created (in this order) two 4GB files and a 10GB file in
>> the mount directory.
>>
>> The extent blocks are reasonably close together for the two 4GB files,
>> but the extents for the 10GB file show a huge gap, which seems to hurt
>> the random read performance pretty substantially.  Here's the output
>> from debugfs:
>>
>> BLOCKS:
>> (IND):8396832, (0-106495):8282112-8388607,
>> (106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
>> (888832-1116159):23889920-24117247, (1116160-1277951):71665664-
>> 71827455, (1277952-1767423):78678016-79167487,
>> (1767424-2125823):102402048-102760447,
>> (2125824-2148351):102768672-102791199,
>> (2148352-2621439):102793216-103266303
>> TOTAL: 2621441
>>
>> Note the gap between blocks 79167487 and 102402048.
>
> Well, there are other even larger gaps for other chunks of the file.

Really?  Not that it's important, but I'm not seeing them...

>> I was lucky enough to capture the mb_history from this 10GB create:
>>
>> 29109 14       735/30720/32758@1114112 735/30720/2048@1114112
>> 735/30720/2048@1114112  1     0     0  1568  M     0     0
>> 29109 14       736/0/32758@1116160     736/0/2048@1116160
>> 2187/2048/2048@1116160  1     1     0  1568        0     0
>> 29109 14       2187/4096/32758@1118208 2187/4096/2048@1118208
>> 2187/4096/2048@1118208  1     0     0  1568  M     2048  4096
>>
>> I've been staring at ext4_mb_regular_allocator() trying to understand
>> why an allocation with a goal block of 736 ends up with a best found
>> extent group of 2187, and I'm stuck -- at least without a lot of
>> printk messages.  It seems to me that we just cycle through the block
>> groups starting with the goal group until we find a group that fits.
>> Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
>> 32768 free blocks.  So why we end up with a best fit group of 2187 is
>> a mystery to me.
>
> This is likely the "uninit_bg" feature that is causing the allocations
> to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
> of skipping the block bitmap read during e2fsck is probably not at all
> beneficial compared to the cost of the extra seeking during IO.  As the
> filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> so we might as well just keep the early allocations contiguous.

Ah, thanks!  That's what I was missing.  Yes, I sort of skipped over
the "is this a good group?" question.

> A simple change to verify this would be something like the following,
> but it hasn't actually been tested.

Tell you what:  I'll try this out and see if it helps out my test case.

Thanks,
Curt

>
> --- ./fs/ext4/mballoc.c.uninit    2009-04-08 19:13:13.000000000 -0600
> +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
> @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
>        switch (cr) {
>        case 0:
>                BUG_ON(ac->ac_2order == 0);
> -               /* If this group is uninitialized, skip it initially */
> -               desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
> -               if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> -                       return 0;
>
>                bits = ac->ac_sb->s_blocksize_bits + 1;
>                for (i = ac->ac_2order; i <= bits; i++)
> @@ -2039,9 +2035,7 @@ repeat:
>                        ac->ac_groups_scanned++;
>                        desc = ext4_get_group_desc(sb, group, NULL);
> -                       if (cr == 0 || (desc->bg_flags &
> -                               cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
> -                               ac->ac_2order != 0))
> +                       if (cr == 0)
>                                ext4_mb_simple_scan_group(ac, &e4b);
>                        else if (cr == 1 &&
>                                        ac->ac_g_ex.fe_len == sbi->s_stripe)
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 27, 2009, 2:14 a.m. UTC | #2
On Thu, Apr 23, 2009 at 03:02:05PM -0700, Curt Wohlgemuth wrote:
> > This is likely the "uninit_bg" feature that is causing the allocations
> > to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
> > of skipping the block bitmap read during e2fsck is probably not at all
> > beneficial compared to the cost of the extra seeking during IO.  As the
> > filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> > so we might as well just keep the early allocations contiguous.

Well, I tried out Andreas' patch, by doing an rsync copy from my SSD
root partition to a 5400 rpm laptop drive, and then ran e2fsck and
dumpe2fs.  The results were interesting:

               Before Patch			  After Patch
	      Time in seconds			Time in seconds
	    Real /  User/  Sys   MB/s	   Real /  User/  Sys    MB/s	   
Pass 1      8.52 / 2.21 / 0.46  20.43	   8.84 / 4.97 / 1.11   19.68
Pass 2	   21.16 / 1.02 / 1.86  11.30	   6.54 / 1.77 / 1.78   36.39
Pass 3 	    0.01 / 0.00 / 0.00 139.00	   0.01 / 0.01 / 0.00  128.90
Pass 4	    0.16 / 0.15 / 0.00   0.00	   0.17 / 0.17 / 0.00    0.00
Pass 5	    2.52 / 1.99 / 0.09   0.79	   2.31 / 1.78 / 0.06	 0.86
Total	   32.40 / 5.11 / 2.49  12.81	  17.99 / 8.75 / 2.98	23.01

The surprise is in the gross inspection of the dumpe2fs results:

    	     	       	     Before Patch    After Patch
# of non-contig files  	     	762	        779
# of non-contig directories	571		570
# of BLOCK_UNINIT bg's		307		293
# of INODE_UNINIT bg's		503		503

So the interesting thing is that the patch only "broke open" an
additional 14 block groups (out of a 333 block groups in use when the
filesystem was created with the unpatched kernel).  However, this
allowed the pass 2 directory time to go *down* by over a factor of
three (from 21.2 seconds with the unpatched ext4 code to 6.5 seconds
with the the patch.

I think what the patch did was to diminish allocation pressure on the
first block group in the flex_bg, so we weren't mixing directory and
regular file contents.  This eliminated seeks during pass 2 of e2fsck,
which was actually a Very Good Thing.

> > A simple change to verify this would be something like the following,
> > but it hasn't actually been tested.
> 
> Tell you what:  I'll try this out and see if it helps out my test case.

Let me know what this does for your test case.  Hopefully the patch
also makes things better, since this patch is looking very interesting
right now.

Andreas, can I get a Signed-off-by from you for this patch? 

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Curt Wohlgemuth April 27, 2009, 5:29 a.m. UTC | #3
Hi Ted:

I don't have access to the actual data right now, because I created
the files and ran the benchmark just before leaving for a few days,
but...

On Sun, Apr 26, 2009 at 8:14 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Apr 23, 2009 at 03:02:05PM -0700, Curt Wohlgemuth wrote:
>> > This is likely the "uninit_bg" feature that is causing the allocations
>> > to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
>> > of skipping the block bitmap read during e2fsck is probably not at all
>> > beneficial compared to the cost of the extra seeking during IO.  As the
>> > filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
>> > so we might as well just keep the early allocations contiguous.
>
> Well, I tried out Andreas' patch, by doing an rsync copy from my SSD
> root partition to a 5400 rpm laptop drive, and then ran e2fsck and
> dumpe2fs.  The results were interesting:
>
>               Before Patch                       After Patch
>              Time in seconds                   Time in seconds
>            Real /  User/  Sys   MB/s      Real /  User/  Sys    MB/s
> Pass 1      8.52 / 2.21 / 0.46  20.43      8.84 / 4.97 / 1.11   19.68
> Pass 2     21.16 / 1.02 / 1.86  11.30      6.54 / 1.77 / 1.78   36.39
> Pass 3      0.01 / 0.00 / 0.00 139.00      0.01 / 0.01 / 0.00  128.90
> Pass 4      0.16 / 0.15 / 0.00   0.00      0.17 / 0.17 / 0.00    0.00
> Pass 5      2.52 / 1.99 / 0.09   0.79      2.31 / 1.78 / 0.06    0.86
> Total      32.40 / 5.11 / 2.49  12.81     17.99 / 8.75 / 2.98   23.01
>
> The surprise is in the gross inspection of the dumpe2fs results:
>
>                             Before Patch    After Patch
> # of non-contig files           762             779
> # of non-contig directories     571             570
> # of BLOCK_UNINIT bg's          307             293
> # of INODE_UNINIT bg's          503             503
>
> So the interesting thing is that the patch only "broke open" an
> additional 14 block groups (out of a 333 block groups in use when the
> filesystem was created with the unpatched kernel).  However, this
> allowed the pass 2 directory time to go *down* by over a factor of
> three (from 21.2 seconds with the unpatched ext4 code to 6.5 seconds
> with the the patch.
>
> I think what the patch did was to diminish allocation pressure on the
> first block group in the flex_bg, so we weren't mixing directory and
> regular file contents.  This eliminated seeks during pass 2 of e2fsck,
> which was actually a Very Good Thing.
>
>> > A simple change to verify this would be something like the following,
>> > but it hasn't actually been tested.
>>
>> Tell you what:  I'll try this out and see if it helps out my test case.
>
> Let me know what this does for your test case.  Hopefully the patch
> also makes things better, since this patch is looking very interesting
> right now.

The random read throughput on the 10GB file went from ~16 MB/s to ~22
MB/s after Andreas' patch; the total fragmentation of the file was
much lower than before his patch.

However, the number of extents went up by quite a bit (I don't have
the debugfs output in front of me at the moment, sorry).  It seemed
that no extent crossed a block group; I didn't have time to see if
Andreas' patch disabled flex BGs or not, as to what was going on.

I'll be able to send details out on Tuesday.

Curt

>
> Andreas, can I get a Signed-off-by from you for this patch?
>
> Thanks,
>
>                                                - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 27, 2009, 10:42 a.m. UTC | #4
On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
> 
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
> 
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry).  It seemed
> that no extent crossed a block group; I didn't have time to see if
> Andreas' patch disabled flex BGs or not, as to what was going on.

Try running e2fsck with the "-E fragcheck" option, and then capture
e2fsck's stdout.  It will help with the grunt work of doing the
analysis, in terms of displaying the details of all of the files which
are discontiguous.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 27, 2009, 10:40 p.m. UTC | #5
On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
> 
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry).  

I'm curious what you meant by the combination of these two statements,
"the total fragmentation of the file was much lower than before his
patch", and "the number of extents went up by quite a bit".  Can you
send me the debugfs output when you have a chance?

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger April 27, 2009, 11:12 p.m. UTC | #6
On Apr 23, 2009  13:08 -0600, Andreas Dilger wrote:
> This is likely the "uninit_bg" feature that is causing the allocations
> to skip groups which are marked BLOCK_UNINIT.  In some sense the benefit
> of skipping the block bitmap read during e2fsck is probably not at all
> beneficial compared to the cost of the extra seeking during IO.  As the
> filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> so we might as well just keep the early allocations contiguous.
> 
> A simple change to verify this would be something like the following,
> but it hasn't actually been tested.
> 
> --- ./fs/ext4/mballoc.c.uninit    2009-04-08 19:13:13.000000000 -0600
> +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
> @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
>  	switch (cr) {
>  	case 0:
>  		BUG_ON(ac->ac_2order == 0);
> -		/* If this group is uninitialized, skip it initially */
> -		desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
> -		if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> -			return 0;
>  
>  		bits = ac->ac_sb->s_blocksize_bits + 1;
>  		for (i = ac->ac_2order; i <= bits; i++)
> @@ -2039,9 +2035,7 @@ repeat:
>  			ac->ac_groups_scanned++;
>  			desc = ext4_get_group_desc(sb, group, NULL);
> -			if (cr == 0 || (desc->bg_flags &
> -				cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
> -				ac->ac_2order != 0))
> +			if (cr == 0)
>  				ext4_mb_simple_scan_group(ac, &e4b);
>  			else if (cr == 1 &&
>  					ac->ac_g_ex.fe_len == sbi->s_stripe)

Because this is actually proving to be useful:

Signed-off-by: Andreas Dilger <adilger@sun.com>

As we discussed in the call, I suspect BLOCK_UNINIT was more useful in the
past when directories were spread over all groups evenly (pre-Orlov), and
before flex_bg where seeking to read all of the bitmaps was a slow and
painful process.  For flex_bg it could be WORSE to skip bitmap reads because
instead of doing contiguous 64kB reads it may now doing read 4kB, seek,
read 4kB, seek, etc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Curt Wohlgemuth April 29, 2009, 6:38 p.m. UTC | #7
Hi Ted:

On Mon, Apr 27, 2009 at 3:40 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
>> The random read throughput on the 10GB file went from ~16 MB/s to ~22
>> MB/s after Andreas' patch; the total fragmentation of the file was
>> much lower than before his patch.
>>
>> However, the number of extents went up by quite a bit (I don't have
>> the debugfs output in front of me at the moment, sorry).
>
> I'm curious what you meant by the combination of these two statements,
> "the total fragmentation of the file was much lower than before his
> patch", and "the number of extents went up by quite a bit".  Can you
> send me the debugfs output when you have a chance?

Sorry it's been so long for me to reply.

Okay, my phrasing was not as precise as it could have been.  What I
meant by "total fragmentation" was simply that the range of physical
blocks for the 10GB file was much lower with Andreas' patch:

Before patch:  8282112 - 103266303
After patch: 271360 - 5074943

The number of extents is much larger.  See the attached debugfs output.

Here's the output of "e2fsck -E fragcheck" on the block devices;
remember, though, that each one has only 3 files:

-rw-rw-r--    1 root     root     10737418240 Apr 23 15:33 10g
-rw-rw-r--    1 root     root     4294967296 Apr 23 15:30 4g
-rw-rw-r--    1 root     root     4294967296 Apr 23 15:30 4g-2
drwx------    2 root     root        16384 Apr 23 15:27 lost+found/

Before patch:
e2fsck 1.41.3 (12-Oct-2008)
/dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

After patch:
e2fsck 1.41.3 (12-Oct-2008)
/dev/hdo3: clean, 14/45760512 files, 7608258/183010471 blocks


Thanks,
Curt
Theodore Ts'o April 29, 2009, 7:16 p.m. UTC | #8
On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
> 
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry).  It seemed
> that no extent crossed a block group; I didn't have time to see if
> Andreas' patch disabled flex BGs or not, as to what was going on.
> 
> I'll be able to send details out on Tuesday.

Hi Curt,

When you have a chance, can you send out the details from your test run?

Many thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 29, 2009, 7:37 p.m. UTC | #9
On Wed, Apr 29, 2009 at 03:16:47PM -0400, Theodore Tso wrote:
> 
> When you have a chance, can you send out the details from your test run?
> 

Oops, sorry, our two e-mails overlapped.  Sorry, I didn't see your new
e-mail when I sent my ping-o-gram.

On Wed, Apr 29, 2009 at 11:38:49AM -0700, Curt Wohlgemuth wrote:
> 
> Okay, my phrasing was not as precise as it could have been.  What I
> meant by "total fragmentation" was simply that the range of physical
> blocks for the 10GB file was much lower with Andreas' patch:
> 
> Before patch:  8282112 - 103266303
> After patch: 271360 - 5074943
> 
> The number of extents is much larger.  See the attached debugfs output.

Ah, OK.  You didn't attach the "e2fsck -E fragcheck" output, but I'm
going to guess that the blocks for 10g, 4g, and 4g-2 ended up getting
interleaved, possibly because they were written in parallel, and not
one after each other?  Each of the extents in the "after" debugfs were
proximately 2k blocks (8 megabytes) in length, and are separated by a
largish cnumber of blocks.  

Now, if my theory that the files were written in an interleaved
fashion is correct, if it is also true that they will be read in an
interleaved pattern, the layout on disk might actually be the best
one.  If however they are going to be read sequentially, and you
really want them to be allocated contiguously, then if you know what
the final size of these files will be, then the probably the best
thing to do is to use the fallocate system call.

Does that make sense?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Curt Wohlgemuth April 29, 2009, 8:21 p.m. UTC | #10
Hi Ted:

On Wed, Apr 29, 2009 at 12:37 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Wed, Apr 29, 2009 at 03:16:47PM -0400, Theodore Tso wrote:
>>
>> When you have a chance, can you send out the details from your test run?
>>
>
> Oops, sorry, our two e-mails overlapped.  Sorry, I didn't see your new
> e-mail when I sent my ping-o-gram.
>
> On Wed, Apr 29, 2009 at 11:38:49AM -0700, Curt Wohlgemuth wrote:
>>
>> Okay, my phrasing was not as precise as it could have been.  What I
>> meant by "total fragmentation" was simply that the range of physical
>> blocks for the 10GB file was much lower with Andreas' patch:
>>
>> Before patch:  8282112 - 103266303
>> After patch: 271360 - 5074943
>>
>> The number of extents is much larger.  See the attached debugfs output.
>
> Ah, OK.  You didn't attach the "e2fsck -E fragcheck" output, but I'm
> going to guess that the blocks for 10g, 4g, and 4g-2 ended up getting
> interleaved, possibly because they were written in parallel, and not
> one after each other?  Each of the extents in the "after" debugfs were
> proximately 2k blocks (8 megabytes) in length, and are separated by a
> largish cnumber of blocks.

Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
I did: one simple line:

        /dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

And actually, I created the files sequentially:

dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10

> Now, if my theory that the files were written in an interleaved
> fashion is correct, if it is also true that they will be read in an
> interleaved pattern, the layout on disk might actually be the best
> one.  If however they are going to be read sequentially, and you
> really want them to be allocated contiguously, then if you know what
> the final size of these files will be, then the probably the best
> thing to do is to use the fallocate system call.
>
> Does that make sense?

Sure, in this sense.

The test in question does something like this:

1. Create 20 or so large files, sequentially.
2. Randomly choose a file.
3. Randomly choose an offset in this file.
4. Read from that file/offset a fixed buffer size (say 256k); the file
was opened with O_DIRECT
5. Go back to #2
6. Stop after some time period

This might not be the most realistic workload we want (the test
actually can be run by doing #1 above with multiple threads), but it's
certainly interesting.

The point that I'm interested in is why the physical block spread is
so different for the 10GB file between (a) the above 'dd' command
sequence; and (b) simply creating the "10g" file alone, without
creating the 4GB files first.

I just did (b) above on a kernel without Andreas' patch, on a freshly
formatted ext4 FS, and here's (most of) the debugfs output for it:

BLOCKS:
(IND):164865, (0-63487):34816-98303, (63488-126975):100352-163839, (126976-19046
3):165888-229375, (190464-253951):231424-294911, (253952-481279):296960-524287,
(481280-544767):821248-884735, (544768-706559):886784-1048575, (706560-1196031):
1607680-2097151, (1196032-1453067):2656256-2913291
TOTAL: 1453069

The total spread of the blocks is tiny compared to the total spread
from the 3 "dd" commands above.

I haven't yet really looked at the block allocation results using
Andreas' patch, except for the "10g" file after the three "dd"
commands above.  So I'm not sure what the effects are with, say,
larger numbers of files.  I'll be doing some more experimentation
soon.

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 29, 2009, 9:20 p.m. UTC | #11
On Wed, Apr 29, 2009 at 01:21:09PM -0700, Curt Wohlgemuth wrote:
> 
> Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
> I did: one simple line:
> 
>         /dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

Sorry, I should have been more explicit.   You need to do

"e2fsck -f -E fragcheck", and you will get a *heck* of a lot more than
a single line.  :-)

> And actually, I created the files sequentially:
> 
> dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
> dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
> dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10

Really?  Hmm, I wouldn't have expected that.  So now I'd really love
to see the fragcheck results (both with and without the patch), and/or
the results of debugfs stat'ing all three files, both with and without
the patch.

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o April 29, 2009, 9:50 p.m. UTC | #12
Oh --- one more question.  You did these tests on your 2.6.26-based
kernel with ext4 backports, right?  Not 2.6.30 mainline kernel?  Did
you backport the changes to the block and inode allocators?  i.e.,
this patch (plus a 1 or 2 subsequent bug fixes)?


commit a4912123b688e057084e6557cef8924f7ae5bbde
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Mar 12 12:18:34 2009 -0400

    ext4: New inode/block allocation algorithms for flex_bg filesystems
    
    The find_group_flex() inode allocator is now only used if the
    filesystem is mounted using the "oldalloc" mount option.  It is
    replaced with the original Orlov allocator that has been updated for
    flex_bg filesystems (it should behave the same way if flex_bg is
    disabled).  The inode allocator now functions by taking into account
    each flex_bg group, instead of each block group, when deciding whether
    or not it's time to allocate a new directory into a fresh flex_bg.
    
    The block allocator has also been changed so that the first block
    group in each flex_bg is preferred for use for storing directory
    blocks.  This keeps directory blocks close together, which is good for
    speeding up e2fsck since large directories are more likely to look
    like this:
    
    debugfs:  stat /home/tytso/Maildir/cur
    Inode: 1844562   Type: directory    Mode:  0700   Flags: 0x81000
    Generation: 1132745781    Version: 0x00000000:0000ad71
    User: 15806   Group: 15806   Size: 1060864
    File ACL: 0    Directory ACL: 0
    Links: 2   Blockcount: 2072
    Fragment:  Address: 0    Number: 0    Size: 0
     ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
     atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
     mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
    crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
    Size of extra inode fields: 28
    BLOCKS:
    (0):7348651, (1-258):7348654-7348911
    TOTAL: 259
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Curt Wohlgemuth April 29, 2009, 10:29 p.m. UTC | #13
Hi Ted:

On Wed, Apr 29, 2009 at 2:50 PM, Theodore Tso <tytso@mit.edu> wrote:
> Oh --- one more question.  You did these tests on your 2.6.26-based
> kernel with ext4 backports, right?  Not 2.6.30 mainline kernel?  Did
> you backport the changes to the block and inode allocators?  i.e.,
> this patch (plus a 1 or 2 subsequent bug fixes)?
>
>
> commit a4912123b688e057084e6557cef8924f7ae5bbde
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Mar 12 12:18:34 2009 -0400
>
>    ext4: New inode/block allocation algorithms for flex_bg filesystems

Yes, we have this patch.  I'm not sure if we have the "1 or 2" bug
fixes you refer to above; do you have commits for these?

I'm regen'ing the e2fsck and debugfs output for the 3 "dd" sequence
above, for our stock kernel and for this + Andreas' patch.

Thanks,
Curt

>
>    The find_group_flex() inode allocator is now only used if the
>    filesystem is mounted using the "oldalloc" mount option.  It is
>    replaced with the original Orlov allocator that has been updated for
>    flex_bg filesystems (it should behave the same way if flex_bg is
>    disabled).  The inode allocator now functions by taking into account
>    each flex_bg group, instead of each block group, when deciding whether
>    or not it's time to allocate a new directory into a fresh flex_bg.
>
>    The block allocator has also been changed so that the first block
>    group in each flex_bg is preferred for use for storing directory
>    blocks.  This keeps directory blocks close together, which is good for
>    speeding up e2fsck since large directories are more likely to look
>    like this:
>
>    debugfs:  stat /home/tytso/Maildir/cur
>    Inode: 1844562   Type: directory    Mode:  0700   Flags: 0x81000
>    Generation: 1132745781    Version: 0x00000000:0000ad71
>    User: 15806   Group: 15806   Size: 1060864
>    File ACL: 0    Directory ACL: 0
>    Links: 2   Blockcount: 2072
>    Fragment:  Address: 0    Number: 0    Size: 0
>     ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
>     atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
>     mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
>    crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
>    Size of extra inode fields: 28
>    BLOCKS:
>    (0):7348651, (1-258):7348654-7348911
>    TOTAL: 259
>
>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>
>                                                - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 1, 2009, 4:39 a.m. UTC | #14
On Wed, Apr 29, 2009 at 03:29:55PM -0700, Curt Wohlgemuth wrote:
> Yes, we have this patch.  I'm not sure if we have the "1 or 2" bug
> fixes you refer to above; do you have commits for these?

b5451f7b  ext4: Fix potential inode allocation soft lockup in Orlov allocator
6b82f3cb  ext4: really print the find_group_flex fallback warning only once
7d39db14  ext4: Use struct flex_groups to calculate get_orlov_stats()
9f24e420  ext4: Use atomic_t's in struct flex_groups

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Curt Wohlgemuth May 4, 2009, 3:52 p.m. UTC | #15
Hi Ted:

On Wed, Apr 29, 2009 at 2:20 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Wed, Apr 29, 2009 at 01:21:09PM -0700, Curt Wohlgemuth wrote:
>>
>> Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
>> I did: one simple line:
>>
>>         /dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks
>
> Sorry, I should have been more explicit.   You need to do
>
> "e2fsck -f -E fragcheck", and you will get a *heck* of a lot more than
> a single line.  :-)
>
>> And actually, I created the files sequentially:
>>
>> dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
>> dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
>> dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10
>
> Really?  Hmm, I wouldn't have expected that.  So now I'd really love
> to see the fragcheck results (both with and without the patch), and/or
> the results of debugfs stat'ing all three files, both with and without
> the patch.

Although it might seem like I've been ignoring this request, in fact
I'm having trouble recreating the problem now.  Both the "three dd"
commands above, and the performance problem I was seeing in the
original posting seem to have mysteriously disappeared.  I'll keep
trying and let the list know what I find.

Thanks,
Curt

>
> Thanks,
>
>                                                - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- ./fs/ext4/mballoc.c.uninit    2009-04-08 19:13:13.000000000 -0600
+++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
@@ -1742,10 +1723,6 @@  static int ext4_mb_good_group(struct ext
 	switch (cr) {
 	case 0:
 		BUG_ON(ac->ac_2order == 0);
-		/* If this group is uninitialized, skip it initially */
-		desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
-		if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
-			return 0;
 
 		bits = ac->ac_sb->s_blocksize_bits + 1;
 		for (i = ac->ac_2order; i <= bits; i++)
@@ -2039,9 +2035,7 @@  repeat:
 			ac->ac_groups_scanned++;
 			desc = ext4_get_group_desc(sb, group, NULL);
-			if (cr == 0 || (desc->bg_flags &
-				cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
-				ac->ac_2order != 0))
+			if (cr == 0)
 				ext4_mb_simple_scan_group(ac, &e4b);
 			else if (cr == 1 &&
 					ac->ac_g_ex.fe_len == sbi->s_stripe)