diff mbox series

[RESEND,4/8] ext4: add the gdt block of meta_bg to system_zone

Message ID 1604764698-4269-4-git-send-email-brookxu@tencent.com
State Accepted
Headers show
Series [RESEND,1/8] ext4: use ext4_assert() to replace J_ASSERT() | expand

Commit Message

brookxu Nov. 7, 2020, 3:58 p.m. UTC
From: Chunguang Xu <brookxu@tencent.com>

In order to avoid poor search efficiency of system_zone, the
system only adds metadata of some sparse group to system_zone.
In the meta_bg scenario, the non-sparse group may contain gdt
blocks. Perhaps we should add these blocks to system_zone to
improve fault tolerance without significantly reducing system
performance.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 fs/ext4/block_validity.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

Comments

Theodore Ts'o Dec. 3, 2020, 3:08 p.m. UTC | #1
On Sat, Nov 07, 2020 at 11:58:14PM +0800, Chunguang Xu wrote:
> From: Chunguang Xu <brookxu@tencent.com>
> 
> In order to avoid poor search efficiency of system_zone, the
> system only adds metadata of some sparse group to system_zone.
> In the meta_bg scenario, the non-sparse group may contain gdt
> blocks. Perhaps we should add these blocks to system_zone to
> improve fault tolerance without significantly reducing system
> performance.

> @@ -226,13 +227,16 @@ int ext4_setup_system_zone(struct super_block *sb)
>  
>  	for (i=0; i < ngroups; i++) {
>  		cond_resched();
> -		if (ext4_bg_has_super(sb, i) &&
> -		    ((i < 5) || ((i % flex_size) == 0))) {
> -			ret = add_system_zone(system_blks,
> -					ext4_group_first_block_no(sb, i),
> -					ext4_bg_num_gdb(sb, i) + 1, 0);
> -			if (ret)
> -				goto err;
> +		if ((i < 5) || ((i % flex_size) == 0)) {

If we're going to do this, why not just drop the above conditional,
and just always do this logic for all block groups?

> +			gd_blks = ext4_bg_has_super(sb, i) +
> +				ext4_bg_num_gdb(sb, i);
> +			if (gd_blks) {
> +				ret = add_system_zone(system_blks,
> +						ext4_group_first_block_no(sb, i),
> +						gd_blks, 0);
> +				if (ret)
> +					goto err;
> +			}

						- Ted
brookxu Dec. 4, 2020, 1:26 a.m. UTC | #2
Theodore Y. Ts'o wrote on 2020/12/3 23:08:
> On Sat, Nov 07, 2020 at 11:58:14PM +0800, Chunguang Xu wrote:
>> From: Chunguang Xu <brookxu@tencent.com>
>>
>> In order to avoid poor search efficiency of system_zone, the
>> system only adds metadata of some sparse group to system_zone.
>> In the meta_bg scenario, the non-sparse group may contain gdt
>> blocks. Perhaps we should add these blocks to system_zone to
>> improve fault tolerance without significantly reducing system
>> performance.

Thanks, in the large-market scenario, if we deal with all groups,
the system_zone will be very large, which may reduce performance.
I think the previous method is good, but it needs to be changed
slightly, so that the fault tolerance in the meta_bg scenario
can be improved without the risk of performance degradation.

>> @@ -226,13 +227,16 @@ int ext4_setup_system_zone(struct super_block *sb)
>>  
>>  	for (i=0; i < ngroups; i++) {
>>  		cond_resched();
>> -		if (ext4_bg_has_super(sb, i) &&
>> -		    ((i < 5) || ((i % flex_size) == 0))) {
>> -			ret = add_system_zone(system_blks,
>> -					ext4_group_first_block_no(sb, i),
>> -					ext4_bg_num_gdb(sb, i) + 1, 0);
>> -			if (ret)
>> -				goto err;
>> +		if ((i < 5) || ((i % flex_size) == 0)) {
> 
> If we're going to do this, why not just drop the above conditional,
> and just always do this logic for all block groups?
> 
>> +			gd_blks = ext4_bg_has_super(sb, i) +
>> +				ext4_bg_num_gdb(sb, i);
>> +			if (gd_blks) {
>> +				ret = add_system_zone(system_blks,
>> +						ext4_group_first_block_no(sb, i),
>> +						gd_blks, 0);
>> +				if (ret)
>> +					goto err;
>> +			}
> 
> 						- Ted
>
brookxu Dec. 4, 2020, 1:29 a.m. UTC | #3
Theodore Y. Ts'o wrote on 2020/12/3 23:08:
> On Sat, Nov 07, 2020 at 11:58:14PM +0800, Chunguang Xu wrote:
>> From: Chunguang Xu <brookxu@tencent.com>
>>
>> In order to avoid poor search efficiency of system_zone, the
>> system only adds metadata of some sparse group to system_zone.
>> In the meta_bg scenario, the non-sparse group may contain gdt
>> blocks. Perhaps we should add these blocks to system_zone to
>> improve fault tolerance without significantly reducing system
>> performance.
> 
>> @@ -226,13 +227,16 @@ int ext4_setup_system_zone(struct super_block *sb)
>>  
>>  	for (i=0; i < ngroups; i++) {
>>  		cond_resched();
>> -		if (ext4_bg_has_super(sb, i) &&
>> -		    ((i < 5) || ((i % flex_size) == 0))) {
>> -			ret = add_system_zone(system_blks,
>> -					ext4_group_first_block_no(sb, i),
>> -					ext4_bg_num_gdb(sb, i) + 1, 0);
>> -			if (ret)
>> -				goto err;
>> +		if ((i < 5) || ((i % flex_size) == 0)) {
> 
> If we're going to do this, why not just drop the above conditional,
> and just always do this logic for all block groups?

Thanks, in the large disk scenario, if we deal with all groups, the
system_zone will be very large, which may reduce performance. I think
the previous method is good, but it needs to be changed slightly, so
that the fault tolerance in the meta_bg scenario can be improved
without the risk of performance degradation.

>> +			gd_blks = ext4_bg_has_super(sb, i) +
>> +				ext4_bg_num_gdb(sb, i);
>> +			if (gd_blks) {
>> +				ret = add_system_zone(system_blks,
>> +						ext4_group_first_block_no(sb, i),
>> +						gd_blks, 0);
>> +				if (ret)
>> +					goto err;
>> +			}
> 
> 						- Ted
>
Theodore Ts'o Dec. 9, 2020, 4:34 a.m. UTC | #4
On Fri, Dec 04, 2020 at 09:26:49AM +0800, brookxu wrote:
> 
> Theodore Y. Ts'o wrote on 2020/12/3 23:08:
> > On Sat, Nov 07, 2020 at 11:58:14PM +0800, Chunguang Xu wrote:
> >> From: Chunguang Xu <brookxu@tencent.com>
> >>
> >> In order to avoid poor search efficiency of system_zone, the
> >> system only adds metadata of some sparse group to system_zone.
> >> In the meta_bg scenario, the non-sparse group may contain gdt
> >> blocks. Perhaps we should add these blocks to system_zone to
> >> improve fault tolerance without significantly reducing system
> >> performance.
> 
> Thanks, in the large-market scenario, if we deal with all groups,
> the system_zone will be very large, which may reduce performance.
> I think the previous method is good, but it needs to be changed
> slightly, so that the fault tolerance in the meta_bg scenario
> can be improved without the risk of performance degradation.

OK, I see.   But this is not actually reliable:

> >> +		if ((i < 5) || ((i % flex_size) == 0)) {

This only works if the flex_size is less than or equal to 64 (assuming
a 4k blocksize).  That's because on 64-bit file systems, we can fit 64
block group descripters in a 4k block group descriptor block, so
that's the size of the meta_bg.  The default flex_bg size is 16, but
it's quite possible to create a file system via "mke2fs -t ext4 -G
256".  In that case, the flex_size will be 256, and we would not be
including all of the meta_bg groups.  So i % flex_size needs to be
replaced by "i % meta_bg_size", where meta_bg_size would be
initialized to EXT4_DESC_PER_BLOCK(sb).

Does that make sense?

						- Ted
brookxu Dec. 9, 2020, 11:48 a.m. UTC | #5
Theodore Y. Ts'o wrote on 2020/12/9 12:34:
> On Fri, Dec 04, 2020 at 09:26:49AM +0800, brookxu wrote:
>>
>> Theodore Y. Ts'o wrote on 2020/12/3 23:08:
>>> On Sat, Nov 07, 2020 at 11:58:14PM +0800, Chunguang Xu wrote:
>>>> From: Chunguang Xu <brookxu@tencent.com>
>>>>
>>>> In order to avoid poor search efficiency of system_zone, the
>>>> system only adds metadata of some sparse group to system_zone.
>>>> In the meta_bg scenario, the non-sparse group may contain gdt
>>>> blocks. Perhaps we should add these blocks to system_zone to
>>>> improve fault tolerance without significantly reducing system
>>>> performance.
>>
>> Thanks, in the large-market scenario, if we deal with all groups,
>> the system_zone will be very large, which may reduce performance.
>> I think the previous method is good, but it needs to be changed
>> slightly, so that the fault tolerance in the meta_bg scenario
>> can be improved without the risk of performance degradation.
> 
> OK, I see.   But this is not actually reliable:
> 
>>>> +		if ((i < 5) || ((i % flex_size) == 0)) {
> 
> This only works if the flex_size is less than or equal to 64 (assuming
> a 4k blocksize).  That's because on 64-bit file systems, we can fit 64
> block group descripters in a 4k block group descriptor block, so
> that's the size of the meta_bg.  The default flex_bg size is 16, but
> it's quite possible to create a file system via "mke2fs -t ext4 -G
> 256".  In that case, the flex_size will be 256, and we would not be
> including all of the meta_bg groups.  So i % flex_size needs to be
> replaced by "i % meta_bg_size", where meta_bg_size would be
> initialized to EXT4_DESC_PER_BLOCK(sb).
> 
> Does that make sense?
Maybe I missed something. If i% meta_bg_size is used instead, if
flex_size <64, then we will miss some flex_bg. There seems to be
a contradiction here. In the scenario where only flex_bg is
enabled, it may not be appropriate to use meta_bg_size. In the
scenario where only meta_bg is enabled, it may not be appropriate
to use flex_size.

As you said before, it maybe better to remove

	if ((i <5) || ((i% flex_size) == 0))

and do it for all groups. 

In this way we won't miss some flex_bg, meta_bg, and sparse_bg.
I tested it on an 80T disk and found that the performance loss
was small:

 unpatched kernel:
 ext4_setup_system_zone() takes 524ms, 
 mount-3137    [006] ....    89.548026: ext4_setup_system_zone: (ext4_setup_system_zone+0x0/0x3f0)
 mount-3137    [006] d...    90.072895: ext4_setup_system_zone_1: (ext4_fill_super+0x2057/0x39b0 <- ext4_setup_system_zone)

 patched kernel:
 ext4_setup_system_zone() takes 552ms, 
 mount-4425    [006] ....   402.555793: ext4_setup_system_zone: (ext4_setup_system_zone+0x0/0x3d0)
 mount-4425    [006] d...   403.107307: ext4_setup_system_zone_1: (ext4_fill_super+0x2057/0x39b0 <- ext4_setup_system_zone)
> 
> 						- Ted
>
Theodore Ts'o Dec. 9, 2020, 7:39 p.m. UTC | #6
On Wed, Dec 09, 2020 at 07:48:09PM +0800, brookxu wrote:
> 
> Maybe I missed something. If i% meta_bg_size is used instead, if
> flex_size <64, then we will miss some flex_bg. There seems to be
> a contradiction here. In the scenario where only flex_bg is
> enabled, it may not be appropriate to use meta_bg_size. In the
> scenario where only meta_bg is enabled, it may not be appropriate
> to use flex_size.
> 
> As you said before, it maybe better to remove
> 
> 	if ((i <5) || ((i% flex_size) == 0))
> 
> and do it for all groups.

I don't think the original (i % flex_size) made any sense in the first
place.

What flex_bg does is that it collects the allocation bitmaps and inode
tables for each block group and locates them within the first block
group in a flex_bg.  It doesn't have anything to do with whether or
not a particular block group has a backup copy of the superblock and
block group descriptor table --- in non-meta_bg file systems and the
meta_bg file systems where the block group is less than
s_first_meta_bg * EXT4_DESC_PER_BLOCK(sb).  And the condition in
question is only about whether or not to add the backup superblock and
backup block group descriptors.  So checking for i % flex_size made no
sense, and I'm not sure that check was there in the first place.

> In this way weh won't miss some flex_bg, meta_bg, and sparse_bg.
> I tested it on an 80T disk and found that the performance loss
> was small:
> 
>  unpatched kernel:
>  ext4_setup_system_zone() takes 524ms, 
> 
>  patched kernel:
>  ext4_setup_system_zone() takes 552ms, 

I don't really care that much about the time it takes to execute
ext4_setup_system_zone().

The really interesting question is how large is the rb_tree
constructed by that function, and what is the percentage increase of
time that the ext4_inode_block_valid() function takes.  (e.g., how
much additional memory is the system_blks tree taking, and how deep is
that tree, since ext4_inode_block_valid() gets called every time we
allocate or free a block, and every time we need to validate an extent
tree node.

Cheers,

						- Ted
brookxu Dec. 10, 2020, 11 a.m. UTC | #7
Theodore Y. Ts'o wrote on 2020/12/10 3:39:
> On Wed, Dec 09, 2020 at 07:48:09PM +0800, brookxu wrote:
>>
>> Maybe I missed something. If i% meta_bg_size is used instead, if
>> flex_size <64, then we will miss some flex_bg. There seems to be
>> a contradiction here. In the scenario where only flex_bg is
>> enabled, it may not be appropriate to use meta_bg_size. In the
>> scenario where only meta_bg is enabled, it may not be appropriate
>> to use flex_size.
>>
>> As you said before, it maybe better to remove
>>
>> 	if ((i <5) || ((i% flex_size) == 0))
>>
>> and do it for all groups.
> 
> I don't think the original (i % flex_size) made any sense in the first
> place.
> 
> What flex_bg does is that it collects the allocation bitmaps and inode
> tables for each block group and locates them within the first block
> group in a flex_bg.  It doesn't have anything to do with whether or
> not a particular block group has a backup copy of the superblock and
> block group descriptor table --- in non-meta_bg file systems and the
> meta_bg file systems where the block group is less than
> s_first_meta_bg * EXT4_DESC_PER_BLOCK(sb).  And the condition in
> question is only about whether or not to add the backup superblock and
> backup block group descriptors.  So checking for i % flex_size made no
> sense, and I'm not sure that check was there in the first place.

I think we should add backup sb and gdt to system_zone, because
these blocks should not be used by applications. In fact, I
think we may have done some work.

>> In this way weh won't miss some flex_bg, meta_bg, and sparse_bg.
>> I tested it on an 80T disk and found that the performance loss
>> was small:
>>
>>  unpatched kernel:
>>  ext4_setup_system_zone() takes 524ms, 
>>
>>  patched kernel:
>>  ext4_setup_system_zone() takes 552ms, 
> 
> I don't really care that much about the time it takes to execute
> ext4_setup_system_zone().
> 
> The really interesting question is how large is the rb_tree
> constructed by that function, and what is the percentage increase of
> time that the ext4_inode_block_valid() function takes.  (e.g., how
> much additional memory is the system_blks tree taking, and how deep is
> that tree, since ext4_inode_block_valid() gets called every time we
> allocate or free a block, and every time we need to validate an extent
> tree node.

During detailed analysis, I found that when the current logic
calls ext4_setup_system_zone(), s_log_groups_per_flex has not
been initialized, and flex_size is always 1, which seems to
be a mistake. therefore

if (ext4_bg_has_super(sb, i) &&
                    ((i <5) || ((i% flex_size) == 0)))

Degenerate to

if (ext4_bg_has_super(sb, i))

So, the existing implementation just adds the backup super
block in sparse_group to system_zone. Due to this mistake,
the behavior of the system in the flex_bg scenario happens to
be correct?

I tested it in three scenarios: only meta_bg, only flex_bg,
both flex_bg and meta_bg were enabled. The test results are as
follows:

Meta_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 866 count 1309087
 
 pacthed kernel:
 ext4_setup_system_zone time 841 count 1309087

Since the backup gdt of meta_bg and BB are connected, they can
be merged, so no additional nodes are added.

Flex_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 529 count 41016

 pacthed kernel:
 ext4_setup_system_zone time 553 count 41016

The system behavior has not changed. All sparse_group backup sb
and gdt are still added, so no additional nodes are added.

Meta_bg & Flex_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 535 count 41016
 
 pacthed kernel:
 ext4_setup_system_zone time 571 count 61508

In addition to sparse_group, the system needs to add the backup
gdt of meta_bg to the system. Set

	N=max(flex_bg_size / meta_bg_size, 1)

then every N meta_bg has a gdt block that can be merged into 
the node corresponding to flex_bg, such as flex_bg_size < meta_bg_size,
then the number of new nodes is 2 * nr_meta_bg. On this 80T
disk, the maximum depth of rbtree is 2log(n+1). According to
this calculation, in this test case, the depth of rbtree is
not increased. Thus, there is no major performance overhead.

Maybe we can deal with it in the same way as discussed before?

> Cheers,
> 
> 						- Ted
>
brookxu Dec. 15, 2020, 1:14 a.m. UTC | #8
Hi, Ted, how do you think of this, should we need to go ahead? Thanks.

Theodore Y. Ts'o wrote on 2020/12/10 3:39:
> On Wed, Dec 09, 2020 at 07:48:09PM +0800, brookxu wrote:
>>
>> Maybe I missed something. If i% meta_bg_size is used instead, if
>> flex_size <64, then we will miss some flex_bg. There seems to be
>> a contradiction here. In the scenario where only flex_bg is
>> enabled, it may not be appropriate to use meta_bg_size. In the
>> scenario where only meta_bg is enabled, it may not be appropriate
>> to use flex_size.
>>
>> As you said before, it maybe better to remove
>>
>> 	if ((i <5) || ((i% flex_size) == 0))
>>
>> and do it for all groups.
> 
> I don't think the original (i % flex_size) made any sense in the first
> place.
> 
> What flex_bg does is that it collects the allocation bitmaps and inode
> tables for each block group and locates them within the first block
> group in a flex_bg.  It doesn't have anything to do with whether or
> not a particular block group has a backup copy of the superblock and
> block group descriptor table --- in non-meta_bg file systems and the
> meta_bg file systems where the block group is less than
> s_first_meta_bg * EXT4_DESC_PER_BLOCK(sb).  And the condition in
> question is only about whether or not to add the backup superblock and
> backup block group descriptors.  So checking for i % flex_size made no
> sense, and I'm not sure that check was there in the first place.

I think we should add backup sb and gdt to system_zone, because
these blocks should not be used by applications. In fact, I
think we may have done some work.

>> In this way weh won't miss some flex_bg, meta_bg, and sparse_bg.
>> I tested it on an 80T disk and found that the performance loss
>> was small:
>>
>>  unpatched kernel:
>>  ext4_setup_system_zone() takes 524ms, 
>>
>>  patched kernel:
>>  ext4_setup_system_zone() takes 552ms, 
> 
> I don't really care that much about the time it takes to execute
> ext4_setup_system_zone().
> 
> The really interesting question is how large is the rb_tree
> constructed by that function, and what is the percentage increase of
> time that the ext4_inode_block_valid() function takes.  (e.g., how
> much additional memory is the system_blks tree taking, and how deep is
> that tree, since ext4_inode_block_valid() gets called every time we
> allocate or free a block, and every time we need to validate an extent
> tree node.

During detailed analysis, I found that when the current logic
calls ext4_setup_system_zone(), s_log_groups_per_flex has not
been initialized, and flex_size is always 1, which seems to
be a mistake. therefore

if (ext4_bg_has_super(sb, i) &&
                    ((i <5) || ((i% flex_size) == 0)))

Degenerate to

if (ext4_bg_has_super(sb, i))

So, the existing implementation just adds the backup sb and gdt
in sparse_group to system_zone. Due to this mistake, the behavior
of the system in the flex_bg scenario happens to be correct?

I tested it in three scenarios: only meta_bg, only flex_bg,
both flex_bg and meta_bg were enabled. The test results are as
follows:

Meta_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 866ms count 1309087(number of nodes inside rbtree)
 
 pacthed kernel:
 ext4_setup_system_zone time 841ms count 1309087(number of nodes inside rbtree)

Since the backup gdt of meta_bg and BB are connected, they can
be merged, so no additional nodes are added.

Flex_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 529ms count 41016(number of nodes inside rbtree)

 pacthed kernel:
 ext4_setup_system_zone time 553ms count 41016(number of nodes inside rbtree)

The system behavior has not changed. All sparse_group backup sb
and gdt are still added, so no additional nodes are added.

Meta_bg & Flex_bg only
 unpacthed kernel:
 ext4_setup_system_zone time 535ms count 41016(number of nodes inside rbtree)
 
 pacthed kernel:
 ext4_setup_system_zone time 571ms count 61508(number of nodes inside rbtree)

In addition to sparse_group, the system needs to add the backup
gdt of meta_bg to the system. Set

	N=max(flex_bg_size / meta_bg_size, 1)

then every N meta_bg has a gdt block that can be merged into 
the node corresponding to flex_bg, such as flex_bg_size < meta_bg_size,
then the number of new nodes is 2 * nr_meta_bg. On this 80T
disk, the maximum depth of rbtree is 2log(n+1). According to
this calculation, in this test case, the depth of rbtree is
not increased. Thus, there is no major performance overhead.

Maybe we can deal with it in the same way as discussed before?

> Cheers,
> 
> 						- Ted
>
Theodore Ts'o Dec. 15, 2020, 8:13 p.m. UTC | #9
You did your test on a 80T file system, but that's not where someone
would be using meta_bg.  Meta_bg ges used for much larger file systems
than that!  With meta_bg, we have 3 block group descriptors every 64
block groups.  Each block group describes 128M of memory.  So for that
means we are going to have 3 entries in the system zone tree for every_
8GB of file system space, 383,216 entries for every PB.  Given that
each entry is 40 bytes, that means that the block_validity entries
will consume 15 megabytes per PB.

Now, one third of these entries overlap with the flex_bg entries
(meta_bg groups are in the first, second, and last block group of each
meta_bg, where are 64 block groups in 4k file systems), and of course,
the default flex_bg size of 16 block groups means that there are
524,288 entries per PB.  So if we include all backup sb and block
groups, in a 1 PB file system, there will be roughly 786,432 entries
in a 1 PB file system.  (I'm ignoring the entries for the backup
superblocks, but that's only about 20 or so extra entries.)

So for a flex_bg 1PB file system, the amount of memory for a
block_validity data structure is roughly 20M, and including all backup
descriptors for meta_bg on a flex_bg + meta_bg setup is roughly 30M.

I agree with you that for a non-meta_bg file system, including all of
the backup superblock and block group descriptors is not going to be
large.  But while protecting the meta_bg group descriptors is
worthwhile, protecting the backup meta_bg's is not free, and will
increase the size of the tree by 33%.

I'm also wondering whether or not Lustre (where they do have some file
systems that are in the PB range) have run into overhead issues with
block_validity.

What do folks think?

						- Ted
Andreas Dilger Dec. 17, 2020, 4:01 p.m. UTC | #10
An extra 20-30MB of RAM for mounting a 1PB filesystem isn't
a huge deal. We already need 512MB for just the 8M group descriptors,
and we have a 1GB journal.

I haven't heard any specific performance issues with block_validity,
but it may be newer than the 3.10 kernels we are currently using on
our servers. 

Cheers, Andreas

> On Dec 15, 2020, at 13:13, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
> You did your test on a 80T file system, but that's not where someone
> would be using meta_bg.  Meta_bg ges used for much larger file systems
> than that!  With meta_bg, we have 3 block group descriptors every 64
> block groups.  Each block group describes 128M of memory.  So for that
> means we are going to have 3 entries in the system zone tree for every_
> 8GB of file system space, 383,216 entries for every PB.  Given that
> each entry is 40 bytes, that means that the block_validity entries
> will consume 15 megabytes per PB.
> 
> Now, one third of these entries overlap with the flex_bg entries
> (meta_bg groups are in the first, second, and last block group of each
> meta_bg, where are 64 block groups in 4k file systems), and of course,
> the default flex_bg size of 16 block groups means that there are
> 524,288 entries per PB.  So if we include all backup sb and block
> groups, in a 1 PB file system, there will be roughly 786,432 entries
> in a 1 PB file system.  (I'm ignoring the entries for the backup
> superblocks, but that's only about 20 or so extra entries.)
> 
> So for a flex_bg 1PB file system, the amount of memory for a
> block_validity data structure is roughly 20M, and including all backup
> descriptors for meta_bg on a flex_bg + meta_bg setup is roughly 30M.
> 
> I agree with you that for a non-meta_bg file system, including all of
> the backup superblock and block group descriptors is not going to be
> large.  But while protecting the meta_bg group descriptors is
> worthwhile, protecting the backup meta_bg's is not free, and will
> increase the size of the tree by 33%.
> 
> I'm also wondering whether or not Lustre (where they do have some file
> systems that are in the PB range) have run into overhead issues with
> block_validity.
> 
> What do folks think?
> 
>                       - Ted
diff mbox series

Patch

diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 8e6ca23..37025e3 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -218,6 +218,7 @@  int ext4_setup_system_zone(struct super_block *sb)
 	struct ext4_group_desc *gdp;
 	ext4_group_t i;
 	int flex_size = ext4_flex_bg_size(sbi);
+	int gd_blks;
 	int ret;
 
 	system_blks = kzalloc(sizeof(*system_blks), GFP_KERNEL);
@@ -226,13 +227,16 @@  int ext4_setup_system_zone(struct super_block *sb)
 
 	for (i=0; i < ngroups; i++) {
 		cond_resched();
-		if (ext4_bg_has_super(sb, i) &&
-		    ((i < 5) || ((i % flex_size) == 0))) {
-			ret = add_system_zone(system_blks,
-					ext4_group_first_block_no(sb, i),
-					ext4_bg_num_gdb(sb, i) + 1, 0);
-			if (ret)
-				goto err;
+		if ((i < 5) || ((i % flex_size) == 0)) {
+			gd_blks = ext4_bg_has_super(sb, i) +
+				ext4_bg_num_gdb(sb, i);
+			if (gd_blks) {
+				ret = add_system_zone(system_blks,
+						ext4_group_first_block_no(sb, i),
+						gd_blks, 0);
+				if (ret)
+					goto err;
+			}
 		}
 		gdp = ext4_get_group_desc(sb, i, NULL);
 		ret = add_system_zone(system_blks,