diff mbox series

[1/2] ext4: fix bug_on ext4_mb_use_inode_pa

Message ID 20220521134217.312071-2-libaokun1@huawei.com
State Superseded
Headers show
Series ext4: fix two bugs in ext4_mb_normalize_request | expand

Commit Message

Baokun Li May 21, 2022, 1:42 p.m. UTC
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
 ext4_mb_new_blocks+0x9df/0x5d30
 ext4_ext_map_blocks+0x1803/0x4d80
 ext4_map_blocks+0x3a4/0x1a10
 ext4_writepages+0x126d/0x2c30
 do_writepages+0x7f/0x1b0
 __filemap_fdatawrite_range+0x285/0x3b0
 file_write_and_wait_range+0xb1/0x140
 ext4_sync_file+0x1aa/0xca0
 vfs_fsync_range+0xfb/0x260
 do_fsync+0x48/0xa0
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
do_fsync
 vfs_fsync_range
  ext4_sync_file
   file_write_and_wait_range
    __filemap_fdatawrite_range
     do_writepages
      ext4_writepages
       mpage_map_and_submit_extent
        mpage_map_one_extent
         ext4_map_blocks
          ext4_mb_new_blocks
           ext4_mb_normalize_request
            >>> start + size <= ac->ac_o_ex.fe_logical
           ext4_mb_regular_allocator
            ext4_mb_simple_scan_group
             ext4_mb_use_best_found
              ext4_mb_new_preallocation
               ext4_mb_new_inode_pa
                ext4_mb_use_inode_pa
                 >>> set ac->ac_b_ex.fe_len <= 0
           ext4_mb_mark_diskspace_used
            >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);

we can easily reproduce this problem with the following commands:
	`fallocate -l100M disk`
	`mkfs.ext4 -b 1024 -g 256 disk`
	`mount disk /mnt`
	`fsstress -d /mnt -l 0 -n 1000 -p 1`

The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.

Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Jan Kara May 23, 2022, 9:29 a.m. UTC | #1
On Sat 21-05-22 21:42:16, Baokun Li wrote:
> Hulk Robot reported a BUG_ON:
> ==================================================================
> kernel BUG at fs/ext4/mballoc.c:3211!
> [...]
> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
> [...]
> Call Trace:
>  ext4_mb_new_blocks+0x9df/0x5d30
>  ext4_ext_map_blocks+0x1803/0x4d80
>  ext4_map_blocks+0x3a4/0x1a10
>  ext4_writepages+0x126d/0x2c30
>  do_writepages+0x7f/0x1b0
>  __filemap_fdatawrite_range+0x285/0x3b0
>  file_write_and_wait_range+0xb1/0x140
>  ext4_sync_file+0x1aa/0xca0
>  vfs_fsync_range+0xfb/0x260
>  do_fsync+0x48/0xa0
> [...]
> ==================================================================
> 
> Above issue may happen as follows:
> -------------------------------------
> do_fsync
>  vfs_fsync_range
>   ext4_sync_file
>    file_write_and_wait_range
>     __filemap_fdatawrite_range
>      do_writepages
>       ext4_writepages
>        mpage_map_and_submit_extent
>         mpage_map_one_extent
>          ext4_map_blocks
>           ext4_mb_new_blocks
>            ext4_mb_normalize_request
>             >>> start + size <= ac->ac_o_ex.fe_logical
>            ext4_mb_regular_allocator
>             ext4_mb_simple_scan_group
>              ext4_mb_use_best_found
>               ext4_mb_new_preallocation
>                ext4_mb_new_inode_pa
>                 ext4_mb_use_inode_pa
>                  >>> set ac->ac_b_ex.fe_len <= 0
>            ext4_mb_mark_diskspace_used
>             >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
> 
> we can easily reproduce this problem with the following commands:
> 	`fallocate -l100M disk`
> 	`mkfs.ext4 -b 1024 -g 256 disk`
> 	`mount disk /mnt`
> 	`fsstress -d /mnt -l 0 -n 1000 -p 1`
> 
> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
> when the size is truncated. So start should be the start position of
> the group where ac_o_ex.fe_logical is located after alignment.
> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
> is very large, the value calculated by start_off is more accurate.
> 
> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. I'd just phrase the comment below a bit differently:

> +	/*
> +	 * Because size must be less than or equal to
> +	 * EXT4_BLOCKS_PER_GROUP, start should be the start position of
> +	 * the group where ac_o_ex.fe_logical is located after alignment.
> +	 * In addition, when the value of fe_logical or
> +	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
> +	 * by start_off is more accurate.
> +	 */
> +	start = max(start, round_down(ac->ac_o_ex.fe_logical,
> +			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
> +

Can we make the comment like:

	/*
	 * For tiny groups (smaller than 8MB) the chosen allocation
	 * alignment may be larger than group size. Make sure the alignment
	 * does not move allocation to a different group which makes mballoc
	 * fail assertions later.
	 */

With that feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
Lukas Czerner May 23, 2022, 9:58 a.m. UTC | #2
On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote:
> Hulk Robot reported a BUG_ON:
> ==================================================================
> kernel BUG at fs/ext4/mballoc.c:3211!
> [...]
> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
> [...]
> Call Trace:
>  ext4_mb_new_blocks+0x9df/0x5d30
>  ext4_ext_map_blocks+0x1803/0x4d80
>  ext4_map_blocks+0x3a4/0x1a10
>  ext4_writepages+0x126d/0x2c30
>  do_writepages+0x7f/0x1b0
>  __filemap_fdatawrite_range+0x285/0x3b0
>  file_write_and_wait_range+0xb1/0x140
>  ext4_sync_file+0x1aa/0xca0
>  vfs_fsync_range+0xfb/0x260
>  do_fsync+0x48/0xa0
> [...]
> ==================================================================
> 
> Above issue may happen as follows:
> -------------------------------------
> do_fsync
>  vfs_fsync_range
>   ext4_sync_file
>    file_write_and_wait_range
>     __filemap_fdatawrite_range
>      do_writepages
>       ext4_writepages
>        mpage_map_and_submit_extent
>         mpage_map_one_extent
>          ext4_map_blocks
>           ext4_mb_new_blocks
>            ext4_mb_normalize_request
>             >>> start + size <= ac->ac_o_ex.fe_logical
>            ext4_mb_regular_allocator
>             ext4_mb_simple_scan_group
>              ext4_mb_use_best_found
>               ext4_mb_new_preallocation
>                ext4_mb_new_inode_pa
>                 ext4_mb_use_inode_pa
>                  >>> set ac->ac_b_ex.fe_len <= 0
>            ext4_mb_mark_diskspace_used
>             >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
> 
> we can easily reproduce this problem with the following commands:
> 	`fallocate -l100M disk`
> 	`mkfs.ext4 -b 1024 -g 256 disk`
> 	`mount disk /mnt`
> 	`fsstress -d /mnt -l 0 -n 1000 -p 1`
> 
> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
> when the size is truncated. So start should be the start position of
> the group where ac_o_ex.fe_logical is located after alignment.
> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
> is very large, the value calculated by start_off is more accurate.
> 
> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index ea653d19f9ec..32410b79b664 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	size = size >> bsbits;
>  	start = start_off >> bsbits;
>  
> +	/*
> +	 * Because size must be less than or equal to
> +	 * EXT4_BLOCKS_PER_GROUP, start should be the start position of
> +	 * the group where ac_o_ex.fe_logical is located after alignment.
> +	 * In addition, when the value of fe_logical or
> +	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
> +	 * by start_off is more accurate.
> +	 */
> +	start = max(start, round_down(ac->ac_o_ex.fe_logical,
> +			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));

This does not look right. The second argument in round_down() must be a
power of two, but there is no such restriction on blocks per group.

Also I am not quite sure why do we adjust the start in this way at all?
If we found what seems to be a preallocated extent which we can use and
we're actually going to use 0 lenght extent it seems like the problem is
somewhere else? Can you desribe the problem a bit more in detail?

Maybe I need to look at the ext4_mb_normalize_request() some more.

-Lukas

> +
>  	/* don't cover already allocated blocks in selected range */
>  	if (ar->pleft && start <= ar->lleft) {
>  		size -= ar->lleft + 1 - start;
> -- 
> 2.31.1
>
Ritesh Harjani (IBM) May 23, 2022, 7:51 p.m. UTC | #3
On 22/05/21 09:42PM, Baokun Li wrote:
> Hulk Robot reported a BUG_ON:
> ==================================================================
> kernel BUG at fs/ext4/mballoc.c:3211!
> [...]
> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
> [...]
> Call Trace:
>  ext4_mb_new_blocks+0x9df/0x5d30
>  ext4_ext_map_blocks+0x1803/0x4d80
>  ext4_map_blocks+0x3a4/0x1a10
>  ext4_writepages+0x126d/0x2c30
>  do_writepages+0x7f/0x1b0
>  __filemap_fdatawrite_range+0x285/0x3b0
>  file_write_and_wait_range+0xb1/0x140
>  ext4_sync_file+0x1aa/0xca0
>  vfs_fsync_range+0xfb/0x260
>  do_fsync+0x48/0xa0
> [...]
> ==================================================================
>
> Above issue may happen as follows:
> -------------------------------------
> do_fsync
>  vfs_fsync_range
>   ext4_sync_file
>    file_write_and_wait_range
>     __filemap_fdatawrite_range
>      do_writepages
>       ext4_writepages
>        mpage_map_and_submit_extent
>         mpage_map_one_extent
>          ext4_map_blocks
>           ext4_mb_new_blocks
>            ext4_mb_normalize_request
>             >>> start + size <= ac->ac_o_ex.fe_logical
>            ext4_mb_regular_allocator
>             ext4_mb_simple_scan_group
>              ext4_mb_use_best_found
>               ext4_mb_new_preallocation
>                ext4_mb_new_inode_pa
>                 ext4_mb_use_inode_pa
>                  >>> set ac->ac_b_ex.fe_len <= 0
>            ext4_mb_mark_diskspace_used
>             >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>
> we can easily reproduce this problem with the following commands:
> 	`fallocate -l100M disk`
> 	`mkfs.ext4 -b 1024 -g 256 disk`
> 	`mount disk /mnt`
> 	`fsstress -d /mnt -l 0 -n 1000 -p 1`

Thanks for sharing the reproducer. Yes, this could easily trigger the bug_on.
We don't even need "-b 1024".


>
> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
> when the size is truncated. So start should be the start position of
> the group where ac_o_ex.fe_logical is located after alignment.
> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
> is very large, the value calculated by start_off is more accurate.
>
> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")

Commit message does say that it can result into allocation request bigger then
the size of the block group. So then, what happens in case of flex_bg feature is
enabled (which by default nowadays).
Shouldn't we consider flex_bg_groups * EXT4_BLOCKS_PER_GROUP as the allocation size request?

>
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index ea653d19f9ec..32410b79b664 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	size = size >> bsbits;
>  	start = start_off >> bsbits;
>
> +	/*
> +	 * Because size must be less than or equal to
> +	 * EXT4_BLOCKS_PER_GROUP,

We should confirm whether in case of flex_bg groups, this assumption
holds correct?

> start should be the start position of
> +	 * the group where ac_o_ex.fe_logical is located after alignment.
> +	 * In addition, when the value of fe_logical or
> +	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
> +	 * by start_off is more accurate.
> +	 */
> +	start = max(start, round_down(ac->ac_o_ex.fe_logical,
> +			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
> +

This does looks like it could solve the problem at hand.
This is because we try to allocate based on the start offset upto size.
As of now we do allocate more space then requested (due to normalization),
but due to the start offset of our preallocation, our allocation request
of ac->ac_o_ex.fe_logical + fe_len doesn't lie in the allocated PA. This happens since
we trim the size of the allocation request to only till blocks_per_group.

So with that if we make the above changes (your patch), it does looks like, that we
will then be allocating the PA such that our allocation request will fall within
the allocated PA (in ext4_mb_use_inode_pa()) and hence we won't hit the bug_on in
ext4_mb_mark_diskspace_used().


One more suggestion - I think we should also add a WARN_ON()/BUG_ON() here.
This makes the start of the problem more visible, rather then waiting till
ext4_mb_mark_diskspace_used() call to hit the bug_on().
Thoughts?

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 252c168454c7..e91d5aeb8efd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4301,6 +4301,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
        BUG_ON(start < pa->pa_pstart);
        BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len));
        BUG_ON(pa->pa_free < len);
+       WARN_ON(len <= 0);
        pa->pa_free -= len;

        mb_debug(ac->ac_sb, "use %llu/%d from inode pa %p\n", start, len, pa);


-ritesh
Lukas Czerner May 24, 2022, 12:11 p.m. UTC | #4
On Mon, May 23, 2022 at 08:19:03PM +0800, libaokun (A) wrote:
> 在 2022/5/23 17:58, Lukas Czerner 写道:
> > On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote:
> > > Hulk Robot reported a BUG_ON:
> > > ==================================================================
> > > kernel BUG at fs/ext4/mballoc.c:3211!
> > > [...]
> > > RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
> > > [...]
> > > Call Trace:
> > >   ext4_mb_new_blocks+0x9df/0x5d30
> > >   ext4_ext_map_blocks+0x1803/0x4d80
> > >   ext4_map_blocks+0x3a4/0x1a10
> > >   ext4_writepages+0x126d/0x2c30
> > >   do_writepages+0x7f/0x1b0
> > >   __filemap_fdatawrite_range+0x285/0x3b0
> > >   file_write_and_wait_range+0xb1/0x140
> > >   ext4_sync_file+0x1aa/0xca0
> > >   vfs_fsync_range+0xfb/0x260
> > >   do_fsync+0x48/0xa0
> > > [...]
> > > ==================================================================
> > > 
> > > Above issue may happen as follows:
> > > -------------------------------------
> > > do_fsync
> > >   vfs_fsync_range
> > >    ext4_sync_file
> > >     file_write_and_wait_range
> > >      __filemap_fdatawrite_range
> > >       do_writepages
> > >        ext4_writepages
> > >         mpage_map_and_submit_extent
> > >          mpage_map_one_extent
> > >           ext4_map_blocks
> > >            ext4_mb_new_blocks
> > >             ext4_mb_normalize_request
> > >              >>> start + size <= ac->ac_o_ex.fe_logical
> > >             ext4_mb_regular_allocator
> > >              ext4_mb_simple_scan_group
> > >               ext4_mb_use_best_found
> > >                ext4_mb_new_preallocation
> > >                 ext4_mb_new_inode_pa
> > >                  ext4_mb_use_inode_pa
> > >                   >>> set ac->ac_b_ex.fe_len <= 0
> > >             ext4_mb_mark_diskspace_used
> > >              >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
> > > 
> > > we can easily reproduce this problem with the following commands:
> > > 	`fallocate -l100M disk`
> > > 	`mkfs.ext4 -b 1024 -g 256 disk`
> > > 	`mount disk /mnt`
> > > 	`fsstress -d /mnt -l 0 -n 1000 -p 1`
> > > 
> > > The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> > > Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
> > > when the size is truncated. So start should be the start position of
> > > the group where ac_o_ex.fe_logical is located after alignment.
> > > In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
> > > is very large, the value calculated by start_off is more accurate.
> > > 
> > > Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
> > > Reported-by: Hulk Robot<hulkci@huawei.com>
> > > Signed-off-by: Baokun Li<libaokun1@huawei.com>
> > > ---
> > >   fs/ext4/mballoc.c | 11 +++++++++++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index ea653d19f9ec..32410b79b664 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> > >   	size = size >> bsbits;
> > >   	start = start_off >> bsbits;
> > > +	/*
> > > +	 * Because size must be less than or equal to
> > > +	 * EXT4_BLOCKS_PER_GROUP, start should be the start position of
> > > +	 * the group where ac_o_ex.fe_logical is located after alignment.
> > > +	 * In addition, when the value of fe_logical or
> > > +	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
> > > +	 * by start_off is more accurate.
> > > +	 */
> > > +	start = max(start, round_down(ac->ac_o_ex.fe_logical,
> > > +			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
> > This does not look right. The second argument in round_down() must be a
> > power of two, but there is no such restriction on blocks per group.
> 
> Indeed, block peer group size should be a multiple of 8. I forgot.
> 
> Thank you very much for your correction.
> 
> > Also I am not quite sure why do we adjust the start in this way at all?
> > If we found what seems to be a preallocated extent which we can use and
> > we're actually going to use 0 lenght extent it seems like the problem is
> > somewhere else? Can you desribe the problem a bit more in detail?
> > 
> > Maybe I need to look at the ext4_mb_normalize_request() some more.
> > 
> > -Lukas
> The logical block map reached before the problem stack was 1011.
> 
> The estimated location of the size logical block of the inde plus the
> required allocation length 7, the size is 1018.
> 
> But the i_size of inode is 1299, so the size is 1299, the aligned size is
> 2048, and the end is 2048.
> 
> Because of the restriction of ar -> pleft, start==648.
> 
> EXT4_BLOCKS_PER_GROUP (ac- > ac_sb) is 256, so the size is 256 and the end
> is 904.
> 
> It is not normal to truncate here, the end is less than 1299 of the target
> logical block,
> that is, the allocated range does not contain the target logical block.
> 
> Then this new scope conflicts with the previous PA, as follows:
> 
>           pa_start-506           pa_end-759
>  |____________P________V_________P__________V_____________l________|
>  0                       start-648                   end-904 logical-1299 
> 2048
> 
> In this case, start is changed to pa_end, that is, 759.
> In this case, a bug_ON is reported in ext4_mb_mark_diskspace_used.
> 
> The problem is caused by the truncation introduced in the
> cd648b8a8fd5 ("ext4: trim allocation requests to group size").
> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
> However, the truncation method is incorrect. The group where the logical is
> located should be used for allocation. If the value of EXT4_BLOCKS_PER_GROUP
> is 256, size 2048 can be divided into eight groups. If the value of logical
> is 1299,
> the value of logical must be in the sixth group, that is,
> start=1299/256*256=5*256=1280, end=size+1280=1536.
> Then, the value range can be further narrowed down based on other
> restrictions.
> 
>                                    1024     1280     1536
> |________|________|________|________|________|__l______|________|________|
> 0 group1 group2 group3 group4 group5 group6 group7 group8 2048

Ok, thanks for the explanation it makes sense now, although should not
we just adjust the start only when the size is being truncated to the
EXT4_BLOCKS_PER_GROUP?

-Lukas

> 
> > > +
> > >   	/* don't cover already allocated blocks in selected range */
> > >   	if (ar->pleft && start <= ar->lleft) {
> > >   		size -= ar->lleft + 1 - start;
> > > -- 
> > > 2.31.1
> > > 
> > .
> 
> -- 
> With Best Regards,
> Baokun Li
Baokun Li May 24, 2022, 12:42 p.m. UTC | #5
在 2022/5/24 20:11, Lukas Czerner 写道:
> On Mon, May 23, 2022 at 08:19:03PM +0800, libaokun (A) wrote:
>> 在 2022/5/23 17:58, Lukas Czerner 写道:
>>> On Sat, May 21, 2022 at 09:42:16PM +0800, Baokun Li wrote:
>>>> Hulk Robot reported a BUG_ON:
>>>> ==================================================================
>>>> kernel BUG at fs/ext4/mballoc.c:3211!
>>>> [...]
>>>> RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
>>>> [...]
>>>> Call Trace:
>>>>    ext4_mb_new_blocks+0x9df/0x5d30
>>>>    ext4_ext_map_blocks+0x1803/0x4d80
>>>>    ext4_map_blocks+0x3a4/0x1a10
>>>>    ext4_writepages+0x126d/0x2c30
>>>>    do_writepages+0x7f/0x1b0
>>>>    __filemap_fdatawrite_range+0x285/0x3b0
>>>>    file_write_and_wait_range+0xb1/0x140
>>>>    ext4_sync_file+0x1aa/0xca0
>>>>    vfs_fsync_range+0xfb/0x260
>>>>    do_fsync+0x48/0xa0
>>>> [...]
>>>> ==================================================================
>>>>
>>>> Above issue may happen as follows:
>>>> -------------------------------------
>>>> do_fsync
>>>>    vfs_fsync_range
>>>>     ext4_sync_file
>>>>      file_write_and_wait_range
>>>>       __filemap_fdatawrite_range
>>>>        do_writepages
>>>>         ext4_writepages
>>>>          mpage_map_and_submit_extent
>>>>           mpage_map_one_extent
>>>>            ext4_map_blocks
>>>>             ext4_mb_new_blocks
>>>>              ext4_mb_normalize_request
>>>>               >>> start + size <= ac->ac_o_ex.fe_logical
>>>>              ext4_mb_regular_allocator
>>>>               ext4_mb_simple_scan_group
>>>>                ext4_mb_use_best_found
>>>>                 ext4_mb_new_preallocation
>>>>                  ext4_mb_new_inode_pa
>>>>                   ext4_mb_use_inode_pa
>>>>                    >>> set ac->ac_b_ex.fe_len <= 0
>>>>              ext4_mb_mark_diskspace_used
>>>>               >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>>>>
>>>> we can easily reproduce this problem with the following commands:
>>>> 	`fallocate -l100M disk`
>>>> 	`mkfs.ext4 -b 1024 -g 256 disk`
>>>> 	`mount disk /mnt`
>>>> 	`fsstress -d /mnt -l 0 -n 1000 -p 1`
>>>>
>>>> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
>>>> Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
>>>> when the size is truncated. So start should be the start position of
>>>> the group where ac_o_ex.fe_logical is located after alignment.
>>>> In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
>>>> is very large, the value calculated by start_off is more accurate.
>>>>
>>>> Fixes: cd648b8a8fd5 ("ext4: trim allocation requests to group size")
>>>> Reported-by: Hulk Robot<hulkci@huawei.com>
>>>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>>>> ---
>>>>    fs/ext4/mballoc.c | 11 +++++++++++
>>>>    1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index ea653d19f9ec..32410b79b664 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -4107,6 +4107,17 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>>>>    	size = size >> bsbits;
>>>>    	start = start_off >> bsbits;
>>>> +	/*
>>>> +	 * Because size must be less than or equal to
>>>> +	 * EXT4_BLOCKS_PER_GROUP, start should be the start position of
>>>> +	 * the group where ac_o_ex.fe_logical is located after alignment.
>>>> +	 * In addition, when the value of fe_logical or
>>>> +	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
>>>> +	 * by start_off is more accurate.
>>>> +	 */
>>>> +	start = max(start, round_down(ac->ac_o_ex.fe_logical,
>>>> +			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
>>> This does not look right. The second argument in round_down() must be a
>>> power of two, but there is no such restriction on blocks per group.
>> Indeed, block peer group size should be a multiple of 8. I forgot.
>>
>> Thank you very much for your correction.
>>
>>> Also I am not quite sure why do we adjust the start in this way at all?
>>> If we found what seems to be a preallocated extent which we can use and
>>> we're actually going to use 0 lenght extent it seems like the problem is
>>> somewhere else? Can you desribe the problem a bit more in detail?
>>>
>>> Maybe I need to look at the ext4_mb_normalize_request() some more.
>>>
>>> -Lukas
>> The logical block map reached before the problem stack was 1011.
>>
>> The estimated location of the size logical block of the inde plus the
>> required allocation length 7, the size is 1018.
>>
>> But the i_size of inode is 1299, so the size is 1299, the aligned size is
>> 2048, and the end is 2048.
>>
>> Because of the restriction of ar -> pleft, start==648.
>>
>> EXT4_BLOCKS_PER_GROUP (ac- > ac_sb) is 256, so the size is 256 and the end
>> is 904.
>>
>> It is not normal to truncate here, the end is less than 1299 of the target
>> logical block,
>> that is, the allocated range does not contain the target logical block.
>>
>> Then this new scope conflicts with the previous PA, as follows:
>>
>>            pa_start-506           pa_end-759
>>   |____________P________V_________P__________V_____________l________|
>>   0                       start-648                   end-904 logical-1299
>> 2048
>>
>> In this case, start is changed to pa_end, that is, 759.
>> In this case, a bug_ON is reported in ext4_mb_mark_diskspace_used.
>>
>> The problem is caused by the truncation introduced in the
>> cd648b8a8fd5 ("ext4: trim allocation requests to group size").
>> The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
>> However, the truncation method is incorrect. The group where the logical is
>> located should be used for allocation. If the value of EXT4_BLOCKS_PER_GROUP
>> is 256, size 2048 can be divided into eight groups. If the value of logical
>> is 1299,
>> the value of logical must be in the sixth group, that is,
>> start=1299/256*256=5*256=1280, end=size+1280=1536.
>> Then, the value range can be further narrowed down based on other
>> restrictions.
>>
>>                                     1024     1280     1536
>> |________|________|________|________|________|__l______|________|________|
>> 0 group1 group2 group3 group4 group5 group6 group7 group8 2048
> Ok, thanks for the explanation it makes sense now, although should not
> we just adjust the start only when the size is being truncated to the
> EXT4_BLOCKS_PER_GROUP?
>
> -Lukas
Yes, it is. Assume that the value of fe_logical is 1011,
and the value of EXT4_BLOCKS_PER_GROUP is 1024.
In this case, 1011 / 1024 = 0, and 0 x 1024 is still 0.
Therefore, the value of start is 0, which does not change.

>>>> +
>>>>    	/* don't cover already allocated blocks in selected range */
>>>>    	if (ar->pleft && start <= ar->lleft) {
>>>>    		size -= ar->lleft + 1 - start;
>>>> -- 
>>>> 2.31.1
>>>>
>>> .
>> -- 
>> With Best Regards,
>> Baokun Li
> .
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ea653d19f9ec..32410b79b664 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4107,6 +4107,17 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	size = size >> bsbits;
 	start = start_off >> bsbits;
 
+	/*
+	 * Because size must be less than or equal to
+	 * EXT4_BLOCKS_PER_GROUP, start should be the start position of
+	 * the group where ac_o_ex.fe_logical is located after alignment.
+	 * In addition, when the value of fe_logical or
+	 * EXT4_BLOCKS_PER_GROUP is very large, the value calculated
+	 * by start_off is more accurate.
+	 */
+	start = max(start, round_down(ac->ac_o_ex.fe_logical,
+			EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
+
 	/* don't cover already allocated blocks in selected range */
 	if (ar->pleft && start <= ar->lleft) {
 		size -= ar->lleft + 1 - start;