diff mbox series

[2/2] ext4: correct the judgment of BUG in ext4_mb_normalize_request

Message ID 20220521134217.312071-3-libaokun1@huawei.com
State Superseded
Headers show
Series ext4: fix two bugs in ext4_mb_normalize_request | expand

Commit Message

Baokun Li May 21, 2022, 1:42 p.m. UTC
When either of the "start + size <= ac->ac_o_ex.fe_logical" or
"start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
that the fe_logical is not in the allocated range.
In this case, it should be bug_ON.

Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jan Kara May 23, 2022, 9:40 a.m. UTC | #1
On Sat 21-05-22 21:42:17, Baokun Li wrote:
> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> that the fe_logical is not in the allocated range.
> In this case, it should be bug_ON.
> 
> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

I think this is actually wrong. The original condition checks whether
start + size does not overflow the used integer type. Your condition is
much stronger and I don't think it always has to be true. E.g. allocation
goal block (start variable) can be pushed to larger values by existing
preallocation or so.

								Honza 


> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 32410b79b664..d0fb57970648 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	}
>  	rcu_read_unlock();
>  
> -	if (start + size <= ac->ac_o_ex.fe_logical &&
> +	if (start + size <= ac->ac_o_ex.fe_logical ||
>  			start > ac->ac_o_ex.fe_logical) {
>  		ext4_msg(ac->ac_sb, KERN_ERR,
>  			 "start %lu, size %lu, fe_logical %lu",
> -- 
> 2.31.1
>
Lukas Czerner May 23, 2022, 10:05 a.m. UTC | #2
On Sat, May 21, 2022 at 09:42:17PM +0800, Baokun Li wrote:
> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> that the fe_logical is not in the allocated range.
> In this case, it should be bug_ON.

This seems wrong, I think that this condition is testing overflow and
it's correct as it is. Or am I missing something?

-Lukas

> 
> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 32410b79b664..d0fb57970648 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	}
>  	rcu_read_unlock();
>  
> -	if (start + size <= ac->ac_o_ex.fe_logical &&
> +	if (start + size <= ac->ac_o_ex.fe_logical ||
>  			start > ac->ac_o_ex.fe_logical) {
>  		ext4_msg(ac->ac_sb, KERN_ERR,
>  			 "start %lu, size %lu, fe_logical %lu",
> -- 
> 2.31.1
>
Ritesh Harjani (IBM) May 23, 2022, 8:08 p.m. UTC | #3
On 22/05/21 09:42PM, Baokun Li wrote:
> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> that the fe_logical is not in the allocated range.

Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
within the preallocated range. So if our start or start + size doesn't include
fe_logical then it is a bug in the ext4_mb_normalize_request() logic.

But should we be so harsh to hit a bug_on() or make it warn_on()?

Also did you run any fs tests with this change. Since it looks like this
logic existed since mballoc was introduced.


> In this case, it should be bug_ON.
>
> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")

No, there is no issue with this patch. It correctly just removes the duplicate
logic.

> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 32410b79b664..d0fb57970648 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  	}
>  	rcu_read_unlock();
>
> -	if (start + size <= ac->ac_o_ex.fe_logical &&
> +	if (start + size <= ac->ac_o_ex.fe_logical ||
>  			start > ac->ac_o_ex.fe_logical) {
>  		ext4_msg(ac->ac_sb, KERN_ERR,
>  			 "start %lu, size %lu, fe_logical %lu",
> --
> 2.31.1
>
Jan Kara May 23, 2022, 9:08 p.m. UTC | #4
On Tue 24-05-22 01:38:44, Ritesh Harjani wrote:
> On 22/05/21 09:42PM, Baokun Li wrote:
> > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > that the fe_logical is not in the allocated range.
> 
> Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> within the preallocated range. So if our start or start + size doesn't include
> fe_logical then it is a bug in the ext4_mb_normalize_request() logic.

I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a
hard guarantee that we would allocate extent that includes that block. It
was always treated as a hint only. In particular if you look at the logic
in ext4_mb_normalize_request() it does shift the start of the allocation to
avoid preallocated ranges etc. so I don't see how we are guaranteed that
ext4_mb_normalize_request() will result in an allocation request that
includes ac->ac_o_ex.fe_logical.

								Honza
Baokun Li May 24, 2022, 6:09 a.m. UTC | #5
在 2022/5/24 4:08, Ritesh Harjani 写道:
> On 22/05/21 09:42PM, Baokun Li wrote:
>> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
>> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
>> that the fe_logical is not in the allocated range.
> Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> within the preallocated range. So if our start or start + size doesn't include
> fe_logical then it is a bug in the ext4_mb_normalize_request() logic.
Yes, exactly.
> But should we be so harsh to hit a bug_on() or make it warn_on()?
I don't think hit a bug_on() is a problem. BUG_ON is not triggered here 
and will be triggered later.
> Also did you run any fs tests with this change.
Yes, I ran xfstests on ext3 and ext4 and found no problems.
> Since it looks like this
> logic existed since mballoc was introduced.
>
Yes, on our coverage report, those lines of code never seem to get there.

>> In this case, it should be bug_ON.
>>
>> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
> No, there is no issue with this patch. It correctly just removes the duplicate
> logic.
Okay, I'm going to remove this tag.
>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>> ---
>>   fs/ext4/mballoc.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 32410b79b664..d0fb57970648 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -4190,7 +4190,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>>   	}
>>   	rcu_read_unlock();
>>
>> -	if (start + size <= ac->ac_o_ex.fe_logical &&
>> +	if (start + size <= ac->ac_o_ex.fe_logical ||
>>   			start > ac->ac_o_ex.fe_logical) {
>>   		ext4_msg(ac->ac_sb, KERN_ERR,
>>   			 "start %lu, size %lu, fe_logical %lu",
>> --
>> 2.31.1
>>
> .
Ritesh Harjani (IBM) May 24, 2022, 6:26 a.m. UTC | #6
On 22/05/23 11:08PM, Jan Kara wrote:
> On Tue 24-05-22 01:38:44, Ritesh Harjani wrote:
> > On 22/05/21 09:42PM, Baokun Li wrote:
> > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > that the fe_logical is not in the allocated range.
> >
> > Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> > within the preallocated range. So if our start or start + size doesn't include
> > fe_logical then it is a bug in the ext4_mb_normalize_request() logic.
>
> I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a
> hard guarantee that we would allocate extent that includes that block. It

Agree that the guarantee is not about the extent which finally gets allocated.
It is only within ext4_mb_normalize_request() that the "start" and "size"
variable calculations is done in such a way that the ac->ac_o_ex.fe_logical
block will always fall within the "start" & "end" boundaries after
normalization.

That is how it also updates the goal block at the end too. ac->ac_g_ex.

> was always treated as a hint only. In particular if you look at the logic
> in ext4_mb_normalize_request() it does shift the start of the allocation to
> avoid preallocated ranges etc.

Yes, I checked the logic of ext4_mb_normalize_request() again.
As I see it (I can be wrong, so please correct me), there is always an attempt
to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical
except just one change where we are trimming the size of the request to only
EXT4_BLOCKS_PER_GROUP.

For e.g. when it compares against preallocated ranges. It has a BUG() which
checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range.
Because then we should never have come here to do allocation of a new block.

4143                 /* PA must not overlap original request */
4144                 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
4145                         ac->ac_o_ex.fe_logical < pa->pa_lstart));
<...>
4152                 BUG_ON(pa->pa_lstart <= start && pa_end >= end);

Then after skipping the preallocated regions which doesn't fall in between
"start" and "end"...

4147                 /* skip PAs this normalized request doesn't overlap with */
4148                 if (pa->pa_lstart >= end || pa_end <= start) {
4149                         spin_unlock(&pa->pa_lock);
4150                         continue;
4151                 }

...it adjusts "start" and "end" boundary according to allocated PAs boundaries
such that fe_logical is always in between "start" and "end".

4154                 /* adjust start or end to be adjacent to this pa */
4155                 if (pa_end <= ac->ac_o_ex.fe_logical) {
4156                         BUG_ON(pa_end < start);
4157                         start = pa_end;
4158                 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
4159                         BUG_ON(pa->pa_lstart > end);
4160                         end = pa->pa_lstart;
4161                 }



> so I don't see how we are guaranteed that
> ext4_mb_normalize_request() will result in an allocation request that
> includes ac->ac_o_ex.fe_logical.

It could be I am wrong, but looks like ext4_mb_normalize_request() keeps
ac->ac_o_ex.fe_logical within "start" and "end" of allocation request.
And then updates the goal block.

4196         ac->ac_g_ex.fe_logical = start;
4197         ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);

Thoughts?

-ritesh
Jan Kara May 24, 2022, 9:30 a.m. UTC | #7
On Mon 23-05-22 21:04:16, libaokun (A) wrote:
> 在 2022/5/23 17:40, Jan Kara 写道:
> > On Sat 21-05-22 21:42:17, Baokun Li wrote:
> > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > that the fe_logical is not in the allocated range.
> > > In this case, it should be bug_ON.
> > > 
> > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
> > > Signed-off-by: Baokun Li<libaokun1@huawei.com>
> > I think this is actually wrong. The original condition checks whether
> > start + size does not overflow the used integer type. Your condition is
> > much stronger and I don't think it always has to be true. E.g. allocation
> > goal block (start variable) can be pushed to larger values by existing
> > preallocation or so.
> > 
> > 								Honza
> > 
> I think there are two reasons for this:
> 
> First of all, the code here is as follows.
> ```
>         size = end - start;
>         [...]
> if (start + size <= ac->ac_o_ex.fe_logical &&
>                         start > ac->ac_o_ex.fe_logical) {
>                 ext4_msg(ac->ac_sb, KERN_ERR,
>                          "start %lu, size %lu, fe_logical %lu",
>                          (unsigned long) start, (unsigned long) size,
>                          (unsigned long) ac->ac_o_ex.fe_logical);
> BUG();
> }
>         BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> ```
> First of all, there is no need to compare with ac_o_ex.fe_logical if it is
> to determine whether there is an overflow.
> Because the previous logic guarantees start < = ac_o_ex.fe_logical, and

How does it guarantee that? The logic:

        if (ar->pleft && start <= ar->lleft) {
                size -= ar->lleft + 1 - start;
                start = ar->lleft + 1;
        }

can move 'start' to further blocks...

> limits the scope of size in
> "BUG_ON (size < = 0 | | size > EXT4_BLOCKS_PER_GROUP (ac- > ac_sb))"
> immediately following.

OK, but what guarantees that ac_o_ex.fe_logical < UINT_MAX - size?

> Secondly, the following code flow also reflects this logic.
> 
>            ext4_mb_normalize_request
>             >>> start + size <= ac->ac_o_ex.fe_logical
>            ext4_mb_regular_allocator
>             ext4_mb_simple_scan_group
>              ext4_mb_use_best_found
>               ext4_mb_new_preallocation
>                ext4_mb_new_inode_pa
>                 ext4_mb_use_inode_pa
>                  >>> set ac->ac_b_ex.fe_len <= 0
>            ext4_mb_mark_diskspace_used
>             >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
> 
> In ext4_mb_use_inode_pa, you have the following code.
> ```
> start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
> end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi,
> ac->ac_o_ex.fe_len));
> len = EXT4_NUM_B2C(sbi, end - start);
> ac->ac_b_ex.fe_len = len;
> ```
> The starting position in ext4_mb_mark_diskspace_used will be assert.
> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>  
> When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of
> end - start must be greater than 0.
> However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this
> bug_ON may be triggered.
> When this bug_ON is triggered, that is,
> 
> ac->ac_b_ex.fe_len <= 0
> end - start <= 0
> end <= start
> pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart +
> (ac->ac_o_ex.fe_logical - pa->pa_lstart)
> pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart
> pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical
> start + size <= ac->ac_o_ex.fe_logical
> 
> So I think that "&&" here should be changed to "||".

Sorry, I still disagree. After some more code reading I agree that
ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks
to be placed in the inode so logical extent of allocated blocks should include
ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you
suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which
case we can also remove some other code from ext4_mb_normalize_request()).

									Honza
Jan Kara May 24, 2022, 9:39 a.m. UTC | #8
On Tue 24-05-22 11:56:55, Ritesh Harjani wrote:
> On 22/05/23 11:08PM, Jan Kara wrote:
> > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote:
> > > On 22/05/21 09:42PM, Baokun Li wrote:
> > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > > that the fe_logical is not in the allocated range.
> > >
> > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> > > within the preallocated range. So if our start or start + size doesn't include
> > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic.
> >
> > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a
> > hard guarantee that we would allocate extent that includes that block. It
> 
> Agree that the guarantee is not about the extent which finally gets allocated.
> It is only within ext4_mb_normalize_request() that the "start" and "size"
> variable calculations is done in such a way that the ac->ac_o_ex.fe_logical
> block will always fall within the "start" & "end" boundaries after
> normalization.
> 
> That is how it also updates the goal block at the end too. ac->ac_g_ex.
> 
> > was always treated as a hint only. In particular if you look at the logic
> > in ext4_mb_normalize_request() it does shift the start of the allocation to
> > avoid preallocated ranges etc.
> 
> Yes, I checked the logic of ext4_mb_normalize_request() again.
> As I see it (I can be wrong, so please correct me), there is always an attempt
> to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical
> except just one change where we are trimming the size of the request to only
> EXT4_BLOCKS_PER_GROUP.
> 
> For e.g. when it compares against preallocated ranges. It has a BUG() which
> checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range.
> Because then we should never have come here to do allocation of a new block.
> 
> 4143                 /* PA must not overlap original request */
> 4144                 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
> 4145                         ac->ac_o_ex.fe_logical < pa->pa_lstart));
> <...>
> 4152                 BUG_ON(pa->pa_lstart <= start && pa_end >= end);
> 
> Then after skipping the preallocated regions which doesn't fall in between
> "start" and "end"...
> 
> 4147                 /* skip PAs this normalized request doesn't overlap with */
> 4148                 if (pa->pa_lstart >= end || pa_end <= start) {
> 4149                         spin_unlock(&pa->pa_lock);
> 4150                         continue;
> 4151                 }
> 
> ...it adjusts "start" and "end" boundary according to allocated PAs boundaries
> such that fe_logical is always in between "start" and "end".
> 
> 4154                 /* adjust start or end to be adjacent to this pa */
> 4155                 if (pa_end <= ac->ac_o_ex.fe_logical) {
> 4156                         BUG_ON(pa_end < start);
> 4157                         start = pa_end;
> 4158                 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
> 4159                         BUG_ON(pa->pa_lstart > end);
> 4160                         end = pa->pa_lstart;
> 4161                 }
> 
> 
> 
> > so I don't see how we are guaranteed that
> > ext4_mb_normalize_request() will result in an allocation request that
> > includes ac->ac_o_ex.fe_logical.
> 
> It could be I am wrong, but looks like ext4_mb_normalize_request() keeps
> ac->ac_o_ex.fe_logical within "start" and "end" of allocation request.
> And then updates the goal block.
> 
> 4196         ac->ac_g_ex.fe_logical = start;
> 4197         ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
> 
> Thoughts?

Right, after some more inspection the only thing I'm concerned about is:

        /* don't cover already allocated blocks in selected range */
        if (ar->pleft && start <= ar->lleft) {
                size -= ar->lleft + 1 - start;
                start = ar->lleft + 1;
        }

which can shift start beyond ac->ac_o_ex.fe_logical if the block would be
already allocated. But I guess in that case we should not be calling
ext4_mb_normalize_request()? ... some more code digging .. Yes, that is
guaranteed in how lleft is initialized in ext4_ext_map_blocks(). So OK, I
withdraw my objection to the stronger check but the changelog really needs
to do a better job to explain why the stronger condition should be true.

								Honza
Baokun Li May 24, 2022, 1:44 p.m. UTC | #9
在 2022/5/24 17:30, Jan Kara 写道:
> On Mon 23-05-22 21:04:16, libaokun (A) wrote:
>> 在 2022/5/23 17:40, Jan Kara 写道:
>>> On Sat 21-05-22 21:42:17, Baokun Li wrote:
>>>> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
>>>> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
>>>> that the fe_logical is not in the allocated range.
>>>> In this case, it should be bug_ON.
>>>>
>>>> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
>>>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>>> I think this is actually wrong. The original condition checks whether
>>> start + size does not overflow the used integer type. Your condition is
>>> much stronger and I don't think it always has to be true. E.g. allocation
>>> goal block (start variable) can be pushed to larger values by existing
>>> preallocation or so.
>>>
>>> 								Honza
>>>
>> I think there are two reasons for this:
>>
>> First of all, the code here is as follows.
>> ```
>>          size = end - start;
>>          [...]
>> if (start + size <= ac->ac_o_ex.fe_logical &&
>>                          start > ac->ac_o_ex.fe_logical) {
>>                  ext4_msg(ac->ac_sb, KERN_ERR,
>>                           "start %lu, size %lu, fe_logical %lu",
>>                           (unsigned long) start, (unsigned long) size,
>>                           (unsigned long) ac->ac_o_ex.fe_logical);
>> BUG();
>> }
>>          BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
>> ```
>> First of all, there is no need to compare with ac_o_ex.fe_logical if it is
>> to determine whether there is an overflow.
>> Because the previous logic guarantees start < = ac_o_ex.fe_logical, and
> How does it guarantee that? The logic:
>
>          if (ar->pleft && start <= ar->lleft) {
>                  size -= ar->lleft + 1 - start;
>                  start = ar->lleft + 1;
>          }
>
> can move 'start' to further blocks...
This is not the case. According to the code of the preceding process,
ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks.
ar->pleft is the first allocated block found to the left by map->m_lblk 
(that is, fe_logical),
and ar->pright is the first allocated block found to the right.
ar->lleft and ar->lright are logical block numbers, so there must be 
"ar->lleft < ac_o_ex.fe_logical < ar->lright".
>
>> limits the scope of size in
>> "BUG_ON (size < = 0 | | size > EXT4_BLOCKS_PER_GROUP (ac- > ac_sb))"
>> immediately following.
> OK, but what guarantees that ac_o_ex.fe_logical < UINT_MAX - size?
When ac_o_ex.fe_logical is too large to overflow, predict filesize 
enters the last branch.
In this case, start = ac->ac_o_ex.fe_logical and size = ac->ac_o_ex.fe_len.
However, the overflow is checked in ext4_ext_check_overlap of 
ext4_ext_map_blocks.
The code is as follows:
```
1898         /* check for wrap through zero on extent logical start block*/
1899         if (b1 + len1 < b1) {
1900                 len1 = EXT_MAX_BLOCKS - b1;
1901                 newext->ee_len = cpu_to_le16(len1);
1902                 ret = 1;
1903         }
```
Therefore, no overflow occurs.
>
>> Secondly, the following code flow also reflects this logic.
>>
>>             ext4_mb_normalize_request
>>              >>> start + size <= ac->ac_o_ex.fe_logical
>>             ext4_mb_regular_allocator
>>              ext4_mb_simple_scan_group
>>               ext4_mb_use_best_found
>>                ext4_mb_new_preallocation
>>                 ext4_mb_new_inode_pa
>>                  ext4_mb_use_inode_pa
>>                   >>> set ac->ac_b_ex.fe_len <= 0
>>             ext4_mb_mark_diskspace_used
>>              >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>>
>> In ext4_mb_use_inode_pa, you have the following code.
>> ```
>> start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
>> end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi,
>> ac->ac_o_ex.fe_len));
>> len = EXT4_NUM_B2C(sbi, end - start);
>> ac->ac_b_ex.fe_len = len;
>> ```
>> The starting position in ext4_mb_mark_diskspace_used will be assert.
>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>>   
>> When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of
>> end - start must be greater than 0.
>> However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this
>> bug_ON may be triggered.
>> When this bug_ON is triggered, that is,
>>
>> ac->ac_b_ex.fe_len <= 0
>> end - start <= 0
>> end <= start
>> pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart +
>> (ac->ac_o_ex.fe_logical - pa->pa_lstart)
>> pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart
>> pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical
>> start + size <= ac->ac_o_ex.fe_logical
>>
>> So I think that "&&" here should be changed to "||".
> Sorry, I still disagree. After some more code reading I agree that
> ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks
> to be placed in the inode so logical extent of allocated blocks should include
> ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you
> suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which
> case we can also remove some other code from ext4_mb_normalize_request()).
>
> 									Honza
>
What codes are you referring to that can be deleted?
Ritesh Harjani (IBM) May 24, 2022, 5:31 p.m. UTC | #10
On 22/05/24 11:39AM, Jan Kara wrote:
> On Tue 24-05-22 11:56:55, Ritesh Harjani wrote:
> > On 22/05/23 11:08PM, Jan Kara wrote:
> > > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote:
> > > > On 22/05/21 09:42PM, Baokun Li wrote:
> > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > > > that the fe_logical is not in the allocated range.
> > > >
> > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> > > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> > > > within the preallocated range. So if our start or start + size doesn't include
> > > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic.
> > >
> > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a
> > > hard guarantee that we would allocate extent that includes that block. It
> >
> > Agree that the guarantee is not about the extent which finally gets allocated.
> > It is only within ext4_mb_normalize_request() that the "start" and "size"
> > variable calculations is done in such a way that the ac->ac_o_ex.fe_logical
> > block will always fall within the "start" & "end" boundaries after
> > normalization.
> >
> > That is how it also updates the goal block at the end too. ac->ac_g_ex.
> >
> > > was always treated as a hint only. In particular if you look at the logic
> > > in ext4_mb_normalize_request() it does shift the start of the allocation to
> > > avoid preallocated ranges etc.
> >
> > Yes, I checked the logic of ext4_mb_normalize_request() again.
> > As I see it (I can be wrong, so please correct me), there is always an attempt
> > to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical
> > except just one change where we are trimming the size of the request to only
> > EXT4_BLOCKS_PER_GROUP.
> >
> > For e.g. when it compares against preallocated ranges. It has a BUG() which
> > checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range.
> > Because then we should never have come here to do allocation of a new block.
> >
> > 4143                 /* PA must not overlap original request */
> > 4144                 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
> > 4145                         ac->ac_o_ex.fe_logical < pa->pa_lstart));
> > <...>
> > 4152                 BUG_ON(pa->pa_lstart <= start && pa_end >= end);
> >
> > Then after skipping the preallocated regions which doesn't fall in between
> > "start" and "end"...
> >
> > 4147                 /* skip PAs this normalized request doesn't overlap with */
> > 4148                 if (pa->pa_lstart >= end || pa_end <= start) {
> > 4149                         spin_unlock(&pa->pa_lock);
> > 4150                         continue;
> > 4151                 }
> >
> > ...it adjusts "start" and "end" boundary according to allocated PAs boundaries
> > such that fe_logical is always in between "start" and "end".
> >
> > 4154                 /* adjust start or end to be adjacent to this pa */
> > 4155                 if (pa_end <= ac->ac_o_ex.fe_logical) {
> > 4156                         BUG_ON(pa_end < start);
> > 4157                         start = pa_end;
> > 4158                 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
> > 4159                         BUG_ON(pa->pa_lstart > end);
> > 4160                         end = pa->pa_lstart;
> > 4161                 }
> >
> >
> >
> > > so I don't see how we are guaranteed that
> > > ext4_mb_normalize_request() will result in an allocation request that
> > > includes ac->ac_o_ex.fe_logical.
> >
> > It could be I am wrong, but looks like ext4_mb_normalize_request() keeps
> > ac->ac_o_ex.fe_logical within "start" and "end" of allocation request.
> > And then updates the goal block.
> >
> > 4196         ac->ac_g_ex.fe_logical = start;
> > 4197         ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
> >
> > Thoughts?
>
> Right, after some more inspection the only thing I'm concerned about is:
>
>         /* don't cover already allocated blocks in selected range */
>         if (ar->pleft && start <= ar->lleft) {
>                 size -= ar->lleft + 1 - start;
>                 start = ar->lleft + 1;
>         }
>
> which can shift start beyond ac->ac_o_ex.fe_logical if the block would be
> already allocated. But I guess in that case we should not be calling
> ext4_mb_normalize_request()? ... some more code digging .. Yes, that is
> guaranteed in how lleft is initialized in ext4_ext_map_blocks().

Yes.

> So OK, I withdraw my objection to the stronger check but the changelog really needs

Thanks Jan for confirming it.


> to do a better job to explain why the stronger condition should be true.

Agreed.


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 252c168454c7..9e7c145e9aa2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4179,7 +4179,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
        }
        rcu_read_unlock();

-       if (start + size <= ac->ac_o_ex.fe_logical &&
+       /*
+        * In this function "start" and "size" are normalized for better
+        * alignment and length such that we could preallocate more blocks.
+        * This normalization is done such that original request of
+        * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and
+        * "size" boundaries.

Does above comment look good to you?


+        * (Note fe_len can be relaxed since FS block allocation API does not
+        * provide gurantee on number of contiguous blocks allocation since that
+        * depends upon free space left, etc).
+        * In case of inode pa, later we use the allocated blocks
+        * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated
+        * range of goal/best blocks [start, size] to put it at the
+        * ac_o_ex.fe_logical extent of this inode.
+        * (See ext4_mb_use_inode_pa() for more details)
+        */

^^^ We can even put more info if needed (maybe in commit message?)

+       if (start + size <= ac->ac_o_ex.fe_logical ||
                        start > ac->ac_o_ex.fe_logical) {
                ext4_msg(ac->ac_sb, KERN_ERR,
                         "start %lu, size %lu, fe_logical %lu",

FYI - I ran the fsstress test (with -g 256) shared by Baokun with only above
change (&& -> ||) and not the original fix. And I see that we can hit this
mentioned BUG() now.

<logs>
========
[  599.619875] EXT4-fs (loop2): start 692, size 196, fe_logical 982
	...I think we should also add (orig_size >> bsbits) in above print msg ^^

[  599.621043] ------------[ cut here ]------------
[  599.625099] kernel BUG at fs/ext4/mballoc.c:4188!

-ritesh
Jan Kara May 25, 2022, 11:29 a.m. UTC | #11
On Tue 24-05-22 21:44:31, Baokun Li wrote:
> 在 2022/5/24 17:30, Jan Kara 写道:
> > On Mon 23-05-22 21:04:16, libaokun (A) wrote:
> > > 在 2022/5/23 17:40, Jan Kara 写道:
> > > > On Sat 21-05-22 21:42:17, Baokun Li wrote:
> > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > > > that the fe_logical is not in the allocated range.
> > > > > In this case, it should be bug_ON.
> > > > > 
> > > > > Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
> > > > > Signed-off-by: Baokun Li<libaokun1@huawei.com>
> > > > I think this is actually wrong. The original condition checks whether
> > > > start + size does not overflow the used integer type. Your condition is
> > > > much stronger and I don't think it always has to be true. E.g. allocation
> > > > goal block (start variable) can be pushed to larger values by existing
> > > > preallocation or so.
> > > > 
> > > > 								Honza
> > > > 
> > > I think there are two reasons for this:
> > > 
> > > First of all, the code here is as follows.
> > > ```
> > >          size = end - start;
> > >          [...]
> > > if (start + size <= ac->ac_o_ex.fe_logical &&
> > >                          start > ac->ac_o_ex.fe_logical) {
> > >                  ext4_msg(ac->ac_sb, KERN_ERR,
> > >                           "start %lu, size %lu, fe_logical %lu",
> > >                           (unsigned long) start, (unsigned long) size,
> > >                           (unsigned long) ac->ac_o_ex.fe_logical);
> > > BUG();
> > > }
> > >          BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> > > ```
> > > First of all, there is no need to compare with ac_o_ex.fe_logical if it is
> > > to determine whether there is an overflow.
> > > Because the previous logic guarantees start < = ac_o_ex.fe_logical, and
> > How does it guarantee that? The logic:
> > 
> >          if (ar->pleft && start <= ar->lleft) {
> >                  size -= ar->lleft + 1 - start;
> >                  start = ar->lleft + 1;
> >          }
> > 
> > can move 'start' to further blocks...
> This is not the case. According to the code of the preceding process,
> ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks.
> ar->pleft is the first allocated block found to the left by map->m_lblk
> (that is, fe_logical),
> and ar->pright is the first allocated block found to the right.
> ar->lleft and ar->lright are logical block numbers, so there must be
> "ar->lleft < ac_o_ex.fe_logical < ar->lright".

Right, I've found that out after sending my previous email. Sorry for
confusion.

> > > Secondly, the following code flow also reflects this logic.
> > > 
> > >             ext4_mb_normalize_request
> > >              >>> start + size <= ac->ac_o_ex.fe_logical
> > >             ext4_mb_regular_allocator
> > >              ext4_mb_simple_scan_group
> > >               ext4_mb_use_best_found
> > >                ext4_mb_new_preallocation
> > >                 ext4_mb_new_inode_pa
> > >                  ext4_mb_use_inode_pa
> > >                   >>> set ac->ac_b_ex.fe_len <= 0
> > >             ext4_mb_mark_diskspace_used
> > >              >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
> > > 
> > > In ext4_mb_use_inode_pa, you have the following code.
> > > ```
> > > start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
> > > end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi,
> > > ac->ac_o_ex.fe_len));
> > > len = EXT4_NUM_B2C(sbi, end - start);
> > > ac->ac_b_ex.fe_len = len;
> > > ```
> > > The starting position in ext4_mb_mark_diskspace_used will be assert.
> > > BUG_ON(ac->ac_b_ex.fe_len <= 0);
> > > When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of
> > > end - start must be greater than 0.
> > > However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this
> > > bug_ON may be triggered.
> > > When this bug_ON is triggered, that is,
> > > 
> > > ac->ac_b_ex.fe_len <= 0
> > > end - start <= 0
> > > end <= start
> > > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart +
> > > (ac->ac_o_ex.fe_logical - pa->pa_lstart)
> > > pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart
> > > pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical
> > > start + size <= ac->ac_o_ex.fe_logical
> > > 
> > > So I think that "&&" here should be changed to "||".
> > Sorry, I still disagree. After some more code reading I agree that
> > ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks
> > to be placed in the inode so logical extent of allocated blocks should include
> > ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you
> > suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which
> > case we can also remove some other code from ext4_mb_normalize_request()).
> > 
> > 									Honza
> > 
> What codes are you referring to that can be deleted?

So I though the shifting of 'start' by lleft cannot happen but then I
realized that if 'start' got aligned down, it can now be lower than lleft
so the shifting is indeed needed. So all code is needed there.

								Honza
Jan Kara May 25, 2022, 12:12 p.m. UTC | #12
On Tue 24-05-22 23:01:35, Ritesh Harjani wrote:
> On 22/05/24 11:39AM, Jan Kara wrote:
> > On Tue 24-05-22 11:56:55, Ritesh Harjani wrote:
> > > On 22/05/23 11:08PM, Jan Kara wrote:
> > > > On Tue 24-05-22 01:38:44, Ritesh Harjani wrote:
> > > > > On 22/05/21 09:42PM, Baokun Li wrote:
> > > > > > When either of the "start + size <= ac->ac_o_ex.fe_logical" or
> > > > > > "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
> > > > > > that the fe_logical is not in the allocated range.
> > > > >
> > > > > Sounds about right to me based on the logic in ext4_mb_use_inode_pa().
> > > > > We try to allocate/preallocate such that ac->ac_o_ex.fe_logical should fall
> > > > > within the preallocated range. So if our start or start + size doesn't include
> > > > > fe_logical then it is a bug in the ext4_mb_normalize_request() logic.
> > > >
> > > > I agree ac->ac_o_ex.fe_logical is a goal block. But AFAIK it never was a
> > > > hard guarantee that we would allocate extent that includes that block. It
> > >
> > > Agree that the guarantee is not about the extent which finally gets allocated.
> > > It is only within ext4_mb_normalize_request() that the "start" and "size"
> > > variable calculations is done in such a way that the ac->ac_o_ex.fe_logical
> > > block will always fall within the "start" & "end" boundaries after
> > > normalization.
> > >
> > > That is how it also updates the goal block at the end too. ac->ac_g_ex.
> > >
> > > > was always treated as a hint only. In particular if you look at the logic
> > > > in ext4_mb_normalize_request() it does shift the start of the allocation to
> > > > avoid preallocated ranges etc.
> > >
> > > Yes, I checked the logic of ext4_mb_normalize_request() again.
> > > As I see it (I can be wrong, so please correct me), there is always an attempt
> > > to make "start" & "start + size" such that it covers ac->ac_o_ex.fe_logical
> > > except just one change where we are trimming the size of the request to only
> > > EXT4_BLOCKS_PER_GROUP.
> > >
> > > For e.g. when it compares against preallocated ranges. It has a BUG() which
> > > checks if the ac->ac_o_ex.fe_logical already lies in the preallocated range.
> > > Because then we should never have come here to do allocation of a new block.
> > >
> > > 4143                 /* PA must not overlap original request */
> > > 4144                 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
> > > 4145                         ac->ac_o_ex.fe_logical < pa->pa_lstart));
> > > <...>
> > > 4152                 BUG_ON(pa->pa_lstart <= start && pa_end >= end);
> > >
> > > Then after skipping the preallocated regions which doesn't fall in between
> > > "start" and "end"...
> > >
> > > 4147                 /* skip PAs this normalized request doesn't overlap with */
> > > 4148                 if (pa->pa_lstart >= end || pa_end <= start) {
> > > 4149                         spin_unlock(&pa->pa_lock);
> > > 4150                         continue;
> > > 4151                 }
> > >
> > > ...it adjusts "start" and "end" boundary according to allocated PAs boundaries
> > > such that fe_logical is always in between "start" and "end".
> > >
> > > 4154                 /* adjust start or end to be adjacent to this pa */
> > > 4155                 if (pa_end <= ac->ac_o_ex.fe_logical) {
> > > 4156                         BUG_ON(pa_end < start);
> > > 4157                         start = pa_end;
> > > 4158                 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
> > > 4159                         BUG_ON(pa->pa_lstart > end);
> > > 4160                         end = pa->pa_lstart;
> > > 4161                 }
> > >
> > >
> > >
> > > > so I don't see how we are guaranteed that
> > > > ext4_mb_normalize_request() will result in an allocation request that
> > > > includes ac->ac_o_ex.fe_logical.
> > >
> > > It could be I am wrong, but looks like ext4_mb_normalize_request() keeps
> > > ac->ac_o_ex.fe_logical within "start" and "end" of allocation request.
> > > And then updates the goal block.
> > >
> > > 4196         ac->ac_g_ex.fe_logical = start;
> > > 4197         ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
> > >
> > > Thoughts?
> >
> > Right, after some more inspection the only thing I'm concerned about is:
> >
> >         /* don't cover already allocated blocks in selected range */
> >         if (ar->pleft && start <= ar->lleft) {
> >                 size -= ar->lleft + 1 - start;
> >                 start = ar->lleft + 1;
> >         }
> >
> > which can shift start beyond ac->ac_o_ex.fe_logical if the block would be
> > already allocated. But I guess in that case we should not be calling
> > ext4_mb_normalize_request()? ... some more code digging .. Yes, that is
> > guaranteed in how lleft is initialized in ext4_ext_map_blocks().
> 
> Yes.
> 
> > So OK, I withdraw my objection to the stronger check but the changelog really needs
> 
> Thanks Jan for confirming it.
> 
> 
> > to do a better job to explain why the stronger condition should be true.
> 
> Agreed.
> 
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 252c168454c7..9e7c145e9aa2 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4179,7 +4179,22 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>         }
>         rcu_read_unlock();
> 
> -       if (start + size <= ac->ac_o_ex.fe_logical &&
> +       /*
> +        * In this function "start" and "size" are normalized for better
> +        * alignment and length such that we could preallocate more blocks.
> +        * This normalization is done such that original request of
> +        * ac->ac_o_ex.fe_logical & fe_len should always lie within "start" and
> +        * "size" boundaries.
> 
> Does above comment look good to you?

Yes, thanks!

> +        * (Note fe_len can be relaxed since FS block allocation API does not
> +        * provide gurantee on number of contiguous blocks allocation since that
> +        * depends upon free space left, etc).
> +        * In case of inode pa, later we use the allocated blocks
> +        * [pa_start + fe_logical - pa_lstart, fe_len/size] from the preallocated
> +        * range of goal/best blocks [start, size] to put it at the
> +        * ac_o_ex.fe_logical extent of this inode.
> +        * (See ext4_mb_use_inode_pa() for more details)
> +        */
> 
> ^^^ We can even put more info if needed (maybe in commit message?)

I'd just put into commit message a note like: ext4_mb_normalize_request()
can move logical start of allocated blocks to reduce fragmentation and
better utilize preallocation. However logical block requested as a start of
allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated
blocks so we add assertion to check that.

> 
> +       if (start + size <= ac->ac_o_ex.fe_logical ||
>                         start > ac->ac_o_ex.fe_logical) {
>                 ext4_msg(ac->ac_sb, KERN_ERR,
>                          "start %lu, size %lu, fe_logical %lu",
> 
> FYI - I ran the fsstress test (with -g 256) shared by Baokun with only above
> change (&& -> ||) and not the original fix. And I see that we can hit this
> mentioned BUG() now.

Cool.

								Honza
Baokun Li May 26, 2022, 1:16 a.m. UTC | #13
在 2022/5/25 19:29, Jan Kara 写道:
> On Tue 24-05-22 21:44:31, Baokun Li wrote:
>> 在 2022/5/24 17:30, Jan Kara 写道:
>>> On Mon 23-05-22 21:04:16, libaokun (A) wrote:
>>>> 在 2022/5/23 17:40, Jan Kara 写道:
>>>>> On Sat 21-05-22 21:42:17, Baokun Li wrote:
>>>>>> When either of the "start + size <= ac->ac_o_ex.fe_logical" or
>>>>>> "start > ac->ac_o_ex.fe_logical" conditions is met, it indicates
>>>>>> that the fe_logical is not in the allocated range.
>>>>>> In this case, it should be bug_ON.
>>>>>>
>>>>>> Fixes: dfe076c106f6 ("ext4: get rid of code duplication")
>>>>>> Signed-off-by: Baokun Li<libaokun1@huawei.com>
>>>>> I think this is actually wrong. The original condition checks whether
>>>>> start + size does not overflow the used integer type. Your condition is
>>>>> much stronger and I don't think it always has to be true. E.g. allocation
>>>>> goal block (start variable) can be pushed to larger values by existing
>>>>> preallocation or so.
>>>>>
>>>>> 								Honza
>>>>>
>>>> I think there are two reasons for this:
>>>>
>>>> First of all, the code here is as follows.
>>>> ```
>>>>           size = end - start;
>>>>           [...]
>>>> if (start + size <= ac->ac_o_ex.fe_logical &&
>>>>                           start > ac->ac_o_ex.fe_logical) {
>>>>                   ext4_msg(ac->ac_sb, KERN_ERR,
>>>>                            "start %lu, size %lu, fe_logical %lu",
>>>>                            (unsigned long) start, (unsigned long) size,
>>>>                            (unsigned long) ac->ac_o_ex.fe_logical);
>>>> BUG();
>>>> }
>>>>           BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
>>>> ```
>>>> First of all, there is no need to compare with ac_o_ex.fe_logical if it is
>>>> to determine whether there is an overflow.
>>>> Because the previous logic guarantees start < = ac_o_ex.fe_logical, and
>>> How does it guarantee that? The logic:
>>>
>>>           if (ar->pleft && start <= ar->lleft) {
>>>                   size -= ar->lleft + 1 - start;
>>>                   start = ar->lleft + 1;
>>>           }
>>>
>>> can move 'start' to further blocks...
>> This is not the case. According to the code of the preceding process,
>> ar->pleft and ar->pright are assigned values in ext4_ext_map_blocks.
>> ar->pleft is the first allocated block found to the left by map->m_lblk
>> (that is, fe_logical),
>> and ar->pright is the first allocated block found to the right.
>> ar->lleft and ar->lright are logical block numbers, so there must be
>> "ar->lleft < ac_o_ex.fe_logical < ar->lright".
> Right, I've found that out after sending my previous email. Sorry for
> confusion.
Don't be sorry. Thank you very much for your advice. It has benefited me 
a lot.
>
>>>> Secondly, the following code flow also reflects this logic.
>>>>
>>>>              ext4_mb_normalize_request
>>>>               >>> start + size <= ac->ac_o_ex.fe_logical
>>>>              ext4_mb_regular_allocator
>>>>               ext4_mb_simple_scan_group
>>>>                ext4_mb_use_best_found
>>>>                 ext4_mb_new_preallocation
>>>>                  ext4_mb_new_inode_pa
>>>>                   ext4_mb_use_inode_pa
>>>>                    >>> set ac->ac_b_ex.fe_len <= 0
>>>>              ext4_mb_mark_diskspace_used
>>>>               >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>>>>
>>>> In ext4_mb_use_inode_pa, you have the following code.
>>>> ```
>>>> start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
>>>> end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), start + EXT4_C2B(sbi,
>>>> ac->ac_o_ex.fe_len));
>>>> len = EXT4_NUM_B2C(sbi, end - start);
>>>> ac->ac_b_ex.fe_len = len;
>>>> ```
>>>> The starting position in ext4_mb_mark_diskspace_used will be assert.
>>>> BUG_ON(ac->ac_b_ex.fe_len <= 0);
>>>> When end == start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) is used, the value of
>>>> end - start must be greater than 0.
>>>> However, when end == pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) occurs, this
>>>> bug_ON may be triggered.
>>>> When this bug_ON is triggered, that is,
>>>>
>>>> ac->ac_b_ex.fe_len <= 0
>>>> end - start <= 0
>>>> end <= start
>>>> pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) <= pa->pa_pstart +
>>>> (ac->ac_o_ex.fe_logical - pa->pa_lstart)
>>>> pa->pa_len <= ac->ac_o_ex.fe_logical - pa->pa_lstart
>>>> pa->pa_lstart + pa->pa_len <= ac->ac_o_ex.fe_logical
>>>> start + size <= ac->ac_o_ex.fe_logical
>>>>
>>>> So I think that "&&" here should be changed to "||".
>>> Sorry, I still disagree. After some more code reading I agree that
>>> ac->ac_o_ex.fe_logical is the logical block where we want allocated blocks
>>> to be placed in the inode so logical extent of allocated blocks should include
>>> ac->ac_o_ex.fe_logical. But I would be reluctant to make assertion you
>>> suggest unless we make sure ac->ac_o_ex.fe_logical in unallocated (in which
>>> case we can also remove some other code from ext4_mb_normalize_request()).
>>>
>>> 									Honza
>>>
>> What codes are you referring to that can be deleted?
> So I though the shifting of 'start' by lleft cannot happen but then I
> realized that if 'start' got aligned down, it can now be lower than lleft
> so the shifting is indeed needed. So all code is needed there.
>
> 								Honza

Okay, thanks again!
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 32410b79b664..d0fb57970648 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4190,7 +4190,7 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	}
 	rcu_read_unlock();
 
-	if (start + size <= ac->ac_o_ex.fe_logical &&
+	if (start + size <= ac->ac_o_ex.fe_logical ||
 			start > ac->ac_o_ex.fe_logical) {
 		ext4_msg(ac->ac_sb, KERN_ERR,
 			 "start %lu, size %lu, fe_logical %lu",