diff mbox series

[2/4] ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow

Message ID 20230718131052.283350-3-libaokun1@huawei.com
State Superseded
Headers show
Series ext4: fix some ext4_lblk_t overflow issues | expand

Commit Message

Baokun Li July 18, 2023, 1:10 p.m. UTC
When we calculate the end position of ext4_free_extent, this position may
be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
the first case of adjusting the best extent, that is, new_bex_end > 0, the
following BUG_ON will be triggered:

=========================================================
kernel BUG at fs/ext4/mballoc.c:5116!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
Call Trace:
 <TASK>
 ext4_mb_use_best_found+0x203/0x2f0
 ext4_mb_try_best_found+0x163/0x240
 ext4_mb_regular_allocator+0x158/0x1550
 ext4_mb_new_blocks+0x86a/0xe10
 ext4_ext_map_blocks+0xb0c/0x13a0
 ext4_map_blocks+0x2cd/0x8f0
 ext4_iomap_begin+0x27b/0x400
 iomap_iter+0x222/0x3d0
 __iomap_dio_rw+0x243/0xcb0
 iomap_dio_rw+0x16/0x80
=========================================================

A simple reproducer demonstrating the problem:

	mkfs.ext4 -F /dev/sda -b 4096 100M
	mount /dev/sda /tmp/test
	fallocate -l1M /tmp/test/tmp
	fallocate -l10M /tmp/test/file
	fallocate -i -o 1M -l16777203M /tmp/test/file
	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
	sleep 10 && killall -9 fsstress
	rm -f /tmp/test/tmp
	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"

We declare new_bex_start and new_bex_end as correct types and use fex_end()
to avoid the problems caused by the ext4_lblk_t overflow above.

Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

Comments

Ritesh Harjani (IBM) July 20, 2023, 12:44 p.m. UTC | #1
Baokun Li <libaokun1@huawei.com> writes:

> When we calculate the end position of ext4_free_extent, this position may
> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
> the first case of adjusting the best extent, that is, new_bex_end > 0, the
> following BUG_ON will be triggered:

Nice spotting.

>
> =========================================================
> kernel BUG at fs/ext4/mballoc.c:5116!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
> Call Trace:
>  <TASK>
>  ext4_mb_use_best_found+0x203/0x2f0
>  ext4_mb_try_best_found+0x163/0x240
>  ext4_mb_regular_allocator+0x158/0x1550
>  ext4_mb_new_blocks+0x86a/0xe10
>  ext4_ext_map_blocks+0xb0c/0x13a0
>  ext4_map_blocks+0x2cd/0x8f0
>  ext4_iomap_begin+0x27b/0x400
>  iomap_iter+0x222/0x3d0
>  __iomap_dio_rw+0x243/0xcb0
>  iomap_dio_rw+0x16/0x80
> =========================================================
>
> A simple reproducer demonstrating the problem:
>
> 	mkfs.ext4 -F /dev/sda -b 4096 100M
> 	mount /dev/sda /tmp/test
> 	fallocate -l1M /tmp/test/tmp
> 	fallocate -l10M /tmp/test/file
> 	fallocate -i -o 1M -l16777203M /tmp/test/file
> 	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
> 	sleep 10 && killall -9 fsstress
> 	rm -f /tmp/test/tmp
> 	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"


Could you please also add it into xfstests?
I think it's a nice test which can check the boundary conditions for
start and end of data types used in mballoc. I think it should even work
if you don't do fsstress but instead just fallocate some remaining space
in filesystem, so that when you open and try to write it to 0th offset,
if automatically hits this error in ext4_mb_new_inode_pa().

>
> We declare new_bex_start and new_bex_end as correct types and use fex_end()
> to avoid the problems caused by the ext4_lblk_t overflow above.
>
> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index eb7f5d35ef96..2090e5e7ba58 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	pa = ac->ac_pa;
>  
>  	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
> -		int new_bex_start;
> -		int new_bex_end;
> +		ext4_lblk_t new_bex_start;
> +		loff_t new_bex_end;
>  
>  		/* we can't allocate as much as normalizer wants.
>  		 * so, found space must get proper lstart
> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  		 *    still cover original start
>  		 * 3. Else, keep the best ex at start of original request.
>  		 */
> -		new_bex_end = ac->ac_g_ex.fe_logical +
> -			EXT4_C2B(sbi, ac->ac_orig_goal_len);
> +		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
>  		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>  		if (ac->ac_o_ex.fe_logical >= new_bex_start)
>  			goto adjust_bex;
> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  
>  		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>  		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> -		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
> -				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));

Ok so the right hand becomes 0 (because then end can go upto 1<<32 which
will be 0 for unsigned int). And the left (new_bex_end) might be
negative due to some operations above as I see it. 
And comparing an int with unsigned int, it will promote new_bex_end to
unsigned int which will make it's value too large and will hit the
bug_on. 

I would like to carefully review all such paths. I will soon review and
get back.


> +		BUG_ON(new_bex_end >
> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));

I am not sure whether using fex_end or pa_end is any helpful. 
I think we can just typecast if needed and keep it simple rather
than adding helpers functions for addition operation.
(because of the fact that fex_end() can take a third parameter which
sometimes you pass as NULL. Hence it doesn't look clean, IMO)

>  	}
>  
>  	pa->pa_lstart = ac->ac_b_ex.fe_logical;
> -- 
> 2.31.1

-ritesh
Baokun Li July 20, 2023, 1:42 p.m. UTC | #2
On 2023/7/20 20:44, Ritesh Harjani (IBM) wrote:
> Baokun Li <libaokun1@huawei.com> writes:
>
>> When we calculate the end position of ext4_free_extent, this position may
>> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
>> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
>> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
>> the first case of adjusting the best extent, that is, new_bex_end > 0, the
>> following BUG_ON will be triggered:
> Nice spotting.
>
>> =========================================================
>> kernel BUG at fs/ext4/mballoc.c:5116!
>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
>> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
>> Call Trace:
>>   <TASK>
>>   ext4_mb_use_best_found+0x203/0x2f0
>>   ext4_mb_try_best_found+0x163/0x240
>>   ext4_mb_regular_allocator+0x158/0x1550
>>   ext4_mb_new_blocks+0x86a/0xe10
>>   ext4_ext_map_blocks+0xb0c/0x13a0
>>   ext4_map_blocks+0x2cd/0x8f0
>>   ext4_iomap_begin+0x27b/0x400
>>   iomap_iter+0x222/0x3d0
>>   __iomap_dio_rw+0x243/0xcb0
>>   iomap_dio_rw+0x16/0x80
>> =========================================================
>>
>> A simple reproducer demonstrating the problem:
>>
>> 	mkfs.ext4 -F /dev/sda -b 4096 100M
>> 	mount /dev/sda /tmp/test
>> 	fallocate -l1M /tmp/test/tmp
>> 	fallocate -l10M /tmp/test/file
>> 	fallocate -i -o 1M -l16777203M /tmp/test/file
>> 	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
>> 	sleep 10 && killall -9 fsstress
>> 	rm -f /tmp/test/tmp
>> 	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"
>
> Could you please also add it into xfstests?
Sure!I'll try to push this test case to xfstests.
> I think it's a nice test which can check the boundary conditions for
> start and end of data types used in mballoc. I think it should even work
> if you don't do fsstress but instead just fallocate some remaining space
> in filesystem, so that when you open and try to write it to 0th offset,
> if automatically hits this error in ext4_mb_new_inode_pa().
Yes, the fsstress here is just to fill up the remaining space on the disk.
>
>> We declare new_bex_start and new_bex_end as correct types and use fex_end()
>> to avoid the problems caused by the ext4_lblk_t overflow above.
>>
>> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/mballoc.c | 11 +++++------
>>   1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index eb7f5d35ef96..2090e5e7ba58 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   	pa = ac->ac_pa;
>>   
>>   	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
>> -		int new_bex_start;
>> -		int new_bex_end;
>> +		ext4_lblk_t new_bex_start;
>> +		loff_t new_bex_end;
>>   
>>   		/* we can't allocate as much as normalizer wants.
>>   		 * so, found space must get proper lstart
>> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   		 *    still cover original start
>>   		 * 3. Else, keep the best ex at start of original request.
>>   		 */
>> -		new_bex_end = ac->ac_g_ex.fe_logical +
>> -			EXT4_C2B(sbi, ac->ac_orig_goal_len);
>> +		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
>>   		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>>   		if (ac->ac_o_ex.fe_logical >= new_bex_start)
>>   			goto adjust_bex;
>> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   
>>   		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>>   		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
>> -		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
>> -				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
> Ok so the right hand becomes 0 (because then end can go upto 1<<32 which
> will be 0 for unsigned int). And the left (new_bex_end) might be
> negative due to some operations above as I see it.
> And comparing an int with unsigned int, it will promote new_bex_end to
> unsigned int which will make it's value too large and will hit the
> bug_on.
Exactly!
>
> I would like to carefully review all such paths. I will soon review and
> get back.
Okay, thank you very much for your careful review.
The 2nd and 3rd cases of adjusting the best extent are impossible to 
overflow,
so only the first case is converted here.
>
>
>> +		BUG_ON(new_bex_end >
>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
> I am not sure whether using fex_end or pa_end is any helpful.
> I think we can just typecast if needed and keep it simple rather
> than adding helpers functions for addition operation.
> (because of the fact that fex_end() can take a third parameter which
> sometimes you pass as NULL. Hence it doesn't look clean, IMO)
I added the helper functions here for two reasons:
1. restricting the type of the return value.
2. This avoids the ugly line breaks in most cases.

The fex_end() indeed doesn't look as clean as the pa_end(), because we 
might use
the start of the free extent plus some other length to get a new end, 
like right in
ext4_mb_new_inode_pa(), which makes me have to add another extra length
argument, but I think it's worth it, and even with the addition of a 
parameter
that will probably be unused, it still looks a lot shorter than the 
original code.
>>   	}
>>   
>>   	pa->pa_lstart = ac->ac_b_ex.fe_logical;
>> -- 
>> 2.31.1
> -ritesh


Thanks again!
Ritesh Harjani (IBM) July 20, 2023, 7:07 p.m. UTC | #3
Baokun Li <libaokun1@huawei.com> writes:

> When we calculate the end position of ext4_free_extent, this position may
> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
> the first case of adjusting the best extent, that is, new_bex_end > 0, the
> following BUG_ON will be triggered:
>
> =========================================================
> kernel BUG at fs/ext4/mballoc.c:5116!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
> Call Trace:
>  <TASK>
>  ext4_mb_use_best_found+0x203/0x2f0
>  ext4_mb_try_best_found+0x163/0x240
>  ext4_mb_regular_allocator+0x158/0x1550
>  ext4_mb_new_blocks+0x86a/0xe10
>  ext4_ext_map_blocks+0xb0c/0x13a0
>  ext4_map_blocks+0x2cd/0x8f0
>  ext4_iomap_begin+0x27b/0x400
>  iomap_iter+0x222/0x3d0
>  __iomap_dio_rw+0x243/0xcb0
>  iomap_dio_rw+0x16/0x80
> =========================================================
>
> A simple reproducer demonstrating the problem:
>
> 	mkfs.ext4 -F /dev/sda -b 4096 100M
> 	mount /dev/sda /tmp/test
> 	fallocate -l1M /tmp/test/tmp
> 	fallocate -l10M /tmp/test/file
> 	fallocate -i -o 1M -l16777203M /tmp/test/file
> 	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
> 	sleep 10 && killall -9 fsstress
> 	rm -f /tmp/test/tmp
> 	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"
>
> We declare new_bex_start and new_bex_end as correct types and use fex_end()
> to avoid the problems caused by the ext4_lblk_t overflow above.
>
> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index eb7f5d35ef96..2090e5e7ba58 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	pa = ac->ac_pa;
>  
>  	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
> -		int new_bex_start;
> -		int new_bex_end;
> +		ext4_lblk_t new_bex_start;
> +		loff_t new_bex_end;
>  
>  		/* we can't allocate as much as normalizer wants.
>  		 * so, found space must get proper lstart
> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  		 *    still cover original start
>  		 * 3. Else, keep the best ex at start of original request.
>  		 */
> -		new_bex_end = ac->ac_g_ex.fe_logical +
> -			EXT4_C2B(sbi, ac->ac_orig_goal_len);
> +		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
>  		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>  		if (ac->ac_o_ex.fe_logical >= new_bex_start)
>  			goto adjust_bex;

		new_bex_end = ac->ac_g_ex.fe_logical +
			EXT4_C2B(sbi, ac->ac_orig_goal_len);
		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
		if (ac->ac_o_ex.fe_logical >= new_bex_start)
			goto adjust_bex;

		new_bex_start = ac->ac_g_ex.fe_logical;
		new_bex_end =
			new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
		if (ac->ac_o_ex.fe_logical < new_bex_end)
			goto adjust_bex;

		new_bex_start = ac->ac_o_ex.fe_logical;
		new_bex_end =
			new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);

I think it will be safer if we typecast all above new_bex_end
calculations. I understand that it might never happen here that
fe_logical + best found extent length will make it overflow.
Simply because we only enter here when ac_b_ex.fe_len <
ac_orig_goal_len. But I think if we change any logic tomorrow or
refactor any part of code, it will be much safer if we simply keep the
typecast in place so that we avoid any future tripping caused by
arithmatic overflow here.

Also I am not finding fex_end() or pa_end() helpers very intuitive way
of doing it. I feel typecasting in place is much simpler and reads better.

-ritesh

> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  
>  		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>  		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> -		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
> -				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
> +		BUG_ON(new_bex_end >
> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>  	}
>  
>  	pa->pa_lstart = ac->ac_b_ex.fe_logical;
> -- 
> 2.31.1
Ritesh Harjani (IBM) July 20, 2023, 7:30 p.m. UTC | #4
Baokun Li <libaokun1@huawei.com> writes:

> On 2023/7/20 20:44, Ritesh Harjani (IBM) wrote:
>> Baokun Li <libaokun1@huawei.com> writes:
>>
>>> When we calculate the end position of ext4_free_extent, this position may
>>> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
>>> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
>>> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
>>> the first case of adjusting the best extent, that is, new_bex_end > 0, the
>>> following BUG_ON will be triggered:
>> Nice spotting.
>>
>>> =========================================================
>>> kernel BUG at fs/ext4/mballoc.c:5116!
>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
>>> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
>>> Call Trace:
>>>   <TASK>
>>>   ext4_mb_use_best_found+0x203/0x2f0
>>>   ext4_mb_try_best_found+0x163/0x240
>>>   ext4_mb_regular_allocator+0x158/0x1550
>>>   ext4_mb_new_blocks+0x86a/0xe10
>>>   ext4_ext_map_blocks+0xb0c/0x13a0
>>>   ext4_map_blocks+0x2cd/0x8f0
>>>   ext4_iomap_begin+0x27b/0x400
>>>   iomap_iter+0x222/0x3d0
>>>   __iomap_dio_rw+0x243/0xcb0
>>>   iomap_dio_rw+0x16/0x80
>>> =========================================================
>>>
>>> A simple reproducer demonstrating the problem:
>>>
>>> 	mkfs.ext4 -F /dev/sda -b 4096 100M
>>> 	mount /dev/sda /tmp/test
>>> 	fallocate -l1M /tmp/test/tmp
>>> 	fallocate -l10M /tmp/test/file
>>> 	fallocate -i -o 1M -l16777203M /tmp/test/file
>>> 	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
>>> 	sleep 10 && killall -9 fsstress
>>> 	rm -f /tmp/test/tmp
>>> 	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"
>>
>> Could you please also add it into xfstests?
> Sure!I'll try to push this test case to xfstests.

Thanks that would be great!

>> I think it's a nice test which can check the boundary conditions for
>> start and end of data types used in mballoc. I think it should even work
>> if you don't do fsstress but instead just fallocate some remaining space
>> in filesystem, so that when you open and try to write it to 0th offset,
>> if automatically hits this error in ext4_mb_new_inode_pa().
> Yes, the fsstress here is just to fill up the remaining space on the disk.
>>
>>> We declare new_bex_start and new_bex_end as correct types and use fex_end()
>>> to avoid the problems caused by the ext4_lblk_t overflow above.
>>>
>>> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ---
>>>   fs/ext4/mballoc.c | 11 +++++------
>>>   1 file changed, 5 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>> index eb7f5d35ef96..2090e5e7ba58 100644
>>> --- a/fs/ext4/mballoc.c
>>> +++ b/fs/ext4/mballoc.c
>>> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>>   	pa = ac->ac_pa;
>>>   
>>>   	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
>>> -		int new_bex_start;
>>> -		int new_bex_end;
>>> +		ext4_lblk_t new_bex_start;
>>> +		loff_t new_bex_end;
>>>   
>>>   		/* we can't allocate as much as normalizer wants.
>>>   		 * so, found space must get proper lstart
>>> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>>   		 *    still cover original start
>>>   		 * 3. Else, keep the best ex at start of original request.
>>>   		 */
>>> -		new_bex_end = ac->ac_g_ex.fe_logical +
>>> -			EXT4_C2B(sbi, ac->ac_orig_goal_len);
>>> +		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
>>>   		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>>>   		if (ac->ac_o_ex.fe_logical >= new_bex_start)
>>>   			goto adjust_bex;
>>> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>>   
>>>   		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>>>   		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
>>> -		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
>>> -				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
>> Ok so the right hand becomes 0 (because then end can go upto 1<<32 which
>> will be 0 for unsigned int). And the left (new_bex_end) might be
>> negative due to some operations above as I see it.
>> And comparing an int with unsigned int, it will promote new_bex_end to
>> unsigned int which will make it's value too large and will hit the
>> bug_on.
> Exactly!
>>
>> I would like to carefully review all such paths. I will soon review and
>> get back.
> Okay, thank you very much for your careful review.
> The 2nd and 3rd cases of adjusting the best extent are impossible to 
> overflow,
> so only the first case is converted here.

I noticed them too during review. I think it would be safe to make the
changes in other two places as well such that in future we never
trip over such overlooked overflow bugs.

>>
>>
>>> +		BUG_ON(new_bex_end >
>>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>> I am not sure whether using fex_end or pa_end is any helpful.
>> I think we can just typecast if needed and keep it simple rather
>> than adding helpers functions for addition operation.
>> (because of the fact that fex_end() can take a third parameter which
>> sometimes you pass as NULL. Hence it doesn't look clean, IMO)
> I added the helper functions here for two reasons:
> 1. restricting the type of the return value.
> 2. This avoids the ugly line breaks in most cases.
>
> The fex_end() indeed doesn't look as clean as the pa_end(), because we 
> might use
> the start of the free extent plus some other length to get a new end, 
> like right in
> ext4_mb_new_inode_pa(), which makes me have to add another extra length
> argument, but I think it's worth it, and even with the addition of a 
> parameter
> that will probably be unused, it still looks a lot shorter than the 
> original code.

IMO, we don't need pa_end() and fex_end() at all. In several places in
ext4 we always have taken care by directly typecasting to avoid
overflows. Also it reads much simpler rather to typecast in place than
having a helper function which is also not very elegant due to a third
parameter. Hence I think we should drop those helpers.

Thanks once again for catching the overflows and coming up with a
easy reproducer. I am surprised that this bug was never caught with LTP,
fstests, smatch static checker.
How did you find it? :)

-ritesh
Baokun Li July 21, 2023, 7:29 a.m. UTC | #5
On 2023/7/21 3:30, Ritesh Harjani (IBM) wrote:
>
>>> I would like to carefully review all such paths. I will soon review and
>>> get back.
>> Okay, thank you very much for your careful review.
>> The 2nd and 3rd cases of adjusting the best extent are impossible to
>> overflow,
>> so only the first case is converted here.
> I noticed them too during review. I think it would be safe to make the
> changes in other two places as well such that in future we never
> trip over such overlooked overflow bugs.
>
>>>
>>>> +		BUG_ON(new_bex_end >
>>>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>>> I am not sure whether using fex_end or pa_end is any helpful.
>>> I think we can just typecast if needed and keep it simple rather
>>> than adding helpers functions for addition operation.
>>> (because of the fact that fex_end() can take a third parameter which
>>> sometimes you pass as NULL. Hence it doesn't look clean, IMO)
>> I added the helper functions here for two reasons:
>> 1. restricting the type of the return value.
>> 2. This avoids the ugly line breaks in most cases.
>>
>> The fex_end() indeed doesn't look as clean as the pa_end(), because we
>> might use
>> the start of the free extent plus some other length to get a new end,
>> like right in
>> ext4_mb_new_inode_pa(), which makes me have to add another extra length
>> argument, but I think it's worth it, and even with the addition of a
>> parameter
>> that will probably be unused, it still looks a lot shorter than the
>> original code.
> IMO, we don't need pa_end() and fex_end() at all. In several places in
> ext4 we always have taken care by directly typecasting to avoid
> overflows. Also it reads much simpler rather to typecast in place than
> having a helper function which is also not very elegant due to a third
> parameter. Hence I think we should drop those helpers.
I still think helper is useful, but my previous thinking is problematic. 
I shouldn't
make fex_end() adapt to ext4_mb_new_inode_pa(), but should make
ext4_mb_new_inode_pa() adapt to fex_end(). After dropping the third argument
of fex_end(), modify the ext4_mb_new_inode_pa() function as follows:


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a2475b8c9fb5..7492ba9062f4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5072,8 +5072,11 @@ ext4_mb_new_inode_pa(struct 
ext4_allocation_context *ac)
         pa = ac->ac_pa;

         if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
-               int new_bex_start;
-               int new_bex_end;
+               struct ext4_free_extent ex = {
+                       .fe_logical = ac->ac_g_ex.fe_logical;
+                       .fe_len = ac->ac_orig_goal_len;
+               }
+               loff_t orig_goal_end = fex_end(sbi, &ex);

                 /* we can't allocate as much as normalizer wants.
                  * so, found space must get proper lstart
@@ -5092,29 +5095,23 @@ ext4_mb_new_inode_pa(struct 
ext4_allocation_context *ac)
                  *    still cover original start
                  * 3. Else, keep the best ex at start of original request.
                  */
-               new_bex_end = ac->ac_g_ex.fe_logical +
-                       EXT4_C2B(sbi, ac->ac_orig_goal_len);
-               new_bex_start = new_bex_end - EXT4_C2B(sbi, 
ac->ac_b_ex.fe_len);
-               if (ac->ac_o_ex.fe_logical >= new_bex_start)
-                       goto adjust_bex;
+               ex.fe_len = ac->ac_b_ex.fe_len;

-               new_bex_start = ac->ac_g_ex.fe_logical;
-               new_bex_end =
-                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
-               if (ac->ac_o_ex.fe_logical < new_bex_end)
+               ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
+               if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
                         goto adjust_bex;

-               new_bex_start = ac->ac_o_ex.fe_logical;
-               new_bex_end =
-                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
+               ex.fe_logical = ac->ac_g_ex.fe_logical;
+               if (ac->ac_o_ex.fe_logical < fex_end(sbi, &ex))
+                       goto adjust_bex;

+               ex.fe_logical = ac->ac_o_ex.fe_logical;
  adjust_bex:
-               ac->ac_b_ex.fe_logical = new_bex_start;
+               ac->ac_b_ex.fe_logical = ex.fe_logical;

                 BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
                 BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
-               BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
-                                     EXT4_C2B(sbi, ac->ac_orig_goal_len)));
+               BUG_ON(fex_end(sbi, &ex) > orig_goal_end);
         }

         pa->pa_lstart = ac->ac_b_ex.fe_logical;


What do you think of this modification?

> Thanks once again for catching the overflows and coming up with a
> easy reproducer. I am surprised that this bug was never caught with LTP,
> fstests, smatch static checker.
> How did you find it? :)
>
> -ritesh
This problem is found in the internal test.

Thank you for your review!
Baokun Li July 21, 2023, 8:01 a.m. UTC | #6
On 2023/7/21 3:07, Ritesh Harjani (IBM) wrote:
> Baokun Li <libaokun1@huawei.com> writes:
>
>> When we calculate the end position of ext4_free_extent, this position may
>> be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if
>> ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the
>> computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not
>> the first case of adjusting the best extent, that is, new_bex_end > 0, the
>> following BUG_ON will be triggered:
>>
>> =========================================================
>> kernel BUG at fs/ext4/mballoc.c:5116!
>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>> CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279
>> RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430
>> Call Trace:
>>   <TASK>
>>   ext4_mb_use_best_found+0x203/0x2f0
>>   ext4_mb_try_best_found+0x163/0x240
>>   ext4_mb_regular_allocator+0x158/0x1550
>>   ext4_mb_new_blocks+0x86a/0xe10
>>   ext4_ext_map_blocks+0xb0c/0x13a0
>>   ext4_map_blocks+0x2cd/0x8f0
>>   ext4_iomap_begin+0x27b/0x400
>>   iomap_iter+0x222/0x3d0
>>   __iomap_dio_rw+0x243/0xcb0
>>   iomap_dio_rw+0x16/0x80
>> =========================================================
>>
>> A simple reproducer demonstrating the problem:
>>
>> 	mkfs.ext4 -F /dev/sda -b 4096 100M
>> 	mount /dev/sda /tmp/test
>> 	fallocate -l1M /tmp/test/tmp
>> 	fallocate -l10M /tmp/test/file
>> 	fallocate -i -o 1M -l16777203M /tmp/test/file
>> 	fsstress -d /tmp/test -l 0 -n 100000 -p 8 &
>> 	sleep 10 && killall -9 fsstress
>> 	rm -f /tmp/test/tmp
>> 	xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192"
>>
>> We declare new_bex_start and new_bex_end as correct types and use fex_end()
>> to avoid the problems caused by the ext4_lblk_t overflow above.
>>
>> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/mballoc.c | 11 +++++------
>>   1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index eb7f5d35ef96..2090e5e7ba58 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -5076,8 +5076,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   	pa = ac->ac_pa;
>>   
>>   	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
>> -		int new_bex_start;
>> -		int new_bex_end;
>> +		ext4_lblk_t new_bex_start;
>> +		loff_t new_bex_end;
>>   
>>   		/* we can't allocate as much as normalizer wants.
>>   		 * so, found space must get proper lstart
>> @@ -5096,8 +5096,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   		 *    still cover original start
>>   		 * 3. Else, keep the best ex at start of original request.
>>   		 */
>> -		new_bex_end = ac->ac_g_ex.fe_logical +
>> -			EXT4_C2B(sbi, ac->ac_orig_goal_len);
>> +		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
>>   		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>>   		if (ac->ac_o_ex.fe_logical >= new_bex_start)
>>   			goto adjust_bex;
> 		new_bex_end = ac->ac_g_ex.fe_logical +
> 			EXT4_C2B(sbi, ac->ac_orig_goal_len);
> 		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> 		if (ac->ac_o_ex.fe_logical >= new_bex_start)
> 			goto adjust_bex;
>
> 		new_bex_start = ac->ac_g_ex.fe_logical;
> 		new_bex_end =
> 			new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> 		if (ac->ac_o_ex.fe_logical < new_bex_end)
> 			goto adjust_bex;
>
> 		new_bex_start = ac->ac_o_ex.fe_logical;
> 		new_bex_end =
> 			new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>
> I think it will be safer if we typecast all above new_bex_end
> calculations. I understand that it might never happen here that
> fe_logical + best found extent length will make it overflow.
> Simply because we only enter here when ac_b_ex.fe_len <
> ac_orig_goal_len. But I think if we change any logic tomorrow or
> refactor any part of code, it will be much safer if we simply keep the
> typecast in place so that we avoid any future tripping caused by
> arithmatic overflow here.
>
> Also I am not finding fex_end() or pa_end() helpers very intuitive way
> of doing it. I feel typecasting in place is much simpler and reads better.
>
> -ritesh
I remove the third parameter of fex_end() and adjust the code in
ext4_mb_new_inode_pa() as follows:

                 struct ext4_free_extent ex = {
                         .fe_logical = ac->ac_g_ex.fe_logical,
                         .fe_len = ac->ac_orig_goal_len,
                 };
                 loff_t orig_goal_end = fex_end(sbi, &ex);

                 ex.fe_len = ac->ac_b_ex.fe_len;

                 ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
                 if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
                         goto adjust_bex;

                 ex.fe_logical = ac->ac_g_ex.fe_logical;
                 if (ac->ac_o_ex.fe_logical < fex_end(sbi, &ex))
                         goto adjust_bex;

                 ex.fe_logical = ac->ac_o_ex.fe_logical;
adjust_bex:
                 ac->ac_b_ex.fe_logical = ex.fe_logical;

                 BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
                 BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
                 BUG_ON(fex_end(sbi, &ex) > orig_goal_end);
In this way, all new_bex_end calculations do not have overflow problems,
and the code looks much more neat than before. 😊
>> @@ -5117,8 +5116,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   
>>   		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>>   		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
>> -		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
>> -				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
>> +		BUG_ON(new_bex_end >
>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>>   	}
>>   
>>   	pa->pa_lstart = ac->ac_b_ex.fe_logical;
>> -- 
>> 2.31.1
Thanks!
Ritesh Harjani (IBM) July 21, 2023, 8:24 a.m. UTC | #7
Baokun Li <libaokun1@huawei.com> writes:

> On 2023/7/21 3:30, Ritesh Harjani (IBM) wrote:
>>
>>>> I would like to carefully review all such paths. I will soon review and
>>>> get back.
>>> Okay, thank you very much for your careful review.
>>> The 2nd and 3rd cases of adjusting the best extent are impossible to
>>> overflow,
>>> so only the first case is converted here.
>> I noticed them too during review. I think it would be safe to make the
>> changes in other two places as well such that in future we never
>> trip over such overlooked overflow bugs.
>>
>>>>
>>>>> +		BUG_ON(new_bex_end >
>>>>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>>>> I am not sure whether using fex_end or pa_end is any helpful.
>>>> I think we can just typecast if needed and keep it simple rather
>>>> than adding helpers functions for addition operation.
>>>> (because of the fact that fex_end() can take a third parameter which
>>>> sometimes you pass as NULL. Hence it doesn't look clean, IMO)
>>> I added the helper functions here for two reasons:
>>> 1. restricting the type of the return value.
>>> 2. This avoids the ugly line breaks in most cases.
>>>
>>> The fex_end() indeed doesn't look as clean as the pa_end(), because we
>>> might use
>>> the start of the free extent plus some other length to get a new end,
>>> like right in
>>> ext4_mb_new_inode_pa(), which makes me have to add another extra length
>>> argument, but I think it's worth it, and even with the addition of a
>>> parameter
>>> that will probably be unused, it still looks a lot shorter than the
>>> original code.
>> IMO, we don't need pa_end() and fex_end() at all. In several places in
>> ext4 we always have taken care by directly typecasting to avoid
>> overflows. Also it reads much simpler rather to typecast in place than
>> having a helper function which is also not very elegant due to a third
>> parameter. Hence I think we should drop those helpers.
> I still think helper is useful, but my previous thinking is problematic. 
> I shouldn't
> make fex_end() adapt to ext4_mb_new_inode_pa(), but should make
> ext4_mb_new_inode_pa() adapt to fex_end(). After dropping the third argument
> of fex_end(), modify the ext4_mb_new_inode_pa() function as follows:

No. I think we are overly complicating it by doing this. IMHO we don't need
fex_end and pa_end at all here. With fex_end, pa_end, we are passing pointers,
updating it's members before and after sending it to these functions,
dereferencing them within those helpers.

e.g. with your change it will look like 
<snip>
		struct ext4_free_extent ex = {
			.fe_logical = ac->ac_g_ex.fe_logical;
			.fe_len = ac->ac_orig_goal_len;
		}

        loff_t orig_goal_end = fex_end(sbi, &ex);
		ex.fe_len = ac->ac_b_ex.fe_len;
		ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
		if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
			goto adjust_bex;
</snip>

In above snip we introduced ex variable, updated it with
orig_goal_len, then called fex_end() to get orig_goal_end, then we again
updated ex.fe_len, but this time we didn't call fex_end instead we
needed it for getting ex.fe_logical. All of this is not needed at all. 

e.g. we should use just use (loff_t) wherever it was missed in the code.
<snip>
		ext4_lblk_t new_bex_start;
		loff_t new_bex_end;

		new_bex_end = (loff_t)ac->ac_g_ex.fe_logical +
			EXT4_C2B(sbi, ac->ac_orig_goal_len);
		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
		if (ac->ac_o_ex.fe_logical >= new_bex_start)
			goto adjust_bex;
</snip>


In this we just update (loff_t) and it reads better in my opinion then
using ex, fex_end() etc.

In my opinion we should simply drop the helpers. It should be obvious
that while calculating logical end block for an inode in ext4 by doing
lstart + len, we should use loff_t.
Since it can be 1 more than the last block which a u32 can hold.
So wherever such calculations are made we should ensure the data
type of left hand operand should be loff_t and we should typecast the
right hand side operands such that there should not be any overflow
happening. We do at several places in ext4 already (also while doing
left-shifting (loff_t)map.m_lblk).

Doing this with helpers, IMO is not useful as we also saw above.

>
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index a2475b8c9fb5..7492ba9062f4 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5072,8 +5072,11 @@ ext4_mb_new_inode_pa(struct 
> ext4_allocation_context *ac)
>          pa = ac->ac_pa;
>
>          if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
> -               int new_bex_start;
> -               int new_bex_end;
> +               struct ext4_free_extent ex = {
> +                       .fe_logical = ac->ac_g_ex.fe_logical;
> +                       .fe_len = ac->ac_orig_goal_len;
> +               }
> +               loff_t orig_goal_end = fex_end(sbi, &ex);
>
>                  /* we can't allocate as much as normalizer wants.
>                   * so, found space must get proper lstart
> @@ -5092,29 +5095,23 @@ ext4_mb_new_inode_pa(struct 
> ext4_allocation_context *ac)
>                   *    still cover original start
>                   * 3. Else, keep the best ex at start of original request.
>                   */
> -               new_bex_end = ac->ac_g_ex.fe_logical +
> -                       EXT4_C2B(sbi, ac->ac_orig_goal_len);
> -               new_bex_start = new_bex_end - EXT4_C2B(sbi, 
> ac->ac_b_ex.fe_len);
> -               if (ac->ac_o_ex.fe_logical >= new_bex_start)
> -                       goto adjust_bex;
> +               ex.fe_len = ac->ac_b_ex.fe_len;
>
> -               new_bex_start = ac->ac_g_ex.fe_logical;
> -               new_bex_end =
> -                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> -               if (ac->ac_o_ex.fe_logical < new_bex_end)
> +               ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
> +               if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
>                          goto adjust_bex;
>
> -               new_bex_start = ac->ac_o_ex.fe_logical;
> -               new_bex_end =
> -                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> +               ex.fe_logical = ac->ac_g_ex.fe_logical;
> +               if (ac->ac_o_ex.fe_logical < fex_end(sbi, &ex))
> +                       goto adjust_bex;
>
> +               ex.fe_logical = ac->ac_o_ex.fe_logical;
>   adjust_bex:
> -               ac->ac_b_ex.fe_logical = new_bex_start;
> +               ac->ac_b_ex.fe_logical = ex.fe_logical;
>
>                  BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>                  BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> -               BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
> -                                     EXT4_C2B(sbi, ac->ac_orig_goal_len)));
> +               BUG_ON(fex_end(sbi, &ex) > orig_goal_end);
>          }
>
>          pa->pa_lstart = ac->ac_b_ex.fe_logical;
>
>
> What do you think of this modification?
>
>> Thanks once again for catching the overflows and coming up with a
>> easy reproducer. I am surprised that this bug was never caught with LTP,
>> fstests, smatch static checker.
>> How did you find it? :)
>>
>> -ritesh
> This problem is found in the internal test.

Ok.

>
> Thank you for your review!

Thanks for working on the fix.

-ritesh
Baokun Li July 21, 2023, 9:13 a.m. UTC | #8
On 2023/7/21 16:24, Ritesh Harjani (IBM) wrote:
> Baokun Li <libaokun1@huawei.com> writes:
>
>> On 2023/7/21 3:30, Ritesh Harjani (IBM) wrote:
>>>>> I would like to carefully review all such paths. I will soon review and
>>>>> get back.
>>>> Okay, thank you very much for your careful review.
>>>> The 2nd and 3rd cases of adjusting the best extent are impossible to
>>>> overflow,
>>>> so only the first case is converted here.
>>> I noticed them too during review. I think it would be safe to make the
>>> changes in other two places as well such that in future we never
>>> trip over such overlooked overflow bugs.
>>>
>>>>>> +		BUG_ON(new_bex_end >
>>>>>> +			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
>>>>> I am not sure whether using fex_end or pa_end is any helpful.
>>>>> I think we can just typecast if needed and keep it simple rather
>>>>> than adding helpers functions for addition operation.
>>>>> (because of the fact that fex_end() can take a third parameter which
>>>>> sometimes you pass as NULL. Hence it doesn't look clean, IMO)
>>>> I added the helper functions here for two reasons:
>>>> 1. restricting the type of the return value.
>>>> 2. This avoids the ugly line breaks in most cases.
>>>>
>>>> The fex_end() indeed doesn't look as clean as the pa_end(), because we
>>>> might use
>>>> the start of the free extent plus some other length to get a new end,
>>>> like right in
>>>> ext4_mb_new_inode_pa(), which makes me have to add another extra length
>>>> argument, but I think it's worth it, and even with the addition of a
>>>> parameter
>>>> that will probably be unused, it still looks a lot shorter than the
>>>> original code.
>>> IMO, we don't need pa_end() and fex_end() at all. In several places in
>>> ext4 we always have taken care by directly typecasting to avoid
>>> overflows. Also it reads much simpler rather to typecast in place than
>>> having a helper function which is also not very elegant due to a third
>>> parameter. Hence I think we should drop those helpers.
>> I still think helper is useful, but my previous thinking is problematic.
>> I shouldn't
>> make fex_end() adapt to ext4_mb_new_inode_pa(), but should make
>> ext4_mb_new_inode_pa() adapt to fex_end(). After dropping the third argument
>> of fex_end(), modify the ext4_mb_new_inode_pa() function as follows:
> No. I think we are overly complicating it by doing this. IMHO we don't need
> fex_end and pa_end at all here. With fex_end, pa_end, we are passing pointers,
> updating it's members before and after sending it to these functions,
> dereferencing them within those helpers.
>
> e.g. with your change it will look like
> <snip>
> 		struct ext4_free_extent ex = {
> 			.fe_logical = ac->ac_g_ex.fe_logical;
> 			.fe_len = ac->ac_orig_goal_len;
> 		}
>
>          loff_t orig_goal_end = fex_end(sbi, &ex);
> 		ex.fe_len = ac->ac_b_ex.fe_len;
> 		ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
> 		if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
> 			goto adjust_bex;
> </snip>
>
> In above snip we introduced ex variable, updated it with
> orig_goal_len, then called fex_end() to get orig_goal_end, then we again
> updated ex.fe_len, but this time we didn't call fex_end instead we
> needed it for getting ex.fe_logical. All of this is not needed at all.
>
> e.g. we should use just use (loff_t) wherever it was missed in the code.
> <snip>
> 		ext4_lblk_t new_bex_start;
> 		loff_t new_bex_end;
>
> 		new_bex_end = (loff_t)ac->ac_g_ex.fe_logical +
> 			EXT4_C2B(sbi, ac->ac_orig_goal_len);
> 		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
> 		if (ac->ac_o_ex.fe_logical >= new_bex_start)
> 			goto adjust_bex;
> </snip>
>
>
> In this we just update (loff_t) and it reads better in my opinion then
> using ex, fex_end() etc.
>
> In my opinion we should simply drop the helpers. It should be obvious
> that while calculating logical end block for an inode in ext4 by doing
> lstart + len, we should use loff_t.
> Since it can be 1 more than the last block which a u32 can hold.
> So wherever such calculations are made we should ensure the data
> type of left hand operand should be loff_t and we should typecast the
> right hand side operands such that there should not be any overflow
> happening. We do at several places in ext4 already (also while doing
> left-shifting (loff_t)map.m_lblk).
>
> Doing this with helpers, IMO is not useful as we also saw above.

I still think it is necessary to add a helper to make the code more concise.


Ted, do you think the fex_end() helper function is needed here?

I think we might need your advice to end this discussion. 😅

>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index a2475b8c9fb5..7492ba9062f4 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -5072,8 +5072,11 @@ ext4_mb_new_inode_pa(struct
>> ext4_allocation_context *ac)
>>           pa = ac->ac_pa;
>>
>>           if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
>> -               int new_bex_start;
>> -               int new_bex_end;
>> +               struct ext4_free_extent ex = {
>> +                       .fe_logical = ac->ac_g_ex.fe_logical;
>> +                       .fe_len = ac->ac_orig_goal_len;
>> +               }
>> +               loff_t orig_goal_end = fex_end(sbi, &ex);
>>
>>                   /* we can't allocate as much as normalizer wants.
>>                    * so, found space must get proper lstart
>> @@ -5092,29 +5095,23 @@ ext4_mb_new_inode_pa(struct
>> ext4_allocation_context *ac)
>>                    *    still cover original start
>>                    * 3. Else, keep the best ex at start of original request.
>>                    */
>> -               new_bex_end = ac->ac_g_ex.fe_logical +
>> -                       EXT4_C2B(sbi, ac->ac_orig_goal_len);
>> -               new_bex_start = new_bex_end - EXT4_C2B(sbi,
>> ac->ac_b_ex.fe_len);
>> -               if (ac->ac_o_ex.fe_logical >= new_bex_start)
>> -                       goto adjust_bex;
>> +               ex.fe_len = ac->ac_b_ex.fe_len;
>>
>> -               new_bex_start = ac->ac_g_ex.fe_logical;
>> -               new_bex_end =
>> -                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>> -               if (ac->ac_o_ex.fe_logical < new_bex_end)
>> +               ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
>> +               if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
>>                           goto adjust_bex;
>>
>> -               new_bex_start = ac->ac_o_ex.fe_logical;
>> -               new_bex_end =
>> -                       new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
>> +               ex.fe_logical = ac->ac_g_ex.fe_logical;
>> +               if (ac->ac_o_ex.fe_logical < fex_end(sbi, &ex))
>> +                       goto adjust_bex;
>>
>> +               ex.fe_logical = ac->ac_o_ex.fe_logical;
>>    adjust_bex:
>> -               ac->ac_b_ex.fe_logical = new_bex_start;
>> +               ac->ac_b_ex.fe_logical = ex.fe_logical;
>>
>>                   BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>>                   BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
>> -               BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
>> -                                     EXT4_C2B(sbi, ac->ac_orig_goal_len)));
>> +               BUG_ON(fex_end(sbi, &ex) > orig_goal_end);
>>           }
>>
>>           pa->pa_lstart = ac->ac_b_ex.fe_logical;
Thanks!
Theodore Ts'o July 21, 2023, 5:15 p.m. UTC | #9
On Fri, Jul 21, 2023 at 05:13:13PM +0800, Baokun Li wrote:
> > Doing this with helpers, IMO is not useful as we also saw above.
> 
> I still think it is necessary to add a helper to make the code more concise.
> 
> Ted, do you think the fex_end() helper function is needed here?
> 
> I think we might need your advice to end this discussion. 😅

Having helper functions doesn't bother me all _that_ much --- so long
as they are named appropriately.  The readibility issues start with
the fact that the helper function uses loff_t as opposed to
ext4_lblk_t, and because someone looking at fex_end() would need
additional thinking to figure out what it did.  If renamed it to be
fex_logical_end() and made it take an ext4_lblk_t, I think it would be
better.

The bigger complaint that I have, although it's not the fault of your
patch, is the use of "ext4_free_extent" all over ext4/mballoc.c when
it's much more often used for allocating blocks.  So the name of the
structure and the "fex" in "fex_end" or "fex_logical_end" is also
confusing.

Hmm... how about naming helper function extent_logial_end()?

And at some point we might want to do a global search and replace
changing ext4_free_extent to something like ext4_mballoc_extent, and
changing the structure element names.  Perhaps:

       fe_logical  --->  ex_lblk
       fe_start    --->  ex_cluster_start
       fe_group    --->  ex_group
       fe_len      --->  ex_cluster_len

This is addressing problems where "start" can mean the starting
physical block, the starting logical block, or the starting cluster
relative to the beginning of the block group.

There is also documentation which is just wrong.  For example:

/**
 * ext4_trim_all_free -- function to trim all free space in alloc. group
 * @sb:			super block for file system
 * @group:		group to be trimmed
 * @start:		first group block to examine
 * @max:		last group block to examine

start and max should be "first group cluster to examine" and "last
group cluster to examine", respectively.

The bottom line is that there are much larger opportunities to make
the code more maintainable than worrying about two new helper
functions.  :-)

Cheers,

					- Ted
Baokun Li July 22, 2023, 3:16 a.m. UTC | #10
On 2023/7/22 1:15, Theodore Ts'o wrote:
> On Fri, Jul 21, 2023 at 05:13:13PM +0800, Baokun Li wrote:
>>> Doing this with helpers, IMO is not useful as we also saw above.
>> I still think it is necessary to add a helper to make the code more concise.
>>
>> Ted, do you think the fex_end() helper function is needed here?
>>
>> I think we might need your advice to end this discussion. 😅
> Having helper functions doesn't bother me all _that_ much --- so long
> as they are named appropriately.  The readibility issues start with
> the fact that the helper function uses loff_t as opposed to
> ext4_lblk_t, and because someone looking at fex_end() would need
> additional thinking to figure out what it did.  If renamed it to be
> fex_logical_end() and made it take an ext4_lblk_t, I think it would be
> better.
Yes, naming is one of the most difficult things.

The reason the helper function uses loff_t instead of ext4_lblk_t is because
when we compute the extent logical end or pa end, we may exceed the
maximum length of ext4_lblk_t and get an overflowed result, which can lead
to the four issues fixed in the patch set. Perhaps I should add some 
comments
to explain these.

In other words, add this helper function and limit the return value of 
the function
to loff_t is precisely to remind people that such overflow problems exist.
As mentioned by ritesh, ext4 has many places to directly perform type 
conversion
in place. However, when we modify the kernel or review code, we will 
still ignore
that the current code may have overflow problems, as in the commit fixed 
by this
patch.

If we have a helper function fex_end(), we can avoid potential overflow 
problems
by using it directly when we make changes or by questioning why we 
didn't use
a simpler helper function when reviewing the code, rather than just 
adding a (loff_t)
when we spot the problem and saying, "Oh, the problem is perfectly 
solved!" 😀

The current problem can be solved in any way, the key is how to prevent 
similar
problems in the future?
> The bigger complaint that I have, although it's not the fault of your
> patch, is the use of "ext4_free_extent" all over ext4/mballoc.c when
> it's much more often used for allocating blocks.  So the name of the
> structure and the "fex" in "fex_end" or "fex_logical_end" is also
> confusing.
>
> Hmm... how about naming helper function extent_logial_end()?
Great! Thank you for naming the helper function.
>
> And at some point we might want to do a global search and replace
> changing ext4_free_extent to something like ext4_mballoc_extent, and
> changing the structure element names.  Perhaps:
>
>         fe_logical  --->  ex_lblk
>         fe_start    --->  ex_cluster_start
>         fe_group    --->  ex_group
>         fe_len      --->  ex_cluster_len
>
> This is addressing problems where "start" can mean the starting
> physical block, the starting logical block, or the starting cluster
> relative to the beginning of the block group.
>
> There is also documentation which is just wrong.  For example:
>
> /**
>   * ext4_trim_all_free -- function to trim all free space in alloc. group
>   * @sb:			super block for file system
>   * @group:		group to be trimmed
>   * @start:		first group block to examine
>   * @max:		last group block to examine
>
> start and max should be "first group cluster to examine" and "last
> group cluster to examine", respectively.
Yes, it is also very important to distinguish between mballoc_extent and 
free_extent.
I will try to rename ext4_free_extent to ext4_mballoc_extent and fix 
some comment
errors in another patch set that does some performance optimizations using
"free extent".
>
> The bottom line is that there are much larger opportunities to make
> the code more maintainable than worrying about two new helper
> functions.  :-)
>
> Cheers,
>
> 					- Ted
Yes, making code more maintainable is always the goal.
Thank you for your patient explanation!

Cheers!
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index eb7f5d35ef96..2090e5e7ba58 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5076,8 +5076,8 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	pa = ac->ac_pa;
 
 	if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) {
-		int new_bex_start;
-		int new_bex_end;
+		ext4_lblk_t new_bex_start;
+		loff_t new_bex_end;
 
 		/* we can't allocate as much as normalizer wants.
 		 * so, found space must get proper lstart
@@ -5096,8 +5096,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 		 *    still cover original start
 		 * 3. Else, keep the best ex at start of original request.
 		 */
-		new_bex_end = ac->ac_g_ex.fe_logical +
-			EXT4_C2B(sbi, ac->ac_orig_goal_len);
+		new_bex_end = fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len);
 		new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 		if (ac->ac_o_ex.fe_logical >= new_bex_start)
 			goto adjust_bex;
@@ -5117,8 +5116,8 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 
 		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
 		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
-		BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical +
-				      EXT4_C2B(sbi, ac->ac_orig_goal_len)));
+		BUG_ON(new_bex_end >
+			fex_end(sbi, &ac->ac_g_ex, &ac->ac_orig_goal_len));
 	}
 
 	pa->pa_lstart = ac->ac_b_ex.fe_logical;