diff mbox series

ext4: correct best extent lstart adjustment logic

Message ID 20240122123332.555370-1-libaokun1@huawei.com
State Superseded
Headers show
Series ext4: correct best extent lstart adjustment logic | expand

Commit Message

Baokun Li Jan. 22, 2024, 12:33 p.m. UTC
When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart
adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
extent did not completely cover the original request after adjusting the
best extent lstart in ext4_mb_new_inode_pa() as follows:

  original request: 2/10(8)
  normalized request: 0/64(64)
  best extent: 0/9(9)

When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
is 2 less than the adjusted best extent logical end 9, so we think the
adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
should determine here if the original request logical end is less than or
equal to the adjusted best extent logical end.

Moreover, the best extent len is not modified during the adjustment
process, and it is already checked by the previous assertion, so replace
the check for fe_len with a check for the best extent logical end.

Cc: stable@kernel.org
Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Jan Kara Jan. 31, 2024, 12:46 p.m. UTC | #1
[Added Ojaswin to CC as an author of the discussed patch]

On Mon 22-01-24 20:33:32, Baokun Li wrote:
> When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart
> adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
> extent did not completely cover the original request after adjusting the
> best extent lstart in ext4_mb_new_inode_pa() as follows:
> 
>   original request: 2/10(8)
>   normalized request: 0/64(64)
>   best extent: 0/9(9)
> 
> When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
> is 2 less than the adjusted best extent logical end 9, so we think the
> adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
> should determine here if the original request logical end is less than or
> equal to the adjusted best extent logical end.

I'm sorry for a bit delayed reply. Why do you think it is a problem if the
resulting extent doesn't cover the full original range? We must always
cover the first block of the original extent so that the allocation makes
forward progress. But otherwise we choose to align to the start / end of
the goal range to reduce fragmentation even if we don't cover the whole
requested range - the rest of the range will be covered by the next
allocation. Also there is a problem with trying to cover the whole original
range described in [1]. Essentially the goal range does not need to cover
the whole original range and if we try to align the allocated range to
cover the whole original range, it may result in exceeding the goal range
and thus overlapping preallocations and triggering asserts in the prealloc
code.

So if we decided we want to handle the case you describe in a better way,
we'd need something making sure we don't exceed the goal range.

								Honza

[1] https://lore.kernel.org/all/Y+UzQJRIJEiAr4Z4@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com/

> 
> Moreover, the best extent len is not modified during the adjustment
> process, and it is already checked by the previous assertion, so replace
> the check for fe_len with a check for the best extent logical end.
> 
> Cc: stable@kernel.org
> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
> Signed-off-by: yangerkun <yangerkun@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/mballoc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f44f668e407f..fa5977fe8d72 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5146,6 +5146,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  			.fe_len = ac->ac_orig_goal_len,
>  		};
>  		loff_t orig_goal_end = extent_logical_end(sbi, &ex);
> +		loff_t o_ex_end = extent_logical_end(sbi, &ac->ac_o_ex);
>  
>  		/* we can't allocate as much as normalizer wants.
>  		 * so, found space must get proper lstart
> @@ -5161,7 +5162,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  		 * 1. Check if best ex can be kept at end of goal (before
>  		 *    cr_best_avail trimmed it) and still cover original start
>  		 * 2. Else, check if best ex can be kept at start of goal and
> -		 *    still cover original start
> +		 *    still cover original end
>  		 * 3. Else, keep the best ex at start of original request.
>  		 */
>  		ex.fe_len = ac->ac_b_ex.fe_len;
> @@ -5171,7 +5172,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  			goto adjust_bex;
>  
>  		ex.fe_logical = ac->ac_g_ex.fe_logical;
> -		if (ac->ac_o_ex.fe_logical < extent_logical_end(sbi, &ex))
> +		if (o_ex_end <= extent_logical_end(sbi, &ex))
>  			goto adjust_bex;
>  
>  		ex.fe_logical = ac->ac_o_ex.fe_logical;
> @@ -5179,7 +5180,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  		ac->ac_b_ex.fe_logical = ex.fe_logical;
>  
>  		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
> -		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> +		BUG_ON(o_ex_end > extent_logical_end(sbi, &ex));
>  		BUG_ON(extent_logical_end(sbi, &ex) > orig_goal_end);
>  	}
>  
> -- 
> 2.31.1
>
Baokun Li Feb. 1, 2024, 3:31 a.m. UTC | #2
On 2024/1/31 20:46, Jan Kara wrote:
> [Added Ojaswin to CC as an author of the discussed patch]
>
> On Mon 22-01-24 20:33:32, Baokun Li wrote:
>> When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart
>> adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best
>> extent did not completely cover the original request after adjusting the
>> best extent lstart in ext4_mb_new_inode_pa() as follows:
>>
>>    original request: 2/10(8)
>>    normalized request: 0/64(64)
>>    best extent: 0/9(9)
>>
>> When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical
>> is 2 less than the adjusted best extent logical end 9, so we think the
>> adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we
>> should determine here if the original request logical end is less than or
>> equal to the adjusted best extent logical end.

Hello Jan,

Thanks for the detailed explanation! 😉

> I'm sorry for a bit delayed reply. Why do you think it is a problem if the
> resulting extent doesn't cover the full original range?

We adjust lstart when ac_o_ex.fe_len < ac_b_ex.fe_len and
ac_b_ex.fe_len < ac->ac_orig_goal_len, in which case the length of
the allocation is greater than the length of the original request,
and we would normally assume that this allocation would satisfy
the request for the block allocation without the need for an
additional allocation.

      /* we can't allocate as much as normalizer wants.
       * so, found space must get proper lstart
       * to cover original request */

And the comment in the code states that we need to "cover original
request", but this logic is not fulfilled in the code below, so yangerkun
is very puzzled and presents the above counterexample, so we think
it's a problem.

> We must always
> cover the first block of the original extent so that the allocation makes
> forward progress. But otherwise we choose to align to the start / end of
> the goal range to reduce fragmentation even if we don't cover the whole
> requested range - the rest of the range will be covered by the next
> allocation.
Totally agree, for the example above, if we end up with a total of 64
blocks, then the final extent distribution might look like this:

Before:  [0/9(9)], [9/64(55)]
Patched: [0/2(2)], [2/11(9)], [11/64(53)]

So the question is really whether we expect fewer allocations currently
or fewer fragments later.
> Also there is a problem with trying to cover the whole original
> range described in [1]. Essentially the goal range does not need to cover
> the whole original range and if we try to align the allocated range to
> cover the whole original range, it may result in exceeding the goal range
> and thus overlapping preallocations and triggering asserts in the prealloc
> code.
>
> So if we decided we want to handle the case you describe in a better way,
> we'd need something making sure we don't exceed the goal range.
>
> 								Honza
>
> [1] https://lore.kernel.org/all/Y+UzQJRIJEiAr4Z4@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com/
goal_start          B    original_start   A              goal_end
   |-----------------|----------*----------|-----------------|
      best_ex_len                              best_ex_len

The current logic guarantees that the goal range will not be exceeded.
If original_start + best_ex_len > goal_end, then in case1 the ex_end
will be adjusted to align with the goal_end, and if the
goal_end < original_end, then another block allocation will be triggered,
which is fine. But in other cases, we can guarantee that the original
request will be covered by the adjusted best ex.

The problem is that in case2, when we aligned ex_fe_start with
goal_start, we exited the alignment as soon as we contained the
original_start, which may not have contained the original_end and
triggered an additional block allocation, but if we jumped to case3
we could cover the entire original request.

In general, this patch will not cause the goal range to be exceeded.
>> Moreover, the best extent len is not modified during the adjustment
>> process, and it is already checked by the previous assertion, so replace
>> the check for fe_len with a check for the best extent logical end.
>>
>> Cc: stable@kernel.org
>> Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
>> Signed-off-by: yangerkun <yangerkun@huawei.com>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/mballoc.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index f44f668e407f..fa5977fe8d72 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -5146,6 +5146,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   			.fe_len = ac->ac_orig_goal_len,
>>   		};
>>   		loff_t orig_goal_end = extent_logical_end(sbi, &ex);
>> +		loff_t o_ex_end = extent_logical_end(sbi, &ac->ac_o_ex);
>>   
>>   		/* we can't allocate as much as normalizer wants.
>>   		 * so, found space must get proper lstart
>> @@ -5161,7 +5162,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   		 * 1. Check if best ex can be kept at end of goal (before
>>   		 *    cr_best_avail trimmed it) and still cover original start
>>   		 * 2. Else, check if best ex can be kept at start of goal and
>> -		 *    still cover original start
>> +		 *    still cover original end
>>   		 * 3. Else, keep the best ex at start of original request.
>>   		 */
>>   		ex.fe_len = ac->ac_b_ex.fe_len;
>> @@ -5171,7 +5172,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   			goto adjust_bex;
>>   
>>   		ex.fe_logical = ac->ac_g_ex.fe_logical;
>> -		if (ac->ac_o_ex.fe_logical < extent_logical_end(sbi, &ex))
>> +		if (o_ex_end <= extent_logical_end(sbi, &ex))
>>   			goto adjust_bex;
>>   
>>   		ex.fe_logical = ac->ac_o_ex.fe_logical;
>> @@ -5179,7 +5180,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>>   		ac->ac_b_ex.fe_logical = ex.fe_logical;
>>   
>>   		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
>> -		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
>> +		BUG_ON(o_ex_end > extent_logical_end(sbi, &ex));
>>   		BUG_ON(extent_logical_end(sbi, &ex) > orig_goal_end);
>>   	}
>>   
>> -- 
>> 2.31.1
>>
Cheers!
Ojaswin Mujoo Feb. 1, 2024, 11:08 a.m. UTC | #3
Hi Baokun, Jan

Thanks for the CC, I somehow missed this patch.

As described in the discussion Jan linked [1] , there is a known bug in the
normalize code (which i should probably get back to now ) where we sometimes
end up with a goal range which doesn't completely cover the original extent and
this was causing issues when we tried to cover the complete original request in
the PA window adjustment logic. That and to minimize fragmentation, we ended up
going with the logic we have right now.

In short, I agree that in the example Baokun pointed out, it is not optimal to
have to make an allocation request twice when we can get it in one go.

I also think Baokun is correct that if keeping the best extent at the end doesn't 
cover the original start, then any other case should not lead to it overflowing out
of goal extent, including the case where original extent is overflowing goal extent.

So, as mentioned, it boils down to a trade off between multiple allocations and slightly 
increased fragmentation. iiuc preallocations are anyways dropped when the file closes
so I think it shouldn't hurt too much fragmentation wise to prioritize less
allocations. What are your thoughts on this Jan, Baokun?

Coming to the code, the only thing I think might cause an issue is the following line:

-		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
+		BUG_ON(o_ex_end > extent_logical_end(sbi, &ex));

So as discussed towards the end here [1] we could have ac_o_ex that
overflows the goal and hence would be beyond the best length. I'll try
to look into the normalize logic to fix this however till then, I think
we should not have this BUG_ON since it would crash the kernel if this
happens.

Rest of it looks good to me.

Regards,
Ojaswin

[1]
https://lore.kernel.org/all/Y+UzQJRIJEiAr4Z4@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com/
Jan Kara Feb. 1, 2024, 11:46 a.m. UTC | #4
Hi guys!

On Thu 01-02-24 16:38:33, Ojaswin Mujoo wrote:
> Thanks for the CC, I somehow missed this patch.
> 
> As described in the discussion Jan linked [1] , there is a known bug in the
> normalize code (which i should probably get back to now ) where we sometimes
> end up with a goal range which doesn't completely cover the original extent and
> this was causing issues when we tried to cover the complete original request in
> the PA window adjustment logic. That and to minimize fragmentation, we ended up
> going with the logic we have right now.
> 
> In short, I agree that in the example Baokun pointed out, it is not
> optimal to have to make an allocation request twice when we can get it in
> one go.
> 
> I also think Baokun is correct that if keeping the best extent at the end
> doesn't cover the original start, then any other case should not lead to
> it overflowing out of goal extent, including the case where original
> extent is overflowing goal extent.

Right, it was not obvious to me yesterday but when I've now reread how the
normalization shifts the goal window, it is obvious.

> So, as mentioned, it boils down to a trade off between multiple allocations and slightly 
> increased fragmentation. iiuc preallocations are anyways dropped when the file closes
> so I think it shouldn't hurt too much fragmentation wise to prioritize less
> allocations. What are your thoughts on this Jan, Baokun?

OK, I'm fine with the Baokun's change if we remove the problematic BUG_ON.

								Honza
Baokun Li Feb. 1, 2024, 12:34 p.m. UTC | #5
On 2024/2/1 19:08, Ojaswin Mujoo wrote:

Hi Ojaswin, Jan

> Hi Baokun, Jan
>
> Thanks for the CC, I somehow missed this patch.
>
> As described in the discussion Jan linked [1] , there is a known bug in the
> normalize code (which i should probably get back to now ) where we sometimes
> end up with a goal range which doesn't completely cover the original extent and
> this was causing issues when we tried to cover the complete original request in
> the PA window adjustment logic. That and to minimize fragmentation, we ended up
> going with the logic we have right now.
>
> In short, I agree that in the example Baokun pointed out, it is not optimal to
> have to make an allocation request twice when we can get it in one go.
>
> I also think Baokun is correct that if keeping the best extent at the end doesn't
> cover the original start, then any other case should not lead to it overflowing out
> of goal extent, including the case where original extent is overflowing goal extent.
>
> So, as mentioned, it boils down to a trade off between multiple allocations and slightly
> increased fragmentation. iiuc preallocations are anyways dropped when the file closes
> so I think it shouldn't hurt too much fragmentation wise to prioritize less
> allocations. What are your thoughts on this Jan, Baokun?
>
> Coming to the code, the only thing I think might cause an issue is the following line:
>
> -		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> +		BUG_ON(o_ex_end > extent_logical_end(sbi, &ex));
>
> So as discussed towards the end here [1] we could have ac_o_ex that
> overflows the goal and hence would be beyond the best length. I'll try
> to look into the normalize logic to fix this however till then, I think
> we should not have this BUG_ON since it would crash the kernel if this
> happens.
>
> Rest of it looks good to me.
>
> Regards,
> Ojaswin
>
> [1]
> https://lore.kernel.org/all/Y+UzQJRIJEiAr4Z4@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com/
I will remove the problematic BUG_ON and add some comments in
the next version.

Thanks to Ojaswin and Jan for the review!
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f44f668e407f..fa5977fe8d72 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5146,6 +5146,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 			.fe_len = ac->ac_orig_goal_len,
 		};
 		loff_t orig_goal_end = extent_logical_end(sbi, &ex);
+		loff_t o_ex_end = extent_logical_end(sbi, &ac->ac_o_ex);
 
 		/* we can't allocate as much as normalizer wants.
 		 * so, found space must get proper lstart
@@ -5161,7 +5162,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 		 * 1. Check if best ex can be kept at end of goal (before
 		 *    cr_best_avail trimmed it) and still cover original start
 		 * 2. Else, check if best ex can be kept at start of goal and
-		 *    still cover original start
+		 *    still cover original end
 		 * 3. Else, keep the best ex at start of original request.
 		 */
 		ex.fe_len = ac->ac_b_ex.fe_len;
@@ -5171,7 +5172,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 			goto adjust_bex;
 
 		ex.fe_logical = ac->ac_g_ex.fe_logical;
-		if (ac->ac_o_ex.fe_logical < extent_logical_end(sbi, &ex))
+		if (o_ex_end <= extent_logical_end(sbi, &ex))
 			goto adjust_bex;
 
 		ex.fe_logical = ac->ac_o_ex.fe_logical;
@@ -5179,7 +5180,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 		ac->ac_b_ex.fe_logical = ex.fe_logical;
 
 		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
-		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
+		BUG_ON(o_ex_end > extent_logical_end(sbi, &ex));
 		BUG_ON(extent_logical_end(sbi, &ex) > orig_goal_end);
 	}