diff mbox series

[3/5] ext4: call ext4_mb_mark_free_simple in mb_mark_used to clear bits

Message ID 20240326213823.528302-4-shikemeng@huaweicloud.com
State New
Headers show
Series Minor improvements and cleanups to ext4 mballoc | expand

Commit Message

Kemeng Shi March 26, 2024, 9:38 p.m. UTC
Function ext4_mb_mark_free_simple could search order for bit clearing in
O(1) cost while mb_mark_used will search order in O(distance from chunk
order to target order) and introduce unnecessary bit flips.

Consider we have 4 continuous free bits and going to mark bit 0-2 inuse.
initial state of buddy bitmap:
order 2 |           0           |
order 1 |     1     |     1     |
order 0 |  1  |  1  |  1  |  1  |

mark whole chunk inuse
order 2 |           1           |
order 1 |     1     |     1     |
order 0 |  1  |  1  |  1  |  1  |

split chunk to order 1
order 2 |           1           |
order 1 |     0     |     0     |
order 0 |  1  |  1  |  1  |  1  |

set the first bit in order 1 to mark bit 0-1 inuse
set the second bit in order 1 for split
order 2 |           1           |
order 1 |     1     |     1     |
order 0 |  1  |  1  |  1  |  1  |

step 3: split the second bit in order 1 to order 0
order 2 |           1           |
order 1 |     1     |     1     |
order 0 |  1  |  1  |  0  |  0  |

step 4: set the third bit in order 0 to mark bit 2 inuse.
order 2 |           1           |
order 1 |     1     |     1     |
order 0 |  1  |  1  |  1  |  0  |
There are two unnecessary splits and three unnecessary bit flips.

With ext4_mb_mark_free_simple, we will clear the 4th bit in order 0
with O(1) search and no extra bit flip.

The cost estimated by test_mb_mark_used_cost is as following:
Before (three runs of test):
    # test_mb_mark_used_cost: costed jiffies 311
    # test_mb_mark_used_cost: costed jiffies 304
    # test_mb_mark_used_cost: costed jiffies 305
    # test_mb_mark_used_cost: costed jiffies 323
    # test_mb_mark_used_cost: costed jiffies 317
    # test_mb_mark_used_cost: costed jiffies 317
After (three runs of test):
    # test_mb_mark_used_cost: costed jiffies 166
    # test_mb_mark_used_cost: costed jiffies 152
    # test_mb_mark_used_cost: costed jiffies 159
    # test_mb_mark_used_cost: costed jiffies 138
    # test_mb_mark_used_cost: costed jiffies 149

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 fs/ext4/mballoc.c | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

Comments

Jan Kara April 4, 2024, 2:16 p.m. UTC | #1
On Wed 27-03-24 05:38:21, Kemeng Shi wrote:
> Function ext4_mb_mark_free_simple could search order for bit clearing in
> O(1) cost while mb_mark_used will search order in O(distance from chunk
> order to target order) and introduce unnecessary bit flips.

Let me see if I understand you right. I agree that mb_mark_used() is
actually O(log(bitmap_size)^2) because each call to
mb_find_order_for_block() is O(log(bitmap_size)). Do I understand your
concern right?

> Consider we have 4 continuous free bits and going to mark bit 0-2 inuse.
> initial state of buddy bitmap:
> order 2 |           0           |
> order 1 |     1     |     1     |
> order 0 |  1  |  1  |  1  |  1  |
>
> mark whole chunk inuse
> order 2 |           1           |
> order 1 |     1     |     1     |
> order 0 |  1  |  1  |  1  |  1  |
> 
> split chunk to order 1
> order 2 |           1           |
> order 1 |     0     |     0     |
> order 0 |  1  |  1  |  1  |  1  |
> 
> set the first bit in order 1 to mark bit 0-1 inuse
> set the second bit in order 1 for split
> order 2 |           1           |
> order 1 |     1     |     1     |
> order 0 |  1  |  1  |  1  |  1  |
> 
> step 3: split the second bit in order 1 to order 0
> order 2 |           1           |
> order 1 |     1     |     1     |
> order 0 |  1  |  1  |  0  |  0  |
> 
> step 4: set the third bit in order 0 to mark bit 2 inuse.
> order 2 |           1           |
> order 1 |     1     |     1     |
> order 0 |  1  |  1  |  1  |  0  |
> There are two unnecessary splits and three unnecessary bit flips.
> 
> With ext4_mb_mark_free_simple, we will clear the 4th bit in order 0
> with O(1) search and no extra bit flip.

However this looks like a bit ugly way to speed it up, I'm not even sure
this would result in practical speedups and asymptotically, I think the
complexity is still O(log^2). Also the extra bit flips are not really a
concern I'd say as they are in the same cacheline anyway. The unnecessary
overhead (if at all measurable) comes from the O(log^2) behavior. And there
I agree we could do better by not starting the block order search from 1 in
all the cases - we know the found order will be first increasing for some
time and then decreasing again so with some effort we could amortize all
block order searches to O(log) time. But it makes the code more complex and
I'm not conviced this is all worth it. So if you want to go this direction,
then please provide (micro-)benchmarks from real hardware (not just
theoretical cost estimations) showing the benefit. Thanks.

								Honza

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index a61fc52956b2..62d468379722 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2040,13 +2040,12 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>  	int ord;
>  	int mlen = 0;
>  	int max = 0;
> -	int cur;
>  	int start = ex->fe_start;
>  	int len = ex->fe_len;
>  	unsigned ret = 0;
>  	int len0 = len;
>  	void *buddy;
> -	bool split = false;
> +	int ord_start, ord_end;
>  
>  	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
>  	BUG_ON(e4b->bd_group != ex->fe_group);
> @@ -2071,16 +2070,12 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>  
>  	/* let's maintain buddy itself */
>  	while (len) {
> -		if (!split)
> -			ord = mb_find_order_for_block(e4b, start);
> +		ord = mb_find_order_for_block(e4b, start);
>  
>  		if (((start >> ord) << ord) == start && len >= (1 << ord)) {
>  			/* the whole chunk may be allocated at once! */
>  			mlen = 1 << ord;
> -			if (!split)
> -				buddy = mb_find_buddy(e4b, ord, &max);
> -			else
> -				split = false;
> +			buddy = mb_find_buddy(e4b, ord, &max);
>  			BUG_ON((start >> ord) >= max);
>  			mb_set_bit(start >> ord, buddy);
>  			e4b->bd_info->bb_counters[ord]--;
> @@ -2094,20 +2089,28 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>  		if (ret == 0)
>  			ret = len | (ord << 16);
>  
> -		/* we have to split large buddy */
>  		BUG_ON(ord <= 0);
>  		buddy = mb_find_buddy(e4b, ord, &max);
>  		mb_set_bit(start >> ord, buddy);
>  		e4b->bd_info->bb_counters[ord]--;
>  
> -		ord--;
> -		cur = (start >> ord) & ~1U;
> -		buddy = mb_find_buddy(e4b, ord, &max);
> -		mb_clear_bit(cur, buddy);
> -		mb_clear_bit(cur + 1, buddy);
> -		e4b->bd_info->bb_counters[ord]++;
> -		e4b->bd_info->bb_counters[ord]++;
> -		split = true;
> +		ord_start = (start >> ord) << ord;
> +		ord_end = ord_start + (1 << ord);
> +		if (start > ord_start)
> +			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
> +						 ord_start, start - ord_start,
> +						 e4b->bd_info);
> +
> +		if (start + len < ord_end) {
> +			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
> +						 start + len,
> +						 ord_end - (start + len),
> +						 e4b->bd_info);
> +			break;
> +		}
> +
> +		len = start + len - ord_end;
> +		start = ord_end;
>  	}
>  	mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>  
> -- 
> 2.30.0
>
Kemeng Shi April 7, 2024, 6:31 a.m. UTC | #2
on 4/4/2024 10:16 PM, Jan Kara wrote:
> On Wed 27-03-24 05:38:21, Kemeng Shi wrote:
>> Function ext4_mb_mark_free_simple could search order for bit clearing in
>> O(1) cost while mb_mark_used will search order in O(distance from chunk
>> order to target order) and introduce unnecessary bit flips.
> 
> Let me see if I understand you right. I agree that mb_mark_used() is
> actually O(log(bitmap_size)^2) because each call to
> mb_find_order_for_block() is O(log(bitmap_size)). Do I understand your
> concern right?
Sorry for the confusion. Actually it's times to do bit clear after
mb_find_order_for_block to mark partial part of block chunk free.

In mb_mark_used, we will find free chunk and mark it in use. For chunk
in mid of passed range, we could simply mark whole chunk in use. For chunk
in end of range, we need to mark partial part of chunk inuse. To only mark
partial part of chunk inuse, we firstly mark whole chunk inuse and then
do serveral times of bit clear work by "mb_find_buddy(...);
mb_clear_bit(...); ..." mark partial part of chunk free. The times to call
"mb_find_buddy(); ..." is [order of free chunk] - [last order to free partial
part of chunk], which is what I mean "distance from chunk order to target order"
in changelog.

The repeat "mb_find_buddy(...); ..." aims to mark continuous range blocks
free which is excat the work ext4_mb_mark_free_simple has done and
ext4_mb_mark_free_simple does in a more effective way than code to free
blocks in mb_mark_used. So we can simply find the range need to set free in
chunk at end of range and call ext4_mb_mark_free_simple to use the effective
and exsiting code to free a continuous range of blocks.

> 
>> Consider we have 4 continuous free bits and going to mark bit 0-2 inuse.
>> initial state of buddy bitmap:
>> order 2 |           0           |
>> order 1 |     1     |     1     |
>> order 0 |  1  |  1  |  1  |  1  |
>>
>> mark whole chunk inuse
>> order 2 |           1           |
>> order 1 |     1     |     1     |
>> order 0 |  1  |  1  |  1  |  1  |
>>
>> split chunk to order 1
>> order 2 |           1           |
>> order 1 |     0     |     0     |
>> order 0 |  1  |  1  |  1  |  1  |
>>
>> set the first bit in order 1 to mark bit 0-1 inuse
>> set the second bit in order 1 for split
>> order 2 |           1           |
>> order 1 |     1     |     1     |
>> order 0 |  1  |  1  |  1  |  1  |
>>
>> step 3: split the second bit in order 1 to order 0
>> order 2 |           1           |
>> order 1 |     1     |     1     |
>> order 0 |  1  |  1  |  0  |  0  |
>>
>> step 4: set the third bit in order 0 to mark bit 2 inuse.
>> order 2 |           1           |
>> order 1 |     1     |     1     |
>> order 0 |  1  |  1  |  1  |  0  |
>> There are two unnecessary splits and three unnecessary bit flips.
>>
>> With ext4_mb_mark_free_simple, we will clear the 4th bit in order 0
>> with O(1) search and no extra bit flip.
> 
> However this looks like a bit ugly way to speed it up, I'm not even sure
> this would result in practical speedups and asymptotically, I think the
> complexity is still O(log^2). Also the extra bit flips are not really a
> concern I'd say as they are in the same cacheline anyway. The unnecessary
> overhead (if at all measurable) comes from the O(log^2) behavior. And there
> I agree we could do better by not starting the block order search from 1 in
> all the cases - we know the found order will be first increasing for some
> time and then decreasing again so with some effort we could amortize all
> block order searches to O(log) time. But it makes the code more complex and
> I'm not conviced this is all worth it. So if you want to go this direction,
> then please provide (micro-)benchmarks from real hardware (not just
> theoretical cost estimations) showing the benefit. Thanks.
> 
> 								Honza
> 
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index a61fc52956b2..62d468379722 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2040,13 +2040,12 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>  	int ord;
>>  	int mlen = 0;
>>  	int max = 0;
>> -	int cur;
>>  	int start = ex->fe_start;
>>  	int len = ex->fe_len;
>>  	unsigned ret = 0;
>>  	int len0 = len;
>>  	void *buddy;
>> -	bool split = false;
>> +	int ord_start, ord_end;
>>  
>>  	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
>>  	BUG_ON(e4b->bd_group != ex->fe_group);
>> @@ -2071,16 +2070,12 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>  
>>  	/* let's maintain buddy itself */
>>  	while (len) {
>> -		if (!split)
>> -			ord = mb_find_order_for_block(e4b, start);
>> +		ord = mb_find_order_for_block(e4b, start);
>>  
>>  		if (((start >> ord) << ord) == start && len >= (1 << ord)) {
>>  			/* the whole chunk may be allocated at once! */
>>  			mlen = 1 << ord;
>> -			if (!split)
>> -				buddy = mb_find_buddy(e4b, ord, &max);
>> -			else
>> -				split = false;
>> +			buddy = mb_find_buddy(e4b, ord, &max);
>>  			BUG_ON((start >> ord) >= max);
>>  			mb_set_bit(start >> ord, buddy);
>>  			e4b->bd_info->bb_counters[ord]--;
>> @@ -2094,20 +2089,28 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>  		if (ret == 0)
>>  			ret = len | (ord << 16);
>>  
>> -		/* we have to split large buddy */
>>  		BUG_ON(ord <= 0);
>>  		buddy = mb_find_buddy(e4b, ord, &max);
>>  		mb_set_bit(start >> ord, buddy);
>>  		e4b->bd_info->bb_counters[ord]--;
>>  
>> -		ord--;
>> -		cur = (start >> ord) & ~1U;
>> -		buddy = mb_find_buddy(e4b, ord, &max);
>> -		mb_clear_bit(cur, buddy);
>> -		mb_clear_bit(cur + 1, buddy);
>> -		e4b->bd_info->bb_counters[ord]++;
>> -		e4b->bd_info->bb_counters[ord]++;
>> -		split = true;
>> +		ord_start = (start >> ord) << ord;
>> +		ord_end = ord_start + (1 << ord);
>> +		if (start > ord_start)
>> +			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
>> +						 ord_start, start - ord_start,
>> +						 e4b->bd_info);
>> +
>> +		if (start + len < ord_end) {
>> +			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
>> +						 start + len,
>> +						 ord_end - (start + len),
>> +						 e4b->bd_info);
>> +			break;
>> +		}
>> +
>> +		len = start + len - ord_end;
>> +		start = ord_end;
>>  	}
>>  	mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
>>  
>> -- 
>> 2.30.0
>>
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a61fc52956b2..62d468379722 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2040,13 +2040,12 @@  static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 	int ord;
 	int mlen = 0;
 	int max = 0;
-	int cur;
 	int start = ex->fe_start;
 	int len = ex->fe_len;
 	unsigned ret = 0;
 	int len0 = len;
 	void *buddy;
-	bool split = false;
+	int ord_start, ord_end;
 
 	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
 	BUG_ON(e4b->bd_group != ex->fe_group);
@@ -2071,16 +2070,12 @@  static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 
 	/* let's maintain buddy itself */
 	while (len) {
-		if (!split)
-			ord = mb_find_order_for_block(e4b, start);
+		ord = mb_find_order_for_block(e4b, start);
 
 		if (((start >> ord) << ord) == start && len >= (1 << ord)) {
 			/* the whole chunk may be allocated at once! */
 			mlen = 1 << ord;
-			if (!split)
-				buddy = mb_find_buddy(e4b, ord, &max);
-			else
-				split = false;
+			buddy = mb_find_buddy(e4b, ord, &max);
 			BUG_ON((start >> ord) >= max);
 			mb_set_bit(start >> ord, buddy);
 			e4b->bd_info->bb_counters[ord]--;
@@ -2094,20 +2089,28 @@  static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 		if (ret == 0)
 			ret = len | (ord << 16);
 
-		/* we have to split large buddy */
 		BUG_ON(ord <= 0);
 		buddy = mb_find_buddy(e4b, ord, &max);
 		mb_set_bit(start >> ord, buddy);
 		e4b->bd_info->bb_counters[ord]--;
 
-		ord--;
-		cur = (start >> ord) & ~1U;
-		buddy = mb_find_buddy(e4b, ord, &max);
-		mb_clear_bit(cur, buddy);
-		mb_clear_bit(cur + 1, buddy);
-		e4b->bd_info->bb_counters[ord]++;
-		e4b->bd_info->bb_counters[ord]++;
-		split = true;
+		ord_start = (start >> ord) << ord;
+		ord_end = ord_start + (1 << ord);
+		if (start > ord_start)
+			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
+						 ord_start, start - ord_start,
+						 e4b->bd_info);
+
+		if (start + len < ord_end) {
+			ext4_mb_mark_free_simple(e4b->bd_sb, e4b->bd_buddy,
+						 start + len,
+						 ord_end - (start + len),
+						 e4b->bd_info);
+			break;
+		}
+
+		len = start + len - ord_end;
+		start = ord_end;
 	}
 	mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);