diff mbox series

[2/4] ext4: skip non-loaded groups at cr=0/1 when scanning for good groups

Message ID 20200731190805.181253-3-tytso@mit.edu
State Awaiting Upstream
Headers show
Series V2- ext4 block bitmap prefetch patches | expand

Commit Message

Theodore Ts'o July 31, 2020, 7:08 p.m. UTC
From: Alex Zhuravlev <azhuravlev@whamcloud.com>

cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles.  Also, at cr=0
mballoc doesn't choose any avaibale chunk.  cr=1 also skips groups
using heuristic based on avg. fragment size.  It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks.  However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.

Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy.  mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours).  With the patch applied mount took ~20 seconds.

Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 fs/ext4/mballoc.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Andreas Dilger Aug. 1, 2020, 8:49 a.m. UTC | #1
> On Jul 31, 2020, at 1:08 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> From: Alex Zhuravlev <azhuravlev@whamcloud.com>
> 
> cr=0 is supposed to be an optimization to save CPU cycles, but if
> buddy data (in memory) is not initialized then all this makes no sense
> as we have to do sync IO taking a lot of cycles.  Also, at cr=0
> mballoc doesn't choose any avaibale chunk.  cr=1 also skips groups

(typo) "available"

> using heuristic based on avg. fragment size.  It's more useful to skip
> such groups and switch to cr=2 where groups will be scanned for
> available chunks.  However, we always read the first block group in a
> flex_bg so metadata blocks will get read into the first flex_bg if
> possible.
> 
> Using sparse image and dm-slow virtual device of 120TB was
> simulated, then the image was formatted and filled using debugfs to
> mark ~85% of available space as busy.  mount process w/o the patch
> couldn't complete in half an hour (according to vmstat it would take
> ~10-11 hours).  With the patch applied mount took ~20 seconds.
> 
> Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
> Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Looks good, one trivial cleanup below could also be done before landing.

> ---
> fs/ext4/mballoc.c | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index fcc702f1ff15..b1ef35a9e9f1 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2195,7 +2196,26 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
> 
> 	/* We only do this if the grp has never been initialized */
> 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> -		ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> +		struct ext4_group_desc *gdp =
> +			ext4_get_group_desc(sb, group, NULL);
> +		int ret;
> +
> +		/* cr=0/1 is a very optimistic search to find large
> +		 * good chunks almost for free.  If buddy data is not
> +		 * ready, then this optimization makes no sense.  But
> +		 * we never skip the first block group in a flex_bg,
> +		 * since this gets used for metadata block allocation,
> +		 * and we want to make sure we locate metadata blocks
> +		 * in the first block group in the flex_bg if
> +		 * possible.

(style) "possible" could fit on the previous line?

> +		 */
> +		if (cr < 2 &&
> +		    (!sbi->s_log_groups_per_flex ||
> +		     ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &&
> +		    !(ext4_has_group_desc_csum(sb) &&
> +		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))))
> +			return 0;
> +		ret = ext4_mb_init_group(sb, group, GFP_NOFS);
> 		if (ret)
> 			return ret;
> 	}
> --
> 2.24.1
> 


Cheers, Andreas
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index fcc702f1ff15..b1ef35a9e9f1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2177,6 +2177,7 @@  static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 {
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	bool should_lock = ac->ac_flags & EXT4_MB_STRICT_CHECK;
 	ext4_grpblk_t free;
 	int ret = 0;
@@ -2195,7 +2196,26 @@  static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		struct ext4_group_desc *gdp =
+			ext4_get_group_desc(sb, group, NULL);
+		int ret;
+
+		/* cr=0/1 is a very optimistic search to find large
+		 * good chunks almost for free.  If buddy data is not
+		 * ready, then this optimization makes no sense.  But
+		 * we never skip the first block group in a flex_bg,
+		 * since this gets used for metadata block allocation,
+		 * and we want to make sure we locate metadata blocks
+		 * in the first block group in the flex_bg if
+		 * possible.
+		 */
+		if (cr < 2 &&
+		    (!sbi->s_log_groups_per_flex ||
+		     ((group & ((1 << sbi->s_log_groups_per_flex) - 1)) != 0)) &&
+		    !(ext4_has_group_desc_csum(sb) &&
+		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))))
+			return 0;
+		ret = ext4_mb_init_group(sb, group, GFP_NOFS);
 		if (ret)
 			return ret;
 	}