diff mbox series

[1/2] ext4: optimize metadata allocation for hybrid LUNs

Message ID OS3P286MB056789DF4EBAA7363A4346B5AF06A@OS3P286MB0567.JPNP286.PROD.OUTLOOK.COM
State New
Headers show
Series [1/2] ext4: optimize metadata allocation for hybrid LUNs | expand

Commit Message

Bobi Jam July 27, 2023, 11:45 p.m. UTC
With LVM it is possible to create an LV with SSD storage at the
beginning of the LV and HDD storage at the end of the LV, and use that
to separate ext4 metadata allocations (that need small random IOs)
from data allocations (that are better suited for large sequential
IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
the filesystem capacity would need to be high-IOPS storage in order to
hold all of the internal metadata.

This would improve performance for inode and other metadata access,
such as ls, find, e2fsck, and in general improve file access latency,
modification, truncate, unlink, transaction commit, etc.

This patch split largest free order group lists and average fragment
size lists into other two lists for IOPS/fast storage groups, and
cr 0 / cr 1 group scanning for metadata block allocation in following
order:

cr 0 on largest free order IOPS group list
cr 1 on average fragment size IOPS group list
cr 0 on largest free order non-IOPS group list
cr 1 on average fragment size non-IOPS group list
cr >= 2 perform the linear search as before

Non-metadata block allocation does not allocate from the IOPS groups.

Add for mke2fs an option to mark which blocks are in the IOPS region
of storage at format time:

  -E iops=0-1024G,4096-8192G

so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
group descriptors to decide which groups to allocate dynamic filesystem
metadata.

Signed-off-by: Bobi Jam <bobijam@hotmail.com>
---
 fs/ext4/balloc.c  |   2 +-
 fs/ext4/ext4.h    |  12 +++++
 fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
 3 files changed, 134 insertions(+), 34 deletions(-)

Comments

Andreas Dilger Aug. 2, 2023, 11:16 p.m. UTC | #1
On Jul 27, 2023, at 5:45 PM, Bobi Jam <bobijam@hotmail.com> wrote:
> 
> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
> 
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
> 
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
> 
> cr 0 on largest free order IOPS group list
> cr 1 on average fragment size IOPS group list
> cr 0 on largest free order non-IOPS group list
> cr 1 on average fragment size non-IOPS group list
> cr >= 2 perform the linear search as before
> 
> Non-metadata block allocation does not allocate from the IOPS groups.
> 
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
> 
>  -E iops=0-1024G,4096-8192G
> 
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic filesystem
> metadata.

Ted, Ritesh, Ojaswin,
I would appreciate your review and comments on these two patches.

They are a followup to the discussion about hybrid LVM devices  a few
weeks ago in https://www.spinics.net/lists/linux-ext4/msg90237.html
"packed_meta_blocks=1 incompatible with resize2fs?".

The 2/2 patch adds an option to mke2fs to mark groups on the SSD/NVMe
storage with the "EXT4_BG_IOPS" flag.  Together with mke2fs using the
existing sparse_super2, flex_bg, and packed_meta_blocks, this would
allocate all static metadata to the start of the device.  The 1/2 patch
changes mballoc to keep two separate allocation lists/trees based on
the IOPS flag on each group.

This has the dual benefit that all filesystem metadata (typically 4KiB
fragmented allocation/read/write) is on fast storage, and these blocks
also do not interfere with (usually larger) data allocation/read/write
(both in terms of allocation fragmentation and contending IOPS/seeks).
This should help both normal IO usage, as well as e2fsck significantly
(virtually all e2fsck IO would go to the SSD/NVMe storage).


The implementation is relatively simple as you can see.  Currently it
has a flag in the superblock to indicate that IOPS groups are available
during block allocation, but even that is not strictly needed (it could
be detected at GDT reading time).  Metadata allocations prefer to use
the IOPS groups (if available), otherwise fall back to regular groups.

For our usage, this is a "soft" feature that does not affect compatibility.
It would be mostly harmless if the filesystem was mounted with an older
kernel.  At worst some performance loss that would disappear again over
time, but this would happen rarely I think.


This doesn't *directly* address filesystem resize that Roberto was asking
about, but having IOPS groups used only for metadata would make it easier
to resize later (if only adding HDD capacity).  Alternately, because the
individual groups are marked with the IOPS flag, a second (third, fourth)
flash region could be added at the end of the current filesystem to hold
the new bitmaps and inode tables would be relatively straight forward to
add on top of this.  There might be some work needed for mke2fs to honor
the "resize" option with packed_meta_blocks, but maybe not much more.

We basically never resize filesystems, so this is not of any interest to
implement at this point.

Cheers, Andreas

> Signed-off-by: Bobi Jam <bobijam@hotmail.com>
> ---
> fs/ext4/balloc.c  |   2 +-
> fs/ext4/ext4.h    |  12 +++++
> fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
> 3 files changed, 134 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde8..7b1b3ec 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
> 	ar.inode = inode;
> 	ar.goal = goal;
> 	ar.len = count ? *count : 1;
> -	ar.flags = flags;
> +	ar.flags = flags | EXT4_MB_HINT_METADATA;
> 
> 	ret = ext4_mb_new_blocks(handle, &ar, errp);
> 	if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21..3444b6e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
> #define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
> #define EXT4_BG_BLOCK_UNINIT	0x0002 /* Block bitmap not in use */
> #define EXT4_BG_INODE_ZEROED	0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS		0x0010 /* In IOPS/fast storage */
> 
> /*
>  * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_UNSIGNED_HASH	0x0002  /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS		0x0004	/* to test development code */
> 
> +#define EXT2_FLAGS_HAS_IOPS		0x0080	/* has IOPS storage */
> +
> /*
>  * Mount flags set via mount options or defaults
>  */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
> 	atomic_t s_retry_alloc_pending;
> 	struct list_head *s_mb_avg_fragment_size;
> 	rwlock_t *s_mb_avg_fragment_size_locks;
> +	struct list_head *s_avg_fragment_size_list_iops;  /* avg_frament_size for IOPS groups */
> +	rwlock_t *s_avg_fragment_size_locks_iops;
> 	struct list_head *s_mb_largest_free_orders;
> 	rwlock_t *s_mb_largest_free_orders_locks;
> +	struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> +	rwlock_t *s_largest_free_orders_locks_iops;
> 
> 	/* tunables */
> 	unsigned long s_stripe;
> @@ -3366,6 +3373,7 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_IBITMAP_CORRUPT		\
> 	(1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
> #define EXT4_GROUP_INFO_BBITMAP_READ_BIT	4
> +#define EXT4_GROUP_INFO_IOPS_BIT		5
> 
> #define EXT4_MB_GRP_NEED_INIT(grp)	\
> 	(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3390,10 @@ struct ext4_group_info {
> 	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
> 	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp)	\
> +	(test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp)	\
> +	(set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> 
> #define EXT4_MAX_CONTENTION		8
> #define EXT4_CONTENTION_THRESHOLD	2
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a2..6d218af 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
> {
> 	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	rwlock_t *afs_locks;
> +	struct list_head *afs_list;
> 	int new_order;
> 
> 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,23 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> 	if (new_order == grp->bb_avg_fragment_size_order)
> 		return;
> 
> +	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> +		afs_locks = sbi->s_avg_fragment_size_locks_iops;
> +		afs_list = sbi->s_avg_fragment_size_list_iops;
> +	} else {
> +		afs_locks = sbi->s_mb_avg_fragment_size_locks;
> +		afs_list = sbi->s_mb_avg_fragment_size;
> +	}
> +
> 	if (grp->bb_avg_fragment_size_order != -1) {
> -		write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +		write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
> 		list_del(&grp->bb_avg_fragment_size_node);
> -		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +		write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
> 	}
> 	grp->bb_avg_fragment_size_order = new_order;
> -	write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> -	list_add_tail(&grp->bb_avg_fragment_size_node,
> -		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> -	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +	write_lock(&afs_locks[new_order]);
> +	list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> +	write_unlock(&afs_locks[new_order]);
> }
> 
> /*
> @@ -863,6 +868,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> {
> 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> 	struct ext4_group_info *iter, *grp;
> +	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> +		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> +	rwlock_t *lfo_locks;
> +	struct list_head *lfo_list;
> 	int i;
> 
> 	if (ac->ac_status == AC_STATUS_FOUND)
> @@ -871,17 +880,25 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> 	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
> 		atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
> 
> +	if (iops) {
> +		lfo_locks = sbi->s_largest_free_orders_locks_iops;
> +		lfo_list = sbi->s_largest_free_orders_list_iops;
> +	} else {
> +		lfo_locks = sbi->s_mb_largest_free_orders_locks;
> +		lfo_list = sbi->s_mb_largest_free_orders;
> +	}
> +
> 	grp = NULL;
> 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> -		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> +		if (list_empty(&lfo_list[i]))
> 			continue;
> -		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> -		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> -			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		read_lock(&lfo_locks[i]);
> +		if (list_empty(&lfo_list[i])) {
> +			read_unlock(&lfo_locks[i]);
> 			continue;
> 		}
> 		grp = NULL;
> -		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> +		list_for_each_entry(iter, &lfo_list[i],
> 				    bb_largest_free_order_node) {
> 			if (sbi->s_mb_stats)
> 				atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> @@ -890,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> 				break;
> 			}
> 		}
> -		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		read_unlock(&lfo_locks[i]);
> 		if (grp)
> 			break;
> 	}
> @@ -913,6 +930,10 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> {
> 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> 	struct ext4_group_info *grp = NULL, *iter;
> +	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> +		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> +	rwlock_t *afs_locks;
> +	struct list_head *afs_list;
> 	int i;
> 
> 	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> @@ -920,16 +941,24 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> 			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
> 	}
> 
> +	if (iops) {
> +		afs_locks = sbi->s_avg_fragment_size_locks_iops;
> +		afs_list = sbi->s_avg_fragment_size_list_iops;
> +	} else {
> +		afs_locks = sbi->s_mb_avg_fragment_size_locks;
> +		afs_list = sbi->s_mb_avg_fragment_size;
> +	}
> +
> 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
> 	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> -		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> +		if (list_empty(&afs_list[i]))
> 			continue;
> -		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> -		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> -			read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> +		read_lock(&afs_locks[i]);
> +		if (list_empty(&afs_list[i])) {
> +			read_unlock(&afs_locks[i]);
> 			continue;
> 		}
> -		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
> +		list_for_each_entry(iter, &afs_list[i],
> 				    bb_avg_fragment_size_node) {
> 			if (sbi->s_mb_stats)
> 				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> @@ -938,7 +967,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> 				break;
> 			}
> 		}
> -		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> +		read_unlock(&afs_locks[i]);
> 		if (grp)
> 			break;
> 	}
> @@ -947,7 +976,15 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
> 		*group = grp->bb_group;
> 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
> 	} else {
> -		*new_cr = 2;
> +		if (iops) {
> +			/* cannot find proper group in IOPS storage,
> +			 * fall back to cr0 for non-IOPS groups.
> +			 */
> +			ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
> +			*new_cr = 0;
> +		} else {
> +			*new_cr = 2;
> +		}
> 	}
> }
> 
> @@ -1030,6 +1067,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> 	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	rwlock_t *lfo_locks;
> +	struct list_head *lfo_list;
> 	int i;
> 
> 	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1081,24 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> 		return;
> 	}
> 
> +	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> +		lfo_locks = sbi->s_largest_free_orders_locks_iops;
> +		lfo_list = sbi->s_largest_free_orders_list_iops;
> +	} else {
> +		lfo_locks = sbi->s_mb_largest_free_orders_locks;
> +		lfo_list = sbi->s_mb_largest_free_orders;
> +	}
> +
> 	if (grp->bb_largest_free_order >= 0) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_lock(&lfo_locks[grp->bb_largest_free_order]);
> 		list_del_init(&grp->bb_largest_free_order_node);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_unlock(&lfo_locks[grp->bb_largest_free_order]);
> 	}
> 	grp->bb_largest_free_order = i;
> 	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> -		list_add_tail(&grp->bb_largest_free_order_node,
> -		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_lock(&lfo_locks[i]);
> +		list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> +		write_unlock(&lfo_locks[i]);
> 	}
> }
> 
> @@ -3150,6 +3192,8 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> 	init_rwsem(&meta_group_info[i]->alloc_sem);
> 	meta_group_info[i]->bb_free_root = RB_ROOT;
> +	if (desc->bg_flags & EXT4_BG_IOPS)
> +		EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
> 	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
> 	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
> 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
> @@ -3423,6 +3467,24 @@ int ext4_mb_init(struct super_block *sb)
> 		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
> 		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
> 	}
> +	sbi->s_avg_fragment_size_list_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> +			      GFP_KERNEL);
> +	if (!sbi->s_avg_fragment_size_list_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sbi->s_avg_fragment_size_locks_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> +			      GFP_KERNEL);
> +	if (!sbi->s_avg_fragment_size_locks_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> +		INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> +		rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> +	}
> 	sbi->s_mb_largest_free_orders =
> 		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> 			GFP_KERNEL);
> @@ -3441,6 +3503,24 @@ int ext4_mb_init(struct super_block *sb)
> 		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
> 		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
> 	}
> +	sbi->s_largest_free_orders_list_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> +			      GFP_KERNEL);
> +	if (!sbi->s_largest_free_orders_list_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sbi->s_largest_free_orders_locks_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> +			      GFP_KERNEL);
> +	if (!sbi->s_largest_free_orders_locks_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> +		INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
> +		rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> +	}
> 
> 	spin_lock_init(&sbi->s_md_lock);
> 	sbi->s_mb_free_pending = 0;
> @@ -3512,8 +3592,12 @@ int ext4_mb_init(struct super_block *sb)
> out:
> 	kfree(sbi->s_mb_avg_fragment_size);
> 	kfree(sbi->s_mb_avg_fragment_size_locks);
> +	kfree(sbi->s_avg_fragment_size_list_iops);
> +	kfree(sbi->s_avg_fragment_size_locks_iops);
> 	kfree(sbi->s_mb_largest_free_orders);
> 	kfree(sbi->s_mb_largest_free_orders_locks);
> +	kfree(sbi->s_largest_free_orders_list_iops);
> +	kfree(sbi->s_largest_free_orders_locks_iops);
> 	kfree(sbi->s_mb_offsets);
> 	sbi->s_mb_offsets = NULL;
> 	kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3666,12 @@ int ext4_mb_release(struct super_block *sb)
> 	}
> 	kfree(sbi->s_mb_avg_fragment_size);
> 	kfree(sbi->s_mb_avg_fragment_size_locks);
> +	kfree(sbi->s_avg_fragment_size_list_iops);
> +	kfree(sbi->s_avg_fragment_size_locks_iops);
> 	kfree(sbi->s_mb_largest_free_orders);
> 	kfree(sbi->s_mb_largest_free_orders_locks);
> +	kfree(sbi->s_largest_free_orders_list_iops);
> +	kfree(sbi->s_largest_free_orders_locks_iops);
> 	kfree(sbi->s_mb_offsets);
> 	kfree(sbi->s_mb_maxs);
> 	iput(sbi->s_buddy_cache);
> --
> 1.8.3.1
> 


Cheers, Andreas
Ritesh Harjani (IBM) Aug. 3, 2023, 12:10 p.m. UTC | #2
Bobi Jam <bobijam@hotmail.com> writes:

> With LVM it is possible to create an LV with SSD storage at the
> beginning of the LV and HDD storage at the end of the LV, and use that
> to separate ext4 metadata allocations (that need small random IOs)
> from data allocations (that are better suited for large sequential
> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
> the filesystem capacity would need to be high-IOPS storage in order to
> hold all of the internal metadata.
>
> This would improve performance for inode and other metadata access,
> such as ls, find, e2fsck, and in general improve file access latency,
> modification, truncate, unlink, transaction commit, etc.
>
> This patch split largest free order group lists and average fragment
> size lists into other two lists for IOPS/fast storage groups, and
> cr 0 / cr 1 group scanning for metadata block allocation in following
> order:
>
> cr 0 on largest free order IOPS group list
> cr 1 on average fragment size IOPS group list
> cr 0 on largest free order non-IOPS group list
> cr 1 on average fragment size non-IOPS group list
> cr >= 2 perform the linear search as before

Yes. The implementation looks straight forward to me.


>
> Non-metadata block allocation does not allocate from the IOPS groups.
>
> Add for mke2fs an option to mark which blocks are in the IOPS region
> of storage at format time:
>
>   -E iops=0-1024G,4096-8192G

However few things to discuss here are - 

1. What happens when the hdd space for data gets fully exhausted? AFAICS, the
allocation for data blocks will still succeed, however we won't be able
to make use of optimized scanning any more. Because we search within
iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.

2. Similarly what happens when the ssd space for metadata gets full.
In this case we keep falling back to cr2 for allocation and we don't
utilize optimize_scanning to find the block groups from hdd space to
allocate from.

3. So it seems after a period of time, these iops lists can have block
groups belonging to differnt ssds. Could this cause the metadata
allocation of related inodes to come from different ssds.
Will this be optimal? Checking on this...
     ...On checking further on this, we start with a goal group and we
at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
we frequently allocate metadata blocks from different SSDs.

4. Ok looking into this, do we even require the iops lists for metadata
allocations? Do we allocate more than 1 blocks for metadata? If not then
maintaining these iops lists for metadata allocation isn't really
helpful. On the other hand it does make sense to maintain it when we
allow data allocations from these ssds when hdds gets full.

5. Did we run any benchmarks with this yet? What kind of gains we are
looking for? Do we have any numbers for this?

6. I couldn't stop but start to think of... 
Should there also be a provision from the user to pass hot/cold data
types which we can use as a hint within the filesystem to allocate from
ssd v/s hdd? Does it even make sense to think in this direction?

-ritesh


>
> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> group descriptors to decide which groups to allocate dynamic filesystem
> metadata.
>
> Signed-off-by: Bobi Jam <bobijam@hotmail.com>
> ---
>  fs/ext4/balloc.c  |   2 +-
>  fs/ext4/ext4.h    |  12 +++++
>  fs/ext4/mballoc.c | 154 ++++++++++++++++++++++++++++++++++++++++++------------
>  3 files changed, 134 insertions(+), 34 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c1edde8..7b1b3ec 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -739,7 +739,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
>  	ar.inode = inode;
>  	ar.goal = goal;
>  	ar.len = count ? *count : 1;
> -	ar.flags = flags;
> +	ar.flags = flags | EXT4_MB_HINT_METADATA;
>  
>  	ret = ext4_mb_new_blocks(handle, &ar, errp);
>  	if (count)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8104a21..3444b6e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -382,6 +382,7 @@ struct flex_groups {
>  #define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
>  #define EXT4_BG_BLOCK_UNINIT	0x0002 /* Block bitmap not in use */
>  #define EXT4_BG_INODE_ZEROED	0x0004 /* On-disk itable initialized to zero */
> +#define EXT4_BG_IOPS		0x0010 /* In IOPS/fast storage */
>  
>  /*
>   * Macro-instructions used to manage group descriptors
> @@ -1112,6 +1113,8 @@ struct ext4_inode_info {
>  #define EXT2_FLAGS_UNSIGNED_HASH	0x0002  /* Unsigned dirhash in use */
>  #define EXT2_FLAGS_TEST_FILESYS		0x0004	/* to test development code */
>  
> +#define EXT2_FLAGS_HAS_IOPS		0x0080	/* has IOPS storage */
> +
>  /*
>   * Mount flags set via mount options or defaults
>   */
> @@ -1514,8 +1517,12 @@ struct ext4_sb_info {
>  	atomic_t s_retry_alloc_pending;
>  	struct list_head *s_mb_avg_fragment_size;
>  	rwlock_t *s_mb_avg_fragment_size_locks;
> +	struct list_head *s_avg_fragment_size_list_iops;  /* avg_frament_size for IOPS groups */
> +	rwlock_t *s_avg_fragment_size_locks_iops;
>  	struct list_head *s_mb_largest_free_orders;
>  	rwlock_t *s_mb_largest_free_orders_locks;
> +	struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
> +	rwlock_t *s_largest_free_orders_locks_iops;
>  
>  	/* tunables */
>  	unsigned long s_stripe;
> @@ -3366,6 +3373,7 @@ struct ext4_group_info {
>  #define EXT4_GROUP_INFO_IBITMAP_CORRUPT		\
>  	(1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
>  #define EXT4_GROUP_INFO_BBITMAP_READ_BIT	4
> +#define EXT4_GROUP_INFO_IOPS_BIT		5
>  
>  #define EXT4_MB_GRP_NEED_INIT(grp)	\
>  	(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> @@ -3382,6 +3390,10 @@ struct ext4_group_info {
>  	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
>  #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
>  	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_TEST_IOPS(grp)	\
> +	(test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_SET_IOPS(grp)	\
> +	(set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
>  
>  #define EXT4_MAX_CONTENTION		8
>  #define EXT4_CONTENTION_THRESHOLD	2
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 20f67a2..6d218af 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -828,6 +828,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
>  mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	rwlock_t *afs_locks;
> +	struct list_head *afs_list;
>  	int new_order;
>  
>  	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
> @@ -838,20 +840,23 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
>  	if (new_order == grp->bb_avg_fragment_size_order)
>  		return;
>  
> +	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> +		afs_locks = sbi->s_avg_fragment_size_locks_iops;
> +		afs_list = sbi->s_avg_fragment_size_list_iops;
> +	} else {
> +		afs_locks = sbi->s_mb_avg_fragment_size_locks;
> +		afs_list = sbi->s_mb_avg_fragment_size;
> +	}
> +
>  	if (grp->bb_avg_fragment_size_order != -1) {
> -		write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +		write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
>  		list_del(&grp->bb_avg_fragment_size_node);
> -		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +		write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
>  	}
>  	grp->bb_avg_fragment_size_order = new_order;
> -	write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> -	list_add_tail(&grp->bb_avg_fragment_size_node,
> -		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> -	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +	write_lock(&afs_locks[new_order]);
> +	list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
> +	write_unlock(&afs_locks[new_order]);
>  }
>  
>  /*
> @@ -863,6 +868,10 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>  	struct ext4_group_info *iter, *grp;
> +	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> +		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> +	rwlock_t *lfo_locks;
> +	struct list_head *lfo_list;
>  	int i;
>  
>  	if (ac->ac_status == AC_STATUS_FOUND)
> @@ -871,17 +880,25 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
>  	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
>  		atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
>  
> +	if (iops) {
> +		lfo_locks = sbi->s_largest_free_orders_locks_iops;
> +		lfo_list = sbi->s_largest_free_orders_list_iops;
> +	} else {
> +		lfo_locks = sbi->s_mb_largest_free_orders_locks;
> +		lfo_list = sbi->s_mb_largest_free_orders;
> +	}
> +
>  	grp = NULL;
>  	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> -		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> +		if (list_empty(&lfo_list[i]))
>  			continue;
> -		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> -		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> -			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		read_lock(&lfo_locks[i]);
> +		if (list_empty(&lfo_list[i])) {
> +			read_unlock(&lfo_locks[i]);
>  			continue;
>  		}
>  		grp = NULL;
> -		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> +		list_for_each_entry(iter, &lfo_list[i],
>  				    bb_largest_free_order_node) {
>  			if (sbi->s_mb_stats)
>  				atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
> @@ -890,7 +907,7 @@ static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
>  				break;
>  			}
>  		}
> -		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		read_unlock(&lfo_locks[i]);
>  		if (grp)
>  			break;
>  	}
> @@ -913,6 +930,10 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>  	struct ext4_group_info *grp = NULL, *iter;
> +	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
> +		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
> +	rwlock_t *afs_locks;
> +	struct list_head *afs_list;
>  	int i;
>  
>  	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
> @@ -920,16 +941,24 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>  			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
>  	}
>  
> +	if (iops) {
> +		afs_locks = sbi->s_avg_fragment_size_locks_iops;
> +		afs_list = sbi->s_avg_fragment_size_list_iops;
> +	} else {
> +		afs_locks = sbi->s_mb_avg_fragment_size_locks;
> +		afs_list = sbi->s_mb_avg_fragment_size;
> +	}
> +
>  	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
>  	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> -		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
> +		if (list_empty(&afs_list[i]))
>  			continue;
> -		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
> -		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
> -			read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> +		read_lock(&afs_locks[i]);
> +		if (list_empty(&afs_list[i])) {
> +			read_unlock(&afs_locks[i]);
>  			continue;
>  		}
> -		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
> +		list_for_each_entry(iter, &afs_list[i],
>  				    bb_avg_fragment_size_node) {
>  			if (sbi->s_mb_stats)
>  				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
> @@ -938,7 +967,7 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>  				break;
>  			}
>  		}
> -		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
> +		read_unlock(&afs_locks[i]);
>  		if (grp)
>  			break;
>  	}
> @@ -947,7 +976,15 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
>  		*group = grp->bb_group;
>  		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
>  	} else {
> -		*new_cr = 2;
> +		if (iops) {
> +			/* cannot find proper group in IOPS storage,
> +			 * fall back to cr0 for non-IOPS groups.
> +			 */
> +			ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
> +			*new_cr = 0;
> +		} else {
> +			*new_cr = 2;
> +		}
>  	}
>  }
>  
> @@ -1030,6 +1067,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>  mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	rwlock_t *lfo_locks;
> +	struct list_head *lfo_list;
>  	int i;
>  
>  	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> @@ -1042,21 +1081,24 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
>  		return;
>  	}
>  
> +	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
> +		lfo_locks = sbi->s_largest_free_orders_locks_iops;
> +		lfo_list = sbi->s_largest_free_orders_list_iops;
> +	} else {
> +		lfo_locks = sbi->s_mb_largest_free_orders_locks;
> +		lfo_list = sbi->s_mb_largest_free_orders;
> +	}
> +
>  	if (grp->bb_largest_free_order >= 0) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_lock(&lfo_locks[grp->bb_largest_free_order]);
>  		list_del_init(&grp->bb_largest_free_order_node);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_unlock(&lfo_locks[grp->bb_largest_free_order]);
>  	}
>  	grp->bb_largest_free_order = i;
>  	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> -		list_add_tail(&grp->bb_largest_free_order_node,
> -		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_lock(&lfo_locks[i]);
> +		list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
> +		write_unlock(&lfo_locks[i]);
>  	}
>  }
>  
> @@ -3150,6 +3192,8 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
>  	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
>  	init_rwsem(&meta_group_info[i]->alloc_sem);
>  	meta_group_info[i]->bb_free_root = RB_ROOT;
> +	if (desc->bg_flags & EXT4_BG_IOPS)
> +		EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
>  	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
>  	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
>  	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
> @@ -3423,6 +3467,24 @@ int ext4_mb_init(struct super_block *sb)
>  		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
>  		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
>  	}
> +	sbi->s_avg_fragment_size_list_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> +			      GFP_KERNEL);
> +	if (!sbi->s_avg_fragment_size_list_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sbi->s_avg_fragment_size_locks_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> +			      GFP_KERNEL);
> +	if (!sbi->s_avg_fragment_size_locks_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> +		INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
> +		rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
> +	}
>  	sbi->s_mb_largest_free_orders =
>  		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
>  			GFP_KERNEL);
> @@ -3441,6 +3503,24 @@ int ext4_mb_init(struct super_block *sb)
>  		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
>  		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
>  	}
> +	sbi->s_largest_free_orders_list_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
> +			      GFP_KERNEL);
> +	if (!sbi->s_largest_free_orders_list_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	sbi->s_largest_free_orders_locks_iops =
> +		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
> +			      GFP_KERNEL);
> +	if (!sbi->s_largest_free_orders_locks_iops) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
> +		INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
> +		rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
> +	}
>  
>  	spin_lock_init(&sbi->s_md_lock);
>  	sbi->s_mb_free_pending = 0;
> @@ -3512,8 +3592,12 @@ int ext4_mb_init(struct super_block *sb)
>  out:
>  	kfree(sbi->s_mb_avg_fragment_size);
>  	kfree(sbi->s_mb_avg_fragment_size_locks);
> +	kfree(sbi->s_avg_fragment_size_list_iops);
> +	kfree(sbi->s_avg_fragment_size_locks_iops);
>  	kfree(sbi->s_mb_largest_free_orders);
>  	kfree(sbi->s_mb_largest_free_orders_locks);
> +	kfree(sbi->s_largest_free_orders_list_iops);
> +	kfree(sbi->s_largest_free_orders_locks_iops);
>  	kfree(sbi->s_mb_offsets);
>  	sbi->s_mb_offsets = NULL;
>  	kfree(sbi->s_mb_maxs);
> @@ -3582,8 +3666,12 @@ int ext4_mb_release(struct super_block *sb)
>  	}
>  	kfree(sbi->s_mb_avg_fragment_size);
>  	kfree(sbi->s_mb_avg_fragment_size_locks);
> +	kfree(sbi->s_avg_fragment_size_list_iops);
> +	kfree(sbi->s_avg_fragment_size_locks_iops);
>  	kfree(sbi->s_mb_largest_free_orders);
>  	kfree(sbi->s_mb_largest_free_orders_locks);
> +	kfree(sbi->s_largest_free_orders_list_iops);
> +	kfree(sbi->s_largest_free_orders_locks_iops);
>  	kfree(sbi->s_mb_offsets);
>  	kfree(sbi->s_mb_maxs);
>  	iput(sbi->s_buddy_cache);
> -- 
> 1.8.3.1
Andreas Dilger Aug. 15, 2023, 4:10 a.m. UTC | #3
On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> 
> Bobi Jam <bobijam@hotmail.com> writes:
> 
>> With LVM it is possible to create an LV with SSD storage at the
>> beginning of the LV and HDD storage at the end of the LV, and use that
>> to separate ext4 metadata allocations (that need small random IOs)
>> from data allocations (that are better suited for large sequential
>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>> the filesystem capacity would need to be high-IOPS storage in order to
>> hold all of the internal metadata.
>> 
>> This would improve performance for inode and other metadata access,
>> such as ls, find, e2fsck, and in general improve file access latency,
>> modification, truncate, unlink, transaction commit, etc.
>> 
>> This patch split largest free order group lists and average fragment
>> size lists into other two lists for IOPS/fast storage groups, and
>> cr 0 / cr 1 group scanning for metadata block allocation in following
>> order:
>> 
>> cr 0 on largest free order IOPS group list
>> cr 1 on average fragment size IOPS group list
>> cr 0 on largest free order non-IOPS group list
>> cr 1 on average fragment size non-IOPS group list
>> cr >= 2 perform the linear search as before

Hi Ritesh,
thanks for the review and the discussion about the patch.

> Yes. The implementation looks straight forward to me.
> 

>> Non-metadata block allocation does not allocate from the IOPS groups.
>> 
>> Add for mke2fs an option to mark which blocks are in the IOPS region
>> of storage at format time:
>> 
>>  -E iops=0-1024G,4096-8192G
> 

> However few things to discuss here are -

As Ted requested on the call, this should be done as two separate calls
to the allocator, rather than embedding the policy in mballoc group
selection itself.  Presumably this would be in ext4_mb_new_blocks()
calling ext4_mb_regular_allocator() twice with different allocation
flags (first with EXT4_MB_HINT_METADATA, then without, though I don't
actually see this was used anywhere in the code before this patch?)

Metadata allocations should try only IOPS groups on the first call,
but would go through all allocation phases.  If IOPS allocation fails,
then the allocator should do a full second pass to allocate from the
non-IOPS groups.  Non-metadata allocations would only allocate from
non-IOPS groups.

> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
> the allocation for data blocks will still succeed, however we won't be
> able to make use of optimized scanning any more. Because we search within
> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.

The intention for our usage is that data allocations should *only* come
from the HDD region of the device, and *not* from the IOPS (flash) region
of the device.  The IOPS region will be comparatively small (0.5-1.0% of
the total device size) so using or not using this space will be mostly
meaningless to the overall filesystem usage, especially with a 1-5%
reserved blocks percentage that is the default for new filesystems.

As you mentioned on the call, it seems this is a defect in the current
patch, that non-metadata allocations may eventually fall back to scan
all block groups for free space including IOPS groups.  They need to
explicitly skip groups that have the IOPS flags set.

> 2. Similarly what happens when the ssd space for metadata gets full?
> In this case we keep falling back to cr2 for allocation and we don't
> utilize optimize_scanning to find the block groups from hdd space to
> allocate from.

In the case when the IOPS groups are full then the metadata allocations
should fall back to using non-IOPS groups.  That avoids ENOSPC when the
metadata space is accidentally formatted too small, or unexpected usage
such as large xattrs or many directories are consuming more IOPS space.

> 3. So it seems after a period of time, these iops lists can have block
> groups belonging to differnt ssds. Could this cause the metadata
> allocation of related inodes to come from different ssds.
> Will this be optimal? Checking on this...
>     ...On checking further on this, we start with a goal group and we
> at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
> we frequently allocate metadata blocks from different SSDs.

In our usage will typically be only a single IOPS region at the start of
the device, but the ability to allow multiple IOPS regions was added for
completeness and flexibility in the future (e.g. resize of filesystem).
In our case, the IOPS region would itself be RAIDed, so "different SSDs"
is not really a concern.

> 4. Ok looking into this, do we even require the iops lists for metadata
> allocations? Do we allocate more than 1 blocks for metadata? If not then
> maintaining these iops lists for metadata allocation isn't really
> helpful. On the other hand it does make sense to maintain it when we
> allow data allocations from these ssds when hdds gets full.

I don't think we *need* to use the same mballoc code for IOPS allocation
in most cases, though large xattr inode allocations should also be using
the IOPS groups for allocating blocks, and these might be up to 64KB.
I don't think that is actually implemented properly in this patch yet.

Also, the mballoc list/array make it easy to find groups with free space
in a full filesystem instead of having to scan for them, even if we
don't need the full "allocate order-N" functionality.  Having one list
of free groups or order-N lists doesn't make it more expensive (and it
actually improves scalability to have multiple list heads).

One of the future enhancements might be to allow small files (of some
configurable size) to also be allocated from the IOPS groups, so it is
probably easier IMHO to just stick with the same allocator for both.

> 5. Did we run any benchmarks with this yet? What kind of gains we are
> looking for? Do we have any numbers for this?

We're working on that.  I just wanted to get the initial patches out for
review sooner rather than later, both to get feedback on implementation
(like this, thanks), and also to reserve the EXT4_BG_IOPS field so it
doesn't get used in a conflicting manner.

> 6. I couldn't stop but start to think of...
> Should there also be a provision from the user to pass hot/cold data
> types which we can use as a hint within the filesystem to allocate from
> ssd v/s hdd? Does it even make sense to think in this direction?

Yes, I also had the same idea, but then left it out of my email to avoid
getting distracted from the initial goal.  There are a number of possible
improvements that could be done with a mechanism like this:
- have fast/slow regions within a single HDD (i.e. last 20% of spindle is
  in "slow" region due to reduced linear velocity/bandwidth on inner tracks)
  to avoid using the slow region unless the fast region is (mostly) full
- have several regions across an HDD to *intentionally* allocate some
  extents in the "slow" groups to reduce *peak* bandwidth but keep
  *average* bandwidth higher as the disk becomes more full since there
  would still be free space in the faster groups.

Cheers, Andreas
Ritesh Harjani (IBM) Aug. 16, 2023, 10:09 a.m. UTC | #4
Andreas Dilger <adilger@dilger.ca> writes:

> On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>> 
>> Bobi Jam <bobijam@hotmail.com> writes:
>> 
>>> With LVM it is possible to create an LV with SSD storage at the
>>> beginning of the LV and HDD storage at the end of the LV, and use that
>>> to separate ext4 metadata allocations (that need small random IOs)
>>> from data allocations (that are better suited for large sequential
>>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>>> the filesystem capacity would need to be high-IOPS storage in order to
>>> hold all of the internal metadata.
>>> 
>>> This would improve performance for inode and other metadata access,
>>> such as ls, find, e2fsck, and in general improve file access latency,
>>> modification, truncate, unlink, transaction commit, etc.
>>> 
>>> This patch split largest free order group lists and average fragment
>>> size lists into other two lists for IOPS/fast storage groups, and
>>> cr 0 / cr 1 group scanning for metadata block allocation in following
>>> order:
>>> 
>>> cr 0 on largest free order IOPS group list
>>> cr 1 on average fragment size IOPS group list
>>> cr 0 on largest free order non-IOPS group list
>>> cr 1 on average fragment size non-IOPS group list
>>> cr >= 2 perform the linear search as before
>
> Hi Ritesh,
> thanks for the review and the discussion about the patch.
>
>> Yes. The implementation looks straight forward to me.
>> 
>
>>> Non-metadata block allocation does not allocate from the IOPS groups.
>>> 
>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>> of storage at format time:
>>> 
>>>  -E iops=0-1024G,4096-8192G
>> 
>
>> However few things to discuss here are -
>
> As Ted requested on the call, this should be done as two separate calls
> to the allocator, rather than embedding the policy in mballoc group
> selection itself.  Presumably this would be in ext4_mb_new_blocks()
> calling ext4_mb_regular_allocator() twice with different allocation
> flags (first with EXT4_MB_HINT_METADATA, then without, though I don't
> actually see this was used anywhere in the code before this patch?)
>
> Metadata allocations should try only IOPS groups on the first call,
> but would go through all allocation phases.  If IOPS allocation fails,
> then the allocator should do a full second pass to allocate from the
> non-IOPS groups. Non-metadata allocations would only allocate from
> non-IOPS groups.
>
>> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
>> the allocation for data blocks will still succeed, however we won't be
>> able to make use of optimized scanning any more. Because we search within
>> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.
>
> The intention for our usage is that data allocations should *only* come
> from the HDD region of the device, and *not* from the IOPS (flash) region
> of the device.  The IOPS region will be comparatively small (0.5-1.0% of
> the total device size) so using or not using this space will be mostly
> meaningless to the overall filesystem usage, especially with a 1-5%
> reserved blocks percentage that is the default for new filesystems.
>

Yes, but when we give this functionality to non-enterprise users,
everyone would like to take advantage of a faster performing ext4 using
1 ssd and few hdds or a smaller spare ssd and larger hdds. Then it could
be that the space of iops region might not strictly be less than 1-2%
and could be anywhere between 10-50% ;)  

Shouldn't we still support this class of usecase as well? ^^^ 
So if the HDD gets full then the allocation should fallback to ssd for
data blocks no?

Or we can have a policy knob i.e. fallback_data_to_iops_region_thresh.
So if the free space %age in the iops region is above 1% (can be changed
by user) then the data allocations can fallback to iops region if it is
unable to allocate data blocks from hdd region.

      echo %age_threshold > fallback_data_to_iops_region_thresh (default 1%)

        Fallback data allocations to iops region as long as we have free space
        %age of iops region above %age_threshold.

> As you mentioned on the call, it seems this is a defect in the current
> patch, that non-metadata allocations may eventually fall back to scan
> all block groups for free space including IOPS groups.  They need to
> explicitly skip groups that have the IOPS flags set.
>
>> 2. Similarly what happens when the ssd space for metadata gets full?
>> In this case we keep falling back to cr2 for allocation and we don't
>> utilize optimize_scanning to find the block groups from hdd space to
>> allocate from.
>
> In the case when the IOPS groups are full then the metadata allocations
> should fall back to using non-IOPS groups.  That avoids ENOSPC when the
> metadata space is accidentally formatted too small, or unexpected usage
> such as large xattrs or many directories are consuming more IOPS space.
>
>> 3. So it seems after a period of time, these iops lists can have block
>> groups belonging to differnt ssds. Could this cause the metadata
>> allocation of related inodes to come from different ssds.
>> Will this be optimal? Checking on this...
>>     ...On checking further on this, we start with a goal group and we
>> at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
>> we frequently allocate metadata blocks from different SSDs.
>
> In our usage will typically be only a single IOPS region at the start of
> the device, but the ability to allow multiple IOPS regions was added for
> completeness and flexibility in the future (e.g. resize of filesystem).

I am interested in knowing what do you think will be challenges in
supporting resize with hybrid devices? Like if someone would like to add
an additional ssd and do a resize, do you think all later metadata
allocations can be fullfilled from this iops region?

And what happens when someone adds hdds to existing ssds.
I guess adding an hdd followed by resize operation can still allocate, bgdt, block/inode
bitmaps and inode tables etc for these block groups to sit on the resized hdd right. 
Are there any other challenges as well for such usecase? 


> In our case, the IOPS region would itself be RAIDed, so "different SSDs"
> is not really a concern.
>
>> 4. Ok looking into this, do we even require the iops lists for metadata
>> allocations? Do we allocate more than 1 blocks for metadata? If not then
>> maintaining these iops lists for metadata allocation isn't really
>> helpful. On the other hand it does make sense to maintain it when we
>> allow data allocations from these ssds when hdds gets full.
>
> I don't think we *need* to use the same mballoc code for IOPS allocation
> in most cases, though large xattr inode allocations should also be using
> the IOPS groups for allocating blocks, and these might be up to 64KB.
> I don't think that is actually implemented properly in this patch yet.
>
> Also, the mballoc list/array make it easy to find groups with free space
> in a full filesystem instead of having to scan for them, even if we
> don't need the full "allocate order-N" functionality.  Having one list
> of free groups or order-N lists doesn't make it more expensive (and it
> actually improves scalability to have multiple list heads).
>
> One of the future enhancements might be to allow small files (of some
> configurable size) to also be allocated from the IOPS groups, so it is
> probably easier IMHO to just stick with the same allocator for both.
>
>> 5. Did we run any benchmarks with this yet? What kind of gains we are
>> looking for? Do we have any numbers for this?
>
> We're working on that.  I just wanted to get the initial patches out for
> review sooner rather than later, both to get feedback on implementation
> (like this, thanks), and also to reserve the EXT4_BG_IOPS field so it
> doesn't get used in a conflicting manner.
>
>> 6. I couldn't stop but start to think of...
>> Should there also be a provision from the user to pass hot/cold data
>> types which we can use as a hint within the filesystem to allocate from
>> ssd v/s hdd? Does it even make sense to think in this direction?
>
> Yes, I also had the same idea, but then left it out of my email to avoid
> getting distracted from the initial goal.  There are a number of possible
> improvements that could be done with a mechanism like this:
> - have fast/slow regions within a single HDD (i.e. last 20% of spindle is
>   in "slow" region due to reduced linear velocity/bandwidth on inner tracks)
>   to avoid using the slow region unless the fast region is (mostly) full
> - have several regions across an HDD to *intentionally* allocate some
>   extents in the "slow" groups to reduce *peak* bandwidth but keep
>   *average* bandwidth higher as the disk becomes more full since there
>   would still be free space in the faster groups.
>

Interesting! 

> Cheers, Andreas
>

Thanks
-ritesh
Andreas Dilger Aug. 18, 2023, 7:53 p.m. UTC | #5
On Aug 16, 2023, at 4:09 AM, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> 
> Andreas Dilger <adilger@dilger.ca> writes:
> 
>> On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>>> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
>>> the allocation for data blocks will still succeed, however we won't be
>>> able to make use of optimized scanning any more. Because we search within
>>> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.
>> 
>> The intention for our usage is that data allocations should *only* come
>> from the HDD region of the device, and *not* from the IOPS (flash) region
>> of the device.  The IOPS region will be comparatively small (0.5-1.0% of
>> the total device size) so using or not using this space will be mostly
>> meaningless to the overall filesystem usage, especially with a 1-5%
>> reserved blocks percentage that is the default for new filesystems.
> 
> Yes, but when we give this functionality to non-enterprise users,
> everyone would like to take advantage of a faster performing ext4 using
> 1 ssd and few hdds or a smaller spare ssd and larger hdds. Then it could
> be that the space of iops region might not strictly be less than 1-2%
> and could be anywhere between 10-50% ;)
> 
> Shouldn't we still support this class of usecase as well? ^^^
> So if the HDD gets full then the allocation should fallback to ssd for
> data blocks no?

It's true that this is possible, and I've thought about optionally
allowing e.g. "small files" to be allocated in the IOPS space while
"large files" are allocated only in the HDD space.  This involves
"policy" which is always tricky to get right.  What is "small" and
what is "large"?  At what threshold do we *stop* putting small files
into the IOPS groups when they get too full, or increase the size of
"small" files if it isn't filling up quickly enough vs. the non-IOPS
groups? ...

I'd prefer to get the basic infrastructure working, and then we can
have the long discussions about how the policy should work for the
*next* patches, because those decisions do not have a permanent effect
on the on-disk format. :-)

> Or we can have a policy knob i.e. fallback_data_to_iops_region_thresh.
> So if the free space %age in the iops region is above 1% (can be changed
> by user) then the data allocations can fallback to iops region if it is
> unable to allocate data blocks from hdd region.
> 
>      echo %age_threshold > fallback_data_to_iops_region_thresh (default 1%)
> 
>        Fallback data allocations to iops region as long as we have free space
>        %age of iops region above %age_threshold.

IMHO, a simple "too full" threshold is sub-optimal, because it means
suddenly the IOPS groups would get filled up with regular file data,
and in the likely case that old files are deleted to free up space,
the IOPS groups will still be filled with the new files.

My preference would be to have a "base small file size" (e.g. 64KB)
and a "fullness ratio" (free IOPS blocks / free non-IOPS blocks) and
use the fullness ratio to scale the "small file size".  In case the
free IOPS blocks are very small (e.g. my initial proposal of 1% of
the total volume size, most of which would be filled with static
metadata) then e.g. files < 64 KB * 0.5% <= 3.2KB (probably *no* files,
since this is less than one block) would go into the IOPS groups.

If the ratio is more like 50% then files under 32KB could be allocated
into the IOPS groups, and if the non-IOPS groups end up filling faster
and the free space ratio is equal or even higher in the IOPS groups,
then 64KB or 128KB files can start to be allocated there.

> I am interested in knowing what do you think will be challenges in
> supporting resize with hybrid devices? Like if someone would like to
> add an additional ssd and do a resize, do you think all later metadata
> allocations can be fullfilled from this iops region?
> 
> And what happens when someone adds hdds to existing ssds.
> I guess adding an hdd followed by resize operation can still allocate,
> GDT, block/inode bitmaps and inode tables etc for these block groups
> to sit on the resized hdd right.
> 
> Are there any other challenges as well for such usecase?

At a high level, my expectation would be that resize would "work"
regardless of whether the IOPS groups have space or not, but how
optimal this is depends on how much free space is in the IOPS groups.
If the IOPS groups are over-provisioned, then it should be fine to
allocate bitmaps and inode table blocks in that space (with flex_bg).

It should also be possible to add more IOPS groups at the end of the
filesystem to help the resize to keep all metadata in the fast storage.
Allowing disjoint regions of flash storage is one of the reasons why
EXT4_BG_IOPS is a per-group flag and not just a "threshold" boundary
within the filesystem.


I only realized yesterday that online resize is completely disabled
for filesystems with sparse_super2.  I think this is actually a mistake
because the problem looks like a bad interaction between sparse_super2
having only 2 backup groups, and the resize_inode feature expecting that
there are backup group descriptors in all of the usual places.

So I think it makes sense to change the "cannot do online resize" check
to only the case of sparse_super2 AND resize_inode being enabled.  This
should be uncommon since sparse_super2 is mostly useful for filesystems
over 16TB in size, and resize_inode currently doesn't work in that case.

It does seem possible to fix resize_inode to work with sparse_super2 for
filesystems over 16TiB.  Originally the reason resize_inode is disallowed
for filesystems > 16TiB is because of the 2^32 block number limit for
non-extent files, and because the increasing numbers of backup groups
means a large number of blocks need to be reserved.  However, when using
sparse_super2 there are only 2 backup groups, and they can be located
within the first 16TiB (there is no reason that backup #2 has to be in
the last group) means resize_inode will have enough space in it to reserve
extra GDT blocks for the online resize to work smoothly, whether the IOPS
groups are in use or not.  However, that is a separate project...

Cheers, Andreas
diff mbox series

Patch

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c1edde8..7b1b3ec 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -739,7 +739,7 @@  ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
 	ar.inode = inode;
 	ar.goal = goal;
 	ar.len = count ? *count : 1;
-	ar.flags = flags;
+	ar.flags = flags | EXT4_MB_HINT_METADATA;
 
 	ret = ext4_mb_new_blocks(handle, &ar, errp);
 	if (count)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8104a21..3444b6e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -382,6 +382,7 @@  struct flex_groups {
 #define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
 #define EXT4_BG_BLOCK_UNINIT	0x0002 /* Block bitmap not in use */
 #define EXT4_BG_INODE_ZEROED	0x0004 /* On-disk itable initialized to zero */
+#define EXT4_BG_IOPS		0x0010 /* In IOPS/fast storage */
 
 /*
  * Macro-instructions used to manage group descriptors
@@ -1112,6 +1113,8 @@  struct ext4_inode_info {
 #define EXT2_FLAGS_UNSIGNED_HASH	0x0002  /* Unsigned dirhash in use */
 #define EXT2_FLAGS_TEST_FILESYS		0x0004	/* to test development code */
 
+#define EXT2_FLAGS_HAS_IOPS		0x0080	/* has IOPS storage */
+
 /*
  * Mount flags set via mount options or defaults
  */
@@ -1514,8 +1517,12 @@  struct ext4_sb_info {
 	atomic_t s_retry_alloc_pending;
 	struct list_head *s_mb_avg_fragment_size;
 	rwlock_t *s_mb_avg_fragment_size_locks;
+	struct list_head *s_avg_fragment_size_list_iops;  /* avg_frament_size for IOPS groups */
+	rwlock_t *s_avg_fragment_size_locks_iops;
 	struct list_head *s_mb_largest_free_orders;
 	rwlock_t *s_mb_largest_free_orders_locks;
+	struct list_head *s_largest_free_orders_list_iops; /* largest_free_orders for IOPS grps */
+	rwlock_t *s_largest_free_orders_locks_iops;
 
 	/* tunables */
 	unsigned long s_stripe;
@@ -3366,6 +3373,7 @@  struct ext4_group_info {
 #define EXT4_GROUP_INFO_IBITMAP_CORRUPT		\
 	(1 << EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT)
 #define EXT4_GROUP_INFO_BBITMAP_READ_BIT	4
+#define EXT4_GROUP_INFO_IOPS_BIT		5
 
 #define EXT4_MB_GRP_NEED_INIT(grp)	\
 	(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
@@ -3382,6 +3390,10 @@  struct ext4_group_info {
 	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
 #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
 	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_TEST_IOPS(grp)	\
+	(test_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_SET_IOPS(grp)	\
+	(set_bit(EXT4_GROUP_INFO_IOPS_BIT, &((grp)->bb_state)))
 
 #define EXT4_MAX_CONTENTION		8
 #define EXT4_CONTENTION_THRESHOLD	2
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 20f67a2..6d218af 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -828,6 +828,8 @@  static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
 mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	rwlock_t *afs_locks;
+	struct list_head *afs_list;
 	int new_order;
 
 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
@@ -838,20 +840,23 @@  static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
 	if (new_order == grp->bb_avg_fragment_size_order)
 		return;
 
+	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
+		afs_locks = sbi->s_avg_fragment_size_locks_iops;
+		afs_list = sbi->s_avg_fragment_size_list_iops;
+	} else {
+		afs_locks = sbi->s_mb_avg_fragment_size_locks;
+		afs_list = sbi->s_mb_avg_fragment_size;
+	}
+
 	if (grp->bb_avg_fragment_size_order != -1) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+		write_lock(&afs_locks[grp->bb_avg_fragment_size_order]);
 		list_del(&grp->bb_avg_fragment_size_node);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+		write_unlock(&afs_locks[grp->bb_avg_fragment_size_order]);
 	}
 	grp->bb_avg_fragment_size_order = new_order;
-	write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
-	list_add_tail(&grp->bb_avg_fragment_size_node,
-		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
-	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+	write_lock(&afs_locks[new_order]);
+	list_add_tail(&grp->bb_avg_fragment_size_node, &afs_list[new_order]);
+	write_unlock(&afs_locks[new_order]);
 }
 
 /*
@@ -863,6 +868,10 @@  static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *iter, *grp;
+	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
+		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
+	rwlock_t *lfo_locks;
+	struct list_head *lfo_list;
 	int i;
 
 	if (ac->ac_status == AC_STATUS_FOUND)
@@ -871,17 +880,25 @@  static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
 		atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
 
+	if (iops) {
+		lfo_locks = sbi->s_largest_free_orders_locks_iops;
+		lfo_list = sbi->s_largest_free_orders_list_iops;
+	} else {
+		lfo_locks = sbi->s_mb_largest_free_orders_locks;
+		lfo_list = sbi->s_mb_largest_free_orders;
+	}
+
 	grp = NULL;
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
+		if (list_empty(&lfo_list[i]))
 			continue;
-		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
-		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
-			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+		read_lock(&lfo_locks[i]);
+		if (list_empty(&lfo_list[i])) {
+			read_unlock(&lfo_locks[i]);
 			continue;
 		}
 		grp = NULL;
-		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
+		list_for_each_entry(iter, &lfo_list[i],
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
@@ -890,7 +907,7 @@  static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
 				break;
 			}
 		}
-		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+		read_unlock(&lfo_locks[i]);
 		if (grp)
 			break;
 	}
@@ -913,6 +930,10 @@  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *grp = NULL, *iter;
+	bool iops = ac->ac_flags & EXT4_MB_HINT_METADATA &&
+		    ac->ac_sb->s_flags & EXT2_FLAGS_HAS_IOPS;
+	rwlock_t *afs_locks;
+	struct list_head *afs_list;
 	int i;
 
 	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
@@ -920,16 +941,24 @@  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
 	}
 
+	if (iops) {
+		afs_locks = sbi->s_avg_fragment_size_locks_iops;
+		afs_list = sbi->s_avg_fragment_size_list_iops;
+	} else {
+		afs_locks = sbi->s_mb_avg_fragment_size_locks;
+		afs_list = sbi->s_mb_avg_fragment_size;
+	}
+
 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		if (list_empty(&sbi->s_mb_avg_fragment_size[i]))
+		if (list_empty(&afs_list[i]))
 			continue;
-		read_lock(&sbi->s_mb_avg_fragment_size_locks[i]);
-		if (list_empty(&sbi->s_mb_avg_fragment_size[i])) {
-			read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		read_lock(&afs_locks[i]);
+		if (list_empty(&afs_list[i])) {
+			read_unlock(&afs_locks[i]);
 			continue;
 		}
-		list_for_each_entry(iter, &sbi->s_mb_avg_fragment_size[i],
+		list_for_each_entry(iter, &afs_list[i],
 				    bb_avg_fragment_size_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
@@ -938,7 +967,7 @@  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 				break;
 			}
 		}
-		read_unlock(&sbi->s_mb_avg_fragment_size_locks[i]);
+		read_unlock(&afs_locks[i]);
 		if (grp)
 			break;
 	}
@@ -947,7 +976,15 @@  static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
 		*group = grp->bb_group;
 		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
 	} else {
-		*new_cr = 2;
+		if (iops) {
+			/* cannot find proper group in IOPS storage,
+			 * fall back to cr0 for non-IOPS groups.
+			 */
+			ac->ac_flags &= ~EXT4_MB_HINT_METADATA;
+			*new_cr = 0;
+		} else {
+			*new_cr = 2;
+		}
 	}
 }
 
@@ -1030,6 +1067,8 @@  static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	rwlock_t *lfo_locks;
+	struct list_head *lfo_list;
 	int i;
 
 	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
@@ -1042,21 +1081,24 @@  static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		return;
 	}
 
+	if (EXT4_MB_GRP_TEST_IOPS(grp)) {
+		lfo_locks = sbi->s_largest_free_orders_locks_iops;
+		lfo_list = sbi->s_largest_free_orders_list_iops;
+	} else {
+		lfo_locks = sbi->s_mb_largest_free_orders_locks;
+		lfo_list = sbi->s_mb_largest_free_orders;
+	}
+
 	if (grp->bb_largest_free_order >= 0) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+		write_lock(&lfo_locks[grp->bb_largest_free_order]);
 		list_del_init(&grp->bb_largest_free_order_node);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+		write_unlock(&lfo_locks[grp->bb_largest_free_order]);
 	}
 	grp->bb_largest_free_order = i;
 	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
-		list_add_tail(&grp->bb_largest_free_order_node,
-		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+		write_lock(&lfo_locks[i]);
+		list_add_tail(&grp->bb_largest_free_order_node, &lfo_list[i]);
+		write_unlock(&lfo_locks[i]);
 	}
 }
 
@@ -3150,6 +3192,8 @@  int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
+	if (desc->bg_flags & EXT4_BG_IOPS)
+		EXT4_MB_GRP_SET_IOPS(meta_group_info[i]);
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
@@ -3423,6 +3467,24 @@  int ext4_mb_init(struct super_block *sb)
 		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
 		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
 	}
+	sbi->s_avg_fragment_size_list_iops =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+			      GFP_KERNEL);
+	if (!sbi->s_avg_fragment_size_list_iops) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sbi->s_avg_fragment_size_locks_iops =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+			      GFP_KERNEL);
+	if (!sbi->s_avg_fragment_size_locks_iops) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+		INIT_LIST_HEAD(&sbi->s_avg_fragment_size_list_iops[i]);
+		rwlock_init(&sbi->s_avg_fragment_size_locks_iops[i]);
+	}
 	sbi->s_mb_largest_free_orders =
 		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
 			GFP_KERNEL);
@@ -3441,6 +3503,24 @@  int ext4_mb_init(struct super_block *sb)
 		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
 		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
 	}
+	sbi->s_largest_free_orders_list_iops =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+			      GFP_KERNEL);
+	if (!sbi->s_largest_free_orders_list_iops) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sbi->s_largest_free_orders_locks_iops =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+			      GFP_KERNEL);
+	if (!sbi->s_largest_free_orders_locks_iops) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+		INIT_LIST_HEAD(&sbi->s_largest_free_orders_list_iops[i]);
+		rwlock_init(&sbi->s_largest_free_orders_locks_iops[i]);
+	}
 
 	spin_lock_init(&sbi->s_md_lock);
 	sbi->s_mb_free_pending = 0;
@@ -3512,8 +3592,12 @@  int ext4_mb_init(struct super_block *sb)
 out:
 	kfree(sbi->s_mb_avg_fragment_size);
 	kfree(sbi->s_mb_avg_fragment_size_locks);
+	kfree(sbi->s_avg_fragment_size_list_iops);
+	kfree(sbi->s_avg_fragment_size_locks_iops);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
+	kfree(sbi->s_largest_free_orders_list_iops);
+	kfree(sbi->s_largest_free_orders_locks_iops);
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
@@ -3582,8 +3666,12 @@  int ext4_mb_release(struct super_block *sb)
 	}
 	kfree(sbi->s_mb_avg_fragment_size);
 	kfree(sbi->s_mb_avg_fragment_size_locks);
+	kfree(sbi->s_avg_fragment_size_list_iops);
+	kfree(sbi->s_avg_fragment_size_locks_iops);
 	kfree(sbi->s_mb_largest_free_orders);
 	kfree(sbi->s_mb_largest_free_orders_locks);
+	kfree(sbi->s_largest_free_orders_list_iops);
+	kfree(sbi->s_largest_free_orders_locks_iops);
 	kfree(sbi->s_mb_offsets);
 	kfree(sbi->s_mb_maxs);
 	iput(sbi->s_buddy_cache);