diff mbox series

[RFC,08/11] ext4: Don't skip prefetching BLOCK_UNINIT groups

Message ID 4881693a4f5ba1fed367310b27c793e4e78520d3.1674822311.git.ojaswin@linux.ibm.com
State Superseded
Headers show
Series multiblock allocator improvements | expand

Commit Message

Ojaswin Mujoo Jan. 27, 2023, 12:37 p.m. UTC
Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
As a consequence, we end not initializing the buddy structures and CR0/1
lists for these BGs, even though it can be done without any disk IO
overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.

This improves the accuracy of CR0/1 allocation as earlier, we could have
essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
not being initialized, leading to slower CR2 allocations. With this patch CR0/1
will be able to discover these groups as well, thus improving performance.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/mballoc.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Comments

Jan Kara March 9, 2023, 2:14 p.m. UTC | #1
On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> As a consequence, we end not initializing the buddy structures and CR0/1
> lists for these BGs, even though it can be done without any disk IO
> overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> 
> This improves the accuracy of CR0/1 allocation as earlier, we could have
> essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> will be able to discover these groups as well, thus improving performance.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

The patch looks good. I just somewhat wonder - this change may result in
uninitialized groups being initialized and used earlier (previously we'd
rather search in other already initialized groups) which may spread
allocations more. But I suppose that's fine and uninit groups are not
really a feature meant to limit fragmentation and as the filesystem ages
the differences should be minimal. So feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/mballoc.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 14529d2fe65f..48726a831264 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2557,9 +2557,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
>  		 */
>  		if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
>  		    EXT4_MB_GRP_NEED_INIT(grp) &&
> -		    ext4_free_group_clusters(sb, gdp) > 0 &&
> -		    !(ext4_has_group_desc_csum(sb) &&
> -		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> +		    ext4_free_group_clusters(sb, gdp) > 0 ) {
>  			bh = ext4_read_block_bitmap_nowait(sb, group, true);
>  			if (bh && !IS_ERR(bh)) {
>  				if (!buffer_uptodate(bh) && cnt)
> @@ -2600,9 +2598,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
>  		grp = ext4_get_group_info(sb, group);
>  
>  		if (EXT4_MB_GRP_NEED_INIT(grp) &&
> -		    ext4_free_group_clusters(sb, gdp) > 0 &&
> -		    !(ext4_has_group_desc_csum(sb) &&
> -		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> +		    ext4_free_group_clusters(sb, gdp) > 0) {
>  			if (ext4_mb_init_group(sb, group, GFP_NOFS))
>  				break;
>  		}
> -- 
> 2.31.1
>
Ojaswin Mujoo March 17, 2023, 10:55 a.m. UTC | #2
On Thu, Mar 09, 2023 at 03:14:22PM +0100, Jan Kara wrote:
> On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> > Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> > BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> > As a consequence, we end not initializing the buddy structures and CR0/1
> > lists for these BGs, even though it can be done without any disk IO
> > overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> > 
> > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > will be able to discover these groups as well, thus improving performance.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> 
> The patch looks good. I just somewhat wonder - this change may result in
> uninitialized groups being initialized and used earlier (previously we'd
> rather search in other already initialized groups) which may spread
> allocations more. But I suppose that's fine and uninit groups are not
> really a feature meant to limit fragmentation and as the filesystem ages
> the differences should be minimal. So feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> 								Honza
Thanks for the review. As for the allocation spread, I agree that it
should be something our goal determination logic should take care of
rather than limiting the BGs available to the allocator.

Another point I wanted to discuss wrt this patch series was why were the
BLOCK_UNINIT groups not being prefetched earlier. One point I can think
of is that this might lead to memory pressure when we have too many
empty BGs in a very large (say terabytes) disk.

But i'd still like to know if there's some history behind not
prefetching block uninit.

Cc'ing Andreas as well to check if they came across anything in Lustre
in the past.
> 
> > ---
> >  fs/ext4/mballoc.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 14529d2fe65f..48726a831264 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -2557,9 +2557,7 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
> >  		 */
> >  		if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
> >  		    EXT4_MB_GRP_NEED_INIT(grp) &&
> > -		    ext4_free_group_clusters(sb, gdp) > 0 &&
> > -		    !(ext4_has_group_desc_csum(sb) &&
> > -		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> > +		    ext4_free_group_clusters(sb, gdp) > 0 ) {
> >  			bh = ext4_read_block_bitmap_nowait(sb, group, true);
> >  			if (bh && !IS_ERR(bh)) {
> >  				if (!buffer_uptodate(bh) && cnt)
> > @@ -2600,9 +2598,7 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
> >  		grp = ext4_get_group_info(sb, group);
> >  
> >  		if (EXT4_MB_GRP_NEED_INIT(grp) &&
> > -		    ext4_free_group_clusters(sb, gdp) > 0 &&
> > -		    !(ext4_has_group_desc_csum(sb) &&
> > -		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
> > +		    ext4_free_group_clusters(sb, gdp) > 0) {
> >  			if (ext4_mb_init_group(sb, group, GFP_NOFS))
> >  				break;
> >  		}
> > -- 
> > 2.31.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara March 23, 2023, 10:57 a.m. UTC | #3
On Fri 17-03-23 16:25:04, Ojaswin Mujoo wrote:
> On Thu, Mar 09, 2023 at 03:14:22PM +0100, Jan Kara wrote:
> > On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> > > Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> > > BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> > > As a consequence, we end not initializing the buddy structures and CR0/1
> > > lists for these BGs, even though it can be done without any disk IO
> > > overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> > > 
> > > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > > will be able to discover these groups as well, thus improving performance.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > 
> > The patch looks good. I just somewhat wonder - this change may result in
> > uninitialized groups being initialized and used earlier (previously we'd
> > rather search in other already initialized groups) which may spread
> > allocations more. But I suppose that's fine and uninit groups are not
> > really a feature meant to limit fragmentation and as the filesystem ages
> > the differences should be minimal. So feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > 								Honza
> Thanks for the review. As for the allocation spread, I agree that it
> should be something our goal determination logic should take care of
> rather than limiting the BGs available to the allocator.
> 
> Another point I wanted to discuss wrt this patch series was why were the
> BLOCK_UNINIT groups not being prefetched earlier. One point I can think
> of is that this might lead to memory pressure when we have too many
> empty BGs in a very large (say terabytes) disk.
> 
> But i'd still like to know if there's some history behind not
> prefetching block uninit.

Hum, I don't remember anything. Maybe Ted will. You can ask him today on a
call.

								Honza
Ojaswin Mujoo March 25, 2023, 2:43 p.m. UTC | #4
On Thu, Mar 23, 2023 at 11:57:10AM +0100, Jan Kara wrote:
> On Fri 17-03-23 16:25:04, Ojaswin Mujoo wrote:
> > On Thu, Mar 09, 2023 at 03:14:22PM +0100, Jan Kara wrote:
> > > On Fri 27-01-23 18:07:35, Ojaswin Mujoo wrote:
> > > > Currently, ext4_mb_prefetch() and ext4_mb_prefetch_fini() skip
> > > > BLOCK_UNINIT groups since fetching their bitmaps doesn't need disk IO.
> > > > As a consequence, we end not initializing the buddy structures and CR0/1
> > > > lists for these BGs, even though it can be done without any disk IO
> > > > overhead. Hence, don't skip such BGs during prefetch and prefetch_fini.
> > > > 
> > > > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > > > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > > > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > > > will be able to discover these groups as well, thus improving performance.
> > > > 
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> > > 
> > > The patch looks good. I just somewhat wonder - this change may result in
> > > uninitialized groups being initialized and used earlier (previously we'd
> > > rather search in other already initialized groups) which may spread
> > > allocations more. But I suppose that's fine and uninit groups are not
> > > really a feature meant to limit fragmentation and as the filesystem ages
> > > the differences should be minimal. So feel free to add:
> > > 
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > 
> > > 								Honza
> > Thanks for the review. As for the allocation spread, I agree that it
> > should be something our goal determination logic should take care of
> > rather than limiting the BGs available to the allocator.
> > 
> > Another point I wanted to discuss wrt this patch series was why were the
> > BLOCK_UNINIT groups not being prefetched earlier. One point I can think
> > of is that this might lead to memory pressure when we have too many
> > empty BGs in a very large (say terabytes) disk.
> > 
> > But i'd still like to know if there's some history behind not
> > prefetching block uninit.
> 
> Hum, I don't remember anything. Maybe Ted will. You can ask him today on a
> call.
Unfortunately, couldn't join it last time :) I'll check with him on
upcoming Thurs.

Regards,
ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Theodore Ts'o March 26, 2023, 3:54 a.m. UTC | #5
On Fri, Mar 17, 2023 at 04:25:04PM +0530, Ojaswin Mujoo wrote:
> > > This improves the accuracy of CR0/1 allocation as earlier, we could have
> > > essentially empty BLOCK_UNINIT groups being ignored by CR0/1 due to their buddy
> > > not being initialized, leading to slower CR2 allocations. With this patch CR0/1
> > > will be able to discover these groups as well, thus improving performance.
> >
> > The patch looks good. I just somewhat wonder - this change may result in
> > uninitialized groups being initialized and used earlier (previously we'd
> > rather search in other already initialized groups) which may spread
> > allocations more. But I suppose that's fine and uninit groups are not
> > really a feature meant to limit fragmentation and as the filesystem ages
> > the differences should be minimal. So feel free to add:
> 
> Another point I wanted to discuss wrt this patch series was why were the
> BLOCK_UNINIT groups not being prefetched earlier. One point I can think
> of is that this might lead to memory pressure when we have too many
> empty BGs in a very large (say terabytes) disk.

Originally the prefetch logic was simply something to optimize I/O ---
that is, normally, all of the block bitmaps for a flex_bg are
contiguous, so why not just read them all in a single I/O which is
issued all at once, instead of doing them as separate 4k reads.

Skipping block groups that hadn't yet been prefetched was something
which was added later, in order to improve performance of the
allocator for freshly mounted file systems where the prefetch hadn't
yet had a chance to pull in block bitmaps; the problem was that if the
block groups hadn't been prefetch yet, then the cr0 scan would fetch
them, and if you have a storage device where blocks with monotonically
increasing LBA numbers aren't necessarily stored adjacently on disk
(for example, on a dm-thin volume, but if one were to do an experiment
on certain emulated block devices on certain hyperscalar cloud
environments, one might find a similar performance profile), resulting
in a cr0 scan potentially issuing a series of 16 sequential 4k I/O's,
that could be substantially worse from a performance standpoint than
doing a single squential 64k I/O.

When this change was made, the focus was on *initialized* bitmaps
taking a long time if they were issued as individual sequential 4k
I/O's; the fix was to skip scanning them initially, since the hope was
that the prefetch would pull them in fairly quickly, and a few bad
allocations when the file system was freshly mounted was an acceptable
tradeoff.

But prefetching prefetching BLOCK_UNINIT groups makes sense, that
should fix the problem that you've identified (at least for
BLOCK_UNINIT groups; for initialized block bitmaps, we'll still have
less optimal allocation patterns until we've managed to prefetch those
block groups).

Cheers,

					0 Ted
diff mbox series

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 14529d2fe65f..48726a831264 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2557,9 +2557,7 @@  ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 		 */
 		if (!EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
 		    EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 &&
-		    !(ext4_has_group_desc_csum(sb) &&
-		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+		    ext4_free_group_clusters(sb, gdp) > 0 ) {
 			bh = ext4_read_block_bitmap_nowait(sb, group, true);
 			if (bh && !IS_ERR(bh)) {
 				if (!buffer_uptodate(bh) && cnt)
@@ -2600,9 +2598,7 @@  void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 		grp = ext4_get_group_info(sb, group);
 
 		if (EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 &&
-		    !(ext4_has_group_desc_csum(sb) &&
-		      (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) {
+		    ext4_free_group_clusters(sb, gdp) > 0) {
 			if (ext4_mb_init_group(sb, group, GFP_NOFS))
 				break;
 		}