Message ID | 20230811071519.1094-1-teawaterz@linux.alibaba.com |
---|---|
State | Rejected |
Headers | show |
Series | ext4_sb_breadahead_unmovable: Change to be no-blocking | expand |
Oops, this is the version 2 of the patch. Best, Hui Hui Zhu <teawaterz@linux.alibaba.com> 于2023年8月11日周五 15:15写道: > > From: Hui Zhu <teawater@antgroup.com> > > This version fix the gfp flags in the callers instead of working this > new "bool" flag through the buffer head layers according to the comments > from Matthew Wilcox. > > Encountered an issue where a large number of filesystem reads and writes > occurred suddenly within a container. At the same time, other tasks on > the same host that were performing filesystem read and write operations > became blocked. It was observed that many of the blocked tasks were > blocked on the ext4 journal lock. For example: > PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread" > 0 [] __schedule at xxxxxxxxxxxxxxx > 1 [] schedule at xxxxxxxxxxxxxxx > 2 [] wait_transaction_locked at xxxxxxxxxxxxxxx > 3 [] add_transaction_credits at xxxxxxxxxxxxxxx > 4 [] start_this_handle at xxxxxxxxxxxxxxx > 5 [] jbd2__journal_start at xxxxxxxxxxxxxxx > 6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx > 7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx > 8 [] mark_inode_dirty at xxxxxxxxxxxxxxx > 9 [] generic_update_time at xxxxxxxxxxxxxxx > > Meanwhile, it was observed that the task holding the ext4 journal lock > was blocked for an extended period of time on "shrink_page_list" due to > "ext4_sb_breadahead_unmovable". > 0 [] __schedule at xxxxxxxxxxxxxxx > 1 [] _cond_resched at xxxxxxxxxxxxxxx > 2 [] shrink_page_list at xxxxxxxxxxxxxxx > 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx > 4 [] shrink_lruvec at xxxxxxxxxxxxxxx > 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx > 6 [] shrink_node at xxxxxxxxxxxxxxx > 7 [] shrink_zones at xxxxxxxxxxxxxxx > 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx > 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx > 10 [] try_charge at xxxxxxxxxxxxxxx > 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx > 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx > 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx > 14 [] pagecache_get_page at xxxxxxxxxxxxxxx > 15 [] grow_dev_page at xxxxxxxxxxxxxxx > 16 [] __getblk_slow at xxxxxxxxxxxxxxx > 17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx > 18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx > 19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx > 20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx > 21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx > 22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx > 23 [] ext4_add_entry at xxxxxxxxxxxxxxx > 24 [] ext4_add_nondir at xxxxxxxxxxxxxxx > 25 [] ext4_create at xxxxxxxxxxxxxxx > 26 [] vfs_create at xxxxxxxxxxxxxxx > > The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL", > causing longer blocking times. > /* > * XXX: __getblk_slow() can not really deal with failure and > * will endlessly loop on improvised global reclaim. Prefer > * looping in the allocator rather than here, at least that > * code knows what it's doing. > */ > gfp_mask |= __GFP_NOFAIL; > However, "ext4_sb_breadahead_unmovable" is a prefetch function and > failures are acceptable. > > Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be > non-blocking. > Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable > calls sb_getblk_gfp. > Modify grow_dev_page to will not be blocked by the allocation of folio > if gfp is ~__GFP_DIRECT_RECLAIM. > > Signed-off-by: Hui Zhu <teawater@antgroup.com> > --- > fs/buffer.c | 27 +++++++++++++++++++-------- > fs/ext4/super.c | 3 ++- > 2 files changed, 21 insertions(+), 9 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index bd091329026c..330cf19c77b1 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio, > * Create the page-cache page that contains the requested block. > * > * This is used purely for blockdev mappings. > + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM. > */ > static int > grow_dev_page(struct block_device *bdev, sector_t block, > @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block, > int ret = 0; > gfp_t gfp_mask; > > - gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp; > + gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS); > + if (gfp == ~__GFP_DIRECT_RECLAIM) > + gfp_mask &= ~__GFP_DIRECT_RECLAIM; > + else { > + gfp_mask |= gfp; > > - /* > - * XXX: __getblk_slow() can not really deal with failure and > - * will endlessly loop on improvised global reclaim. Prefer > - * looping in the allocator rather than here, at least that > - * code knows what it's doing. > - */ > - gfp_mask |= __GFP_NOFAIL; > + /* > + * XXX: __getblk_slow() can not really deal with failure and > + * will endlessly loop on improvised global reclaim. Prefer > + * looping in the allocator rather than here, at least that > + * code knows what it's doing. > + */ > + gfp_mask |= __GFP_NOFAIL; > + } > > folio = __filemap_get_folio(inode->i_mapping, index, > FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + goto out; > + } > > bh = folio_buffers(folio); > if (bh) { > @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block, > failed: > folio_unlock(folio); > folio_put(folio); > +out: > return ret; > } > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index c94ebf704616..6a529509b83b 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb, > > void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block) > { > - struct buffer_head *bh = sb_getblk_gfp(sb, block, 0); > + struct buffer_head *bh = sb_getblk_gfp(sb, block, > + ~__GFP_DIRECT_RECLAIM); > > if (likely(bh)) { > if (trylock_buffer(bh)) > -- > 2.19.1.6.gb485710b >
On Fri, Aug 11, 2023 at 07:15:19AM +0000, Hui Zhu wrote: > From: Hui Zhu <teawater@antgroup.com> > > This version fix the gfp flags in the callers instead of working this > new "bool" flag through the buffer head layers according to the comments > from Matthew Wilcox. FYI, this paragraph should have been below the --- so it gets excluded from the commit log. > Meanwhile, it was observed that the task holding the ext4 journal lock > was blocked for an extended period of time on "shrink_page_list" due to > "ext4_sb_breadahead_unmovable". > 0 [] __schedule at xxxxxxxxxxxxxxx > 1 [] _cond_resched at xxxxxxxxxxxxxxx > 2 [] shrink_page_list at xxxxxxxxxxxxxxx > 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx > 4 [] shrink_lruvec at xxxxxxxxxxxxxxx > 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx > 6 [] shrink_node at xxxxxxxxxxxxxxx > 7 [] shrink_zones at xxxxxxxxxxxxxxx > 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx > 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx > 10 [] try_charge at xxxxxxxxxxxxxxx > 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx > 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx > 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx > 14 [] pagecache_get_page at xxxxxxxxxxxxxxx > 15 [] grow_dev_page at xxxxxxxxxxxxxxx After applying your patch, we'd still get into trouble with folio_alloc_buffers() also specifying __GFP_NOWAIT. So I decided to pass the GFP flags into folio_alloc_buffers() -- see the patch series I just sent out. > @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block, > int ret = 0; > gfp_t gfp_mask; > > - gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp; > + gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS); > + if (gfp == ~__GFP_DIRECT_RECLAIM) > + gfp_mask &= ~__GFP_DIRECT_RECLAIM; This isn't how we normally use gfp_mask. OTOH, how buffer.c uses GFP masks is also a bit weird. The bdev_getblk() I just added is more normal. Please try the patchset I cc'd you on (with the __GFP_ACCOUNT added); I'm currently running it through xfstests and it's holding up fine. I suppose I should play around with memcgs to try to make it happen a bit more often.
diff --git a/fs/buffer.c b/fs/buffer.c index bd091329026c..330cf19c77b1 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio, * Create the page-cache page that contains the requested block. * * This is used purely for blockdev mappings. + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM. */ static int grow_dev_page(struct block_device *bdev, sector_t block, @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block, int ret = 0; gfp_t gfp_mask; - gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp; + gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS); + if (gfp == ~__GFP_DIRECT_RECLAIM) + gfp_mask &= ~__GFP_DIRECT_RECLAIM; + else { + gfp_mask |= gfp; - /* - * XXX: __getblk_slow() can not really deal with failure and - * will endlessly loop on improvised global reclaim. Prefer - * looping in the allocator rather than here, at least that - * code knows what it's doing. - */ - gfp_mask |= __GFP_NOFAIL; + /* + * XXX: __getblk_slow() can not really deal with failure and + * will endlessly loop on improvised global reclaim. Prefer + * looping in the allocator rather than here, at least that + * code knows what it's doing. + */ + gfp_mask |= __GFP_NOFAIL; + } folio = __filemap_get_folio(inode->i_mapping, index, FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask); + if (IS_ERR(folio)) { + ret = PTR_ERR(folio); + goto out; + } bh = folio_buffers(folio); if (bh) { @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block, failed: folio_unlock(folio); folio_put(folio); +out: return ret; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c94ebf704616..6a529509b83b 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb, void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block) { - struct buffer_head *bh = sb_getblk_gfp(sb, block, 0); + struct buffer_head *bh = sb_getblk_gfp(sb, block, + ~__GFP_DIRECT_RECLAIM); if (likely(bh)) { if (trylock_buffer(bh))