diff mbox series

ext4_sb_breadahead_unmovable: Change to be no-blocking

Message ID 20230811071519.1094-1-teawaterz@linux.alibaba.com
State Rejected
Headers show
Series ext4_sb_breadahead_unmovable: Change to be no-blocking | expand

Commit Message

Hui Zhu Aug. 11, 2023, 7:15 a.m. UTC
From: Hui Zhu <teawater@antgroup.com>

This version fix the gfp flags in the callers instead of working this
new "bool" flag through the buffer head layers according to the comments
from Matthew Wilcox.

Encountered an issue where a large number of filesystem reads and writes
occurred suddenly within a container.  At the same time, other tasks on
the same host that were performing filesystem read and write operations
became blocked.  It was observed that many of the blocked tasks were
blocked on the ext4 journal lock. For example:
PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread"
0 [] __schedule at xxxxxxxxxxxxxxx
1 [] schedule at xxxxxxxxxxxxxxx
2 [] wait_transaction_locked at xxxxxxxxxxxxxxx
3 [] add_transaction_credits at xxxxxxxxxxxxxxx
4 [] start_this_handle at xxxxxxxxxxxxxxx
5 [] jbd2__journal_start at xxxxxxxxxxxxxxx
6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx
7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx
8 [] mark_inode_dirty at xxxxxxxxxxxxxxx
9 [] generic_update_time at xxxxxxxxxxxxxxx

Meanwhile, it was observed that the task holding the ext4 journal lock
was blocked for an extended period of time on "shrink_page_list" due to
"ext4_sb_breadahead_unmovable".
0 [] __schedule at xxxxxxxxxxxxxxx
1 [] _cond_resched at xxxxxxxxxxxxxxx
2 [] shrink_page_list at xxxxxxxxxxxxxxx
3 [] shrink_inactive_list at xxxxxxxxxxxxxxx
4 [] shrink_lruvec at xxxxxxxxxxxxxxx
5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx
6 [] shrink_node at xxxxxxxxxxxxxxx
7 [] shrink_zones at xxxxxxxxxxxxxxx
8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx
9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx
10 [] try_charge at xxxxxxxxxxxxxxx
11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx
12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx
13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx
14 [] pagecache_get_page at xxxxxxxxxxxxxxx
15 [] grow_dev_page at xxxxxxxxxxxxxxx
16 [] __getblk_slow at xxxxxxxxxxxxxxx
17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx
18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx
19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx
20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx
21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx
22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx
23 [] ext4_add_entry at xxxxxxxxxxxxxxx
24 [] ext4_add_nondir at xxxxxxxxxxxxxxx
25 [] ext4_create at xxxxxxxxxxxxxxx
26 [] vfs_create at xxxxxxxxxxxxxxx

The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL",
causing longer blocking times.
	/*
	 * XXX: __getblk_slow() can not really deal with failure and
	 * will endlessly loop on improvised global reclaim.  Prefer
	 * looping in the allocator rather than here, at least that
	 * code knows what it's doing.
	 */
	gfp_mask |= __GFP_NOFAIL;
However, "ext4_sb_breadahead_unmovable" is a prefetch function and
failures are acceptable.

Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be
non-blocking.
Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable
calls sb_getblk_gfp.
Modify grow_dev_page to will not be blocked by the allocation of folio
if gfp is ~__GFP_DIRECT_RECLAIM.

Signed-off-by: Hui Zhu <teawater@antgroup.com>
---
 fs/buffer.c     | 27 +++++++++++++++++++--------
 fs/ext4/super.c |  3 ++-
 2 files changed, 21 insertions(+), 9 deletions(-)

Comments

Hui Zhu Aug. 11, 2023, 7:21 a.m. UTC | #1
Oops, this is the version 2 of the patch.

Best,
Hui

Hui Zhu <teawaterz@linux.alibaba.com> 于2023年8月11日周五 15:15写道:
>
> From: Hui Zhu <teawater@antgroup.com>
>
> This version fix the gfp flags in the callers instead of working this
> new "bool" flag through the buffer head layers according to the comments
> from Matthew Wilcox.
>
> Encountered an issue where a large number of filesystem reads and writes
> occurred suddenly within a container.  At the same time, other tasks on
> the same host that were performing filesystem read and write operations
> became blocked.  It was observed that many of the blocked tasks were
> blocked on the ext4 journal lock. For example:
> PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread"
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] schedule at xxxxxxxxxxxxxxx
> 2 [] wait_transaction_locked at xxxxxxxxxxxxxxx
> 3 [] add_transaction_credits at xxxxxxxxxxxxxxx
> 4 [] start_this_handle at xxxxxxxxxxxxxxx
> 5 [] jbd2__journal_start at xxxxxxxxxxxxxxx
> 6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx
> 7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx
> 8 [] mark_inode_dirty at xxxxxxxxxxxxxxx
> 9 [] generic_update_time at xxxxxxxxxxxxxxx
>
> Meanwhile, it was observed that the task holding the ext4 journal lock
> was blocked for an extended period of time on "shrink_page_list" due to
> "ext4_sb_breadahead_unmovable".
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] _cond_resched at xxxxxxxxxxxxxxx
> 2 [] shrink_page_list at xxxxxxxxxxxxxxx
> 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx
> 4 [] shrink_lruvec at xxxxxxxxxxxxxxx
> 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx
> 6 [] shrink_node at xxxxxxxxxxxxxxx
> 7 [] shrink_zones at xxxxxxxxxxxxxxx
> 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx
> 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx
> 10 [] try_charge at xxxxxxxxxxxxxxx
> 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx
> 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx
> 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx
> 14 [] pagecache_get_page at xxxxxxxxxxxxxxx
> 15 [] grow_dev_page at xxxxxxxxxxxxxxx
> 16 [] __getblk_slow at xxxxxxxxxxxxxxx
> 17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx
> 18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx
> 21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx
> 22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx
> 23 [] ext4_add_entry at xxxxxxxxxxxxxxx
> 24 [] ext4_add_nondir at xxxxxxxxxxxxxxx
> 25 [] ext4_create at xxxxxxxxxxxxxxx
> 26 [] vfs_create at xxxxxxxxxxxxxxx
>
> The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL",
> causing longer blocking times.
>         /*
>          * XXX: __getblk_slow() can not really deal with failure and
>          * will endlessly loop on improvised global reclaim.  Prefer
>          * looping in the allocator rather than here, at least that
>          * code knows what it's doing.
>          */
>         gfp_mask |= __GFP_NOFAIL;
> However, "ext4_sb_breadahead_unmovable" is a prefetch function and
> failures are acceptable.
>
> Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be
> non-blocking.
> Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable
> calls sb_getblk_gfp.
> Modify grow_dev_page to will not be blocked by the allocation of folio
> if gfp is ~__GFP_DIRECT_RECLAIM.
>
> Signed-off-by: Hui Zhu <teawater@antgroup.com>
> ---
>  fs/buffer.c     | 27 +++++++++++++++++++--------
>  fs/ext4/super.c |  3 ++-
>  2 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index bd091329026c..330cf19c77b1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio,
>   * Create the page-cache page that contains the requested block.
>   *
>   * This is used purely for blockdev mappings.
> + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM.
>   */
>  static int
>  grow_dev_page(struct block_device *bdev, sector_t block,
> @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>         int ret = 0;
>         gfp_t gfp_mask;
>
> -       gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
> +       gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS);
> +       if (gfp == ~__GFP_DIRECT_RECLAIM)
> +               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +       else {
> +               gfp_mask |= gfp;
>
> -       /*
> -        * XXX: __getblk_slow() can not really deal with failure and
> -        * will endlessly loop on improvised global reclaim.  Prefer
> -        * looping in the allocator rather than here, at least that
> -        * code knows what it's doing.
> -        */
> -       gfp_mask |= __GFP_NOFAIL;
> +               /*
> +                * XXX: __getblk_slow() can not really deal with failure and
> +                * will endlessly loop on improvised global reclaim.  Prefer
> +                * looping in the allocator rather than here, at least that
> +                * code knows what it's doing.
> +                */
> +               gfp_mask |= __GFP_NOFAIL;
> +       }
>
>         folio = __filemap_get_folio(inode->i_mapping, index,
>                         FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
> +       if (IS_ERR(folio)) {
> +               ret = PTR_ERR(folio);
> +               goto out;
> +       }
>
>         bh = folio_buffers(folio);
>         if (bh) {
> @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  failed:
>         folio_unlock(folio);
>         folio_put(folio);
> +out:
>         return ret;
>  }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c94ebf704616..6a529509b83b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
>
>  void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
>  {
> -       struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
> +       struct buffer_head *bh = sb_getblk_gfp(sb, block,
> +                                              ~__GFP_DIRECT_RECLAIM);
>
>         if (likely(bh)) {
>                 if (trylock_buffer(bh))
> --
> 2.19.1.6.gb485710b
>
Matthew Wilcox Aug. 11, 2023, 4:49 p.m. UTC | #2
On Fri, Aug 11, 2023 at 07:15:19AM +0000, Hui Zhu wrote:
> From: Hui Zhu <teawater@antgroup.com>
> 
> This version fix the gfp flags in the callers instead of working this
> new "bool" flag through the buffer head layers according to the comments
> from Matthew Wilcox.

FYI, this paragraph should have been below the --- so it gets excluded
from the commit log.

> Meanwhile, it was observed that the task holding the ext4 journal lock
> was blocked for an extended period of time on "shrink_page_list" due to
> "ext4_sb_breadahead_unmovable".
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] _cond_resched at xxxxxxxxxxxxxxx
> 2 [] shrink_page_list at xxxxxxxxxxxxxxx
> 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx
> 4 [] shrink_lruvec at xxxxxxxxxxxxxxx
> 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx
> 6 [] shrink_node at xxxxxxxxxxxxxxx
> 7 [] shrink_zones at xxxxxxxxxxxxxxx
> 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx
> 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx
> 10 [] try_charge at xxxxxxxxxxxxxxx
> 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx
> 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx
> 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx
> 14 [] pagecache_get_page at xxxxxxxxxxxxxxx
> 15 [] grow_dev_page at xxxxxxxxxxxxxxx

After applying your patch, we'd still get into trouble with
folio_alloc_buffers() also specifying __GFP_NOWAIT.  So I decided
to pass the GFP flags into folio_alloc_buffers() -- see the patch
series I just sent out.

> @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	int ret = 0;
>  	gfp_t gfp_mask;
>  
> -	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
> +	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS);
> +	if (gfp == ~__GFP_DIRECT_RECLAIM)
> +		gfp_mask &= ~__GFP_DIRECT_RECLAIM;

This isn't how we normally use gfp_mask.  OTOH, how buffer.c uses GFP
masks is also a bit weird.  The bdev_getblk() I just added is more
normal.

Please try the patchset I cc'd you on (with the __GFP_ACCOUNT added);
I'm currently running it through xfstests and it's holding up fine.
I suppose I should play around with memcgs to try to make it happen a
bit more often.
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index bd091329026c..330cf19c77b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1038,6 +1038,7 @@  static sector_t folio_init_buffers(struct folio *folio,
  * Create the page-cache page that contains the requested block.
  *
  * This is used purely for blockdev mappings.
+ * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM.
  */
 static int
 grow_dev_page(struct block_device *bdev, sector_t block,
@@ -1050,18 +1051,27 @@  grow_dev_page(struct block_device *bdev, sector_t block,
 	int ret = 0;
 	gfp_t gfp_mask;
 
-	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
+	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS);
+	if (gfp == ~__GFP_DIRECT_RECLAIM)
+		gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+	else {
+		gfp_mask |= gfp;
 
-	/*
-	 * XXX: __getblk_slow() can not really deal with failure and
-	 * will endlessly loop on improvised global reclaim.  Prefer
-	 * looping in the allocator rather than here, at least that
-	 * code knows what it's doing.
-	 */
-	gfp_mask |= __GFP_NOFAIL;
+		/*
+		 * XXX: __getblk_slow() can not really deal with failure and
+		 * will endlessly loop on improvised global reclaim.  Prefer
+		 * looping in the allocator rather than here, at least that
+		 * code knows what it's doing.
+		 */
+		gfp_mask |= __GFP_NOFAIL;
+	}
 
 	folio = __filemap_get_folio(inode->i_mapping, index,
 			FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
+	if (IS_ERR(folio)) {
+		ret = PTR_ERR(folio);
+		goto out;
+	}
 
 	bh = folio_buffers(folio);
 	if (bh) {
@@ -1091,6 +1101,7 @@  grow_dev_page(struct block_device *bdev, sector_t block,
 failed:
 	folio_unlock(folio);
 	folio_put(folio);
+out:
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c94ebf704616..6a529509b83b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -254,7 +254,8 @@  struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
 
 void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
 {
-	struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
+	struct buffer_head *bh = sb_getblk_gfp(sb, block,
+					       ~__GFP_DIRECT_RECLAIM);
 
 	if (likely(bh)) {
 		if (trylock_buffer(bh))