[8/8] Revert "ext4: fix wrong gfp type under transaction"
diff mbox

Message ID 20170119083956.GE30786@dhcp22.suse.cz
State New
Headers show

Commit Message

Michal Hocko Jan. 19, 2017, 8:39 a.m. UTC
On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c       |  4 ++--
 fs/ext4/mballoc.c     | 14 +++++++-------
 fs/ext4/xattr.c       |  2 +-
 fs/jbd2/journal.c     |  4 ++--
 fs/jbd2/revoke.c      |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

Comments

Jan Kara Jan. 19, 2017, 9:22 a.m. UTC | #1
On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > But before going to play with that I am really wondering whether we need
> > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > point of view. What would cause a deadlock in no journal mode?
> > > > 
> > > > We still have the original problem for why we need GFP_NOFS even in
> > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > we don't want to recurse back into the file system's writeback path.
> > > 
> > > But we do not enter the writeback path from the direct reclaim. Or do
> > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > There is only try_to_release_page where we get back to the filesystems
> > > but I do not see any NOFS protection in ext4_releasepage.
> > 
> > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > shrinkers. That's it. So the recursion possibilities are rather more limited
> > than they used to be several years ago and we likely do not need as much
> > GFP_NOFS protection as we used to.
> 
> Thanks for making my remark more clear Jack! I would just want to add
> that I was playing with the patch below (it is basically
> GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> debugging patch which means they are called from within transaction) and
> it didn't hit the lockdep when running xfstests both with or without the
> enabled journal.
> 
> So am I still missing something or the nojournal mode is safe and the
> current series is OK wrt. ext*?

I'm convinced the current series is OK, only real life will tell us whether
we missed something or not ;)

> The following patch in its current form is WIP and needs a proper review
> before I post it.

So jbd2 changes look confusing (although technically correct) to me - we
*always* should run in NOFS context in those place so having GFP_KERNEL
there looks like it is unnecessarily hiding what is going on. So in those
places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
allocations are expected to be GFP_NOFS (and assert that). Otherwise the
changes look good to me.

								Honza

> ---
>  fs/ext4/inode.c       |  4 ++--
>  fs/ext4/mballoc.c     | 14 +++++++-------
>  fs/ext4/xattr.c       |  2 +-
>  fs/jbd2/journal.c     |  4 ++--
>  fs/jbd2/revoke.c      |  2 +-
>  fs/jbd2/transaction.c |  2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b7d141c3b810..841cb8c4cb5e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
>  		return __ext4_journalled_writepage(page, len);
>  
>  	ext4_io_submit_init(&io_submit, wbc);
> -	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  	if (!io_submit.io_end) {
>  		redirty_page_for_writepage(wbc, page);
>  		unlock_page(page);
> @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	int err = 0;
>  
>  	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
> -				   mapping_gfp_constraint(mapping, ~__GFP_FS));
> +				   mapping_gfp_mask(mapping));
>  	if (!page)
>  		return -ENOMEM;
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d9fd184b049e..67b97cd6e3d6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
>  static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  			      struct ext4_buddy *e4b)
>  {
> -	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
> +	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
>  }
>  
>  static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
> @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  
>  	/* We only do this if the grp has never been initialized */
>  	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> -		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> +		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
>  		if (ret)
>  			return ret;
>  	}
> @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
>  	BUG_ON(ext4_pspace_cachep == NULL);
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  		}
>  	}
>  
> -	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
> +	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
>  	if (!ac) {
>  		ar->len = 0;
>  		*errp = -ENOMEM;
> @@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  
>  	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
>  	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
> -				     GFP_NOFS|__GFP_NOFAIL);
> +				     GFP_KERNEL|__GFP_NOFAIL);
>  	if (err)
>  		goto error_return;
>  
> @@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		 * to fail.
>  		 */
>  		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
> -				GFP_NOFS|__GFP_NOFAIL);
> +				GFP_KERNEL|__GFP_NOFAIL);
>  		new_entry->efd_start_cluster = bit;
>  		new_entry->efd_group = block_group;
>  		new_entry->efd_count = count_clusters;
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 172317462238..f68e8c87f9f2 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
>  		       EXT4_XATTR_REFCOUNT_MAX;
>  	int error;
>  
> -	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
> +	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
>  				      bh->b_blocknr, reusable);
>  	if (error) {
>  		if (error == -EBUSY)
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 3a449150f834..bd29daa975a5 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  	 */
>  	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
>  
> -	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> +	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
>  
>  	/* keep subsequent assertions sane */
>  	atomic_set(&new_bh->b_count, 1);
> @@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
>  #ifdef CONFIG_JBD2_DEBUG
>  	atomic_inc(&nr_journal_heads);
>  #endif
> -	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
> +	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
>  	if (!ret) {
>  		jbd_debug(1, "out of memory for journal_head\n");
>  		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index cfc38b552118..c9c347468c5b 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
>  {
>  	struct list_head *hash_list;
>  	struct jbd2_revoke_record_s *record;
> -	gfp_t gfp_mask = GFP_NOFS;
> +	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	if (journal_oom_retry)
>  		gfp_mask |= __GFP_NOFAIL;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 35a5d3d76182..a7e50eb330a8 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  			JBUFFER_TRACE(jh, "allocate memory for buffer");
>  			jbd_unlock_bh_state(bh);
>  			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> -						   GFP_NOFS | __GFP_NOFAIL);
> +						   GFP_KERNEL | __GFP_NOFAIL);
>  			goto repeat;
>  		}
>  		jh->b_frozen_data = frozen_buffer;
> -- 
> 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 19, 2017, 9:44 a.m. UTC | #2
On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
Michal Hocko Jan. 26, 2017, 7:44 a.m. UTC | #3
On Thu 19-01-17 10:44:05, Michal Hocko wrote:
> On Thu 19-01-17 10:22:36, Jan Kara wrote:
> > On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > > But before going to play with that I am really wondering whether we need
> > > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > > 
> > > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > > we don't want to recurse back into the file system's writeback path.
> > > > > 
> > > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > > There is only try_to_release_page where we get back to the filesystems
> > > > > but I do not see any NOFS protection in ext4_releasepage.
> > > > 
> > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > > than they used to be several years ago and we likely do not need as much
> > > > GFP_NOFS protection as we used to.
> > > 
> > > Thanks for making my remark more clear Jack! I would just want to add
> > > that I was playing with the patch below (it is basically
> > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > > debugging patch which means they are called from within transaction) and
> > > it didn't hit the lockdep when running xfstests both with or without the
> > > enabled journal.
> > > 
> > > So am I still missing something or the nojournal mode is safe and the
> > > current series is OK wrt. ext*?
> > 
> > I'm convinced the current series is OK, only real life will tell us whether
> > we missed something or not ;)
> 
> I would like to extend the changelog of "jbd2: mark the transaction
> context with the scope GFP_NOFS context".
> 
> "
> Please note that setups without journal do not suffer from potential
> recursion problems and so they do not need the scope protection because
> neither ->releasepage nor ->evict_inode (which are the only fs entry
> points from the direct reclaim) can reenter a locked context which is
> doing the allocation currently.
> "

Could you comment on this Ted, please?
Theodore Y. Ts'o Jan. 27, 2017, 6:13 a.m. UTC | #4
On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:

        - to prevent from stack overflows during the reclaim because
	          the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

						- Ted
Michal Hocko Jan. 27, 2017, 9:37 a.m. UTC | #5
On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:
> 
>         - to prevent from stack overflows during the reclaim because
> 	          the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:2317      shrink_node_memcg       560     static
./mm/vmscan.c:1692      shrink_inactive_list    688     static
./mm/vmscan.c:908       shrink_page_list        608     static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:427       shrink_slab     	336     static
./fs/super.c:56 	super_cache_scan        104     static << here we have the NOFS protection
./fs/dcache.c:1089      prune_dcache_sb 	152     static
./fs/dcache.c:939       shrink_dentry_list      96      static
./fs/dcache.c:509       __dentry_kill   	72      static
./fs/dcache.c:323       dentry_unlink_inode     64      static
./fs/inode.c:1527       iput    		80      static
./fs/inode.c:532        evict   		72      static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
Theodore Y. Ts'o Jan. 27, 2017, 4:40 p.m. UTC | #6
On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find.... comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

					- Ted
Christoph Hellwig Jan. 28, 2017, 7:32 a.m. UTC | #7
On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

What does SRE stand for?
David Lang Jan. 28, 2017, 8:17 a.m. UTC | #8
On Fri, 27 Jan 2017, Christoph Hellwig wrote:

> On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
>> The reason why I'm nervous is that nojournal mode is not a common
>> configuration, and "wait until production systems start failing" is
>> not a strategy that I or many SRE-types find.... comforting.
>
> What does SRE stand for?

Site Reliability Engineer, a mix of operations and engineering (DevOps++)

David Lang
Michal Hocko Jan. 30, 2017, 8:12 a.m. UTC | #9
On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.
Michal Hocko Feb. 3, 2017, 3:32 p.m. UTC | #10
On Mon 30-01-17 09:12:10, Michal Hocko wrote:
> On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > > If this ever turn out to be a problem and with the vmapped stacks we
> > > have good chances to get a proper stack traces on a potential overflow
> > > we can add the scope API around the problematic code path with the
> > > explanation why it is needed.
> > 
> > Yeah, or maybe we can automate it?  Can the reclaim code check how
> > much stack space is left and do the right thing automatically?
> 
> I am not sure how to do that. Checking for some magic value sounds quite
> fragile to me. It also sounds a bit strange to focus only on the reclaim
> while other code paths might suffer from the same problem.
> 
> What is actually the deepest possible call chain from the slab reclaim
> where I stopped? I have tried to follow that path but hit the callback
> wall quite early.
>  
> > The reason why I'm nervous is that nojournal mode is not a common
> > configuration, and "wait until production systems start failing" is
> > not a strategy that I or many SRE-types find.... comforting.
> 
> I understand that but I would be much more happier if we did the
> decision based on the actual data rather than a fear something would
> break down.

ping on this. I would really like to move forward here and target 4.11
merge window. Is your concern so serious to block this patch?

Patch
diff mbox

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@  static int ext4_writepage(struct page *page,
 		return __ext4_journalled_writepage(page, len);
 
 	ext4_io_submit_init(&io_submit, wbc);
-	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 	if (!io_submit.io_end) {
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
@@ -3794,7 +3794,7 @@  static int __ext4_block_zero_page_range(handle_t *handle,
 	int err = 0;
 
 	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-				   mapping_gfp_constraint(mapping, ~__GFP_FS));
+				   mapping_gfp_mask(mapping));
 	if (!page)
 		return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@  ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			      struct ext4_buddy *e4b)
 {
-	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@  static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
 		if (ret)
 			return ret;
 	}
@@ -3600,7 +3600,7 @@  ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -3694,7 +3694,7 @@  ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
 	BUG_ON(ext4_pspace_cachep == NULL);
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -4479,7 +4479,7 @@  ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		}
 	}
 
-	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
+	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
 	if (!ac) {
 		ar->len = 0;
 		*errp = -ENOMEM;
@@ -4813,7 +4813,7 @@  void ext4_free_blocks(handle_t *handle, struct inode *inode,
 
 	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
-				     GFP_NOFS|__GFP_NOFAIL);
+				     GFP_KERNEL|__GFP_NOFAIL);
 	if (err)
 		goto error_return;
 
@@ -4832,7 +4832,7 @@  void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		 * to fail.
 		 */
 		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
-				GFP_NOFS|__GFP_NOFAIL);
+				GFP_KERNEL|__GFP_NOFAIL);
 		new_entry->efd_start_cluster = bit;
 		new_entry->efd_group = block_group;
 		new_entry->efd_count = count_clusters;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 172317462238..f68e8c87f9f2 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1650,7 +1650,7 @@  ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 		       EXT4_XATTR_REFCOUNT_MAX;
 	int error;
 
-	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
+	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
 				      bh->b_blocknr, reusable);
 	if (error) {
 		if (error == -EBUSY)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3a449150f834..bd29daa975a5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -379,7 +379,7 @@  int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
@@ -2375,7 +2375,7 @@  static struct journal_head *journal_alloc_journal_head(void)
 #ifdef CONFIG_JBD2_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
 	if (!ret) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index cfc38b552118..c9c347468c5b 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -141,7 +141,7 @@  static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
 {
 	struct list_head *hash_list;
 	struct jbd2_revoke_record_s *record;
-	gfp_t gfp_mask = GFP_NOFS;
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (journal_oom_retry)
 		gfp_mask |= __GFP_NOFAIL;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 35a5d3d76182..a7e50eb330a8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -974,7 +974,7 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
 			JBUFFER_TRACE(jh, "allocate memory for buffer");
 			jbd_unlock_bh_state(bh);
 			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
-						   GFP_NOFS | __GFP_NOFAIL);
+						   GFP_KERNEL | __GFP_NOFAIL);
 			goto repeat;
 		}
 		jh->b_frozen_data = frozen_buffer;