diff mbox

How to handle TIF_MEMDIE stalls?

Message ID 20150221011907.2d26c979.akpm@linux-foundation.org
State Not Applicable, archived
Headers show

Commit Message

Andrew Morton Feb. 21, 2015, 9:19 a.m. UTC
On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:

> +akpm

I was hoping not to have to read this thread ;)

afaict there are two (main) issues:

a) whether to oom-kill when __GFP_FS is not set.  The kernel hasn't
   been doing this for ages and nothing has changed recently.

b) whether to keep looping when __GFP_NOFAIL is not set and __GFP_FS
   is not set and we can't oom-kill anything (which goes without
   saying, because __GFP_FS isn't set!).

   And 9879de7373fc ("mm: page_alloc: embed OOM killing naturally
   into allocation slowpath") somewhat inadvertently changed this policy
   - the allocation attempt will now promptly return ENOMEM if
   !__GFP_NOFAIL and !__GFP_FS.

Correct enough?

Question a) seems a bit of red herring and we can park it for now.


What I'm not really understanding is why the pre-3.19 implementation
actually worked.  We've exhausted the free pages, we're not succeeding
at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
*does* work - we eventually find that memory and everything proceeds.

How come?  Where did that memory come from?


Short term, we need to fix 3.19.x and 3.20 and that appears to be by
applying Johannes's akpm-doesnt-know-why-it-works patch:


Have people adequately confirmed that this gets us out of trouble?


And yes, I agree that sites such as xfs's kmem_alloc() should be
passing __GFP_NOFAIL to tell the page allocator what's going on.  I
don't think it matters a lot whether kmem_alloc() retains its retry
loop.  If __GFP_NOFAIL is working correctly then it will never loop
anyway...


Also, this:

On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Right now, the oom killer is a liability. Over the past 6 months
> I've slowly had to exclude filesystem regression tests from running
> on small memory machines because the OOM killer is now so unreliable
> that it kills the test harness regularly rather than the process
> generating memory pressure.

David, I did not know this!  If you've been telling us about this then
perhaps it wasn't loud enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tetsuo Handa Feb. 21, 2015, 1:48 p.m. UTC | #1
Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

Sorry for getting so complicated.

> What I'm not really understanding is why the pre-3.19 implementation
> actually worked.  We've exhausted the free pages, we're not succeeding
> at reclaiming anything, we aren't able to oom-kill anyone.  Yet it
> *does* work - we eventually find that memory and everything proceeds.
> 
> How come?  Where did that memory come from?
> 

Even without __GFP_NOFAIL, GFP_NOFS / GFP_NOIO allocations retried forever
(without invoking the OOM killer) if order <= PAGE_ALLOC_COSTLY_ORDER and
TIF_MEMDIE is not set. Somebody else volunteered that memory while retrying.
This implies silent hang-up forever if nobody volunteers memory.

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath") inadvertently changed GFP_NOFS / GFP_NOIO allocations
not to retry unless __GFP_NOFAIL is specified. Therefore, either applying
Johannes's akpm-doesnt-know-why-it-works patch or passing __GFP_NOFAIL
will restore the pre-3.19 behavior (with possibility of silent hang-up).
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 21, 2015, 9:38 p.m. UTC | #2
On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> On Fri, 20 Feb 2015 22:20:00 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > +akpm
> 
> I was hoping not to have to read this thread ;)

ditto....

> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

I'm not about to change behaviour "just because". Any sort of change
like this requires a *lot* of low memory regression testing because
we'd be replacing long standing known behaviour with behaviour that
changes without warning. e.g the ext4 low memory failures starting because of
changes made in 3.19-rc6 due to changes in oom-killer behaviour.
Those changes *did not affect XFS* and that's the way I'd like
things to remain.

Put simply: right now I don't trust the mm subsystem to get low memory
behaviour right, and this thread has done nothing to convince me
that it's going to improve any time soon.

> Also, this:
> 
> On Wed, 18 Feb 2015 09:54:30 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Right now, the oom killer is a liability. Over the past 6 months
> > I've slowly had to exclude filesystem regression tests from running
> > on small memory machines because the OOM killer is now so unreliable
> > that it kills the test harness regularly rather than the process
> > generating memory pressure.
> 
> David, I did not know this!  If you've been telling us about this then
> perhaps it wasn't loud enough.

IME, such bug reports get ignored.

Instead, over the past few months I have been pointing out bugs and
problems in the oom-killer in threads like this because it seems to
be the only way to get any attention to the issues I'm seeing. Bug
reports simply get ignored.  From this process, I've managed to
learn that low order memory allocation now never fails (contrary to
documentation and long standing behavioural expectations) and
pointed out bugs that cause the oom killer to get invoked when the
filesystem is saying "I can handle ENOMEM!" (commit 45f87de ("mm:
get rid of radix tree gfp mask for pagecache_get_page").

And yes, I've definitely mentioned in these discussions that, for
example, xfstests::generic/224 is triggering the oom killer far more
often than it used to on my 1GB RAM vm. The only fix that has been
made recently that's made any difference is 45f87de, so it's a slow
process of raising awareness and trying to ensure things don't get
worse before they get better....

Cheers,

Dave.
Johannes Weiner Feb. 22, 2015, 12:20 a.m. UTC | #3
On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> applying Johannes's akpm-doesnt-know-why-it-works patch:
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 
> Have people adequately confirmed that this gets us out of trouble?

I'd be interested in this too.  Who is seeing these failures?

Andrew, can you please use the following changelog for this patch?

---
From: Johannes Weiner <hannes@cmpxchg.org>

mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
into allocation slowpath"), which should have been a simple cleanup
patch, accidentally changed the behavior to aborting the allocation at
that point.  This creates problems with filesystem callers (?) that
currently rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Feb. 23, 2015, 10:48 a.m. UTC | #4
On Sat 21-02-15 19:20:58, Johannes Weiner wrote:
> On Sat, Feb 21, 2015 at 01:19:07AM -0800, Andrew Morton wrote:
> > Short term, we need to fix 3.19.x and 3.20 and that appears to be by
> > applying Johannes's akpm-doesnt-know-why-it-works patch:
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> >  		/* The OOM killer does not compensate for light reclaim */
> > -		if (!(gfp_mask & __GFP_FS))
> > +		if (!(gfp_mask & __GFP_FS)) {
> > +			/*
> > +			 * XXX: Page reclaim didn't yield anything,
> > +			 * and the OOM killer can't be invoked, but
> > +			 * keep looping as per should_alloc_retry().
> > +			 */
> > +			*did_some_progress = 1;
> >  			goto out;
> > +		}
> >  		/*
> >  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> >  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > 
> > Have people adequately confirmed that this gets us out of trouble?
> 
> I'd be interested in this too.  Who is seeing these failures?
> 
> Andrew, can you please use the following changelog for this patch?
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.

OK, if this a _short term_ change. I really think that all the requests
except for __GFP_NOFAIL should be able to fail. I would argue that it
should be the caller who should be fixed but it is true that the patch
was introduced too late (rc7) and so it caught other subsystems
unprepared so backporting to stable makes sense to me. But can we please
move on and stop pretending that allocations do not fail for the
upcoming release?

> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>
David Rientjes Feb. 23, 2015, 9:33 p.m. UTC | #5
On Sat, 21 Feb 2015, Johannes Weiner wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
> 
> Historically, !__GFP_FS allocations were not allowed to invoke the OOM
> killer once reclaim had failed, but nevertheless kept looping in the
> allocator.  9879de7373fc ("mm: page_alloc: embed OOM killing naturally
> into allocation slowpath"), which should have been a simple cleanup
> patch, accidentally changed the behavior to aborting the allocation at
> that point.  This creates problems with filesystem callers (?) that
> currently rely on the allocator waiting for other tasks to intervene.
> 
> Revert the behavior as it shouldn't have been changed as part of a
> cleanup patch.
> 
> Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cc: stable@vger.kernel.org [3.19]
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2382,8 +2382,15 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
-		if (!(gfp_mask & __GFP_FS))
+		if (!(gfp_mask & __GFP_FS)) {
+			/*
+			 * XXX: Page reclaim didn't yield anything,
+			 * and the OOM killer can't be invoked, but
+			 * keep looping as per should_alloc_retry().
+			 */
+			*did_some_progress = 1;
 			goto out;
+		}
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.