Message ID | 20150701133715.GA6287@dhcp22.suse.cz |
---|---|
State | Not Applicable, archived |
Headers | show |
On 07/01/2015 09:37 AM, Michal Hocko wrote: > Fix this issue by limiting the wait to reclaim triggered by __GFP_FS > allocations to make sure we are not called from filesystem paths which > might be doing exactly this kind of IO optimizations. The page fault > path, which is the only path that triggers memcg oom killer since 3.12, > shouldn't require GFP_NOFS and so we shouldn't reintroduce the premature > OOM killer issue which was originally addressed by the heuristic. > > As per David Chinner the xfs is doing similar thing since 2.6.15 already > so ext4 is not the only affected filesystem. Moreover he notes: > : For example: IO completion might require unwritten extent conversion > : which executes filesystem transactions and GFP_NOFS allocations. The > : writeback flag on the pages can not be cleared until unwritten > : extent conversion completes. Hence memory reclaim cannot wait on > : page writeback to complete in GFP_NOFS context because it is not > : safe to do so, memcg reclaim or otherwise. I remember fixing something like this back in the 2.2 days. Funny how these bugs keep coming back. > Cc: stable # 3.6+ > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") > Reported-by: Nikolay Borisov <kernel@kyup.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com>
On Wed, Jul 01, 2015 at 03:37:15PM +0200, Michal Hocko wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 37e90db1520b..6c44d424968e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -995,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > goto keep_locked; > > /* Case 3 above */ > - } else { > + } else if (sc->gfp_mask & __GFP_FS) { > wait_on_page_writeback(page); > } > } Um, I've just taken a closer look at this code now that I'm back from vacation, and I'm not sure this is right. This Case 3 code occurs inside an if (PageWriteback(page)) { ... } conditional, and if I'm not mistaken, if the flow of control exits this conditional, it is assumed that the page is *not* under writeback. This patch will assume the page has been cleaned if __GFP_FS is set, which could lead to a dirty page getting dropped, so I believe this is a bug. No? It would seem to me that a better fix would be to change the Case 2 handling: /* Case 2 above */ } else if (global_reclaim(sc) || - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { + !PageReclaim(page) || !(sc->gfp_mask & __GFP_FS)) { /* * This is slightly racy - end_page_writeback() * might have just cleared PageReclaim, then * setting PageReclaim here end up interpreted * as PageReadahead - but that does not matter * enough to care. What we do want is for this * page to have PageReclaim set next time memcg * reclaim reaches the tests above, so it will * then wait_on_page_writeback() to avoid OOM; * and it's also appropriate in global reclaim. */ SetPageReclaim(page); nr_writeback++; goto keep_locked; Am I missing something? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/mm/vmscan.c b/mm/vmscan.c index 37e90db1520b..6c44d424968e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -995,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Case 3 above */ - } else { + } else if (sc->gfp_mask & __GFP_FS) { wait_on_page_writeback(page); } }