diff mbox

[3.16.y-ckt,stable] Patch "mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress" has been added to staging queue

Message ID 1452882191-20977-1-git-send-email-luis.henriques@canonical.com
State New
Headers show

Commit Message

Luis Henriques Jan. 15, 2016, 6:23 p.m. UTC
This is a note to let you know that I have just added a patch titled

    mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress

to the linux-3.16.y-queue branch of the 3.16.y-ckt extended stable tree 
which can be found at:

    http://kernel.ubuntu.com/git/ubuntu/linux.git/log/?h=linux-3.16.y-queue

This patch is scheduled to be released in version 3.16.7-ckt23.

If you, or anyone else, feels it should not be added to this tree, please 
reply to this email.

For more information about the 3.16.y-ckt tree, see
https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable

Thanks.
-Luis

---8<------------------------------------------------------------

From d7cea155b3080f3c29616bbe8b7743a6948c973f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 11 Dec 2015 13:40:32 -0800
Subject: mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't
 make any progress

commit 373ccbe5927034b55bdc80b0f8b54d6e13fe8d12 upstream.

Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.

The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context.  If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much.  WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation.  This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.

In order to fix this issue we need to do two things.  First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep.  In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
[ luis: backported to 3.16, based on Ben's backport to 3.2:
  - use schedule_delayed_work instead of queue_delayed_work_on function
    vmstat_update()
  - change start_cpu_timer() instead of vmstat_shepherd()
  - adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
---
 mm/backing-dev.c | 19 ++++++++++++++++---
 mm/vmstat.c      |  6 ++++--
 2 files changed, 20 insertions(+), 5 deletions(-)

Comments

Ben Hutchings Jan. 17, 2016, 1:08 a.m. UTC | #1
On Fri, 2016-01-15 at 18:23 +0000, Luis Henriques wrote:
> This is a note to let you know that I have just added a patch titled
> 
>     mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
> 
> to the linux-3.16.y-queue branch of the 3.16.y-ckt extended stable tree 
> which can be found at:
> 
>     http://kernel.ubuntu.com/git/ubuntu/linux.git/log/?h=linux-3.16.y-queue
> 
> This patch is scheduled to be released in version 3.16.7-ckt23.
> 
> If you, or anyone else, feels it should not be added to this tree, please 
> reply to this email.
[...]

You also need this fix:

commit 751e5f5c753e8d447bcf89f9e96b9616ac081628
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Jan 8 11:18:29 2016 +0100

    vmstat: allocate vmstat_wq before it is used

Ben.
Luis Henriques Jan. 18, 2016, 10:21 p.m. UTC | #2
On Sun, Jan 17, 2016 at 01:08:14AM +0000, Ben Hutchings wrote:
> On Fri, 2016-01-15 at 18:23 +0000, Luis Henriques wrote:
> > This is a note to let you know that I have just added a patch titled
> > 
> >     mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
> > 
> > to the linux-3.16.y-queue branch of the 3.16.y-ckt extended stable tree 
> > which can be found at:
> > 
> >     http://kernel.ubuntu.com/git/ubuntu/linux.git/log/?h=linux-3.16.y-queue
> > 
> > This patch is scheduled to be released in version 3.16.7-ckt23.
> > 
> > If you, or anyone else, feels it should not be added to this tree, please 
> > reply to this email.
> [...]
> 
> You also need this fix:
> 
> commit 751e5f5c753e8d447bcf89f9e96b9616ac081628
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Fri Jan 8 11:18:29 2016 +0100
> 
>     vmstat: allocate vmstat_wq before it is used
> 

Thank you Ben and Tetsuo.  I'll also queue this for the next 3.16
kernel release.

Cheers,
--
Luís

> Ben.
> 
> -- 
> Ben Hutchings
> Theory and practice are closer in theory than in practice.
>                                 - John Levine, moderator of comp.compilers
diff mbox

Patch

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1706cbbdf5f0..035be81fb150 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -611,8 +611,9 @@  EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -632,7 +633,19 @@  long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 */
 	if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
 			!zone_is_reclaim_congested(zone)) {
-		cond_resched();
+
+		/*
+		 * Memory allocation/reclaim might be called from a WQ
+		 * context and the current implementation of the WQ
+		 * concurrency control doesn't recognize that a particular
+		 * WQ is congested if the worker thread is looping without
+		 * ever sleeping. Therefore we have to do a short sleep
+		 * here rather than calling cond_resched().
+		 */
+		if (current->flags & PF_WQ_WORKER)
+			schedule_timeout(1);
+		else
+			cond_resched();

 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49bfd55..7e913ba2b9e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1226,13 +1226,14 @@  static const struct file_operations proc_vmstat_file_operations = {
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;

 static void vmstat_update(struct work_struct *w)
 {
 	refresh_cpu_vm_stats();
-	schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+	queue_delayed_work(vmstat_wq, this_cpu_ptr(&vmstat_work),
 		round_jiffies_relative(sysctl_stat_interval));
 }

@@ -1241,7 +1242,7 @@  static void start_cpu_timer(int cpu)
 	struct delayed_work *work = &per_cpu(vmstat_work, cpu);

 	INIT_DEFERRABLE_WORK(work, vmstat_update);
-	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
+	queue_delayed_work_on(cpu, vmstat_wq, work, __round_jiffies_relative(HZ, cpu));
 }

 static void vmstat_cpu_dead(int node)
@@ -1312,6 +1313,7 @@  static int __init setup_vmstat(void)
 		node_set_state(cpu_to_node(cpu), N_CPU);
 	}
 	cpu_notifier_register_done();
+	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);