diff mbox

workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue

Message ID 20160128101210.GC6357@twins.programming.kicks-ass.net
State Not Applicable, archived
Headers show

Commit Message

Peter Zijlstra Jan. 28, 2016, 10:12 a.m. UTC
On Tue, Jan 26, 2016 at 06:38:43PM +0100, Thierry Reding wrote:
> > Task or work item involved in memory reclaim trying to flush a
> > non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> > deadlock.  Trigger WARN_ONCE() if such conditions are detected.
> I've started noticing the following during boot on some of the devices I
> work with:
> 
> [    4.723705] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
> [    4.736818] workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
> [    4.748099] Modules linked in:
> [    4.751342] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.5.0-rc1-00018-g420fc292d9c7 #1
> [    4.759504] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> [    4.765762] Workqueue: deferwq deferred_probe_work_func
> [    4.771004] [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
> [    4.778746] [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
> [    4.785966] [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
> [    4.794048] [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
> [    4.802736] [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
> [    4.811769] [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
> [    4.820112] [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
> [    4.828110] [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)

Right, also, I think it makes sense to do lru_add_drain_all() from a
WQ_MEM_RECLAIM workqueue, it is, after all, aiding in getting memory
freed.

Does something like the below cure things?

TJ does this make sense to you?

---
 mm/swap.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Thierry Reding Jan. 28, 2016, 12:47 p.m. UTC | #1
On Thu, Jan 28, 2016 at 11:12:10AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 06:38:43PM +0100, Thierry Reding wrote:
> > > Task or work item involved in memory reclaim trying to flush a
> > > non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> > > deadlock.  Trigger WARN_ONCE() if such conditions are detected.
> > I've started noticing the following during boot on some of the devices I
> > work with:
> > 
> > [    4.723705] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
> > [    4.736818] workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
> > [    4.748099] Modules linked in:
> > [    4.751342] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.5.0-rc1-00018-g420fc292d9c7 #1
> > [    4.759504] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> > [    4.765762] Workqueue: deferwq deferred_probe_work_func
> > [    4.771004] [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
> > [    4.778746] [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
> > [    4.785966] [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
> > [    4.794048] [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
> > [    4.802736] [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
> > [    4.811769] [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
> > [    4.820112] [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
> > [    4.828110] [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
> 
> Right, also, I think it makes sense to do lru_add_drain_all() from a
> WQ_MEM_RECLAIM workqueue, it is, after all, aiding in getting memory
> freed.
> 
> Does something like the below cure things?
> 
> TJ does this make sense to you?
> 
> ---
>  mm/swap.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 09fe5e97714a..a3de016b2a9d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -666,6 +666,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>  
>  static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
>  
> +static struct workqueue_struct *lru_wq;
> +
> +static int __init lru_init(void)
> +{
> +	lru_wq = create_workqueue("lru");
> +	return 0;
> +}
> +early_initcall(lru_init);
> +
>  void lru_add_drain_all(void)
>  {
>  	static DEFINE_MUTEX(lock);
> @@ -685,7 +694,7 @@ void lru_add_drain_all(void)
>  		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
> -			schedule_work_on(cpu, work);
> +			queue_work_on(cpu, &lru_wq, work);
                                           ^

This ampersand is too much here and causes a compile-time warning.
Removing it and booting the resulting kernel doesn't trigger the
WQ_MEM_RECLAIM warning anymore, though.

Tested on top of next-20160128.

Thanks,
Thierry
Thierry Reding Jan. 28, 2016, 12:48 p.m. UTC | #2
On Thu, Jan 28, 2016 at 01:47:00PM +0100, Thierry Reding wrote:
> On Thu, Jan 28, 2016 at 11:12:10AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 26, 2016 at 06:38:43PM +0100, Thierry Reding wrote:
> > > > Task or work item involved in memory reclaim trying to flush a
> > > > non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> > > > deadlock.  Trigger WARN_ONCE() if such conditions are detected.
> > > I've started noticing the following during boot on some of the devices I
> > > work with:
> > > 
> > > [    4.723705] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
> > > [    4.736818] workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
> > > [    4.748099] Modules linked in:
> > > [    4.751342] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.5.0-rc1-00018-g420fc292d9c7 #1
> > > [    4.759504] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> > > [    4.765762] Workqueue: deferwq deferred_probe_work_func
> > > [    4.771004] [<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
> > > [    4.778746] [<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
> > > [    4.785966] [<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
> > > [    4.794048] [<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
> > > [    4.802736] [<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
> > > [    4.811769] [<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
> > > [    4.820112] [<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
> > > [    4.828110] [<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
> > 
> > Right, also, I think it makes sense to do lru_add_drain_all() from a
> > WQ_MEM_RECLAIM workqueue, it is, after all, aiding in getting memory
> > freed.
> > 
> > Does something like the below cure things?
> > 
> > TJ does this make sense to you?
> > 
> > ---
> >  mm/swap.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 09fe5e97714a..a3de016b2a9d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -666,6 +666,15 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >  
> >  static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
> >  
> > +static struct workqueue_struct *lru_wq;
> > +
> > +static int __init lru_init(void)
> > +{
> > +	lru_wq = create_workqueue("lru");
> > +	return 0;
> > +}
> > +early_initcall(lru_init);
> > +
> >  void lru_add_drain_all(void)
> >  {
> >  	static DEFINE_MUTEX(lock);
> > @@ -685,7 +694,7 @@ void lru_add_drain_all(void)
> >  		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> >  		    need_activate_page_drain(cpu)) {
> >  			INIT_WORK(work, lru_add_drain_per_cpu);
> > -			schedule_work_on(cpu, work);
> > +			queue_work_on(cpu, &lru_wq, work);
>                                            ^
> 
> This ampersand is too much here and causes a compile-time warning.
> Removing it and booting the resulting kernel doesn't trigger the
> WQ_MEM_RECLAIM warning anymore, though.
> 
> Tested on top of next-20160128.

This implies that if you want to turn this into a proper patch:

Tested-by: Thierry Reding <treding@nvidia.com>

Alternatively, if you come up with a different way to fix things,
please let me know and I'll be happy to test again.

Thierry
Tejun Heo Jan. 29, 2016, 11:09 a.m. UTC | #3
Hello, Peter.

On Thu, Jan 28, 2016 at 11:12:10AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 06:38:43PM +0100, Thierry Reding wrote:
> > > Task or work item involved in memory reclaim trying to flush a
> > > non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> > > deadlock.  Trigger WARN_ONCE() if such conditions are detected.
> > I've started noticing the following during boot on some of the devices I
> > work with:
> > 
> > [    4.723705] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
> > [    4.736818] workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
...
> Right, also, I think it makes sense to do lru_add_drain_all() from a
> WQ_MEM_RECLAIM workqueue, it is, after all, aiding in getting memory
> freed.
> 
> Does something like the below cure things?
> 
> TJ does this make sense to you?

The problem here is that deferwq which has nothing to do with memory
reclaim is marked with WQ_MEM_RECLAIM because it was created the old
create_singlethread_workqueue() which doesn't distinguish workqueues
which may be used on mem reclaim path and thus has to mark all as
needing forward progress guarantee.  I posted a patch to disable
disable flush dependency checks on those workqueues and there's a
outreachy project to weed out the users of the old interface, so
hopefully this won't be an issue soon.

As for whether lru drain should have WQ_MEM_RECLAIM, I'm not sure.
Cc'ing linux-mm.  The rule here is that any workquee which is depended
upon during memory reclaim should have WQ_MEM_RECLAIM set.  IOW, if a
work item failing to start execution under memory pressure can prevent
memory from being reclaimed, it should be scheduled on a
WQ_MEM_RECLAIM workqueue.  Would this be the case for
lru_add_drain_per_cpu()?

Thanks.
Peter Zijlstra Jan. 29, 2016, 3:17 p.m. UTC | #4
On Fri, Jan 29, 2016 at 06:09:41AM -0500, Tejun Heo wrote:
>  I posted a patch to disable
> disable flush dependency checks on those workqueues and there's a
> outreachy project to weed out the users of the old interface, so
> hopefully this won't be an issue soon.

Will that same project review all workqueue users for the strict per-cpu
stuff, so we can finally kill that weird stuff you do on hotplug?
--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 29, 2016, 6:28 p.m. UTC | #5
Hey, Peter.

On Fri, Jan 29, 2016 at 04:17:39PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 29, 2016 at 06:09:41AM -0500, Tejun Heo wrote:
> >  I posted a patch to disable
> > disable flush dependency checks on those workqueues and there's a
> > outreachy project to weed out the users of the old interface, so
> > hopefully this won't be an issue soon.
> 
> Will that same project review all workqueue users for the strict per-cpu
> stuff, so we can finally kill that weird stuff you do on hotplug?

Unfortunately not.  We do want to distinguish cpu-affine for
correctness and as an optimization; however, making that distinction
is unlikely to make the dynamic worker affinity binding go away.  We
can't forcifully shut down workers which are executing work items
which are affine as an optimization when the CPU goes down.

Thanks.
diff mbox

Patch

diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..a3de016b2a9d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -666,6 +666,15 @@  static void lru_add_drain_per_cpu(struct work_struct *dummy)
 
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 
+static struct workqueue_struct *lru_wq;
+
+static int __init lru_init(void)
+{
+	lru_wq = create_workqueue("lru");
+	return 0;
+}
+early_initcall(lru_init);
+
 void lru_add_drain_all(void)
 {
 	static DEFINE_MUTEX(lock);
@@ -685,7 +694,7 @@  void lru_add_drain_all(void)
 		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-			schedule_work_on(cpu, work);
+			queue_work_on(cpu, &lru_wq, work);
 			cpumask_set_cpu(cpu, &has_work);
 		}
 	}