diff mbox series

[v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root

Message ID 20201022122858.8638-1-rpalethorpe@suse.com
State Not Applicable
Headers show
Series [v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root | expand

Commit Message

Richard Palethorpe Oct. 22, 2020, 12:28 p.m. UTC
When use_hierarchy=1, SLAB objects which outlive their descendant
memcg are moved to their parent memcg where they may be uncharged
because charges are made recursively from leaf to root nodes.

However when use_hierarchy=0, they are reparented directly to root and
charging is not made recursively. Therefor uncharging will result in a
counter underflow on the root memcg, but no other ancestors.

To prevent this, we check whether we are about to uncharge the root
memcg and whether use_hierarchy=0. If this is the case then we skip
uncharging. The root memcg does not have its own objcg, so any objcg
which is uncharging to it must have been reparented.

Note that on the default hierarchy (CGroupV2 now) root always has
use_hierarchy=1. So this only effects CGroupV1.

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-by: ltp@lists.linux.it
Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---

V4: Same as V3, but with hopefully better/correct commit message.

 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Shakeel Butt Oct. 22, 2020, 4:37 p.m. UTC | #1
On Thu, Oct 22, 2020 at 5:29 AM Richard Palethorpe <rpalethorpe@suse.com> wrote:
>
> When use_hierarchy=1, SLAB objects which outlive their descendant
> memcg are moved to their parent memcg where they may be uncharged
> because charges are made recursively from leaf to root nodes.
>
> However when use_hierarchy=0, they are reparented directly to root and
> charging is not made recursively. Therefor uncharging will result in a
> counter underflow on the root memcg, but no other ancestors.
>
> To prevent this, we check whether we are about to uncharge the root
> memcg and whether use_hierarchy=0. If this is the case then we skip
> uncharging. The root memcg does not have its own objcg, so any objcg
> which is uncharging to it must have been reparented.
>
> Note that on the default hierarchy (CGroupV2 now) root always has
> use_hierarchy=1. So this only effects CGroupV1.
>
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
>
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> [   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
> [   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
> [   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
> [   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
> [   12.033059] kthread (kernel/kthread.c:292)
> [   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
> [   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [   12.033357] ---[ end trace 961dbfc01c109d1f ]---
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> Reported-by: ltp@lists.linux.it
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

This looks fine to me but I would prefer Roman's approach of charging
to root and for use_hierarchy=0, linking them to root.

Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
for 5.11.

What does everyone think?
Roman Gushchin Oct. 22, 2020, 5:25 p.m. UTC | #2
On Thu, Oct 22, 2020 at 09:37:02AM -0700, Shakeel Butt wrote:
> On Thu, Oct 22, 2020 at 5:29 AM Richard Palethorpe <rpalethorpe@suse.com> wrote:
> >
> > When use_hierarchy=1, SLAB objects which outlive their descendant
> > memcg are moved to their parent memcg where they may be uncharged
> > because charges are made recursively from leaf to root nodes.
> >
> > However when use_hierarchy=0, they are reparented directly to root and
> > charging is not made recursively. Therefor uncharging will result in a
> > counter underflow on the root memcg, but no other ancestors.
> >
> > To prevent this, we check whether we are about to uncharge the root
> > memcg and whether use_hierarchy=0. If this is the case then we skip
> > uncharging. The root memcg does not have its own objcg, so any objcg
> > which is uncharging to it must have been reparented.
> >
> > Note that on the default hierarchy (CGroupV2 now) root always has
> > use_hierarchy=1. So this only effects CGroupV1.
> >
> > The warning can be, unreliably, reproduced with the LTP test
> > madvise06 if the entire patch series
> > https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> > is present. Although the listed commit in 'fixes' appears to introduce
> > the bug, I can not reproduce it with just that commit and bisecting
> > runs into other bugs.
> >
> > [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [   12.029539] Modules linked in:
> > [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> > [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> > [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> > [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> > [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> > [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> > [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> > [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> > [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> > [   12.031209] Call Trace:
> > [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> > [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> > [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> > [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> > [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> > [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> > [   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
> > [   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
> > [   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
> > [   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
> > [   12.033059] kthread (kernel/kthread.c:292)
> > [   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [   12.033357] ---[ end trace 961dbfc01c109d1f ]---
> >
> > [    9.841552] ------------[ cut here ]------------
> > [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [    9.841982] Modules linked in:
> > [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> > [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [    9.842571] Workqueue: events drain_local_stock
> > [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> > [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> > [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> > [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> > [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> > [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> > [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> > [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> > [    9.845319] Call Trace:
> > [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> > [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> > [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> > [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> > [    9.846162] kthread (kernel/kthread.c:292)
> > [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> >
> > Reported-by: ltp@lists.linux.it
> > Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> 
> This looks fine to me but I would prefer Roman's approach of charging
> to root and for use_hierarchy=0, linking them to root.

+1

I think it's safer because it has no effect on use_hierarchy == true case,
and this is what we should really care about. But the latest Richard's version
looks correct to me, so I'm fine with using it for stable too.

> 
> Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> for 5.11.
> 
> What does everyone think?

I think we should use the link to the root approach both for stable backports
and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
is not directly related to this problem, and we can keep it for 5.11+ only.

Thanks!
Shakeel Butt Oct. 22, 2020, 11:59 p.m. UTC | #3
On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
>
[snip]
> >
> > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > for 5.11.
> >
> > What does everyone think?
>
> I think we should use the link to the root approach both for stable backports
> and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> is not directly related to this problem, and we can keep it for 5.11+ only.
>
> Thanks!

Roman, can you send the signed-off patch for the root linking for
use_hierarchy=0?
Roman Gushchin Oct. 23, 2020, 12:40 a.m. UTC | #4
On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> >
> [snip]
> > >
> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > > for 5.11.
> > >
> > > What does everyone think?
> >
> > I think we should use the link to the root approach both for stable backports
> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> > is not directly related to this problem, and we can keep it for 5.11+ only.
> >
> > Thanks!
> 
> Roman, can you send the signed-off patch for the root linking for
> use_hierarchy=0?

Sure, here we are.

Thanks!

--

From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Thu, 22 Oct 2020 17:12:32 -0700
Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge
and triggers a warning.

Fix it by linking all page counters to corresponding root page
counters in the non-hierarchical mode.

The patch doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an
alternative version of the fix!

Reported-by: ltp@lists.linux.it
Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
---
 mm/memcontrol.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..009297017c87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
 	}
-	if (parent && parent->use_hierarchy) {
+	if (!parent) {
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->tcpmem, NULL);
+	} else if (parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
-		page_counter_init(&memcg->memory, NULL);
-		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->kmem, NULL);
-		page_counter_init(&memcg->tcpmem, NULL);
+		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
Johannes Weiner Oct. 23, 2020, 3:44 p.m. UTC | #5
On Thu, Oct 22, 2020 at 05:40:26PM -0700, Roman Gushchin wrote:
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
> 
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
> 
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> 
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
> 
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg
> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
> 
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
> 
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
> 
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
> 
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Shakeel Butt Oct. 23, 2020, 4:41 p.m. UTC | #6
On Thu, Oct 22, 2020 at 5:40 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> > On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > [snip]
> > > >
> > > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > > > for 5.11.
> > > >
> > > > What does everyone think?
> > >
> > > I think we should use the link to the root approach both for stable backports
> > > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> > > is not directly related to this problem, and we can keep it for 5.11+ only.
> > >
> > > Thanks!
> >
> > Roman, can you send the signed-off patch for the root linking for
> > use_hierarchy=0?
>
> Sure, here we are.
>
> Thanks!
>
> --
>
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
>
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
>
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg
> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
>
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
>
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
>
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
>
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Richard Palethorpe Oct. 26, 2020, 7:32 a.m. UTC | #7
Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
>> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
>> >
>> [snip]
>> > >
>> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
>> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
>> > > for 5.11.
>> > >
>> > > What does everyone think?
>> >
>> > I think we should use the link to the root approach both for stable backports
>> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
>> > is not directly related to this problem, and we can keep it for 5.11+ only.
>> >
>> > Thanks!
>> 
>> Roman, can you send the signed-off patch for the root linking for
>> use_hierarchy=0?
>
> Sure, here we are.
>
> Thanks!
>
> --
>
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
>
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
>
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg

I think it is worth mentioning that reparenting will always be to root
to avoid any confusion about what may happen with deeper, broken,
hierarchies.

> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
>
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
>
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
>
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
>
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>

Much appreciated, thanks!
Roman Gushchin Oct. 26, 2020, 11:14 p.m. UTC | #8
On Mon, Oct 26, 2020 at 07:32:39AM +0000, Richard Palethorpe wrote:
> Hello Roman,
> 
> Roman Gushchin <guro@fb.com> writes:
> 
> > On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> >> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> >> >
> >> [snip]
> >> > >
> >> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> >> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> >> > > for 5.11.
> >> > >
> >> > > What does everyone think?
> >> >
> >> > I think we should use the link to the root approach both for stable backports
> >> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> >> > is not directly related to this problem, and we can keep it for 5.11+ only.
> >> >
> >> > Thanks!
> >> 
> >> Roman, can you send the signed-off patch for the root linking for
> >> use_hierarchy=0?
> >
> > Sure, here we are.
> >
> > Thanks!
> >
> > --
> >
> > From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Thu, 22 Oct 2020 17:12:32 -0700
> > Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
> >
> > Richard reported a warning which can be reproduced by running the LTP
> > madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
> >
> > [    9.841552] ------------[ cut here ]------------
> > [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [    9.841982] Modules linked in:
> > [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> > [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [    9.842571] Workqueue: events drain_local_stock
> > [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> > [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> > [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> > [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> > [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> > [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> > [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> > [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> > [    9.845319] Call Trace:
> > [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> > [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> > [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> > [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> > [    9.846162] kthread (kernel/kthread.c:292)
> > [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> >
> > The problem occurs because in the non-hierarchical mode non-root page
> > counters are not linked to root page counters, so the charge is not
> > propagated to the root memory cgroup.
> >
> > After the removal of the original memory cgroup and reparenting of the
> > object cgroup, the root cgroup might be uncharged by draining a objcg
> 
> I think it is worth mentioning that reparenting will always be to root
> to avoid any confusion about what may happen with deeper, broken,
> hierarchies.

I agree. Added and sent v2.

> 
> > stock, for example. It leads to an eventual underflow of the charge
> > and triggers a warning.
> >
> > Fix it by linking all page counters to corresponding root page
> > counters in the non-hierarchical mode.
> >
> > The patch doesn't affect how the hierarchical mode is working,
> > which is the only sane and truly supported mode now.
> >
> > Thanks to Richard for reporting, debugging and providing an
> > alternative version of the fix!
> >
> > Reported-by: ltp@lists.linux.it
> > Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> 
> Much appreciated, thanks!

You did most of the work. Thank you!

Roman
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..34b8c4a66853 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@  static void obj_cgroup_release(struct percpu_ref *ref)
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	memcg = obj_cgroup_memcg(objcg);
-	if (nr_pages)
+	if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
 		__memcg_kmem_uncharge(memcg, nr_pages);
 	list_del(&objcg->list);
 	mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
+	struct mem_cgroup *memcg;
 
 	if (!old)
 		return;
@@ -3110,7 +3111,9 @@  static void drain_obj_stock(struct memcg_stock_pcp *stock)
 
 		if (nr_pages) {
 			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			memcg = obj_cgroup_memcg(old);
+			if (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy)
+				__memcg_kmem_uncharge(memcg, nr_pages);
 			rcu_read_unlock();
 		}