diff mbox series

[net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

Message ID 20200616180352.18602-1-xiyou.wangcong@gmail.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock() | expand

Commit Message

Cong Wang June 16, 2020, 6:03 p.m. UTC
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup/cgroup.c | 26 ++++++++++++++------------
 net/core/sock.c        |  2 +-
 3 files changed, 17 insertions(+), 13 deletions(-)

Comments

Zefan Li June 18, 2020, 1:44 a.m. UTC | #1
Cc: Roman Gushchin <guro@fb.com>

Thanks for fixing this.

On 2020/6/17 2:03, Cong Wang wrote:
> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> copied, so the cgroup refcnt must be taken too. And, unlike the
> sk_alloc() path, sock_update_netprioidx() is not called here.
> Therefore, it is safe and necessary to grab the cgroup refcnt
> even when cgroup_sk_alloc is disabled.
> 
> sk_clone_lock() is in BH context anyway, the in_interrupt()
> would terminate this function if called there. And for sk_alloc()
> skcd->val is always zero. So it's safe to factor out the code
> to make it more readable.
> 
> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")

but I don't think the bug was introduced by this commit, because there
are already calls to cgroup_sk_alloc_disable() in write_priomap() and
write_classid(), which can be triggered by writing to ifpriomap or
classid in cgroupfs. This commit just made it much easier to happen
with systemd invovled.

I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
which added cgroup_bpf_get() in cgroup_sk_alloc().

> Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
> Reported-by: Peter Geis <pgwipeout@gmail.com>
> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
> Reported-by: Daniël Sonck <dsonck92@gmail.com>
> Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Zefan Li <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
Cong Wang June 18, 2020, 7:19 p.m. UTC | #2
On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>
> Cc: Roman Gushchin <guro@fb.com>
>
> Thanks for fixing this.
>
> On 2020/6/17 2:03, Cong Wang wrote:
> > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > copied, so the cgroup refcnt must be taken too. And, unlike the
> > sk_alloc() path, sock_update_netprioidx() is not called here.
> > Therefore, it is safe and necessary to grab the cgroup refcnt
> > even when cgroup_sk_alloc is disabled.
> >
> > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > would terminate this function if called there. And for sk_alloc()
> > skcd->val is always zero. So it's safe to factor out the code
> > to make it more readable.
> >
> > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>
> but I don't think the bug was introduced by this commit, because there
> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> write_classid(), which can be triggered by writing to ifpriomap or
> classid in cgroupfs. This commit just made it much easier to happen
> with systemd invovled.
>
> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> which added cgroup_bpf_get() in cgroup_sk_alloc().

Good point.

I take a deeper look, it looks like commit d979a39d7242e06
is the one to blame, because it is the first commit that began to
hold cgroup refcnt in cgroup_sk_alloc().

The commit you mentioned above merely adds a refcnt for
cgroup bpf on to of cgroup refcnt.

Thanks.
Roman Gushchin June 18, 2020, 7:36 p.m. UTC | #3
On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >
> > Cc: Roman Gushchin <guro@fb.com>
> >
> > Thanks for fixing this.
> >
> > On 2020/6/17 2:03, Cong Wang wrote:
> > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > even when cgroup_sk_alloc is disabled.
> > >
> > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > would terminate this function if called there. And for sk_alloc()
> > > skcd->val is always zero. So it's safe to factor out the code
> > > to make it more readable.
> > >
> > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >
> > but I don't think the bug was introduced by this commit, because there
> > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > write_classid(), which can be triggered by writing to ifpriomap or
> > classid in cgroupfs. This commit just made it much easier to happen
> > with systemd invovled.
> >
> > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > which added cgroup_bpf_get() in cgroup_sk_alloc().
> 
> Good point.
> 
> I take a deeper look, it looks like commit d979a39d7242e06
> is the one to blame, because it is the first commit that began to
> hold cgroup refcnt in cgroup_sk_alloc().

I agree, ut seems that the issue is not related to bpf and probably
can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
seems closer to the origin.

Btw, based on the number of reported-by tags it seems that there was
a real issue which the patch is fixing. Maybe you'll a couple of words
about how it reveals itself in the real life?

Thanks!
Cong Wang June 18, 2020, 9:09 p.m. UTC | #4
On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > >
> > > Cc: Roman Gushchin <guro@fb.com>
> > >
> > > Thanks for fixing this.
> > >
> > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > even when cgroup_sk_alloc is disabled.
> > > >
> > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > would terminate this function if called there. And for sk_alloc()
> > > > skcd->val is always zero. So it's safe to factor out the code
> > > > to make it more readable.
> > > >
> > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > >
> > > but I don't think the bug was introduced by this commit, because there
> > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > write_classid(), which can be triggered by writing to ifpriomap or
> > > classid in cgroupfs. This commit just made it much easier to happen
> > > with systemd invovled.
> > >
> > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> >
> > Good point.
> >
> > I take a deeper look, it looks like commit d979a39d7242e06
> > is the one to blame, because it is the first commit that began to
> > hold cgroup refcnt in cgroup_sk_alloc().
>
> I agree, ut seems that the issue is not related to bpf and probably
> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> seems closer to the origin.

Yeah, I will update the Fixes tag and send V2.

>
> Btw, based on the number of reported-by tags it seems that there was
> a real issue which the patch is fixing. Maybe you'll a couple of words
> about how it reveals itself in the real life?

I still have no idea how exactly this is triggered. According to the
people who reported this bug, they just need to wait for some hours
to trigger. So I am not sure what to add here, just the stack trace?

Thanks.
Roman Gushchin June 18, 2020, 9:26 p.m. UTC | #5
On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > >
> > > > Cc: Roman Gushchin <guro@fb.com>
> > > >
> > > > Thanks for fixing this.
> > > >
> > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > even when cgroup_sk_alloc is disabled.
> > > > >
> > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > would terminate this function if called there. And for sk_alloc()
> > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > to make it more readable.
> > > > >
> > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > >
> > > > but I don't think the bug was introduced by this commit, because there
> > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > with systemd invovled.
> > > >
> > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > >
> > > Good point.
> > >
> > > I take a deeper look, it looks like commit d979a39d7242e06
> > > is the one to blame, because it is the first commit that began to
> > > hold cgroup refcnt in cgroup_sk_alloc().
> >
> > I agree, ut seems that the issue is not related to bpf and probably
> > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > seems closer to the origin.
> 
> Yeah, I will update the Fixes tag and send V2.
> 
> >
> > Btw, based on the number of reported-by tags it seems that there was
> > a real issue which the patch is fixing. Maybe you'll a couple of words
> > about how it reveals itself in the real life?
> 
> I still have no idea how exactly this is triggered. According to the
> people who reported this bug, they just need to wait for some hours
> to trigger. So I am not sure what to add here, just the stack trace?

Yeah, stack trace is definitely useful. So at least if someone will encounter the same
error in the future, they can search for the stacktrace and find the fix.

Thanks!
Peter Geis June 18, 2020, 10:45 p.m. UTC | #6
On Thu, Jun 18, 2020 at 5:26 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > > >
> > > > > Cc: Roman Gushchin <guro@fb.com>
> > > > >
> > > > > Thanks for fixing this.
> > > > >
> > > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > > even when cgroup_sk_alloc is disabled.
> > > > > >
> > > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > > would terminate this function if called there. And for sk_alloc()
> > > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > > to make it more readable.
> > > > > >
> > > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > > >
> > > > > but I don't think the bug was introduced by this commit, because there
> > > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > > with systemd invovled.
> > > > >
> > > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > > >
> > > > Good point.
> > > >
> > > > I take a deeper look, it looks like commit d979a39d7242e06
> > > > is the one to blame, because it is the first commit that began to
> > > > hold cgroup refcnt in cgroup_sk_alloc().
> > >
> > > I agree, ut seems that the issue is not related to bpf and probably
> > > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > > seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
> > >
> > > Btw, based on the number of reported-by tags it seems that there was
> > > a real issue which the patch is fixing. Maybe you'll a couple of words
> > > about how it reveals itself in the real life?
> >
> > I still have no idea how exactly this is triggered. According to the
> > people who reported this bug, they just need to wait for some hours
> > to trigger. So I am not sure what to add here, just the stack trace?
>
> Yeah, stack trace is definitely useful. So at least if someone will encounter the same
> error in the future, they can search for the stacktrace and find the fix.

I can provide at least a little insight into my configuration that
allowed the bug to trigger.
I'm running the following configuration:
Ubuntu 20.04 - aarch64
Mainline 5.6.17 kernel using Ubuntu's config.
LXD with several containers active.
Docker with several containers active.

Every crash was the same, with nfsd triggering the sequence.
My only nfs client at the time was a Windows 10 device.
Disabling nfsd prevented the bug from occurring.

Here was the last stack trace:
[23352.431106] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[23352.431938] rockpro64 kernel: Mem abort info:
[23352.432199] rockpro64 kernel:   ESR = 0x96000004
[23352.432485] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[23352.432965] rockpro64 kernel:   SET = 0, FnV = 0
[23352.433248] rockpro64 kernel:   EA = 0, S1PTW = 0
[23352.433536] rockpro64 kernel: Data abort info:
[23352.433803] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[23352.434153] rockpro64 kernel:   CM = 0, WnR = 0
[23352.434475] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=0000000094d4c000
[23352.435174] rockpro64 kernel: [0000000000000010]
pgd=0000000094f3d003, pud=00000000bdb7f003, pmd=0000000000000000
[23352.435963] rockpro64 kernel: Internal error: Oops: 96000004 [#1] SMP
[23352.436396] rockpro64 kernel: Modules linked in: xt_TCPMSS
nf_conntrack_netlink xfrm_user xt_addrtype br_netfilter ip_set_hash_ip
ip_set_hash_net xt_set ip_set cfg80211 nft_counter xt_length
xt_connmark xt_multiport xt_mark nf_log_ip>
[23352.436519] rockpro64 kernel:  ghash_ce enclosure snd_soc_es8316
scsi_transport_sas snd_seq_midi sha2_ce snd_seq_midi_event
snd_soc_simple_card snd_rawmidi snd_soc_audio_graph_card sha256_arm64
panfrost snd_soc_simple_card_utils sha1>
[23352.444216] rockpro64 kernel:  async_pq async_xor xor xor_neon
async_tx uas raid6_pq raid1 raid0 multipath linear usb_storage
xhci_plat_hcd dwc3 rtc_rk808 clk_rk808 rk808_regulator ulpi udc_core
fusb302 tcpm typec fan53555 rk808 pwm_>
[23352.455532] rockpro64 kernel: CPU: 3 PID: 1237 Comm: nfsd Not
tainted 5.6.17+ #74
[23352.456054] rockpro64 kernel: Hardware name: pine64
rockpro64_rk3399/rockpro64_rk3399, BIOS
2020.07-rc2-00124-g515f613253-dirty 05/19/2020
[23352.457010] rockpro64 kernel: pstate: 60400005 (nZCv daif +PAN -UAO)
[23352.457445] rockpro64 kernel: pc : __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.457918] rockpro64 kernel: lr : ip_finish_output+0x98/0xd0
[23352.458287] rockpro64 kernel: sp : ffff80001325b900
[23352.458581] rockpro64 kernel: x29: ffff80001325b900 x28: ffff000012f0fae0
[23352.459051] rockpro64 kernel: x27: 0000000000000001 x26: ffff00005f0ddc00
[23352.459521] rockpro64 kernel: x25: 0000000000000118 x24: ffff0000dcd3c270
[23352.459990] rockpro64 kernel: x23: 0000000000000010 x22: ffff800011b1aec0
[23352.460458] rockpro64 kernel: x21: ffff0000efcacc40 x20: 0000000000000010
[23352.460928] rockpro64 kernel: x19: ffff0000dcd3bf00 x18: 0000000000000000
[23352.461396] rockpro64 kernel: x17: 0000000000000000 x16: 0000000000000000
[23352.461863] rockpro64 kernel: x15: 0000000000000000 x14: 0000000000000004
[23352.462332] rockpro64 kernel: x13: 0000000000000001 x12: 0000000000201400
[23352.462802] rockpro64 kernel: x11: 0000000000000000 x10: 0000000000000000
[23352.463271] rockpro64 kernel: x9 : ffff800010b6f6d0 x8 : 0000000000000260
[23352.463738] rockpro64 kernel: x7 : 0000000000000000 x6 : ffff0000dc12a000
[23352.464208] rockpro64 kernel: x5 : ffff0000dcd3bf00 x4 : 0000000000000028
[23352.464677] rockpro64 kernel: x3 : 0000000000000000 x2 : ffff000012f0fb08
[23352.465145] rockpro64 kernel: x1 : ffff00005f0ddd40 x0 : 0000000000000000
[23352.465616] rockpro64 kernel: Call trace:
[23352.465843] rockpro64 kernel:  __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.466283] rockpro64 kernel:  ip_finish_output+0x98/0xd0
[23352.466625] rockpro64 kernel:  ip_output+0xb0/0x130
[23352.466920] rockpro64 kernel:  ip_local_out+0x4c/0x60
[23352.467233] rockpro64 kernel:  __ip_queue_xmit+0x128/0x380
[23352.467584] rockpro64 kernel:  ip_queue_xmit+0x10/0x18
[23352.467903] rockpro64 kernel:  __tcp_transmit_skb+0x470/0xaf0
[23352.468274] rockpro64 kernel:  tcp_write_xmit+0x39c/0x1110
[23352.468623] rockpro64 kernel:  __tcp_push_pending_frames+0x40/0x100
[23352.469040] rockpro64 kernel:  tcp_send_fin+0x6c/0x240
[23352.469358] rockpro64 kernel:  tcp_shutdown+0x60/0x68
[23352.469669] rockpro64 kernel:  inet_shutdown+0xb0/0x120
[23352.469997] rockpro64 kernel:  kernel_sock_shutdown+0x1c/0x28
[23352.470464] rockpro64 kernel:  svc_tcp_sock_detach+0xd0/0x110 [sunrpc]
[23352.470980] rockpro64 kernel:  svc_delete_xprt+0x74/0x240 [sunrpc]
[23352.471445] rockpro64 kernel:  svc_recv+0x45c/0xb10 [sunrpc]
[23352.471864] rockpro64 kernel:  nfsd+0xdc/0x150 [nfsd]
[23352.472179] rockpro64 kernel:  kthread+0xfc/0x128
[23352.472461] rockpro64 kernel:  ret_from_fork+0x10/0x18
[23352.472785] rockpro64 kernel: Code: 9100c0c6 17ffff7b f9431cc0
91004017 (f9400814)
[23352.473324] rockpro64 kernel: ---[ end trace 978df9e144fd1235 ]---
[29973.397069] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[29973.397966] rockpro64 kernel: Mem abort info:
[29973.398224] rockpro64 kernel:   ESR = 0x96000004
[29973.398503] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[29973.398976] rockpro64 kernel:   SET = 0, FnV = 0
[29973.399254] rockpro64 kernel:   EA = 0, S1PTW = 0
[29973.399537] rockpro64 kernel: Data abort info:
[29973.399799] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[29973.400143] rockpro64 kernel:   CM = 0, WnR = 0
[29973.400416] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000dcdd1000
[29973.400989] rockpro64 kernel: [0000000000000010] pgd=0000000000000000
[29973.401490] rockpro64 kernel: Internal error: Oops: 96000004 [#2] SMP

I hope this helps.

>
> Thanks!
Zefan Li June 19, 2020, 6:40 a.m. UTC | #7
On 2020/6/19 5:09, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>
>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>
>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>
>>>> Thanks for fixing this.
>>>>
>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>> even when cgroup_sk_alloc is disabled.
>>>>>
>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>> would terminate this function if called there. And for sk_alloc()
>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>> to make it more readable.
>>>>>
>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>
>>>> but I don't think the bug was introduced by this commit, because there
>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>> with systemd invovled.
>>>>
>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>
>>> Good point.
>>>
>>> I take a deeper look, it looks like commit d979a39d7242e06
>>> is the one to blame, because it is the first commit that began to
>>> hold cgroup refcnt in cgroup_sk_alloc().
>>
>> I agree, ut seems that the issue is not related to bpf and probably
>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>> seems closer to the origin.
> 
> Yeah, I will update the Fixes tag and send V2.
> 

Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
but this is fine, because when the socket is to be freed:

 sk_prot_free()
   cgroup_sk_free()
     cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)

cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.

but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
that needs to be fixed.

>>
>> Btw, based on the number of reported-by tags it seems that there was
>> a real issue which the patch is fixing. Maybe you'll a couple of words
>> about how it reveals itself in the real life?
> 
> I still have no idea how exactly this is triggered. According to the
> people who reported this bug, they just need to wait for some hours
> to trigger. So I am not sure what to add here, just the stack trace?
> 
> Thanks.
> .
>
Cong Wang June 19, 2020, 7:51 p.m. UTC | #8
On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
>
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
>
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.

But skcd->val can be a pointer to a non-root cgroup:

static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
{
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
        unsigned long v;

        /*
         * @skcd->val is 64bit but the following is safe on 32bit too as we
         * just need the lower ulong to be written and read atomically.
         */
        v = READ_ONCE(skcd->val);

        if (v & 1)
                return &cgrp_dfl_root.cgrp;

        return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
#else
        return (struct cgroup *)(unsigned long)skcd->val;
#endif
}
Zefan Li June 20, 2020, 12:45 a.m. UTC | #9
On 2020/6/20 3:51, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> But skcd->val can be a pointer to a non-root cgroup:

It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
when cgroup_sk_alloc is disabled.

> 
> static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
> {
> #if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
>         unsigned long v;
> 
>         /*
>          * @skcd->val is 64bit but the following is safe on 32bit too as we
>          * just need the lower ulong to be written and read atomically.
>          */
>         v = READ_ONCE(skcd->val);
> 
>         if (v & 1)
>                 return &cgrp_dfl_root.cgrp;
> 
>         return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> #else
>         return (struct cgroup *)(unsigned long)skcd->val;
> #endif
> }
> .
>
Roman Gushchin June 20, 2020, 12:51 a.m. UTC | #10
On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> > 
> > Yeah, I will update the Fixes tag and send V2.
> > 
> 
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
> 
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> 
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> that needs to be fixed.

Hm, does it mean that the problem always happens with the root cgroup?

From the stacktrace provided by Peter it looks like that the problem
is bpf-related, but the original patch says nothing about it.

So from the test above it sounds like the problem is that we're trying
to release root's cgroup_bpf, which is a bad idea, I totally agree.
Is this the problem? If so, we might wanna fix it in a different way,
just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
like in cgroup_put(). It feels more reliable to me.

Thanks!
Zefan Li June 20, 2020, 12:51 a.m. UTC | #11
在 2020/6/20 8:45, Zefan Li 写道:
> On 2020/6/20 3:51, Cong Wang wrote:
>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>
>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>
>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>
>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>
>>>>>>> Thanks for fixing this.
>>>>>>>
>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>
>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>> to make it more readable.
>>>>>>>>
>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>
>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>> with systemd invovled.
>>>>>>>
>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>
>>>>>> Good point.
>>>>>>
>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>> is the one to blame, because it is the first commit that began to
>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>
>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>> seems closer to the origin.
>>>>
>>>> Yeah, I will update the Fixes tag and send V2.
>>>>
>>>
>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>> but this is fine, because when the socket is to be freed:
>>>
>>>  sk_prot_free()
>>>    cgroup_sk_free()
>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>
>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> But skcd->val can be a pointer to a non-root cgroup:
> 
> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> when cgroup_sk_alloc is disabled.
> 

And please read those recent bug reports, they all happened when bpf cgroup was in use,
and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
Zefan Li June 20, 2020, 1 a.m. UTC | #12
On 2020/6/20 8:51, Roman Gushchin wrote:
> On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
>> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
>> that needs to be fixed.
> 
> Hm, does it mean that the problem always happens with the root cgroup?
> 
>>From the stacktrace provided by Peter it looks like that the problem
> is bpf-related, but the original patch says nothing about it.
> 
> So from the test above it sounds like the problem is that we're trying
> to release root's cgroup_bpf, which is a bad idea, I totally agree.
> Is this the problem?

I think so, though I'm not familiar with the bfp cgroup code.

> If so, we might wanna fix it in a different way,
> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> like in cgroup_put(). It feels more reliable to me.
> 

Yeah I also have this idea in my mind.
Roman Gushchin June 20, 2020, 1:14 a.m. UTC | #13
On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> On 2020/6/20 8:51, Roman Gushchin wrote:
> > On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> >> On 2020/6/19 5:09, Cong Wang wrote:
> >>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>
> >>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>
> >>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>
> >>>>>> Thanks for fixing this.
> >>>>>>
> >>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>
> >>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>> to make it more readable.
> >>>>>>>
> >>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>
> >>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>> with systemd invovled.
> >>>>>>
> >>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>
> >>>>> Good point.
> >>>>>
> >>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>> is the one to blame, because it is the first commit that began to
> >>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>
> >>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>> seems closer to the origin.
> >>>
> >>> Yeah, I will update the Fixes tag and send V2.
> >>>
> >>
> >> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >> but this is fine, because when the socket is to be freed:
> >>
> >>  sk_prot_free()
> >>    cgroup_sk_free()
> >>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>
> >> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>
> >> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> >> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> >> that needs to be fixed.
> > 
> > Hm, does it mean that the problem always happens with the root cgroup?
> > 
> >>From the stacktrace provided by Peter it looks like that the problem
> > is bpf-related, but the original patch says nothing about it.
> > 
> > So from the test above it sounds like the problem is that we're trying
> > to release root's cgroup_bpf, which is a bad idea, I totally agree.
> > Is this the problem?
> 
> I think so, though I'm not familiar with the bfp cgroup code.
> 
> > If so, we might wanna fix it in a different way,
> > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > like in cgroup_put(). It feels more reliable to me.
> > 
> 
> Yeah I also have this idea in my mind.

I wonder if the following patch will fix the issue?

--

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..7eb51137d896 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
 #ifdef CONFIG_CGROUP_BPF
 static inline void cgroup_bpf_get(struct cgroup *cgrp)
 {
-       percpu_ref_get(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_get(&cgrp->bpf.refcnt);
 }
 
 static inline void cgroup_bpf_put(struct cgroup *cgrp)
 {
-       percpu_ref_put(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_put(&cgrp->bpf.refcnt);
 }
 
 #else /* CONFIG_CGROUP_BPF */
Zefan Li June 20, 2020, 2:48 a.m. UTC | #14
>>> If so, we might wanna fix it in a different way,
>>> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
>>> like in cgroup_put(). It feels more reliable to me.
>>>
>>
>> Yeah I also have this idea in my mind.
> 
> I wonder if the following patch will fix the issue?
> 

I guess so, but it's better we have someone who reported this bug to
test it.

> --
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4598e4da6b1b..7eb51137d896 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
>  #ifdef CONFIG_CGROUP_BPF
>  static inline void cgroup_bpf_get(struct cgroup *cgrp)
>  {
> -       percpu_ref_get(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_get(&cgrp->bpf.refcnt);
>  }
>  
>  static inline void cgroup_bpf_put(struct cgroup *cgrp)
>  {
> -       percpu_ref_put(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_put(&cgrp->bpf.refcnt);
>  }
>  
>  #else /* CONFIG_CGROUP_BPF */
>
Cong Wang June 20, 2020, 3 a.m. UTC | #15
On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > I think so, though I'm not familiar with the bfp cgroup code.
> >
> > > If so, we might wanna fix it in a different way,
> > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > like in cgroup_put(). It feels more reliable to me.
> > >
> >
> > Yeah I also have this idea in my mind.
>
> I wonder if the following patch will fix the issue?

Interesting, AFAIU, this refcnt is for bpf programs attached
to the cgroup. By this suggestion, do you mean the root
cgroup does not need to refcnt the bpf programs attached
to it? This seems odd, as I don't see how root is different
from others in terms of bpf programs which can be attached
and detached in the same way.

I certainly understand the root cgroup is never gone, but this
does not mean the bpf programs attached to it too.

What am I missing?

Thanks.
Cong Wang June 20, 2020, 3:31 a.m. UTC | #16
On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>
> 在 2020/6/20 8:45, Zefan Li 写道:
> > On 2020/6/20 3:51, Cong Wang wrote:
> >> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>
> >>> On 2020/6/19 5:09, Cong Wang wrote:
> >>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>>
> >>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>>
> >>>>>>> Thanks for fixing this.
> >>>>>>>
> >>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>>
> >>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>>> to make it more readable.
> >>>>>>>>
> >>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>>
> >>>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>>> with systemd invovled.
> >>>>>>>
> >>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>>
> >>>>>> Good point.
> >>>>>>
> >>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>>> is the one to blame, because it is the first commit that began to
> >>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>>
> >>>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>>> seems closer to the origin.
> >>>>
> >>>> Yeah, I will update the Fixes tag and send V2.
> >>>>
> >>>
> >>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >>> but this is fine, because when the socket is to be freed:
> >>>
> >>>  sk_prot_free()
> >>>    cgroup_sk_free()
> >>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>>
> >>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>
> >> But skcd->val can be a pointer to a non-root cgroup:
> >
> > It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> > when cgroup_sk_alloc is disabled.
> >
>
> And please read those recent bug reports, they all happened when bpf cgroup was in use,
> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.

I am totally aware of this. My concern is whether cgroup
has the same refcnt bug as it always pairs with the bpf refcnt.

But, after a second look, the non-root cgroup refcnt is immediately
overwritten by sock_update_classid() or sock_update_netprioidx(),
which effectively turns into a root cgroup again. :-/

(It seems we leak a refcnt here, but this is not related to my patch).

Thanks.
Zefan Li June 20, 2020, 7:52 a.m. UTC | #17
在 2020/6/20 11:31, Cong Wang 写道:
> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>>
>> 在 2020/6/20 8:45, Zefan Li 写道:
>>> On 2020/6/20 3:51, Cong Wang wrote:
>>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>
>>>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>>>
>>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>>>
>>>>>>>>> Thanks for fixing this.
>>>>>>>>>
>>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>>>
>>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>>>> to make it more readable.
>>>>>>>>>>
>>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>>>
>>>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>>>> with systemd invovled.
>>>>>>>>>
>>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>>>
>>>>>>>> Good point.
>>>>>>>>
>>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>>>> is the one to blame, because it is the first commit that began to
>>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>>>
>>>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>>>> seems closer to the origin.
>>>>>>
>>>>>> Yeah, I will update the Fixes tag and send V2.
>>>>>>
>>>>>
>>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>>>> but this is fine, because when the socket is to be freed:
>>>>>
>>>>>  sk_prot_free()
>>>>>    cgroup_sk_free()
>>>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>>>
>>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>>>
>>>> But skcd->val can be a pointer to a non-root cgroup:
>>>
>>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
>>> when cgroup_sk_alloc is disabled.
>>>
>>
>> And please read those recent bug reports, they all happened when bpf cgroup was in use,
>> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> 
> I am totally aware of this. My concern is whether cgroup
> has the same refcnt bug as it always pairs with the bpf refcnt.
> 
> But, after a second look, the non-root cgroup refcnt is immediately
> overwritten by sock_update_classid() or sock_update_netprioidx(),
> which effectively turns into a root cgroup again. :-/
> 
> (It seems we leak a refcnt here, but this is not related to my patch).
> 

Indead, but it's well known, see bd1060a1d67128bb8fbe2. But now bpf cgroup comes into play...

Your patch doesn't seem to fix the bug completely. If cgroup_sk_alloc_disable happens after
socket cloning, then we will deref the bpf of the root cgroup while incref-ed the bpf of a
non-root cgroup.
Roman Gushchin June 20, 2020, 3:57 p.m. UTC | #18
On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > I think so, though I'm not familiar with the bfp cgroup code.
> > >
> > > > If so, we might wanna fix it in a different way,
> > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > like in cgroup_put(). It feels more reliable to me.
> > > >
> > >
> > > Yeah I also have this idea in my mind.
> >
> > I wonder if the following patch will fix the issue?
> 
> Interesting, AFAIU, this refcnt is for bpf programs attached
> to the cgroup. By this suggestion, do you mean the root
> cgroup does not need to refcnt the bpf programs attached
> to it? This seems odd, as I don't see how root is different
> from others in terms of bpf programs which can be attached
> and detached in the same way.
> 
> I certainly understand the root cgroup is never gone, but this
> does not mean the bpf programs attached to it too.
> 
> What am I missing?

It's different because the root cgroup can't be deleted.

All this reference counting is required to automatically detach bpf programs
from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
because a cgroup can be in dying state for a long time being pinned by a
pagecache page, for example. Only a user can detach a bpf program from
an existing cgroup.

Thanks!
Roman Gushchin June 20, 2020, 4:04 p.m. UTC | #19
On Sat, Jun 20, 2020 at 03:52:49PM +0800, Zefan Li wrote:
> 在 2020/6/20 11:31, Cong Wang 写道:
> > On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
> >>
> >> 在 2020/6/20 8:45, Zefan Li 写道:
> >>> On 2020/6/20 3:51, Cong Wang wrote:
> >>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>
> >>>>> On 2020/6/19 5:09, Cong Wang wrote:
> >>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>>>>
> >>>>>>>>> Thanks for fixing this.
> >>>>>>>>>
> >>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>>>>
> >>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>>>>> to make it more readable.
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>>>>
> >>>>>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>>>>> with systemd invovled.
> >>>>>>>>>
> >>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>>>>
> >>>>>>>> Good point.
> >>>>>>>>
> >>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>>>>> is the one to blame, because it is the first commit that began to
> >>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>>>>
> >>>>>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>>>>> seems closer to the origin.
> >>>>>>
> >>>>>> Yeah, I will update the Fixes tag and send V2.
> >>>>>>
> >>>>>
> >>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >>>>> but this is fine, because when the socket is to be freed:
> >>>>>
> >>>>>  sk_prot_free()
> >>>>>    cgroup_sk_free()
> >>>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>>>>
> >>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>>>
> >>>> But skcd->val can be a pointer to a non-root cgroup:
> >>>
> >>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> >>> when cgroup_sk_alloc is disabled.
> >>>
> >>
> >> And please read those recent bug reports, they all happened when bpf cgroup was in use,
> >> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> > 
> > I am totally aware of this. My concern is whether cgroup
> > has the same refcnt bug as it always pairs with the bpf refcnt.
> > 
> > But, after a second look, the non-root cgroup refcnt is immediately
> > overwritten by sock_update_classid() or sock_update_netprioidx(),
> > which effectively turns into a root cgroup again. :-/
> > 
> > (It seems we leak a refcnt here, but this is not related to my patch).
> > 
> 
> Indead, but it's well known, see bd1060a1d67128bb8fbe2. But now bpf cgroup comes into play...
> 
> Your patch doesn't seem to fix the bug completely. If cgroup_sk_alloc_disable happens after
> socket cloning, then we will deref the bpf of the root cgroup while incref-ed the bpf of a
> non-root cgroup.

Hm, so it seems that to disable the refcounting for the root cgroup's cgroup_bpf
is a good idea anyway, as it makes cgroup_bpf refcounting work similar to cgroup
refcounting. But it's not completely clear to me, if it's enough or do we have
something else to fix here?

Thanks!
Cong Wang June 22, 2020, 6:14 p.m. UTC | #20
On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > > I think so, though I'm not familiar with the bfp cgroup code.
> > > >
> > > > > If so, we might wanna fix it in a different way,
> > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > > like in cgroup_put(). It feels more reliable to me.
> > > > >
> > > >
> > > > Yeah I also have this idea in my mind.
> > >
> > > I wonder if the following patch will fix the issue?
> >
> > Interesting, AFAIU, this refcnt is for bpf programs attached
> > to the cgroup. By this suggestion, do you mean the root
> > cgroup does not need to refcnt the bpf programs attached
> > to it? This seems odd, as I don't see how root is different
> > from others in terms of bpf programs which can be attached
> > and detached in the same way.
> >
> > I certainly understand the root cgroup is never gone, but this
> > does not mean the bpf programs attached to it too.
> >
> > What am I missing?
>
> It's different because the root cgroup can't be deleted.
>
> All this reference counting is required to automatically detach bpf programs
> from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
> because a cgroup can be in dying state for a long time being pinned by a
> pagecache page, for example. Only a user can detach a bpf program from
> an existing cgroup.

Yeah, but users can still detach the bpf programs from root cgroup.
IIUC, after detaching, the pointer in the bpf array will be empty_prog_array
which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
deref it without checking NULL (as check_non_null == false).

This matches the 0000000000000010 pointer seen in the bug reports,
the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
So looks like we have to add a NULL check there regardless of refcnt.

Also, I am not sure whether your suggested patch makes a difference
for percpu refcnt, as percpu_ref_put() will never call ->release() until
percpu_ref_kill(), which is never called on root cgroup?

Thanks.
Roman Gushchin June 22, 2020, 8:39 p.m. UTC | #21
On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote:
> On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > > > I think so, though I'm not familiar with the bfp cgroup code.
> > > > >
> > > > > > If so, we might wanna fix it in a different way,
> > > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > > > like in cgroup_put(). It feels more reliable to me.
> > > > > >
> > > > >
> > > > > Yeah I also have this idea in my mind.
> > > >
> > > > I wonder if the following patch will fix the issue?
> > >
> > > Interesting, AFAIU, this refcnt is for bpf programs attached
> > > to the cgroup. By this suggestion, do you mean the root
> > > cgroup does not need to refcnt the bpf programs attached
> > > to it? This seems odd, as I don't see how root is different
> > > from others in terms of bpf programs which can be attached
> > > and detached in the same way.
> > >
> > > I certainly understand the root cgroup is never gone, but this
> > > does not mean the bpf programs attached to it too.
> > >
> > > What am I missing?
> >
> > It's different because the root cgroup can't be deleted.
> >
> > All this reference counting is required to automatically detach bpf programs
> > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
> > because a cgroup can be in dying state for a long time being pinned by a
> > pagecache page, for example. Only a user can detach a bpf program from
> > an existing cgroup.
> 
> Yeah, but users can still detach the bpf programs from root cgroup.
> IIUC, after detaching, the pointer in the bpf array will be empty_prog_array
> which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
> deref it without checking NULL (as check_non_null == false).
> 
> This matches the 0000000000000010 pointer seen in the bug reports,
> the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
> So looks like we have to add a NULL check there regardless of refcnt.
> 
> Also, I am not sure whether your suggested patch makes a difference
> for percpu refcnt, as percpu_ref_put() will never call ->release() until
> percpu_ref_kill(), which is never called on root cgroup?

Hm, true. But it means that the problem is not with the root cgroup's bpf?

How easy is to reproduce the problem? Is it possible to bisect the problematic
commit?

Thanks!
Zhang, Qiang June 23, 2020, 8:45 a.m. UTC | #22
There are some message in kernelv5.4, I don't know if it will help.

demsg:

cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or 
net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Zhang, Qiang June 23, 2020, 8:54 a.m. UTC | #23
The tester found the following information during the test

The dmesg information is as follows (kernelv5.4) I don't know if it 
helps for this question


root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching 
due to net_prio or net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
rcu_core+0x227/0x870
? timerqueue_add+0x68/0xa0
rcu_core_si+0xe/0x10
__do_softirq+0x102/0x358
? tick_program_event+0x4d/0x90
irq_exit+0xa0/0x110
smp_apic_timer_interrupt+0xa1/0x1b0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc0/0x3c0
Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 
f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 
00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48
RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000
RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018
R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003
R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000
? cpuidle_enter_state+0xa5/0x3c0
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x1c6/0x240
cpu_startup_entry+0x20/0x30
start_secondary+0x15b/0x190
secondary_startup_64+0xb6/0xc0
--[ end trace 805031dac04b28f5 ]--
Zhang, Qiang June 23, 2020, 9:01 a.m. UTC | #24
On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote:
 > On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin <guro@fb.com> wrote:
 > >
 > > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
 > > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
 > > > >
 > > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
 > > > > > I think so, though I'm not familiar with the bfp cgroup code.
 > > > > >
 > > > > > > If so, we might wanna fix it in a different way,
 > > > > > > just checking if (!(css->flags & CSS_NO_REF)) in 
cgroup_bpf_put()
 > > > > > > like in cgroup_put(). It feels more reliable to me.
 > > > > > >
 > > > > >
 > > > > > Yeah I also have this idea in my mind.
 > > > >
 > > > > I wonder if the following patch will fix the issue?
 > > >
 > > > Interesting, AFAIU, this refcnt is for bpf programs attached
 > > > to the cgroup. By this suggestion, do you mean the root
 > > > cgroup does not need to refcnt the bpf programs attached
 > > > to it? This seems odd, as I don't see how root is different
 > > > from others in terms of bpf programs which can be attached
 > > > and detached in the same way.
 > > >
 > > > I certainly understand the root cgroup is never gone, but this
 > > > does not mean the bpf programs attached to it too.
 > > >
 > > > What am I missing?
 > >
 > > It's different because the root cgroup can't be deleted.
 > >
 > > All this reference counting is required to automatically detach bpf 
programs
 > > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
 > > because a cgroup can be in dying state for a long time being pinned 
by a
 > > pagecache page, for example. Only a user can detach a bpf program from
 > > an existing cgroup.
 >
 > Yeah, but users can still detach the bpf programs from root cgroup.
 > IIUC, after detaching, the pointer in the bpf array will be 
empty_prog_array
 > which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
 > deref it without checking NULL (as check_non_null == false).
 >
 > This matches the 0000000000000010 pointer seen in the bug reports,
 > the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
 > So looks like we have to add a NULL check there regardless of refcnt.
 >
 > Also, I am not sure whether your suggested patch makes a difference
 > for percpu refcnt, as percpu_ref_put() will never call ->release() until
 > percpu_ref_kill(), which is never called on root cgroup?

 > Hm, true. But it means that the problem is not with the root cgroup's 
bpf?

 >How easy is to reproduce the problem? Is it possible to bisect the 
problematic
 >commit?

 >Thanks!

The tester found the following information during the test

The dmesg information is as follows (kernelv5.4) I don't know if it 
helps for this question


root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching 
due to net_prio or net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
rcu_core+0x227/0x870
? timerqueue_add+0x68/0xa0
rcu_core_si+0xe/0x10
__do_softirq+0x102/0x358
? tick_program_event+0x4d/0x90
irq_exit+0xa0/0x110
smp_apic_timer_interrupt+0xa1/0x1b0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc0/0x3c0
Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 
f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 
00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48
RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000
RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018
R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003
R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000
? cpuidle_enter_state+0xa5/0x3c0
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x1c6/0x240
cpu_startup_entry+0x20/0x30
start_secondary+0x15b/0x190
secondary_startup_64+0xb6/0xc0
--[ end trace 805031dac04b28f5 ]--
Cong Wang June 23, 2020, 5:56 p.m. UTC | #25
On Tue, Jun 23, 2020 at 1:45 AM Zhang,Qiang <qiang.zhang@windriver.com> wrote:
>
> There are some message in kernelv5.4, I don't know if it will help.
>
> demsg:
>
> cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or
> net_cls activation
...
> -----------[ cut here ]-----------
> percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
> WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161
> percpu_ref_switch_to_atomic_rcu+0x12a/0x140

Yes, this proves we have the refcnt bug which my patch tries to fix.
The negative refcnt is exactly a consequence of the bug, as without
my patch we just put the refcnt without holding it first.

If you can reproduce it, please test my patch:
https://patchwork.ozlabs.org/project/netdev/patch/20200616180352.18602-1-xiyou.wangcong@gmail.com/

But, so far I still don't have a good explanation to the NULL
pointer deref. I think that one is an older bug, and we need to check
for NULL even after we fix the refcnt bug, but I do not know how it is
just exposed recently with Zefan's patch. I am still trying to find an
explanation.

Thanks!
Roman Gushchin June 23, 2020, 10:21 p.m. UTC | #26
On Fri, Jun 19, 2020 at 08:31:14PM -0700, Cong Wang wrote:
> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
> >
> > 在 2020/6/20 8:45, Zefan Li 写道:
> > > On 2020/6/20 3:51, Cong Wang wrote:
> > >> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
> > >>>
> > >>> On 2020/6/19 5:09, Cong Wang wrote:
> > >>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> > >>>>>
> > >>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > >>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > >>>>>>>
> > >>>>>>> Cc: Roman Gushchin <guro@fb.com>
> > >>>>>>>
> > >>>>>>> Thanks for fixing this.
> > >>>>>>>
> > >>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> > >>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > >>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> > >>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> > >>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> > >>>>>>>> even when cgroup_sk_alloc is disabled.
> > >>>>>>>>
> > >>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> > >>>>>>>> would terminate this function if called there. And for sk_alloc()
> > >>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> > >>>>>>>> to make it more readable.
> > >>>>>>>>
> > >>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > >>>>>>>
> > >>>>>>> but I don't think the bug was introduced by this commit, because there
> > >>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > >>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> > >>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> > >>>>>>> with systemd invovled.
> > >>>>>>>
> > >>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > >>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> > >>>>>>
> > >>>>>> Good point.
> > >>>>>>
> > >>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> > >>>>>> is the one to blame, because it is the first commit that began to
> > >>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> > >>>>>
> > >>>>> I agree, ut seems that the issue is not related to bpf and probably
> > >>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > >>>>> seems closer to the origin.
> > >>>>
> > >>>> Yeah, I will update the Fixes tag and send V2.
> > >>>>
> > >>>
> > >>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> > >>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> > >>> but this is fine, because when the socket is to be freed:
> > >>>
> > >>>  sk_prot_free()
> > >>>    cgroup_sk_free()
> > >>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> > >>>
> > >>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> > >>
> > >> But skcd->val can be a pointer to a non-root cgroup:
> > >
> > > It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> > > when cgroup_sk_alloc is disabled.
> > >
> >
> > And please read those recent bug reports, they all happened when bpf cgroup was in use,
> > and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
> 
> I am totally aware of this. My concern is whether cgroup
> has the same refcnt bug as it always pairs with the bpf refcnt.
> 
> But, after a second look, the non-root cgroup refcnt is immediately
> overwritten by sock_update_classid() or sock_update_netprioidx(),
> which effectively turns into a root cgroup again. :-/
> 
> (It seems we leak a refcnt here, but this is not related to my patch).

Yeah, I looked over this code, and I have the same suspicion.
Especially in sk_alloc(), where cgroup_sk_alloc() is followed by
sock_update_classid() and sock_update_netprioidx().

I also think your original patch is good, but there are probably
some other problems which it doesn't fix.

I looked over cgroup bpf code again, and the only difference with cgroup
refcounting I see (behind the root cgroup, which is a non-issue) is
here:

void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
{
	...
	while (true) {
		struct css_set *cset;

		cset = task_css_set(current);
		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
			  ^^^^^^^^^^^^^^^
			skcd->val = (unsigned long)cset->dfl_cgrp;
			cgroup_bpf_get(cset->dfl_cgrp);
			^^^^^^^^^^^^^^
			break;
			...

So, in theory, cgroup_bpf_get() can be called here after cgroup_bpf_release().
We might wanna introduce something like cgroup_bpf_tryget_live().
Idk if it can happen in reality, because it would require opening a new socket
in a deleted cgroup (without any other associated sockets).

Other than that I don't see any differences between cgroup and cgroup bpf
reference counting.

Thanks!

PS I'll be completely offline till the end of the week. I'll answer all
e-mails on Monday (Jun 29th).
Cameron Berkenpas June 26, 2020, 5:23 a.m. UTC | #27
Hello,

Somewhere along the way I got the impression that it generally takes 
those affected hours before their systems lock up. I'm (generally) able 
to reproduce this issue much faster than that. Regardless, I can help test.

Are there any patches that need testing or is this all still pending 
discussion around the  best way to resolve the issue?

Thanks!

On 6/23/20 3:21 PM, Roman Gushchin wrote:
> On Fri, Jun 19, 2020 at 08:31:14PM -0700, Cong Wang wrote:
>> On Fri, Jun 19, 2020 at 5:51 PM Zefan Li <lizefan@huawei.com> wrote:
>>> 在 2020/6/20 8:45, Zefan Li 写道:
>>>> On 2020/6/20 3:51, Cong Wang wrote:
>>>>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>>>>
>>>>>>>>>> Thanks for fixing this.
>>>>>>>>>>
>>>>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>>>>
>>>>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>>>>> to make it more readable.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>>>>> with systemd invovled.
>>>>>>>>>>
>>>>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>>>> Good point.
>>>>>>>>>
>>>>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>>>>> is the one to blame, because it is the first commit that began to
>>>>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>>>>> seems closer to the origin.
>>>>>>> Yeah, I will update the Fixes tag and send V2.
>>>>>>>
>>>>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>>>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>>>>> but this is fine, because when the socket is to be freed:
>>>>>>
>>>>>>   sk_prot_free()
>>>>>>     cgroup_sk_free()
>>>>>>       cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>>>>
>>>>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>>>> But skcd->val can be a pointer to a non-root cgroup:
>>>> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
>>>> when cgroup_sk_alloc is disabled.
>>>>
>>> And please read those recent bug reports, they all happened when bpf cgroup was in use,
>>> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.
>> I am totally aware of this. My concern is whether cgroup
>> has the same refcnt bug as it always pairs with the bpf refcnt.
>>
>> But, after a second look, the non-root cgroup refcnt is immediately
>> overwritten by sock_update_classid() or sock_update_netprioidx(),
>> which effectively turns into a root cgroup again. :-/
>>
>> (It seems we leak a refcnt here, but this is not related to my patch).
> Yeah, I looked over this code, and I have the same suspicion.
> Especially in sk_alloc(), where cgroup_sk_alloc() is followed by
> sock_update_classid() and sock_update_netprioidx().
>
> I also think your original patch is good, but there are probably
> some other problems which it doesn't fix.
>
> I looked over cgroup bpf code again, and the only difference with cgroup
> refcounting I see (behind the root cgroup, which is a non-issue) is
> here:
>
> void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
> {
> 	...
> 	while (true) {
> 		struct css_set *cset;
>
> 		cset = task_css_set(current);
> 		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
> 			  ^^^^^^^^^^^^^^^
> 			skcd->val = (unsigned long)cset->dfl_cgrp;
> 			cgroup_bpf_get(cset->dfl_cgrp);
> 			^^^^^^^^^^^^^^
> 			break;
> 			...
>
> So, in theory, cgroup_bpf_get() can be called here after cgroup_bpf_release().
> We might wanna introduce something like cgroup_bpf_tryget_live().
> Idk if it can happen in reality, because it would require opening a new socket
> in a deleted cgroup (without any other associated sockets).
>
> Other than that I don't see any differences between cgroup and cgroup bpf
> reference counting.
>
> Thanks!
>
> PS I'll be completely offline till the end of the week. I'll answer all
> e-mails on Monday (Jun 29th).
Cong Wang June 26, 2020, 5:58 p.m. UTC | #28
On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>
> Hello,
>
> Somewhere along the way I got the impression that it generally takes
> those affected hours before their systems lock up. I'm (generally) able
> to reproduce this issue much faster than that. Regardless, I can help test.
>
> Are there any patches that need testing or is this all still pending
> discussion around the  best way to resolve the issue?

Yes. I come up with a (hopefully) much better patch in the attachment.
Can you help to test it? You need to unapply the previous patch before
applying this one.

(Just in case of any confusion: I still believe we should check NULL on
top of this refcnt fix. But it should be a separate patch.)

Thank you!
Cameron Berkenpas June 26, 2020, 10:03 p.m. UTC | #29
Box has been up for 25 minutes without issue. Probably the patch works, 
but I can further confirm by tomorrow.

On 6/26/2020 10:58 AM, Cong Wang wrote:
> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>> Hello,
>>
>> Somewhere along the way I got the impression that it generally takes
>> those affected hours before their systems lock up. I'm (generally) able
>> to reproduce this issue much faster than that. Regardless, I can help test.
>>
>> Are there any patches that need testing or is this all still pending
>> discussion around the  best way to resolve the issue?
> Yes. I come up with a (hopefully) much better patch in the attachment.
> Can you help to test it? You need to unapply the previous patch before
> applying this one.
>
> (Just in case of any confusion: I still believe we should check NULL on
> top of this refcnt fix. But it should be a separate patch.)
>
> Thank you!
Cameron Berkenpas June 27, 2020, 10:59 p.m. UTC | #30
The box has been up without issue for over 25 hours now. The patch seems 
solid.

On 6/26/20 3:03 PM, Cameron Berkenpas wrote:
> Box has been up for 25 minutes without issue. Probably the patch 
> works, but I can further confirm by tomorrow.
>
> On 6/26/2020 10:58 AM, Cong Wang wrote:
>> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> 
>> wrote:
>>> Hello,
>>>
>>> Somewhere along the way I got the impression that it generally takes
>>> those affected hours before their systems lock up. I'm (generally) able
>>> to reproduce this issue much faster than that. Regardless, I can 
>>> help test.
>>>
>>> Are there any patches that need testing or is this all still pending
>>> discussion around the  best way to resolve the issue?
>> Yes. I come up with a (hopefully) much better patch in the attachment.
>> Can you help to test it? You need to unapply the previous patch before
>> applying this one.
>>
>> (Just in case of any confusion: I still believe we should check NULL on
>> top of this refcnt fix. But it should be a separate patch.)
>>
>> Thank you!
>
Roman Gushchin June 27, 2020, 11:41 p.m. UTC | #31
On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> >
> > Hello,
> >
> > Somewhere along the way I got the impression that it generally takes
> > those affected hours before their systems lock up. I'm (generally) able
> > to reproduce this issue much faster than that. Regardless, I can help test.
> >
> > Are there any patches that need testing or is this all still pending
> > discussion around the  best way to resolve the issue?
> 
> Yes. I come up with a (hopefully) much better patch in the attachment.
> Can you help to test it? You need to unapply the previous patch before
> applying this one.
> 
> (Just in case of any confusion: I still believe we should check NULL on
> top of this refcnt fix. But it should be a separate patch.)
> 
> Thank you!

Not opposing the patch, but the Fixes tag is still confusing me.
Do we have an explanation for what's wrong with 4bfc0bb2c60e?

It looks like we have cgroup_bpf_get()/put() exactly where we have
cgroup_get()/put(), so it would be nice to understand what's different
if the problem is bpf-related.

Thanks!

> commit 259150604c0b77c717fdaab057da5722e2dfd922
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Sat Jun 13 12:34:40 2020 -0700
> 
>     cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
>     
>     When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>     copied, so the cgroup refcnt must be taken too. And, unlike the
>     sk_alloc() path, sock_update_netprioidx() is not called here.
>     Therefore, it is safe and necessary to grab the cgroup refcnt
>     even when cgroup_sk_alloc is disabled.
>     
>     sk_clone_lock() is in BH context anyway, the in_interrupt()
>     would terminate this function if called there. And for sk_alloc()
>     skcd->val is always zero. So it's safe to factor out the code
>     to make it more readable.
>     
>     Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
>     Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
>     Reported-by: Peter Geis <pgwipeout@gmail.com>
>     Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
>     Reported-by: Daniël Sonck <dsonck92@gmail.com>
>     Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
>     Cc: Daniel Borkmann <daniel@iogearbox.net>
>     Cc: Zefan Li <lizefan@huawei.com>
>     Cc: Tejun Heo <tj@kernel.org>
>     Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 52661155f85f..4f1cd0edc57d 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -790,7 +790,8 @@ struct sock_cgroup_data {
>  	union {
>  #ifdef __LITTLE_ENDIAN
>  		struct {
> -			u8	is_data;
> +			u8	is_data : 1;
> +			u8	no_refcnt : 1;
>  			u8	padding;
>  			u16	prioidx;
>  			u32	classid;
> @@ -800,7 +801,8 @@ struct sock_cgroup_data {
>  			u32	classid;
>  			u16	prioidx;
>  			u8	padding;
> -			u8	is_data;
> +			u8	no_refcnt : 1;
> +			u8	is_data : 1;
>  		} __packed;
>  #endif
>  		u64		val;
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4598e4da6b1b..618838c48313 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
>  
>  void cgroup_sk_alloc_disable(void);
>  void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
> +void cgroup_sk_clone(struct sock_cgroup_data *skcd);
>  void cgroup_sk_free(struct sock_cgroup_data *skcd);
>  
>  static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
> @@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
>  	 */
>  	v = READ_ONCE(skcd->val);
>  
> -	if (v & 1)
> +	if (v & 3)
>  		return &cgrp_dfl_root.cgrp;
>  
>  	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> @@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
>  #else	/* CONFIG_CGROUP_DATA */
>  
>  static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
> +static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
>  static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
>  
>  #endif	/* CONFIG_CGROUP_DATA */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1ea181a58465..dd247747ec14 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void)
>  
>  void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>  {
> -	if (cgroup_sk_alloc_disabled)
> -		return;
> -
> -	/* Socket clone path */
> -	if (skcd->val) {
> -		/*
> -		 * We might be cloning a socket which is left in an empty
> -		 * cgroup and the cgroup might have already been rmdir'd.
> -		 * Don't use cgroup_get_live().
> -		 */
> -		cgroup_get(sock_cgroup_ptr(skcd));
> -		cgroup_bpf_get(sock_cgroup_ptr(skcd));
> +	if (cgroup_sk_alloc_disabled) {
> +		skcd->no_refcnt = 1;
>  		return;
>  	}
>  
> @@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>  	rcu_read_unlock();
>  }
>  
> +void cgroup_sk_clone(struct sock_cgroup_data *skcd)
> +{
> +	if (skcd->val) {
> +		if (skcd->no_refcnt)
> +			return;
> +		/*
> +		 * We might be cloning a socket which is left in an empty
> +		 * cgroup and the cgroup might have already been rmdir'd.
> +		 * Don't use cgroup_get_live().
> +		 */
> +		cgroup_get(sock_cgroup_ptr(skcd));
> +		cgroup_bpf_get(sock_cgroup_ptr(skcd));
> +	}
> +}
> +
>  void cgroup_sk_free(struct sock_cgroup_data *skcd)
>  {
>  	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
>  
> +	if (skcd->no_refcnt)
> +		return;
>  	cgroup_bpf_put(cgrp);
>  	cgroup_put(cgrp);
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index d832c650287c..2e5b7870e5d3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1926,7 +1926,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		/* sk->sk_memcg will be populated at accept() time */
>  		newsk->sk_memcg = NULL;
>  
> -		cgroup_sk_alloc(&newsk->sk_cgrp_data);
> +		cgroup_sk_clone(&newsk->sk_cgrp_data);
>  
>  		rcu_read_lock();
>  		filter = rcu_dereference(sk->sk_filter);
Cong Wang June 30, 2020, 10:16 p.m. UTC | #32
On Sat, Jun 27, 2020 at 3:59 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>
> The box has been up without issue for over 25 hours now. The patch seems
> solid.

That's great! Thanks for testing!
Cong Wang June 30, 2020, 10:22 p.m. UTC | #33
On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> > >
> > > Hello,
> > >
> > > Somewhere along the way I got the impression that it generally takes
> > > those affected hours before their systems lock up. I'm (generally) able
> > > to reproduce this issue much faster than that. Regardless, I can help test.
> > >
> > > Are there any patches that need testing or is this all still pending
> > > discussion around the  best way to resolve the issue?
> >
> > Yes. I come up with a (hopefully) much better patch in the attachment.
> > Can you help to test it? You need to unapply the previous patch before
> > applying this one.
> >
> > (Just in case of any confusion: I still believe we should check NULL on
> > top of this refcnt fix. But it should be a separate patch.)
> >
> > Thank you!
>
> Not opposing the patch, but the Fixes tag is still confusing me.
> Do we have an explanation for what's wrong with 4bfc0bb2c60e?
>
> It looks like we have cgroup_bpf_get()/put() exactly where we have
> cgroup_get()/put(), so it would be nice to understand what's different
> if the problem is bpf-related.

Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
is just in cgroup bpf refcnt, in our previous discussion.

Although I agree cgroup refcnt is buggy too, it may not necessarily
cause any real problem, otherwise we would receive bug report
much earlier than just recently, right?

If the Fixes tag is confusing, I can certainly remove it, but this also
means the patch will not be backported to stable. I am fine either
way, this crash is only reported after Zefan's recent change anyway.

Thanks.
Roman Gushchin June 30, 2020, 10:48 p.m. UTC | #34
On Tue, Jun 30, 2020 at 03:22:34PM -0700, Cong Wang wrote:
> On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Somewhere along the way I got the impression that it generally takes
> > > > those affected hours before their systems lock up. I'm (generally) able
> > > > to reproduce this issue much faster than that. Regardless, I can help test.
> > > >
> > > > Are there any patches that need testing or is this all still pending
> > > > discussion around the  best way to resolve the issue?
> > >
> > > Yes. I come up with a (hopefully) much better patch in the attachment.
> > > Can you help to test it? You need to unapply the previous patch before
> > > applying this one.
> > >
> > > (Just in case of any confusion: I still believe we should check NULL on
> > > top of this refcnt fix. But it should be a separate patch.)
> > >
> > > Thank you!
> >
> > Not opposing the patch, but the Fixes tag is still confusing me.
> > Do we have an explanation for what's wrong with 4bfc0bb2c60e?
> >
> > It looks like we have cgroup_bpf_get()/put() exactly where we have
> > cgroup_get()/put(), so it would be nice to understand what's different
> > if the problem is bpf-related.
> 
> Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
> is just in cgroup bpf refcnt, in our previous discussion.
> 
> Although I agree cgroup refcnt is buggy too, it may not necessarily
> cause any real problem, otherwise we would receive bug report
> much earlier than just recently, right?
> 
> If the Fixes tag is confusing, I can certainly remove it, but this also
> means the patch will not be backported to stable. I am fine either
> way, this crash is only reported after Zefan's recent change anyway.

Well, I'm not trying to protect my commit, I just don't understand
the whole picture and what I see doesn't make complete sense to me.

I understand a problem which can be described as copying the cgroup pointer
on cgroup cloning without bumping the reference counter.
It seems that this problem is not caused by bpf changes, so if we're adding
a Fixes tag, it must point at an earlier commit. Most likely, it was there from
scratch, i.e. from bd1060ad671 ("sock, cgroup: add sock->sk_cgroup").
Do we know why Zefan's change made it reproducible?

Btw if we want to backport the problem but can't blame a specific commit,
we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".

Thanks!
Zefan Li July 1, 2020, 1:18 a.m. UTC | #35
On 2020/7/1 6:48, Roman Gushchin wrote:
> On Tue, Jun 30, 2020 at 03:22:34PM -0700, Cong Wang wrote:
>> On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@fb.com> wrote:
>>>
>>> On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
>>>> On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@neo-zeon.de> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> Somewhere along the way I got the impression that it generally takes
>>>>> those affected hours before their systems lock up. I'm (generally) able
>>>>> to reproduce this issue much faster than that. Regardless, I can help test.
>>>>>
>>>>> Are there any patches that need testing or is this all still pending
>>>>> discussion around the  best way to resolve the issue?
>>>>
>>>> Yes. I come up with a (hopefully) much better patch in the attachment.
>>>> Can you help to test it? You need to unapply the previous patch before
>>>> applying this one.
>>>>
>>>> (Just in case of any confusion: I still believe we should check NULL on
>>>> top of this refcnt fix. But it should be a separate patch.)
>>>>
>>>> Thank you!
>>>
>>> Not opposing the patch, but the Fixes tag is still confusing me.
>>> Do we have an explanation for what's wrong with 4bfc0bb2c60e?
>>>
>>> It looks like we have cgroup_bpf_get()/put() exactly where we have
>>> cgroup_get()/put(), so it would be nice to understand what's different
>>> if the problem is bpf-related.
>>
>> Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
>> is just in cgroup bpf refcnt, in our previous discussion.
>>
>> Although I agree cgroup refcnt is buggy too, it may not necessarily
>> cause any real problem, otherwise we would receive bug report
>> much earlier than just recently, right?
>>
>> If the Fixes tag is confusing, I can certainly remove it, but this also
>> means the patch will not be backported to stable. I am fine either
>> way, this crash is only reported after Zefan's recent change anyway.
> 
> Well, I'm not trying to protect my commit, I just don't understand
> the whole picture and what I see doesn't make complete sense to me.
> 
> I understand a problem which can be described as copying the cgroup pointer
> on cgroup cloning without bumping the reference counter.
> It seems that this problem is not caused by bpf changes, so if we're adding
> a Fixes tag, it must point at an earlier commit. Most likely, it was there from
> scratch, i.e. from bd1060ad671 ("sock, cgroup: add sock->sk_cgroup").
> Do we know why Zefan's change made it reproducible?
> 

Actually I've explain it in my earlier reply.

https://www.spinics.net/lists/netdev/msg661004.html

With my change cgroup_sk_alloc will be disabled when we create a cgroup in
netprio cgroup subsystem, and systemd keeps creating cgroups with high
frequency.

Before this change, it will be disabled only when someone writes to ifpriomap
or classid in cgroupfs, and he's very unlikely to do this if he's using
bpf in the default cgroup tree I think.

So the bug has been there for sometime.

> Btw if we want to backport the problem but can't blame a specific commit,
> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
>
Cong Wang July 2, 2020, 4:48 a.m. UTC | #36
On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>
> Btw if we want to backport the problem but can't blame a specific commit,
> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".

Sure, but if we don't know which is the right commit to blame, then how
do we know which stable version should the patch target? :)

I am open to all options here, including not backporting to stable at all.

Thanks.
Thomas Lamprecht July 2, 2020, 8:12 a.m. UTC | #37
On 02.07.20 06:48, Cong Wang wrote:
> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>>
>> Btw if we want to backport the problem but can't blame a specific commit,
>> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> 
> Sure, but if we don't know which is the right commit to blame, then how
> do we know which stable version should the patch target? :)

We run into a similar issue here once we made an update from the 5.4.41 to the
5.4.44 stable kernel. This patch addresses the issue, at least we are running
stable at >17 hours uptime with this patch, whereas we ran into issues normally
at <6 hour uptime without this patch.

That update included newly the commit 090e28b229af92dc5b ("netprio_cgroup: Fix
unlimited memory leak of v2 cgroups") which this patch originally mentions as
"Fixes", whereas the other mentioned possible culprit 4bfc0bb2c60e2f4c ("bpf:
decouple the lifetime of cgroup_bpf from cgroup itself") was included with 5.2
here, and did *not* made problems here.

So, while the real culprit may be something else, a mix of them, or even more
complex, the race is at least triggered way more frequently with the
090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
one or, for the sake of mentioning, possibly also something else from the
v5.4.41..v5.4.44 commit range - I did not looked into that in detail yet.

> 
> I am open to all options here, including not backporting to stable at all.

As said, the stable-5.4.y tree profits from having this patch here, so there's
that.

Also, FWIW:
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Roman Gushchin July 2, 2020, 4:02 p.m. UTC | #38
On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Btw if we want to backport the problem but can't blame a specific commit,
> > we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> 
> Sure, but if we don't know which is the right commit to blame, then how
> do we know which stable version should the patch target? :)
> 
> I am open to all options here, including not backporting to stable at all.

It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
hard to reproduce it.

Thanks!
Peter Geis July 2, 2020, 4:24 p.m. UTC | #39
On Thu, Jul 2, 2020 at 12:03 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
> > On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Btw if we want to backport the problem but can't blame a specific commit,
> > > we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
> >
> > Sure, but if we don't know which is the right commit to blame, then how
> > do we know which stable version should the patch target? :)
> >
> > I am open to all options here, including not backporting to stable at all.
>
> It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
> so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
> hard to reproduce it.
>
> Thanks!

Just wanted to let you know, this patch has been running for ~36 hours
or so without a crash.
So:
Tested-by: Peter Geis <pgwipeout@gmail.com>
Zefan Li July 3, 2020, 1:17 a.m. UTC | #40
On 2020/7/3 0:02, Roman Gushchin wrote:
> On Wed, Jul 01, 2020 at 09:48:48PM -0700, Cong Wang wrote:
>> On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin <guro@fb.com> wrote:
>>>
>>> Btw if we want to backport the problem but can't blame a specific commit,
>>> we can always use something like "Cc: <stable@vger.kernel.org>    [3.1+]".
>>
>> Sure, but if we don't know which is the right commit to blame, then how
>> do we know which stable version should the patch target? :)
>>
>> I am open to all options here, including not backporting to stable at all.
> 
> It seems to be that the issue was there from bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup"),
> so I'd go with it. Otherwise we can go with 5.4+, as I understand before that it was
> hard to reproduce it.
> 

Actually I think it should be very easy to reproduce the bug.

suppose default cgroup and netcls cgroup are mounted in /cgroup/default and
/cgroup/netcls respectively, and then:

1. mkdir /cgroup/default/sub1
2. mkdir /cgroup/default/sub2
3. attach some tasks into sub1/ and sub2/
4. attach bpf program to sub1/ and sub2/ # get bpf refcnt for those cgroups
5. echo 1 > /cgroup/netcls/classid       # this will disable cgroup_sk_alloc
6. kill all tasks in sub1/ and sub2/
7. rmdir sub1/ sub2/

The last step will deref bpf for the default root cgroup instead of sub1/
and sub2/, and should trigger the bug.

FYI I never use bpf, so I might be wrong.
diff mbox series

Patch

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..818dc7b3ed6c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@  extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -847,6 +848,7 @@  static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 #else	/* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..6377045b7096 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6442,18 +6442,6 @@  void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	if (cgroup_sk_alloc_disabled)
 		return;
 
-	/* Socket clone path */
-	if (skcd->val) {
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
-		return;
-	}
-
 	/* Don't associate the sock with unrelated interrupted task's cgroup. */
 	if (in_interrupt())
 		return;
@@ -6475,6 +6463,20 @@  void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+	/* Socket clone path */
+	if (skcd->val) {
+		/*
+		 * We might be cloning a socket which is left in an empty
+		 * cgroup and the cgroup might have already been rmdir'd.
+		 * Don't use cgroup_get_live().
+		 */
+		cgroup_get(sock_cgroup_ptr(skcd));
+		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	}
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..b62f06fa5e37 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1925,7 +1925,7 @@  struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		/* sk->sk_memcg will be populated at accept() time */
 		newsk->sk_memcg = NULL;
 
-		cgroup_sk_alloc(&newsk->sk_cgrp_data);
+		cgroup_sk_clone(&newsk->sk_cgrp_data);
 
 		rcu_read_lock();
 		filter = rcu_dereference(sk->sk_filter);