diff mbox series

[net,v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

Message ID 20200702185256.17917-1-xiyou.wangcong@gmail.com
State Accepted
Delegated to: David Miller
Headers show
Series [net,v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock() | expand

Commit Message

Cong Wang July 2, 2020, 6:52 p.m. UTC
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/cgroup-defs.h |  6 ++++--
 include/linux/cgroup.h      |  4 +++-
 kernel/cgroup/cgroup.c      | 31 +++++++++++++++++++------------
 net/core/sock.c             |  2 +-
 4 files changed, 27 insertions(+), 16 deletions(-)

Comments

David Miller July 7, 2020, 8:34 p.m. UTC | #1
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu,  2 Jul 2020 11:52:56 -0700

> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> copied, so the cgroup refcnt must be taken too. And, unlike the
> sk_alloc() path, sock_update_netprioidx() is not called here.
> Therefore, it is safe and necessary to grab the cgroup refcnt
> even when cgroup_sk_alloc is disabled.
> 
> sk_clone_lock() is in BH context anyway, the in_interrupt()
> would terminate this function if called there. And for sk_alloc()
> skcd->val is always zero. So it's safe to factor out the code
> to make it more readable.
> 
> The global variable 'cgroup_sk_alloc_disabled' is used to determine
> whether to take these reference counts. It is impossible to make
> the reference counting correct unless we save this bit of information
> in skcd->val. So, add a new bit there to record whether the socket
> has already taken the reference counts. This obviously relies on
> kmalloc() to align cgroup pointers to at least 4 bytes,
> ARCH_KMALLOC_MINALIGN is certainly larger than that.
> 
> This bug seems to be introduced since the beginning, commit
> d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
> tried to fix it but not compeletely. It seems not easy to trigger until
> the recent commit 090e28b229af
> ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
> 
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
> Reported-by: Peter Geis <pgwipeout@gmail.com>
> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
> Reported-by: Daniël Sonck <dsonck92@gmail.com>
> Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
> Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
> Tested-by: Peter Geis <pgwipeout@gmail.com>
> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Zefan Li <lizefan@huawei.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <guro@fb.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied and queued up for -stable, thanks!
Guenter Roeck July 8, 2020, 3:33 p.m. UTC | #2
Hi,

On Thu, Jul 02, 2020 at 11:52:56AM -0700, Cong Wang wrote:
> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> copied, so the cgroup refcnt must be taken too. And, unlike the
> sk_alloc() path, sock_update_netprioidx() is not called here.
> Therefore, it is safe and necessary to grab the cgroup refcnt
> even when cgroup_sk_alloc is disabled.
> 
> sk_clone_lock() is in BH context anyway, the in_interrupt()
> would terminate this function if called there. And for sk_alloc()
> skcd->val is always zero. So it's safe to factor out the code
> to make it more readable.
> 
> The global variable 'cgroup_sk_alloc_disabled' is used to determine
> whether to take these reference counts. It is impossible to make
> the reference counting correct unless we save this bit of information
> in skcd->val. So, add a new bit there to record whether the socket
> has already taken the reference counts. This obviously relies on
> kmalloc() to align cgroup pointers to at least 4 bytes,
> ARCH_KMALLOC_MINALIGN is certainly larger than that.
> 
> This bug seems to be introduced since the beginning, commit
> d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
> tried to fix it but not compeletely. It seems not easy to trigger until
> the recent commit 090e28b229af
> ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
> 

This patch causes all my s390 boot tests to crash. Reverting it fixes
the problem. Please see bisect results and and crash log below.

Guenter

---
bisect results (from pending-fixes branch) in -next repository):

# bad: [1432f824c2db44ef35b26caa9f81dd05211a75fc] Merge remote-tracking branch 'drm-misc-fixes/for-linux-next-fixes'
# good: [dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258] Linux 5.8-rc4
git bisect start 'HEAD' 'v5.8-rc4'
# bad: [fe12f8184e7265e2d24e5ed5b255275dfe4c1c04] Merge remote-tracking branch 'net/master'
git bisect bad fe12f8184e7265e2d24e5ed5b255275dfe4c1c04
# good: [474112d57c70520ebd81a5ca578fee1d93fafd07] Documentation: networking: ipvs-sysctl: drop doubled word
git bisect good 474112d57c70520ebd81a5ca578fee1d93fafd07
# good: [6d12075ddeedc38d25c5b74e929e686158da728c] Merge tag 'mtd/fixes-for-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
git bisect good 6d12075ddeedc38d25c5b74e929e686158da728c
# good: [74478ea4ded519db35cb1f059948b1e713bb4abf] net: ipa: fix QMI structure definition bugs
git bisect good 74478ea4ded519db35cb1f059948b1e713bb4abf
# bad: [9c29e36152748fd623fcff6cc8f538550f9eeafc] mptcp: fix DSS map generation on fin retransmission
git bisect bad 9c29e36152748fd623fcff6cc8f538550f9eeafc
# good: [aea23c323d89836bcdcee67e49def997ffca043b] ipv6: Fix use of anycast address with loopback
git bisect good aea23c323d89836bcdcee67e49def997ffca043b
# bad: [28b18e4eb515af7c6661c3995c6e3c34412c2874] net: sky2: initialize return of gm_phy_read
git bisect bad 28b18e4eb515af7c6661c3995c6e3c34412c2874
# bad: [ad0f75e5f57ccbceec13274e1e242f2b5a6397ed] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
git bisect bad ad0f75e5f57ccbceec13274e1e242f2b5a6397ed
# first bad commit: [ad0f75e5f57ccbceec13274e1e242f2b5a6397ed] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

---
Crash log:

[   22.390674] Run /sbin/init as init process
[   22.497551] Unable to handle kernel pointer dereference in virtual kernel address space
[   22.497738] Failing address: 5010f0b45fa93000 TEID: 5010f0b45fa93803
[   22.497813] Fault in home space mode while using kernel ASCE.
[   22.497958] AS:0000000001774007 R3:0000000000000024
[   22.498300] Oops: 0038 ilc:3 [#1] SMP
[   22.498405] Modules linked in:
[   22.499027] CPU: 0 PID: 153 Comm: init Not tainted 5.8.0-rc4-00328-g1432f824c2db4 #1
[   22.499112] Hardware name: QEMU 2964 QEMU (KVM/Linux)
[   22.499261] Krnl PSW : 0704e00180000000 0000000000259be0 (cgroup_sk_free+0xa8/0x1e8)
[   22.499405]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[   22.499506] Krnl GPRS: 0000000048a38585 5010f0b45fa93094 0000000000000002 000000001c228bd8
[   22.499585]            000000001c228bb0 0000000000000000 0000000000000000 00000000011c2eda
[   22.499665]            fffffffffffff000 00000000011c1f72 00000000011deef0 0000000000014040
[   22.499744]            000000001c228100 0000000000e76bf0 0000000000259c82 000003e0002c3c00
[   22.500270] Krnl Code: 0000000000259bd2: a72a0001		ahi	%r2,1
[   22.500270]            0000000000259bd6: 502003a8		st	%r2,936
[   22.500270]           #0000000000259bda: e31003b80008	ag	%r1,952
[   22.500270]           >0000000000259be0: e32010000004	lg	%r2,0(%r1)
[   22.500270]            0000000000259be6: a7f40004		brc	15,0000000000259bee
[   22.500270]            0000000000259bea: b9040023		lgr	%r2,%r3
[   22.500270]            0000000000259bee: b9040032		lgr	%r3,%r2
[   22.500270]            0000000000259bf2: b9040042		lgr	%r4,%r2
[   22.500635] Call Trace:
[   22.500748]  [<0000000000259be0>] cgroup_sk_free+0xa8/0x1e8
[   22.500835] ([<0000000000259bb4>] cgroup_sk_free+0x7c/0x1e8)
[   22.500914]  [<0000000000b24e16>] __sk_destruct+0x196/0x260
[   22.500999]  [<0000000000cadc18>] unix_release_sock+0x358/0x460
[   22.501073]  [<0000000000cadd5a>] unix_release+0x3a/0x60
[   22.501149]  [<0000000000b1a63a>] __sock_release+0x62/0xf8
[   22.501223]  [<0000000000b1a6f8>] sock_close+0x28/0x38
[   22.501299]  [<000000000045101e>] __fput+0x126/0x2a8
[   22.501374]  [<000000000017e088>] task_work_run+0x78/0xc8
[   22.501449]  [<000000000010a596>] do_notify_resume+0x9e/0xa8
[   22.501526]  [<0000000000de555a>] system_call+0xe6/0x2d4
[   22.501657] INFO: lockdep is turned off.
[   22.501736] Last Breaking-Event-Address:
[   22.501814]  [<0000000000259c86>] cgroup_sk_free+0x14e/0x1e8
[   22.502169] Kernel panic - not syncing: Fatal exception: panic_on_oops
Cong Wang July 8, 2020, 6:34 p.m. UTC | #3
Hi,

On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck <linux@roeck-us.net> wrote:
> This patch causes all my s390 boot tests to crash. Reverting it fixes
> the problem. Please see bisect results and and crash log below.
>
...
> Crash log:

Interesting. I don't see how unix socket is any special here, it creates
a peer sock with sk_alloc(), but this is not any different from two separated
sockets.

What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
if you don't enable either of them but enable CONFIG_CGROUP_BPF.

And if you have the full kernel log, it would be helpful too.

Thanks.
Guenter Roeck July 8, 2020, 7:07 p.m. UTC | #4
On 7/8/20 11:34 AM, Cong Wang wrote:
> Hi,
> 
> On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck <linux@roeck-us.net> wrote:
>> This patch causes all my s390 boot tests to crash. Reverting it fixes
>> the problem. Please see bisect results and and crash log below.
>>
> ...
>> Crash log:
> 
> Interesting. I don't see how unix socket is any special here, it creates
> a peer sock with sk_alloc(), but this is not any different from two separated
> sockets.
> 
> What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
> or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
> if you don't enable either of them but enable CONFIG_CGROUP_BPF.
> 

cgroup specific configuration bits:

CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_BLK_CGROUP_RWSTAT=y
CONFIG_BLK_CGROUP_IOLATENCY=y
CONFIG_BLK_CGROUP_IOCOST=y
# CONFIG_BFQ_CGROUP_DEBUG is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
CONFIG_NET_CLS_CGROUP=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

This originates from arch/s390/configs/defconfig; I don't touch
any cgroup specific configuration in my tests.

> And if you have the full kernel log, it would be helpful too.
> 

https://kerneltests.org/builders/qemu-s390-pending-fixes/builds/222/steps/qemubuildcommand/logs/stdio

Interestingly, enabling CONFIG_BFQ_CGROUP_DEBUG makes the problem disappear.

Guenter
Cong Wang July 8, 2020, 7:32 p.m. UTC | #5
On Wed, Jul 8, 2020 at 12:07 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> On 7/8/20 11:34 AM, Cong Wang wrote:
> > Hi,
> >
> > On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >> This patch causes all my s390 boot tests to crash. Reverting it fixes
> >> the problem. Please see bisect results and and crash log below.
> >>
> > ...
> >> Crash log:
> >
> > Interesting. I don't see how unix socket is any special here, it creates
> > a peer sock with sk_alloc(), but this is not any different from two separated
> > sockets.
> >
> > What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
> > or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
> > if you don't enable either of them but enable CONFIG_CGROUP_BPF.
> >
>
> cgroup specific configuration bits:
>
> CONFIG_CGROUPS=y
> CONFIG_BLK_CGROUP=y
> CONFIG_CGROUP_WRITEBACK=y
> CONFIG_CGROUP_SCHED=y
> CONFIG_CGROUP_PIDS=y
> CONFIG_CGROUP_RDMA=y
> CONFIG_CGROUP_FREEZER=y
> CONFIG_CGROUP_HUGETLB=y
> CONFIG_CGROUP_DEVICE=y
> CONFIG_CGROUP_CPUACCT=y
> CONFIG_CGROUP_PERF=y
> CONFIG_CGROUP_BPF=y
> # CONFIG_CGROUP_DEBUG is not set
> CONFIG_SOCK_CGROUP_DATA=y
> CONFIG_BLK_CGROUP_RWSTAT=y
> CONFIG_BLK_CGROUP_IOLATENCY=y
> CONFIG_BLK_CGROUP_IOCOST=y
> # CONFIG_BFQ_CGROUP_DEBUG is not set
> # CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
> CONFIG_NET_CLS_CGROUP=y
> CONFIG_CGROUP_NET_PRIO=y
> CONFIG_CGROUP_NET_CLASSID=y
>
> This originates from arch/s390/configs/defconfig; I don't touch
> any cgroup specific configuration in my tests.

Good to know you enable everything related here.

>
> > And if you have the full kernel log, it would be helpful too.
> >
>
> https://kerneltests.org/builders/qemu-s390-pending-fixes/builds/222/steps/qemubuildcommand/logs/stdio

It looks like cgroup_sk_alloc_disabled is always false in your case.
This makes the bug even more weird. Unless there is a refcnt bug
prior to my commit, I don't see how it could happen.

>
> Interestingly, enabling CONFIG_BFQ_CGROUP_DEBUG makes the problem disappear.

Yeah, I guess there might be some cgroup refcnt bug which could
just paper out with CONFIG_BFQ_CGROUP_DEBUG=y.

If you can test patches, I can send you a debugging patch for you
to collect more data. I assume this is 100% reproducible on your
side?

Thanks.
Guenter Roeck July 8, 2020, 8 p.m. UTC | #6
On 7/8/20 12:32 PM, Cong Wang wrote:
> On Wed, Jul 8, 2020 at 12:07 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> On 7/8/20 11:34 AM, Cong Wang wrote:
>>> Hi,
>>>
>>> On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck <linux@roeck-us.net> wrote:
>>>> This patch causes all my s390 boot tests to crash. Reverting it fixes
>>>> the problem. Please see bisect results and and crash log below.
>>>>
>>> ...
>>>> Crash log:
>>>
>>> Interesting. I don't see how unix socket is any special here, it creates
>>> a peer sock with sk_alloc(), but this is not any different from two separated
>>> sockets.
>>>
>>> What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
>>> or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
>>> if you don't enable either of them but enable CONFIG_CGROUP_BPF.
>>>
>>
>> cgroup specific configuration bits:
>>
>> CONFIG_CGROUPS=y
>> CONFIG_BLK_CGROUP=y
>> CONFIG_CGROUP_WRITEBACK=y
>> CONFIG_CGROUP_SCHED=y
>> CONFIG_CGROUP_PIDS=y
>> CONFIG_CGROUP_RDMA=y
>> CONFIG_CGROUP_FREEZER=y
>> CONFIG_CGROUP_HUGETLB=y
>> CONFIG_CGROUP_DEVICE=y
>> CONFIG_CGROUP_CPUACCT=y
>> CONFIG_CGROUP_PERF=y
>> CONFIG_CGROUP_BPF=y
>> # CONFIG_CGROUP_DEBUG is not set
>> CONFIG_SOCK_CGROUP_DATA=y
>> CONFIG_BLK_CGROUP_RWSTAT=y
>> CONFIG_BLK_CGROUP_IOLATENCY=y
>> CONFIG_BLK_CGROUP_IOCOST=y
>> # CONFIG_BFQ_CGROUP_DEBUG is not set
>> # CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
>> CONFIG_NET_CLS_CGROUP=y
>> CONFIG_CGROUP_NET_PRIO=y
>> CONFIG_CGROUP_NET_CLASSID=y
>>
>> This originates from arch/s390/configs/defconfig; I don't touch
>> any cgroup specific configuration in my tests.
> 
> Good to know you enable everything related here.
> 
Maybe, but on the other side I would argue that not having everything
enabled should not result in a crash.

>>
>>> And if you have the full kernel log, it would be helpful too.
>>>
>>
>> https://kerneltests.org/builders/qemu-s390-pending-fixes/builds/222/steps/qemubuildcommand/logs/stdio
> 
> It looks like cgroup_sk_alloc_disabled is always false in your case.
> This makes the bug even more weird. Unless there is a refcnt bug
> prior to my commit, I don't see how it could happen.
> 
>>
>> Interestingly, enabling CONFIG_BFQ_CGROUP_DEBUG makes the problem disappear.
> 
> Yeah, I guess there might be some cgroup refcnt bug which could
> just paper out with CONFIG_BFQ_CGROUP_DEBUG=y.
> 
> If you can test patches, I can send you a debugging patch for you
> to collect more data. I assume this is 100% reproducible on your
> side?
> 
Sure, I'll be happy to do that. And, yes, it is always reproducible.

Guenter
Guenter Roeck July 9, 2020, 5:10 p.m. UTC | #7
On 7/8/20 12:32 PM, Cong Wang wrote:
> On Wed, Jul 8, 2020 at 12:07 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> On 7/8/20 11:34 AM, Cong Wang wrote:
>>> Hi,
>>>
>>> On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck <linux@roeck-us.net> wrote:
>>>> This patch causes all my s390 boot tests to crash. Reverting it fixes
>>>> the problem. Please see bisect results and and crash log below.
>>>>
>>> ...
>>>> Crash log:
>>>
>>> Interesting. I don't see how unix socket is any special here, it creates
>>> a peer sock with sk_alloc(), but this is not any different from two separated
>>> sockets.
>>>
>>> What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
>>> or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
>>> if you don't enable either of them but enable CONFIG_CGROUP_BPF.
>>>
>>
>> cgroup specific configuration bits:
>>
>> CONFIG_CGROUPS=y
>> CONFIG_BLK_CGROUP=y
>> CONFIG_CGROUP_WRITEBACK=y
>> CONFIG_CGROUP_SCHED=y
>> CONFIG_CGROUP_PIDS=y
>> CONFIG_CGROUP_RDMA=y
>> CONFIG_CGROUP_FREEZER=y
>> CONFIG_CGROUP_HUGETLB=y
>> CONFIG_CGROUP_DEVICE=y
>> CONFIG_CGROUP_CPUACCT=y
>> CONFIG_CGROUP_PERF=y
>> CONFIG_CGROUP_BPF=y
>> # CONFIG_CGROUP_DEBUG is not set
>> CONFIG_SOCK_CGROUP_DATA=y
>> CONFIG_BLK_CGROUP_RWSTAT=y
>> CONFIG_BLK_CGROUP_IOLATENCY=y
>> CONFIG_BLK_CGROUP_IOCOST=y
>> # CONFIG_BFQ_CGROUP_DEBUG is not set
>> # CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
>> CONFIG_NET_CLS_CGROUP=y
>> CONFIG_CGROUP_NET_PRIO=y
>> CONFIG_CGROUP_NET_CLASSID=y
>>
>> This originates from arch/s390/configs/defconfig; I don't touch
>> any cgroup specific configuration in my tests.
> 
> Good to know you enable everything related here.
> 
>>
>>> And if you have the full kernel log, it would be helpful too.
>>>
>>
>> https://kerneltests.org/builders/qemu-s390-pending-fixes/builds/222/steps/qemubuildcommand/logs/stdio
> 
> It looks like cgroup_sk_alloc_disabled is always false in your case.
> This makes the bug even more weird. Unless there is a refcnt bug
> prior to my commit, I don't see how it could happen.
> 
>>
>> Interestingly, enabling CONFIG_BFQ_CGROUP_DEBUG makes the problem disappear.
> 
> Yeah, I guess there might be some cgroup refcnt bug which could
> just paper out with CONFIG_BFQ_CGROUP_DEBUG=y.
> 
> If you can test patches, I can send you a debugging patch for you
> to collect more data. I assume this is 100% reproducible on your
> side?
> 

Something seems fishy with the use of skcd->val on big endian systems.

Some debug output:

[   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
[   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
[   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
[   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
[   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
[   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
[   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
[   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
[   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
[   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)

Guenter
Cong Wang July 9, 2020, 6:51 p.m. UTC | #8
On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
>
> Something seems fishy with the use of skcd->val on big endian systems.
>
> Some debug output:
>
> [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
> [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
> [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
> [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
> [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
> [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
> [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
> [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
> [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
> [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)

Excellent debugging! I thought it was a double put, but it seems to
be an endian issue. I didn't realize the bit endian machine actually
packs bitfields in a big endian way too...

Does the attached patch address this?

Thank you!
Cong Wang July 9, 2020, 6:59 p.m. UTC | #9
On Thu, Jul 9, 2020 at 11:51 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > Something seems fishy with the use of skcd->val on big endian systems.
> >
> > Some debug output:
> >
> > [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
> > [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
> > [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
> > [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
> > [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
> > [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
> > [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
> > [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
> > [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
> > [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)
>
> Excellent debugging! I thought it was a double put, but it seems to
> be an endian issue. I didn't realize the bit endian machine actually
> packs bitfields in a big endian way too...
>
> Does the attached patch address this?

Ah, this is too ugly. We just have to always make them the last two bits.

Please test this attached patch instead and ignore the previous one.

Thanks.
Guenter Roeck July 9, 2020, 7:13 p.m. UTC | #10
On 7/9/20 11:51 AM, Cong Wang wrote:
> On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> Something seems fishy with the use of skcd->val on big endian systems.
>>
>> Some debug output:
>>
>> [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
>> [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
>> [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
>> [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
>> [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
>> [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
>> [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
>> [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
>> [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
>> [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)
> 
> Excellent debugging! I thought it was a double put, but it seems to
> be an endian issue. I didn't realize the bit endian machine actually
> packs bitfields in a big endian way too...
> 
> Does the attached patch address this?
> 

Partially. I don't see the crash anymore, but something is still odd - some of my
tests require a retry with this patch applied, which previously never happened.
I don't know if this is another problem with this patch, or a different problem.
Unfortunately, I'll be unable to debug this further until next Tuesday.

Guenter
Cong Wang July 9, 2020, 7:19 p.m. UTC | #11
On Thu, Jul 9, 2020 at 12:13 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> On 7/9/20 11:51 AM, Cong Wang wrote:
> > On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
> >>
> >> Something seems fishy with the use of skcd->val on big endian systems.
> >>
> >> Some debug output:
> >>
> >> [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
> >> [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
> >> [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
> >> [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
> >> [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
> >> [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
> >> [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
> >> [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
> >> [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
> >> [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)
> >
> > Excellent debugging! I thought it was a double put, but it seems to
> > be an endian issue. I didn't realize the bit endian machine actually
> > packs bitfields in a big endian way too...
> >
> > Does the attached patch address this?
> >
>
> Partially. I don't see the crash anymore, but something is still odd - some of my
> tests require a retry with this patch applied, which previously never happened.
> I don't know if this is another problem with this patch, or a different problem.
> Unfortunately, I'll be unable to debug this further until next Tuesday.

Make sure you test the second patch I sent, not the first one. The first one
is still incomplete and ugly too. The two bits must be the last two,
so correcting
the if test is not sufficient, we have to fix the whole bitfield packing.

Thanks!
David Miller July 9, 2020, 11:26 p.m. UTC | #12
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 9 Jul 2020 12:19:40 -0700

> Make sure you test the second patch I sent, not the first one. The
> first one is still incomplete and ugly too. The two bits must be the
> last two, so correcting the if test is not sufficient, we have to
> fix the whole bitfield packing.

That is definitely the correct thing to do, I'm going to apply that
bitfield packing patch as-is because I have to get a pull request to
Linus and I don't want to break big endian like this.
Guenter Roeck July 10, 2020, 4:42 p.m. UTC | #13
On Thu, Jul 09, 2020 at 11:59:09AM -0700, Cong Wang wrote:
> On Thu, Jul 9, 2020 at 11:51 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
> > >
> > > Something seems fishy with the use of skcd->val on big endian systems.
> > >
> > > Some debug output:
> > >
> > > [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
> > > [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
> > > [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
> > > [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
> > > [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
> > > [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
> > > [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
> > > [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
> > > [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
> > > [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)
> >
> > Excellent debugging! I thought it was a double put, but it seems to
> > be an endian issue. I didn't realize the bit endian machine actually
> > packs bitfields in a big endian way too...
> >
> > Does the attached patch address this?
> 
> Ah, this is too ugly. We just have to always make them the last two bits.
> 
> Please test this attached patch instead and ignore the previous one.
> 

Sorry, that one came too late; I was already about to leave when I got the first
one. It looks correct, though. I'll be back Monday night and will have another
look then (and I guess my builders will pick it up after Dave sends it all to
Linus). I'll let you know if I still see a problem.

Thanks,
Guenter
Guenter Roeck July 14, 2020, 1:55 p.m. UTC | #14
On Thu, Jul 09, 2020 at 11:59:09AM -0700, Cong Wang wrote:
> On Thu, Jul 9, 2020 at 11:51 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck <linux@roeck-us.net> wrote:
> > >
> > > Something seems fishy with the use of skcd->val on big endian systems.
> > >
> > > Some debug output:
> > >
> > > [   22.643703] sock: ##### sk_alloc(sk=000000001be28100): Calling cgroup_sk_alloc(000000001be28550)
> > > [   22.643807] cgroup: ##### cgroup_sk_alloc(skcd=000000001be28550): cgroup_sk_alloc_disabled=0, in_interrupt: 0
> > > [   22.643886] cgroup:  #### cgroup_sk_alloc(skcd=000000001be28550): cset->dfl_cgrp=0000000001224040, skcd->val=0x1224040
> > > [   22.643957] cgroup: ###### cgroup_bpf_get(cgrp=0000000001224040)
> > > [   22.646451] sock: ##### sk_prot_free(sk=000000001be28100): Calling cgroup_sk_free(000000001be28550)
> > > [   22.646607] cgroup:  #### sock_cgroup_ptr(skcd=000000001be28550) -> 0000000000014040 [v=14040, skcd->val=14040]
> > > [   22.646632] cgroup: ####### cgroup_sk_free(): skcd=000000001be28550, cgrp=0000000000014040
> > > [   22.646739] cgroup: ####### cgroup_sk_free(): skcd->no_refcnt=0
> > > [   22.646814] cgroup: ####### cgroup_sk_free(): Calling cgroup_bpf_put(cgrp=0000000000014040)
> > > [   22.646886] cgroup: ###### cgroup_bpf_put(cgrp=0000000000014040)
> >
> > Excellent debugging! I thought it was a double put, but it seems to
> > be an endian issue. I didn't realize the bit endian machine actually
> > packs bitfields in a big endian way too...
> >
> > Does the attached patch address this?
> 
> Ah, this is too ugly. We just have to always make them the last two bits.
> 
> Please test this attached patch instead and ignore the previous one.
> 
FWIW: Tested and working.

Guenter
diff mbox series

Patch

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..4f1cd0edc57d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,7 +790,8 @@  struct sock_cgroup_data {
 	union {
 #ifdef __LITTLE_ENDIAN
 		struct {
-			u8	is_data;
+			u8	is_data : 1;
+			u8	no_refcnt : 1;
 			u8	padding;
 			u16	prioidx;
 			u32	classid;
@@ -800,7 +801,8 @@  struct sock_cgroup_data {
 			u32	classid;
 			u16	prioidx;
 			u8	padding;
-			u8	is_data;
+			u8	no_refcnt : 1;
+			u8	is_data : 1;
 		} __packed;
 #endif
 		u64		val;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..618838c48313 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@  extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -835,7 +836,7 @@  static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 	 */
 	v = READ_ONCE(skcd->val);
 
-	if (v & 1)
+	if (v & 3)
 		return &cgrp_dfl_root.cgrp;
 
 	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
@@ -847,6 +848,7 @@  static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 #else	/* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..dd247747ec14 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6439,18 +6439,8 @@  void cgroup_sk_alloc_disable(void)
 
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 {
-	if (cgroup_sk_alloc_disabled)
-		return;
-
-	/* Socket clone path */
-	if (skcd->val) {
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	if (cgroup_sk_alloc_disabled) {
+		skcd->no_refcnt = 1;
 		return;
 	}
 
@@ -6475,10 +6465,27 @@  void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+	if (skcd->val) {
+		if (skcd->no_refcnt)
+			return;
+		/*
+		 * We might be cloning a socket which is left in an empty
+		 * cgroup and the cgroup might have already been rmdir'd.
+		 * Don't use cgroup_get_live().
+		 */
+		cgroup_get(sock_cgroup_ptr(skcd));
+		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	}
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
 
+	if (skcd->no_refcnt)
+		return;
 	cgroup_bpf_put(cgrp);
 	cgroup_put(cgrp);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index d832c650287c..2e5b7870e5d3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1926,7 +1926,7 @@  struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		/* sk->sk_memcg will be populated at accept() time */
 		newsk->sk_memcg = NULL;
 
-		cgroup_sk_alloc(&newsk->sk_cgrp_data);
+		cgroup_sk_clone(&newsk->sk_cgrp_data);
 
 		rcu_read_lock();
 		filter = rcu_dereference(sk->sk_filter);