diff mbox series

[bpf-next,2/4] bpf: fix lockdep false positive in stackmap

Message ID 20190130040458.2544340-3-ast@kernel.org
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: fixes for lockdep and deadlock | expand

Commit Message

Alexei Starovoitov Jan. 30, 2019, 4:04 a.m. UTC
Lockdep warns about false positive:
[   11.211460] ------------[ cut here ]------------
[   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[   11.213134] Modules linked in:
[   11.213413] CPU: 0 PID: 141 Comm: systemd-journal Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #476
[   11.214191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   11.214954] RIP: 0010:lock_release+0x1ad/0x280
[   11.217036] RSP: 0018:ffff88813ba03f50 EFLAGS: 00010086
[   11.217516] RAX: 000000000000001f RBX: ffff8881378d8000 RCX: 0000000000000000
[   11.218179] RDX: ffffffff810d3e9e RSI: 0000000000000001 RDI: ffffffff810d3eb3
[   11.218851] RBP: ffff8881393e2b08 R08: 0000000000000002 R09: 0000000000000000
[   11.219504] R10: 0000000000000000 R11: ffff88813ba03d9d R12: ffffffff8118dfa2
[   11.220162] R13: 0000000000000086 R14: 0000000000000000 R15: 0000000000000000
[   11.220717] FS:  00007f3c8cf35780(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
[   11.221348] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.221822] CR2: 00007f5825d92080 CR3: 00000001378c8005 CR4: 00000000003606f0
[   11.222381] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.222951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.223508] Call Trace:
[   11.223705]  <IRQ>
[   11.223874]  ? __local_bh_enable+0x7a/0x80
[   11.224199]  up_read+0x1c/0xa0
[   11.224446]  do_up_read+0x12/0x20
[   11.224713]  irq_work_run_list+0x43/0x70
[   11.225030]  irq_work_run+0x26/0x50
[   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
[   11.225662]  irq_work_interrupt+0xf/0x20

since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Silence the warning by using up_read_non_owner().

Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/stackmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Peter Zijlstra Jan. 30, 2019, 10:15 a.m. UTC | #1
On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> Lockdep warns about false positive:

This is not a false positive, and you probably also need to use
down_read_non_owner() to match this up_read_non_owner().

{up,down}_read() and {up,down}_read_non_owner() are not only different
in the lockdep annotation; there is also optimistic spin stuff that
relies on 'owner' tracking.

> [   11.211460] ------------[ cut here ]------------
> [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
> [   11.213134] Modules linked in:
> [   11.213413] CPU: 0 PID: 141 Comm: systemd-journal Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #476
> [   11.214191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
> [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
> [   11.217036] RSP: 0018:ffff88813ba03f50 EFLAGS: 00010086
> [   11.217516] RAX: 000000000000001f RBX: ffff8881378d8000 RCX: 0000000000000000
> [   11.218179] RDX: ffffffff810d3e9e RSI: 0000000000000001 RDI: ffffffff810d3eb3
> [   11.218851] RBP: ffff8881393e2b08 R08: 0000000000000002 R09: 0000000000000000
> [   11.219504] R10: 0000000000000000 R11: ffff88813ba03d9d R12: ffffffff8118dfa2
> [   11.220162] R13: 0000000000000086 R14: 0000000000000000 R15: 0000000000000000
> [   11.220717] FS:  00007f3c8cf35780(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
> [   11.221348] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.221822] CR2: 00007f5825d92080 CR3: 00000001378c8005 CR4: 00000000003606f0
> [   11.222381] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   11.222951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   11.223508] Call Trace:
> [   11.223705]  <IRQ>
> [   11.223874]  ? __local_bh_enable+0x7a/0x80
> [   11.224199]  up_read+0x1c/0xa0
> [   11.224446]  do_up_read+0x12/0x20
> [   11.224713]  irq_work_run_list+0x43/0x70
> [   11.225030]  irq_work_run+0x26/0x50
> [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
> [   11.225662]  irq_work_interrupt+0xf/0x20
> 
> since rw_semaphore is released in a different task vs task that locked the sema.
> It is expected behavior.
> Silence the warning by using up_read_non_owner().
> 
> Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
> Acked-by: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/stackmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index d43b14535827..4b79e7c251e5 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -44,7 +44,7 @@ static void do_up_read(struct irq_work *entry)
>  	struct stack_map_irq_work *work;
>  
>  	work = container_of(entry, struct stack_map_irq_work, irq_work);
> -	up_read(work->sem);
> +	up_read_non_owner(work->sem);
>  	work->sem = NULL;
>  }
>  
> -- 
> 2.20.0
>
Alexei Starovoitov Jan. 30, 2019, 7:30 p.m. UTC | #2
On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> > Lockdep warns about false positive:
> 
> This is not a false positive, and you probably also need to use
> down_read_non_owner() to match this up_read_non_owner().
> 
> {up,down}_read() and {up,down}_read_non_owner() are not only different
> in the lockdep annotation; there is also optimistic spin stuff that
> relies on 'owner' tracking.

Can you point out in the code the spin bit?
As far as I can see sem->owner is debug only feature.
All owner checks are done under CONFIG_DEBUG_RWSEMS.

Also there is no down_read_trylock_non_owner() at the moment.
We can argue about it for -next, but I'd rather silence lockdep
with this patch today.
Waiman Long Jan. 30, 2019, 7:42 p.m. UTC | #3
On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>> Lockdep warns about false positive:
>> This is not a false positive, and you probably also need to use
>> down_read_non_owner() to match this up_read_non_owner().
>>
>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>> in the lockdep annotation; there is also optimistic spin stuff that
>> relies on 'owner' tracking.
> Can you point out in the code the spin bit?
> As far as I can see sem->owner is debug only feature.
> All owner checks are done under CONFIG_DEBUG_RWSEMS.

No, sem->owner is mainly for performing optimistic spinning which is a
performance feature to make rwsem writer-lock performs similar to mutex.
The debugging part is just an add-on. It is not the reason for the
presence of sem->owner.

> Also there is no down_read_trylock_non_owner() at the moment.
> We can argue about it for -next, but I'd rather silence lockdep
> with this patch today.
>
We can add down_read_trylock_non_owner() if there is a need for it. It
should be easy to do.

Cheers,
Longman

down_read_trylock_non_owner(
Peter Zijlstra Jan. 30, 2019, 7:44 p.m. UTC | #4
On Wed, Jan 30, 2019 at 11:30:41AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> > > Lockdep warns about false positive:
> > 
> > This is not a false positive, and you probably also need to use
> > down_read_non_owner() to match this up_read_non_owner().
> > 
> > {up,down}_read() and {up,down}_read_non_owner() are not only different
> > in the lockdep annotation; there is also optimistic spin stuff that
> > relies on 'owner' tracking.
> 
> Can you point out in the code the spin bit?

Hurmph, looks like you're right. I got lost in that stuff again. I hate
that rwsem code :/

Rewriting that all is on the todo list somewhere, but it's far down :/

> As far as I can see sem->owner is debug only feature.
> All owner checks are done under CONFIG_DEBUG_RWSEMS.
> 
> Also there is no down_read_trylock_non_owner() at the moment.
> We can argue about it for -next, but I'd rather silence lockdep
> with this patch today.

I don't agree with calling is silencing; it fixes an mis-use of the API.
But yes, keep the patch as is for now.
Waiman Long Jan. 30, 2019, 8:05 p.m. UTC | #5
On 01/30/2019 02:44 PM, Peter Zijlstra wrote:
> On Wed, Jan 30, 2019 at 11:30:41AM -0800, Alexei Starovoitov wrote:
>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>> Lockdep warns about false positive:
>>> This is not a false positive, and you probably also need to use
>>> down_read_non_owner() to match this up_read_non_owner().
>>>
>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>> in the lockdep annotation; there is also optimistic spin stuff that
>>> relies on 'owner' tracking.
>> Can you point out in the code the spin bit?
> Hurmph, looks like you're right. I got lost in that stuff again. I hate
> that rwsem code :/

It is actually related to how the lockdep code track if a lock is
acquired or released. It is not actually related to how the rwsem code work.

>
> Rewriting that all is on the todo list somewhere, but it's far down :/
>

I am actually planning to rewrite it. Hopefully, I can send out the
patch soon.

Cheers,
Longman
Alexei Starovoitov Jan. 30, 2019, 8:10 p.m. UTC | #6
On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
> >> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> >>> Lockdep warns about false positive:
> >> This is not a false positive, and you probably also need to use
> >> down_read_non_owner() to match this up_read_non_owner().
> >>
> >> {up,down}_read() and {up,down}_read_non_owner() are not only different
> >> in the lockdep annotation; there is also optimistic spin stuff that
> >> relies on 'owner' tracking.
> > Can you point out in the code the spin bit?
> > As far as I can see sem->owner is debug only feature.
> > All owner checks are done under CONFIG_DEBUG_RWSEMS.
> 
> No, sem->owner is mainly for performing optimistic spinning which is a
> performance feature to make rwsem writer-lock performs similar to mutex.
> The debugging part is just an add-on. It is not the reason for the
> presence of sem->owner.

I see. Got it.

> > Also there is no down_read_trylock_non_owner() at the moment.
> > We can argue about it for -next, but I'd rather silence lockdep
> > with this patch today.
> >
> We can add down_read_trylock_non_owner() if there is a need for it. It
> should be easy to do.

Yes, but looking through the code it's not clear to me that it's safe
to mix non_owner() versions with regular.
bpf/stackmap.c does down_read_trylock + up_read.
If we add new down_read_trylock_non_owner that set the owner to
NULL | RWSEM_* bits is this safe with conccurent read/write
that do regular versions?
rwsem_can_spin_on_owner() does:
        if (owner) {
                ret = is_rwsem_owner_spinnable(owner) &&
                      owner_on_cpu(owner);
that looks correct.
For a second I thought there could be fault here due to non_owner.
But there could be other places where it's assumed that owner
is never null?

May be we should live with this lockdep warn in bpf tree
and fix it only in bpf-next?
Waiman Long Jan. 30, 2019, 9:11 p.m. UTC | #7
On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>> Lockdep warns about false positive:
>>>> This is not a false positive, and you probably also need to use
>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>
>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>> relies on 'owner' tracking.
>>> Can you point out in the code the spin bit?
>>> As far as I can see sem->owner is debug only feature.
>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>> No, sem->owner is mainly for performing optimistic spinning which is a
>> performance feature to make rwsem writer-lock performs similar to mutex.
>> The debugging part is just an add-on. It is not the reason for the
>> presence of sem->owner.
> I see. Got it.
>
>>> Also there is no down_read_trylock_non_owner() at the moment.
>>> We can argue about it for -next, but I'd rather silence lockdep
>>> with this patch today.
>>>
>> We can add down_read_trylock_non_owner() if there is a need for it. It
>> should be easy to do.
> Yes, but looking through the code it's not clear to me that it's safe
> to mix non_owner() versions with regular.
> bpf/stackmap.c does down_read_trylock + up_read.
> If we add new down_read_trylock_non_owner that set the owner to
> NULL | RWSEM_* bits is this safe with conccurent read/write
> that do regular versions?
> rwsem_can_spin_on_owner() does:
>         if (owner) {
>                 ret = is_rwsem_owner_spinnable(owner) &&
>                       owner_on_cpu(owner);
> that looks correct.
> For a second I thought there could be fault here due to non_owner.
> But there could be other places where it's assumed that owner
> is never null?

The content of owner is not the cause of the lockdep warning. The
lockdep code assumes that the task that acquires the lock will release
it some time later. That is not the case when you need to acquire the
lock by one task and released by another. In this case, you have to use
the non_owner version of down/up_read which disable the lockdep
acquire/release tracking. That will be the only difference between the
two set of APIs.

Cheers,
Longman

>
> May be we should live with this lockdep warn in bpf tree
> and fix it only in bpf-next?
>
Waiman Long Jan. 30, 2019, 9:32 p.m. UTC | #8
On 01/30/2019 04:11 PM, Waiman Long wrote:
> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>> Lockdep warns about false positive:
>>>>> This is not a false positive, and you probably also need to use
>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>
>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>> relies on 'owner' tracking.
>>>> Can you point out in the code the spin bit?
>>>> As far as I can see sem->owner is debug only feature.
>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>> The debugging part is just an add-on. It is not the reason for the
>>> presence of sem->owner.
>> I see. Got it.
>>
>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>> with this patch today.
>>>>
>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>> should be easy to do.
>> Yes, but looking through the code it's not clear to me that it's safe
>> to mix non_owner() versions with regular.
>> bpf/stackmap.c does down_read_trylock + up_read.
>> If we add new down_read_trylock_non_owner that set the owner to
>> NULL | RWSEM_* bits is this safe with conccurent read/write
>> that do regular versions?
>> rwsem_can_spin_on_owner() does:
>>         if (owner) {
>>                 ret = is_rwsem_owner_spinnable(owner) &&
>>                       owner_on_cpu(owner);
>> that looks correct.
>> For a second I thought there could be fault here due to non_owner.
>> But there could be other places where it's assumed that owner
>> is never null?
> The content of owner is not the cause of the lockdep warning. The
> lockdep code assumes that the task that acquires the lock will release
> it some time later. That is not the case when you need to acquire the
> lock by one task and released by another. In this case, you have to use
> the non_owner version of down/up_read which disable the lockdep
> acquire/release tracking. That will be the only difference between the
> two set of APIs.
>
> Cheers,
> Longman

BTW, you may want to do something like that to make sure that the lock
ownership is probably tracked.

-Longman

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d43b145..79eef9d 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
bpf_stack_
        } else {
                work->sem = &current->mm->mmap_sem;
                irq_work_queue(&work->irq_work);
+
+               /*
+                * The irq_work will release the mmap_sem with
+                * up_read_non_owner(). The rwsem_release() is called
+                * here to release the lock from lockdep's perspective.
+                */
+               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
        }
 }
Alexei Starovoitov Jan. 31, 2019, 2:01 a.m. UTC | #9
On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
> On 01/30/2019 04:11 PM, Waiman Long wrote:
> > On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
> >> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
> >>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
> >>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
> >>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
> >>>>>> Lockdep warns about false positive:
> >>>>> This is not a false positive, and you probably also need to use
> >>>>> down_read_non_owner() to match this up_read_non_owner().
> >>>>>
> >>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
> >>>>> in the lockdep annotation; there is also optimistic spin stuff that
> >>>>> relies on 'owner' tracking.
> >>>> Can you point out in the code the spin bit?
> >>>> As far as I can see sem->owner is debug only feature.
> >>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
> >>> No, sem->owner is mainly for performing optimistic spinning which is a
> >>> performance feature to make rwsem writer-lock performs similar to mutex.
> >>> The debugging part is just an add-on. It is not the reason for the
> >>> presence of sem->owner.
> >> I see. Got it.
> >>
> >>>> Also there is no down_read_trylock_non_owner() at the moment.
> >>>> We can argue about it for -next, but I'd rather silence lockdep
> >>>> with this patch today.
> >>>>
> >>> We can add down_read_trylock_non_owner() if there is a need for it. It
> >>> should be easy to do.
> >> Yes, but looking through the code it's not clear to me that it's safe
> >> to mix non_owner() versions with regular.
> >> bpf/stackmap.c does down_read_trylock + up_read.
> >> If we add new down_read_trylock_non_owner that set the owner to
> >> NULL | RWSEM_* bits is this safe with conccurent read/write
> >> that do regular versions?
> >> rwsem_can_spin_on_owner() does:
> >>         if (owner) {
> >>                 ret = is_rwsem_owner_spinnable(owner) &&
> >>                       owner_on_cpu(owner);
> >> that looks correct.
> >> For a second I thought there could be fault here due to non_owner.
> >> But there could be other places where it's assumed that owner
> >> is never null?
> > The content of owner is not the cause of the lockdep warning. The
> > lockdep code assumes that the task that acquires the lock will release
> > it some time later. That is not the case when you need to acquire the
> > lock by one task and released by another. In this case, you have to use
> > the non_owner version of down/up_read which disable the lockdep
> > acquire/release tracking. That will be the only difference between the
> > two set of APIs.
> >
> > Cheers,
> > Longman
> 
> BTW, you may want to do something like that to make sure that the lock
> ownership is probably tracked.
> 
> -Longman
> 
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index d43b145..79eef9d 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
> bpf_stack_
>         } else {
>                 work->sem = &current->mm->mmap_sem;
>                 irq_work_queue(&work->irq_work);
> +
> +               /*
> +                * The irq_work will release the mmap_sem with
> +                * up_read_non_owner(). The rwsem_release() is called
> +                * here to release the lock from lockdep's perspective.
> +                */
> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);

are you saying the above is enough coupled with up_read_non_owner?
Looking at how amdgpu is using this api... I think they just use non_owner
version when doing things from different task.
So I don't think pairing non_owner with non_owner is strictly necessary.
It seems fine to use down_read_trylock() with above rwsem_release() hack
plus up_read_non_owner().
Am I missing something?
Waiman Long Jan. 31, 2019, 2:48 a.m. UTC | #10
On 01/30/2019 09:01 PM, Alexei Starovoitov wrote:
> On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
>> On 01/30/2019 04:11 PM, Waiman Long wrote:
>>> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>>>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>>>> Lockdep warns about false positive:
>>>>>>> This is not a false positive, and you probably also need to use
>>>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>>>
>>>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>>>> relies on 'owner' tracking.
>>>>>> Can you point out in the code the spin bit?
>>>>>> As far as I can see sem->owner is debug only feature.
>>>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>>>> The debugging part is just an add-on. It is not the reason for the
>>>>> presence of sem->owner.
>>>> I see. Got it.
>>>>
>>>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>>>> with this patch today.
>>>>>>
>>>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>>>> should be easy to do.
>>>> Yes, but looking through the code it's not clear to me that it's safe
>>>> to mix non_owner() versions with regular.
>>>> bpf/stackmap.c does down_read_trylock + up_read.
>>>> If we add new down_read_trylock_non_owner that set the owner to
>>>> NULL | RWSEM_* bits is this safe with conccurent read/write
>>>> that do regular versions?
>>>> rwsem_can_spin_on_owner() does:
>>>>         if (owner) {
>>>>                 ret = is_rwsem_owner_spinnable(owner) &&
>>>>                       owner_on_cpu(owner);
>>>> that looks correct.
>>>> For a second I thought there could be fault here due to non_owner.
>>>> But there could be other places where it's assumed that owner
>>>> is never null?
>>> The content of owner is not the cause of the lockdep warning. The
>>> lockdep code assumes that the task that acquires the lock will release
>>> it some time later. That is not the case when you need to acquire the
>>> lock by one task and released by another. In this case, you have to use
>>> the non_owner version of down/up_read which disable the lockdep
>>> acquire/release tracking. That will be the only difference between the
>>> two set of APIs.
>>>
>>> Cheers,
>>> Longman
>> BTW, you may want to do something like that to make sure that the lock
>> ownership is probably tracked.
>>
>> -Longman
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index d43b145..79eef9d 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
>> bpf_stack_
>>         } else {
>>                 work->sem = &current->mm->mmap_sem;
>>                 irq_work_queue(&work->irq_work);
>> +
>> +               /*
>> +                * The irq_work will release the mmap_sem with
>> +                * up_read_non_owner(). The rwsem_release() is called
>> +                * here to release the lock from lockdep's perspective.
>> +                */
>> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
> are you saying the above is enough coupled with up_read_non_owner?
> Looking at how amdgpu is using this api... I think they just use non_owner
> version when doing things from different task.
> So I don't think pairing non_owner with non_owner is strictly necessary.
> It seems fine to use down_read_trylock() with above rwsem_release() hack
> plus up_read_non_owner().
> Am I missing something?
>
The statement above is to clear the state for the lockdep so that it
knows that the task no longer owns the lock. Without doing that, there
is a possibility of some other kind of incorrect lockdep warning may be
produced because the code will still think the task hold a read-lock on
the mmap_sem. It is also possible no warning will be reported.

The bpf code is kind of special that it acquires the mmap_sem. Later on,
it either releases it itself (non-NMI) or request irq_work to release it
(NMI). So either, you use the _non_owner versions for both acquire and
release or fake the release like the code segment above.

Peter, please chime in if you have other suggestion.

Cheers,
Longman
Eric Dumazet Feb. 6, 2019, 3:21 a.m. UTC | #11
On 01/30/2019 06:48 PM, Waiman Long wrote:
> On 01/30/2019 09:01 PM, Alexei Starovoitov wrote:
>> On Wed, Jan 30, 2019 at 04:32:12PM -0500, Waiman Long wrote:
>>> On 01/30/2019 04:11 PM, Waiman Long wrote:
>>>> On 01/30/2019 03:10 PM, Alexei Starovoitov wrote:
>>>>> On Wed, Jan 30, 2019 at 02:42:23PM -0500, Waiman Long wrote:
>>>>>> On 01/30/2019 02:30 PM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jan 30, 2019 at 11:15:30AM +0100, Peter Zijlstra wrote:
>>>>>>>> On Tue, Jan 29, 2019 at 08:04:56PM -0800, Alexei Starovoitov wrote:
>>>>>>>>> Lockdep warns about false positive:
>>>>>>>> This is not a false positive, and you probably also need to use
>>>>>>>> down_read_non_owner() to match this up_read_non_owner().
>>>>>>>>
>>>>>>>> {up,down}_read() and {up,down}_read_non_owner() are not only different
>>>>>>>> in the lockdep annotation; there is also optimistic spin stuff that
>>>>>>>> relies on 'owner' tracking.
>>>>>>> Can you point out in the code the spin bit?
>>>>>>> As far as I can see sem->owner is debug only feature.
>>>>>>> All owner checks are done under CONFIG_DEBUG_RWSEMS.
>>>>>> No, sem->owner is mainly for performing optimistic spinning which is a
>>>>>> performance feature to make rwsem writer-lock performs similar to mutex.
>>>>>> The debugging part is just an add-on. It is not the reason for the
>>>>>> presence of sem->owner.
>>>>> I see. Got it.
>>>>>
>>>>>>> Also there is no down_read_trylock_non_owner() at the moment.
>>>>>>> We can argue about it for -next, but I'd rather silence lockdep
>>>>>>> with this patch today.
>>>>>>>
>>>>>> We can add down_read_trylock_non_owner() if there is a need for it. It
>>>>>> should be easy to do.
>>>>> Yes, but looking through the code it's not clear to me that it's safe
>>>>> to mix non_owner() versions with regular.
>>>>> bpf/stackmap.c does down_read_trylock + up_read.
>>>>> If we add new down_read_trylock_non_owner that set the owner to
>>>>> NULL | RWSEM_* bits is this safe with conccurent read/write
>>>>> that do regular versions?
>>>>> rwsem_can_spin_on_owner() does:
>>>>>         if (owner) {
>>>>>                 ret = is_rwsem_owner_spinnable(owner) &&
>>>>>                       owner_on_cpu(owner);
>>>>> that looks correct.
>>>>> For a second I thought there could be fault here due to non_owner.
>>>>> But there could be other places where it's assumed that owner
>>>>> is never null?
>>>> The content of owner is not the cause of the lockdep warning. The
>>>> lockdep code assumes that the task that acquires the lock will release
>>>> it some time later. That is not the case when you need to acquire the
>>>> lock by one task and released by another. In this case, you have to use
>>>> the non_owner version of down/up_read which disable the lockdep
>>>> acquire/release tracking. That will be the only difference between the
>>>> two set of APIs.
>>>>
>>>> Cheers,
>>>> Longman
>>> BTW, you may want to do something like that to make sure that the lock
>>> ownership is probably tracked.
>>>
>>> -Longman
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index d43b145..79eef9d 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
>>> bpf_stack_
>>>         } else {
>>>                 work->sem = &current->mm->mmap_sem;
>>>                 irq_work_queue(&work->irq_work);
>>> +
>>> +               /*
>>> +                * The irq_work will release the mmap_sem with
>>> +                * up_read_non_owner(). The rwsem_release() is called
>>> +                * here to release the lock from lockdep's perspective.
>>> +                */
>>> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
>> are you saying the above is enough coupled with up_read_non_owner?
>> Looking at how amdgpu is using this api... I think they just use non_owner
>> version when doing things from different task.
>> So I don't think pairing non_owner with non_owner is strictly necessary.
>> It seems fine to use down_read_trylock() with above rwsem_release() hack
>> plus up_read_non_owner().
>> Am I missing something?
>>
> The statement above is to clear the state for the lockdep so that it
> knows that the task no longer owns the lock. Without doing that, there
> is a possibility of some other kind of incorrect lockdep warning may be
> produced because the code will still think the task hold a read-lock on
> the mmap_sem. It is also possible no warning will be reported.
> 
> The bpf code is kind of special that it acquires the mmap_sem. Later on,
> it either releases it itself (non-NMI) or request irq_work to release it
> (NMI). So either, you use the _non_owner versions for both acquire and
> release or fake the release like the code segment above.
> 
> Peter, please chime in if you have other suggestion.
> 

Hi guys

What are the plans to address the issue then ?

Latest net tree exhibits this trace :

Thanks

[  546.116982] =====================================
[  546.121688] WARNING: bad unlock balance detected!
[  546.126393] 5.0.0-dbg-DEV #559 Not tainted
[  546.130491] -------------------------------------
[  546.135196] taskset/15409 is trying to release lock (&mm->mmap_sem) at:
[  546.141844] [<ffffffffb0233246>] do_up_read+0x16/0x30
[  546.146919] but there are no more locks to release!
[  546.151790] 
               other info that might help us debug this:
[  546.158325] 1 lock held by taskset/15409:
[  546.162325]  #0: 00000000683db857 (&sig->cred_guard_mutex){+.+.}, at: __do_execve_file.isra.38+0x13e/0xbc0
[  546.171978] 
               stack backtrace:
[  546.176327] CPU: 0 PID: 15409 Comm: taskset Not tainted 5.0.0-dbg-DEV #559
[  546.183214] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.54.0 06/07/2018
[  546.190182] Call Trace:
[  546.192633]  <IRQ>
[  546.194672]  dump_stack+0x67/0x95
[  546.198006]  ? do_up_read+0x16/0x30
[  546.201491]  print_unlock_imbalance_bug.part.33+0xd0/0xd7
[  546.206914]  ? do_up_read+0x16/0x30
[  546.210406]  lock_release+0x213/0x2d0
[  546.214088]  up_read+0x20/0xa0
[  546.217138]  do_up_read+0x16/0x30
[  546.220457]  irq_work_run_list+0x4a/0x70
[  546.224408]  irq_work_run+0x18/0x40
[  546.227911]  smp_irq_work_interrupt+0x54/0x1d0
[  546.232362]  irq_work_interrupt+0xf/0x20
[  546.236280]  </IRQ>
[  546.238396] RIP: 0010:release_pages+0x69/0x650
[  546.242839] Code: 01 4c 8b 2f 4c 8d 67 08 48 8d 44 f7 08 45 31 f6 48 89 04 24 eb 04 49 83 c4 08 48 8b 05 20 fe 31 01 49 39 c5 74 26 4d 8b 7d 08 <4d> 8d 47 ff 41 83 e7 01 4d 0f 44 c5 41 8b 40 34 4d 89 c7 85 c0 0f
[  546.261615] RSP: 0018:ffffac604732bb00 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff09
[  546.269190] RAX: ffffe4c9c0ca8000 RBX: 0000000000000020 RCX: 0000000000000006
[  546.276352] RDX: 0000000000000006 RSI: ffff8cf7facb6aa8 RDI: ffff8cf7facb6240
[  546.283482] RBP: ffffac604732bb70 R08: ffffe4c9c0ba3d80 R09: 0000000000000000
[  546.290613] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8cf7db3292d8
[  546.297752] R13: ffffe4c9c0ba3dc0 R14: 0000000000000000 R15: ffffe4c9c0ba3d88
[  546.304907]  free_pages_and_swap_cache+0xe6/0x100
[  546.309618]  tlb_flush_mmu_free+0x36/0x60
[  546.313625]  arch_tlb_finish_mmu+0x28/0x1e0
[  546.317801]  tlb_finish_mmu+0x23/0x30
[  546.321459]  exit_mmap+0xc9/0x1f0
[  546.324787]  ? __mutex_unlock_slowpath+0x11/0x2e0
[  546.329502]  mmput+0x62/0x130
[  546.332481]  flush_old_exec+0x60e/0x810
[  546.336314]  load_elf_binary+0x830/0x18c0
[  546.340323]  ? search_binary_handler+0x8a/0x200
[  546.344848]  search_binary_handler+0x99/0x200
[  546.349221]  __do_execve_file.isra.38+0x76d/0xbc0
[  546.353918]  __x64_sys_execve+0x39/0x50
[  546.357757]  do_syscall_64+0x5a/0x460
[  546.361440]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  546.366489] RIP: 0033:0x7fd599e3c067
[  546.370069] Code: Bad RIP value.
[  546.373306] RSP: 002b:00007ffcf7092e58 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[  546.380880] RAX: ffffffffffffffda RBX: 00007ffcf7093058 RCX: 00007fd599e3c067
[  546.388010] RDX: 00007ffcf7093070 RSI: 00007ffcf7093058 RDI: 00007ffcf70937c0
[  546.395132] RBP: 00007ffcf7092ec0 R08: 00000000ffffffff R09: 0000000001b1ae40
[  546.402265] R10: 00007ffcf7092c80 R11: 0000000000000202 R12: 00007ffcf70937c0
[  546.409385] R13: 00007ffcf7093070 R14: 0000000000000000 R15: 00007ffcf7093048
Alexei Starovoitov Feb. 6, 2019, 3:30 a.m. UTC | #12
On Tue, Feb 05, 2019 at 07:21:08PM -0800, Eric Dumazet wrote:
> >>>
> >>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >>> index d43b145..79eef9d 100644
> >>> --- a/kernel/bpf/stackmap.c
> >>> +++ b/kernel/bpf/stackmap.c
> >>> @@ -338,6 +338,13 @@ static void stack_map_get_build_id_offset(struct
> >>> bpf_stack_
> >>>         } else {
> >>>                 work->sem = &current->mm->mmap_sem;
> >>>                 irq_work_queue(&work->irq_work);
> >>> +
> >>> +               /*
> >>> +                * The irq_work will release the mmap_sem with
> >>> +                * up_read_non_owner(). The rwsem_release() is called
> >>> +                * here to release the lock from lockdep's perspective.
> >>> +                */
> >>> +               rwsem_release(&current->mm->mmap_sem.dep_map, 1, _RET_IP_);
> >> are you saying the above is enough coupled with up_read_non_owner?
> >> Looking at how amdgpu is using this api... I think they just use non_owner
> >> version when doing things from different task.
> >> So I don't think pairing non_owner with non_owner is strictly necessary.
> >> It seems fine to use down_read_trylock() with above rwsem_release() hack
> >> plus up_read_non_owner().
> >> Am I missing something?
> >>
> > The statement above is to clear the state for the lockdep so that it
> > knows that the task no longer owns the lock. Without doing that, there
> > is a possibility of some other kind of incorrect lockdep warning may be
> > produced because the code will still think the task hold a read-lock on
> > the mmap_sem. It is also possible no warning will be reported.
> > 
> > The bpf code is kind of special that it acquires the mmap_sem. Later on,
> > it either releases it itself (non-NMI) or request irq_work to release it
> > (NMI). So either, you use the _non_owner versions for both acquire and
> > release or fake the release like the code segment above.
> > 
> > Peter, please chime in if you have other suggestion.
> > 
> 
> Hi guys
> 
> What are the plans to address the issue then ?
> Latest net tree exhibits this trace :

Thanks for the reminder :)

I've been waiting for Peter's direction on this one.
Happy to fix it whichever way.

To recap:
Approach 1:
s/up_read/up_read_non_owner/ from irq_work + rwsem_release as Longman proposed.

Apporach 2:
introduce down_read_trylock_non_owner and pair it with up_read_non_owner
in both irq_work and normal.

My preference is 1. I think Longman's as well.
Eric Dumazet Feb. 6, 2019, 3:40 a.m. UTC | #13
On 02/05/2019 07:30 PM, Alexei Starovoitov wrote:

> Thanks for the reminder :)
> 
> I've been waiting for Peter's direction on this one.
> Happy to fix it whichever way.
> 
> To recap:
> Approach 1:
> s/up_read/up_read_non_owner/ from irq_work + rwsem_release as Longman proposed.
> 
> Apporach 2:
> introduce down_read_trylock_non_owner and pair it with up_read_non_owner
> in both irq_work and normal.
> 
> My preference is 1. I think Longman's as well.
> 


Unless we envision other uses for down_read_trylock_non_owner() and up_read_non_owner(),
I agree 1) seems good enough, at least for the short term.

Thanks !
diff mbox series

Patch

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d43b14535827..4b79e7c251e5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -44,7 +44,7 @@  static void do_up_read(struct irq_work *entry)
 	struct stack_map_irq_work *work;
 
 	work = container_of(entry, struct stack_map_irq_work, irq_work);
-	up_read(work->sem);
+	up_read_non_owner(work->sem);
 	work->sem = NULL;
 }