diff mbox series

[bpf-next,2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()

Message ID 20191010061916.198761-3-songliubraving@fb.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf/stackmap: fix A-A deadlock in bpf_get_stack() | expand

Commit Message

Song Liu Oct. 10, 2019, 6:19 a.m. UTC
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
   build-id, e.g. with trace.py from bcc tools:
   trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch

A sample reproducer is attached at the end.

Fix this by checking whether rq_lock is already locked. If so, postpone
the up_read() to irq_work, just like the in_nmi() case.

Fixes: commit 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Cc: stable@vger.kernel.org # v4.17+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>

Reproducer:
============================ 8< ============================

char *filename;

void *worker(void *p)
{
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
}

int main(int argc, char *argv[])
{
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
}
============================ 8< ============================
---
 kernel/bpf/stackmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
2.17.1

Comments

Peter Zijlstra Oct. 10, 2019, 7:36 a.m. UTC | #1
On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> deadlock on rq_lock():
> 
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [...]
> Call Trace:
>  try_to_wake_up+0x1ad/0x590
>  wake_up_q+0x54/0x80
>  rwsem_wake+0x8a/0xb0
>  bpf_get_stack+0x13c/0x150
>  bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>  bpf_overflow_handler+0x60/0x100
>  __perf_event_overflow+0x4f/0xf0
>  perf_swevent_overflow+0x99/0xc0
>  ___perf_sw_event+0xe7/0x120
>  __schedule+0x47d/0x620
>  schedule+0x29/0x90
>  futex_wait_queue_me+0xb9/0x110
>  futex_wait+0x139/0x230
>  do_futex+0x2ac/0xa50
>  __x64_sys_futex+0x13c/0x180
>  do_syscall_64+0x42/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 

> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 052580c33d26..3b278f6b0c3e 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>  	bool irq_work_busy = false;
>  	struct stack_map_irq_work *work = NULL;
> 
> -	if (in_nmi()) {
> +	if (in_nmi() || this_rq_is_locked()) {
>  		work = this_cpu_ptr(&up_read_work);
>  		if (work->irq_work.flags & IRQ_WORK_BUSY)
>  			/* cannot queue more up_read, fallback */

This is horrific crap. Just say no to that get_build_id_offset()
trainwreck.
Alexei Starovoitov Oct. 10, 2019, 5:19 p.m. UTC | #2
On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>> deadlock on rq_lock():
>>
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [...]
>> Call Trace:
>>   try_to_wake_up+0x1ad/0x590
>>   wake_up_q+0x54/0x80
>>   rwsem_wake+0x8a/0xb0
>>   bpf_get_stack+0x13c/0x150
>>   bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>   bpf_overflow_handler+0x60/0x100
>>   __perf_event_overflow+0x4f/0xf0
>>   perf_swevent_overflow+0x99/0xc0
>>   ___perf_sw_event+0xe7/0x120
>>   __schedule+0x47d/0x620
>>   schedule+0x29/0x90
>>   futex_wait_queue_me+0xb9/0x110
>>   futex_wait+0x139/0x230
>>   do_futex+0x2ac/0xa50
>>   __x64_sys_futex+0x13c/0x180
>>   do_syscall_64+0x42/0x100
>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
> 
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 052580c33d26..3b278f6b0c3e 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>   	bool irq_work_busy = false;
>>   	struct stack_map_irq_work *work = NULL;
>>
>> -	if (in_nmi()) {
>> +	if (in_nmi() || this_rq_is_locked()) {
>>   		work = this_cpu_ptr(&up_read_work);
>>   		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>   			/* cannot queue more up_read, fallback */
> 
> This is horrific crap. Just say no to that get_build_id_offset()
> trainwreck.

this is not a helpful comment.
What issues do you see with this approach?
Peter Zijlstra Oct. 10, 2019, 5:46 p.m. UTC | #3
On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> > On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> >> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> >> deadlock on rq_lock():
> >>
> >> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [...]
> >> Call Trace:
> >>   try_to_wake_up+0x1ad/0x590
> >>   wake_up_q+0x54/0x80
> >>   rwsem_wake+0x8a/0xb0
> >>   bpf_get_stack+0x13c/0x150
> >>   bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
> >>   bpf_overflow_handler+0x60/0x100
> >>   __perf_event_overflow+0x4f/0xf0
> >>   perf_swevent_overflow+0x99/0xc0
> >>   ___perf_sw_event+0xe7/0x120
> >>   __schedule+0x47d/0x620
> >>   schedule+0x29/0x90
> >>   futex_wait_queue_me+0xb9/0x110
> >>   futex_wait+0x139/0x230
> >>   do_futex+0x2ac/0xa50
> >>   __x64_sys_futex+0x13c/0x180
> >>   do_syscall_64+0x42/0x100
> >>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> > 
> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >> index 052580c33d26..3b278f6b0c3e 100644
> >> --- a/kernel/bpf/stackmap.c
> >> +++ b/kernel/bpf/stackmap.c
> >> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> >>   	bool irq_work_busy = false;
> >>   	struct stack_map_irq_work *work = NULL;
> >>
> >> -	if (in_nmi()) {
> >> +	if (in_nmi() || this_rq_is_locked()) {
> >>   		work = this_cpu_ptr(&up_read_work);
> >>   		if (work->irq_work.flags & IRQ_WORK_BUSY)
> >>   			/* cannot queue more up_read, fallback */
> > 
> > This is horrific crap. Just say no to that get_build_id_offset()
> > trainwreck.
> 
> this is not a helpful comment.
> What issues do you see with this approach?

It will still generate deadlocks if I place a tracepoint inside a lock
that nests inside rq->lock, and it won't ever be able to detect that.
Say do the very same thing on trace_hrtimer_start(), which is under
cpu_base->lock, which nests inside rq->lock. That should give you an
AB-BA.

tracepoints / perf-overflow should _never_ take locks.

All of stack_map_get_build_id_offset() is just disguisting games; I did
tell you guys how to do lockless vma lookups a few years ago -- and yes,
that is invasive core mm surgery. But this is just disguisting hacks for
not wanting to do it right.

Basically the only semi-sane thing to do with that trainwreck is
s/in_nmi()/true/ and pray.

On top of that I just hate buildids in general.
Alexei Starovoitov Oct. 10, 2019, 6:06 p.m. UTC | #4
On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
>>> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>>>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>>>> deadlock on rq_lock():
>>>>
>>>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [...]
>>>> Call Trace:
>>>>    try_to_wake_up+0x1ad/0x590
>>>>    wake_up_q+0x54/0x80
>>>>    rwsem_wake+0x8a/0xb0
>>>>    bpf_get_stack+0x13c/0x150
>>>>    bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>>>    bpf_overflow_handler+0x60/0x100
>>>>    __perf_event_overflow+0x4f/0xf0
>>>>    perf_swevent_overflow+0x99/0xc0
>>>>    ___perf_sw_event+0xe7/0x120
>>>>    __schedule+0x47d/0x620
>>>>    schedule+0x29/0x90
>>>>    futex_wait_queue_me+0xb9/0x110
>>>>    futex_wait+0x139/0x230
>>>>    do_futex+0x2ac/0xa50
>>>>    __x64_sys_futex+0x13c/0x180
>>>>    do_syscall_64+0x42/0x100
>>>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index 052580c33d26..3b278f6b0c3e 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>>    	bool irq_work_busy = false;
>>>>    	struct stack_map_irq_work *work = NULL;
>>>>
>>>> -	if (in_nmi()) {
>>>> +	if (in_nmi() || this_rq_is_locked()) {
>>>>    		work = this_cpu_ptr(&up_read_work);
>>>>    		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>>    			/* cannot queue more up_read, fallback */
>>>
>>> This is horrific crap. Just say no to that get_build_id_offset()
>>> trainwreck.
>>
>> this is not a helpful comment.
>> What issues do you see with this approach?
> 
> It will still generate deadlocks if I place a tracepoint inside a lock
> that nests inside rq->lock, and it won't ever be able to detect that.
> Say do the very same thing on trace_hrtimer_start(), which is under
> cpu_base->lock, which nests inside rq->lock. That should give you an
> AB-BA.
> 
> tracepoints / perf-overflow should _never_ take locks.
> 
> All of stack_map_get_build_id_offset() is just disguisting games; I did
> tell you guys how to do lockless vma lookups a few years ago -- and yes,
> that is invasive core mm surgery. But this is just disguisting hacks for
> not wanting to do it right.

you mean speculative page fault stuff?
That was my hope as well and I offered Laurent all the help to land it.
Yet after a year since we've talked the patches are not any closer
to landing.
Any other 'invasive mm surgery' you have in mind?

> Basically the only semi-sane thing to do with that trainwreck is
> s/in_nmi()/true/ and pray.
> 
> On top of that I just hate buildids in general.

Emotions aside... build_id is useful and used in production.
It's used widely because it solves real problems.
This dead lock is from real servers and not from some sanitizer wannabe.
Hence we need to fix it as cleanly as possible and quickly.
s/in_nmi/true/ is certainly an option.
I'm worried about overhead of doing irq_work_queue() all the time.
But I'm not familiar with mechanism enough to justify the concerns.
Would it make sense to do s/in_nmi/irgs_disabled/ instead?
Peter Zijlstra Oct. 14, 2019, 9:09 a.m. UTC | #5
On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 10:46 AM, Peter Zijlstra wrote:

> > All of stack_map_get_build_id_offset() is just disguisting games; I did
> > tell you guys how to do lockless vma lookups a few years ago -- and yes,
> > that is invasive core mm surgery. But this is just disguisting hacks for
> > not wanting to do it right.
> 
> you mean speculative page fault stuff?
> That was my hope as well and I offered Laurent all the help to land it.
> Yet after a year since we've talked the patches are not any closer
> to landing.
> Any other 'invasive mm surgery' you have in mind?

Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
which is exactly what you need here.

> > Basically the only semi-sane thing to do with that trainwreck is
> > s/in_nmi()/true/ and pray.
> > 
> > On top of that I just hate buildids in general.
> 
> Emotions aside... build_id is useful and used in production.
> It's used widely because it solves real problems.

AFAIU it solves the problem of you not knowing what version of the
binary runs where; which I was hoping your cloud infrastructure thing
would actually know already.

Anyway, I know what it does, I just don't nessecarily agree it is the
right way around that particular problem (also, the way I'm personally
affected is that perf-record is dead slow by default due to built-id
post processing).

And it obviously leads to horrible hacks like the code currently under
discussion :/

> This dead lock is from real servers and not from some sanitizer wannabe.

If you enable CFS bandwidth control and run this function on the
trace_hrtimer_start() tracepoint, you should be able to trigger a real
AB-BA lockup.

> Hence we need to fix it as cleanly as possible and quickly.
> s/in_nmi/true/ is certainly an option.

That is the best option; because tracepoints / perf-overflow handlers
really should not be taking any locks.

> I'm worried about overhead of doing irq_work_queue() all the time.
> But I'm not familiar with mechanism enough to justify the concerns.
> Would it make sense to do s/in_nmi/irgs_disabled/ instead?

irqs_disabled() should work in this particular case because rq->lock
(and therefore all it's nested locks) are IRQ-safe.
Song Liu Oct. 14, 2019, 8:12 p.m. UTC | #6
Thanks Peter!

> On Oct 14, 2019, at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> 
>>> All of stack_map_get_build_id_offset() is just disguisting games; I did
>>> tell you guys how to do lockless vma lookups a few years ago -- and yes,
>>> that is invasive core mm surgery. But this is just disguisting hacks for
>>> not wanting to do it right.
>> 
>> you mean speculative page fault stuff?
>> That was my hope as well and I offered Laurent all the help to land it.
>> Yet after a year since we've talked the patches are not any closer
>> to landing.
>> Any other 'invasive mm surgery' you have in mind?
> 
> Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
> which is exactly what you need here.

Lockless VMA lookups will be really useful. It would resolve all the 
pains we are having here. 

I remember Google folks also mentioned in LPC that they would like 
better mechanism to confirm build-id in perf. 

> 
>>> Basically the only semi-sane thing to do with that trainwreck is
>>> s/in_nmi()/true/ and pray.
>>> 
>>> On top of that I just hate buildids in general.
>> 
>> Emotions aside... build_id is useful and used in production.
>> It's used widely because it solves real problems.
> 
> AFAIU it solves the problem of you not knowing what version of the
> binary runs where; which I was hoping your cloud infrastructure thing
> would actually know already.
> 
> Anyway, I know what it does, I just don't nessecarily agree it is the
> right way around that particular problem (also, the way I'm personally
> affected is that perf-record is dead slow by default due to built-id
> post processing).
> 
> And it obviously leads to horrible hacks like the code currently under
> discussion :/
> 
>> This dead lock is from real servers and not from some sanitizer wannabe.
> 
> If you enable CFS bandwidth control and run this function on the
> trace_hrtimer_start() tracepoint, you should be able to trigger a real
> AB-BA lockup.
> 
>> Hence we need to fix it as cleanly as possible and quickly.
>> s/in_nmi/true/ is certainly an option.
> 
> That is the best option; because tracepoints / perf-overflow handlers
> really should not be taking any locks.
> 
>> I'm worried about overhead of doing irq_work_queue() all the time.
>> But I'm not familiar with mechanism enough to justify the concerns.
>> Would it make sense to do s/in_nmi/irgs_disabled/ instead?
> 
> irqs_disabled() should work in this particular case because rq->lock
> (and therefore all it's nested locks) are IRQ-safe.

We worry about the overhead of irq_work for every single stackmap 
lookup. So we would like to go with the irqs_disabled() check. I just 
sent v2 of the patch. 

Thanks again,
Song
diff mbox series

Patch

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 052580c33d26..3b278f6b0c3e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -287,7 +287,7 @@  static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	bool irq_work_busy = false;
 	struct stack_map_irq_work *work = NULL;

-	if (in_nmi()) {
+	if (in_nmi() || this_rq_is_locked()) {
 		work = this_cpu_ptr(&up_read_work);
 		if (work->irq_work.flags & IRQ_WORK_BUSY)
 			/* cannot queue more up_read, fallback */