diff mbox series

[V2,01/20] bpf: Enforce preallocation for all instrumentation programs

Message ID 20200220204617.440152945@linutronix.de
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: Make BPF and PREEMPT_RT co-exist | expand

Commit Message

Thomas Gleixner Feb. 20, 2020, 8:45 p.m. UTC
The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:

 kmalloc() from regular non BPF context
  cache empty
   freelist empty
    lock(zone->lock);
     tracepoint or kprobe
      BPF()
       update_elem()
        lock(bucket)
          kmalloc()
           cache empty
            freelist empty
             lock(zone->lock);  <- DEADLOCK

There are other ways which do not involve locking to create wreckage:

 kmalloc() from regular non BPF context
  local_irq_save();
   ...
    obj = percpu_slab_first();
     kprobe()
      BPF()
       update_elem()
        lock(bucket)
         kmalloc()
          local_irq_save();
           ...
            obj = percpu_slab_first(); <- Same object as above ...

So preallocation _must_ be enforced for all variants of intrusive
instrumentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: New patch
---
 kernel/bpf/verifier.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

Comments

Alexei Starovoitov Feb. 22, 2020, 4:29 a.m. UTC | #1
On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
> The assumption that only programs attached to perf NMI events can deadlock
> on memory allocators is wrong. Assume the following simplified callchain:
> 
>  kmalloc() from regular non BPF context
>   cache empty
>    freelist empty
>     lock(zone->lock);
>      tracepoint or kprobe
>       BPF()
>        update_elem()
>         lock(bucket)
>           kmalloc()
>            cache empty
>             freelist empty
>              lock(zone->lock);  <- DEADLOCK
> 
> There are other ways which do not involve locking to create wreckage:
> 
>  kmalloc() from regular non BPF context
>   local_irq_save();
>    ...
>     obj = percpu_slab_first();
>      kprobe()
>       BPF()
>        update_elem()
>         lock(bucket)
>          kmalloc()
>           local_irq_save();
>            ...
>             obj = percpu_slab_first(); <- Same object as above ...
> 
> So preallocation _must_ be enforced for all variants of intrusive
> instrumentation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: New patch
> ---
>  kernel/bpf/verifier.c |   18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8144,19 +8144,23 @@ static int check_map_prog_compatibility(
>  					struct bpf_prog *prog)
>  
>  {
> -	/* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
> -	 * preallocated hash maps, since doing memory allocation
> -	 * in overflow_handler can crash depending on where nmi got
> -	 * triggered.
> +	/*
> +	 * Make sure that trace type programs only use preallocated hash
> +	 * maps. Perf programs obviously can't do memory allocation in NMI
> +	 * context and all other types can deadlock on a memory allocator
> +	 * lock when a tracepoint/kprobe triggers a BPF program inside a
> +	 * lock held region or create inconsistent state when the probe is
> +	 * within an interrupts disabled critical region in the memory
> +	 * allocator.
>  	 */
> -	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> +	if ((is_tracing_prog_type(prog->type)) {

This doesn't build.
I assumed the typo somehow sneaked in and proceeded, but it broke
a bunch of tests:
Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
One can argue that the test are unsafe and broken.
We used to test all those tests with and without prealloc:
map_flags = 0;
run_all_tests();
map_flags = BPF_F_NO_PREALLOC;
run_all_tests();
Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
always and that how it stayed since then. We can adjust the tests to use
prealloc with tracing progs, but this breakage shows that there could be plenty
of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
be because they know that their kprobes are in a safe spot (and kmalloc is ok)
and they want to save memory. They could be using large max_entries parameter
for worst case hash map usage, but typical load is low. In general hashtables
don't perform well after 50%, so prealloc is wasting half of the memory. Since
we cannot control where kprobes are placed I'm not sure what is the right fix
here. It feels that if we proceed with this patch somebody will complain and we
would have to revert, but I'm willing to take this risk if we cannot come up
with an alternative fix.

Going further with the patchset.

Patch 9 "bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites."
adds new warning:
../kernel/seccomp.c: In function ‘seccomp_run_filters’:
../kernel/seccomp.c:272:50: warning: passing argument 2 of ‘bpf_prog_run_pin_on_cpu’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
   u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);

I fixed it up and proceeded, but patch 16 failed to apply:

Applying: bpf: Factor out hashtab bucket lock operations
error: sha1 information is lacking or useless (kernel/bpf/hashtab.c).
error: could not build fake ancestor
Patch failed at 0001 bpf: Factor out hashtab bucket lock operations

I patched it in manually:
patch -p1 < a.patch
patching file kernel/bpf/hashtab.c
Hunk #1 succeeded at 1333 (offset 14 lines).
Hunk #2 succeeded at 1361 with fuzz 1 (offset 24 lines).
Hunk #3 succeeded at 1372 with fuzz 1 (offset 27 lines).
Hunk #4 succeeded at 1442 (offset 48 lines).
patching file kernel/bpf/syscall.c

and it looks correct.

But patch 17 failed completely:
patch -p1 < b.patch
patching file kernel/bpf/hashtab.c
Hunk #1 succeeded at 88 (offset 1 line).
Hunk #2 succeeded at 374 (offset 12 lines).
Hunk #3 succeeded at 437 (offset 12 lines).
Hunk #4 succeeded at 645 (offset 12 lines).
Hunk #5 succeeded at 653 (offset 12 lines).
Hunk #6 succeeded at 919 (offset 12 lines).
Hunk #7 succeeded at 960 (offset 12 lines).
Hunk #8 succeeded at 998 (offset 12 lines).
Hunk #9 succeeded at 1017 (offset 12 lines).
Hunk #10 succeeded at 1052 (offset 12 lines).
Hunk #11 succeeded at 1075 (offset 12 lines).
Hunk #12 succeeded at 1115 (offset 12 lines).
Hunk #13 succeeded at 1137 (offset 12 lines).
Hunk #14 succeeded at 1175 (offset 12 lines).
Hunk #15 succeeded at 1185 (offset 12 lines).
Hunk #16 succeeded at 1207 (offset 12 lines).
Hunk #17 succeeded at 1216 (offset 12 lines).
Hunk #18 FAILED at 1349.
Hunk #19 FAILED at 1358.
Hunk #20 FAILED at 1366.
Hunk #21 FAILED at 1407.
4 out of 21 hunks FAILED -- saving rejects to file kernel/bpf/hashtab.c.rej

That's where I gave up.

I pulled sched-for-bpf-2020-02-20 branch from tip and pushed it into bpf-next.
Could you please rebase your set on top of bpf-next and repost?
The logic in all patches looks good.

For now I propose to drop patch 1 and get the rest merged while we're
figuring out what to do.

Thanks!
Thomas Gleixner Feb. 22, 2020, 8:40 a.m. UTC | #2
Alexei,

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
>> The assumption that only programs attached to perf NMI events can deadlock
>> on memory allocators is wrong. Assume the following simplified callchain:
>>  	 */
>> -	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
>> +	if ((is_tracing_prog_type(prog->type)) {
>
> This doesn't build.
> I assumed the typo somehow sneaked in and proceeded, but it broke
> a bunch of tests:
> Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
> One can argue that the test are unsafe and broken.
> We used to test all those tests with and without prealloc:
> map_flags = 0;
> run_all_tests();
> map_flags = BPF_F_NO_PREALLOC;
> run_all_tests();
> Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
> always and that how it stayed since then. We can adjust the tests to use
> prealloc with tracing progs, but this breakage shows that there could be plenty
> of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
> be because they know that their kprobes are in a safe spot (and kmalloc is ok)
> and they want to save memory. They could be using large max_entries parameter
> for worst case hash map usage, but typical load is low. In general hashtables
> don't perform well after 50%, so prealloc is wasting half of the memory. Since
> we cannot control where kprobes are placed I'm not sure what is the right fix
> here. It feels that if we proceed with this patch somebody will complain and we
> would have to revert, but I'm willing to take this risk if we cannot come up
> with an alternative fix.

Having something which is known to be broken exposed is not a good option
either.

Just assume that someone is investigating a kernel issue. BOFH who is
stuck in the 90's uses perf, kprobes and tracepoints. Now he goes on
vacation and the new kid in the team decides to flip that over to BPF.
So now instead of getting information he deadlocks or crashes the
machine.

You can't just tell him, don't do that then. It's broken by design and
you really can't tell which probes are safe and which are not because
the allocator calls out into whatever functions which might look
completely unrelated.

So one way to phase this out would be:

	if (is_tracing()) {
        	if (is_perf() || IS_ENABLED(RT))
                	return -EINVAL;
                WARN_ONCE(.....)
        }

And clearly write in the warning that this is dangerous, broken and
about to be forbidden. Hmm?

> Going further with the patchset.
>
> Patch 9 "bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites."
> adds new warning:
> ../kernel/seccomp.c: In function ‘seccomp_run_filters’:
> ../kernel/seccomp.c:272:50: warning: passing argument 2 of ‘bpf_prog_run_pin_on_cpu’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
>    u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);

Uurgh. I'm sure I fixed that and then I must have lost it again while
reshuffling stuff. Sorry about that.

> That's where I gave up.

Fair enough.

> I pulled sched-for-bpf-2020-02-20 branch from tip and pushed it into bpf-next.
> Could you please rebase your set on top of bpf-next and repost?
> The logic in all patches looks good.

Will do.

Thanks,

        tglx
Alexei Starovoitov Feb. 23, 2020, 10:40 p.m. UTC | #3
On Sat, Feb 22, 2020 at 09:40:10AM +0100, Thomas Gleixner wrote:
> Alexei,
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
> >> The assumption that only programs attached to perf NMI events can deadlock
> >> on memory allocators is wrong. Assume the following simplified callchain:
> >>  	 */
> >> -	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> >> +	if ((is_tracing_prog_type(prog->type)) {
> >
> > This doesn't build.
> > I assumed the typo somehow sneaked in and proceeded, but it broke
> > a bunch of tests:
> > Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
> > One can argue that the test are unsafe and broken.
> > We used to test all those tests with and without prealloc:
> > map_flags = 0;
> > run_all_tests();
> > map_flags = BPF_F_NO_PREALLOC;
> > run_all_tests();
> > Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
> > always and that how it stayed since then. We can adjust the tests to use
> > prealloc with tracing progs, but this breakage shows that there could be plenty
> > of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
> > be because they know that their kprobes are in a safe spot (and kmalloc is ok)
> > and they want to save memory. They could be using large max_entries parameter
> > for worst case hash map usage, but typical load is low. In general hashtables
> > don't perform well after 50%, so prealloc is wasting half of the memory. Since
> > we cannot control where kprobes are placed I'm not sure what is the right fix
> > here. It feels that if we proceed with this patch somebody will complain and we
> > would have to revert, but I'm willing to take this risk if we cannot come up
> > with an alternative fix.
> 
> Having something which is known to be broken exposed is not a good option
> either.
> 
> Just assume that someone is investigating a kernel issue. BOFH who is
> stuck in the 90's uses perf, kprobes and tracepoints. Now he goes on
> vacation and the new kid in the team decides to flip that over to BPF.
> So now instead of getting information he deadlocks or crashes the
> machine.
> 
> You can't just tell him, don't do that then. It's broken by design and
> you really can't tell which probes are safe and which are not because
> the allocator calls out into whatever functions which might look
> completely unrelated.
> 
> So one way to phase this out would be:
> 
> 	if (is_tracing()) {
>         	if (is_perf() || IS_ENABLED(RT))
>                 	return -EINVAL;
>                 WARN_ONCE(.....)
>         }
> 
> And clearly write in the warning that this is dangerous, broken and
> about to be forbidden. Hmm?

Yeah. Let's start with WARN_ONCE and verbose(env, "dangerous, broken")
so the users see it in the verifier log and people who maintain
servers (like kernel-team-s in fb, goog, etc) see it as well
in their dmesg logs. So the motivation will be on both sides.
Then in few kernel releases we can flip it to disable.
Or we'll find a way to make it work without pre-allocating.
diff mbox series

Patch

--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8144,19 +8144,23 @@  static int check_map_prog_compatibility(
 					struct bpf_prog *prog)
 
 {
-	/* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
-	 * preallocated hash maps, since doing memory allocation
-	 * in overflow_handler can crash depending on where nmi got
-	 * triggered.
+	/*
+	 * Make sure that trace type programs only use preallocated hash
+	 * maps. Perf programs obviously can't do memory allocation in NMI
+	 * context and all other types can deadlock on a memory allocator
+	 * lock when a tracepoint/kprobe triggers a BPF program inside a
+	 * lock held region or create inconsistent state when the probe is
+	 * within an interrupts disabled critical region in the memory
+	 * allocator.
 	 */
-	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
+	if ((is_tracing_prog_type(prog->type)) {
 		if (!check_map_prealloc(map)) {
-			verbose(env, "perf_event programs can only use preallocated hash map\n");
+			verbose(env, "tracing programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
 		if (map->inner_map_meta &&
 		    !check_map_prealloc(map->inner_map_meta)) {
-			verbose(env, "perf_event programs can only use preallocated inner hash map\n");
+			verbose(env, "tracing programs can only use preallocated inner hash map\n");
 			return -EINVAL;
 		}
 	}