From patchwork Mon Feb 24 14:01:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243225 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R53C21Zvz9sRf for ; Tue, 25 Feb 2020 02:05:27 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727854AbgBXPFS (ORCPT ); Mon, 24 Feb 2020 10:05:18 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50177 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727359AbgBXPDU (ORCPT ); Mon, 24 Feb 2020 10:03:20 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFo-0004s2-S2; Mon, 24 Feb 2020 16:02:40 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 85B9410408E; Mon, 24 Feb 2020 16:02:40 +0100 (CET) Message-Id: <20200224145642.540542802@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:32 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 01/22] bpf: Tighten the requirements for preallocated hash maps References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The assumption that only programs attached to perf NMI events can deadlock on memory allocators is wrong. Assume the following simplified callchain: kmalloc() from regular non BPF context cache empty freelist empty lock(zone->lock); tracepoint or kprobe BPF() update_elem() lock(bucket) kmalloc() cache empty freelist empty lock(zone->lock); <- DEADLOCK There are other ways which do not involve locking to create wreckage: kmalloc() from regular non BPF context local_irq_save(); ... obj = slab_first(); kprobe() BPF() update_elem() lock(bucket) kmalloc() local_irq_save(); ... obj = slab_first(); <- Same object as above ... So preallocation _must_ be enforced for all variants of intrusive instrumentation. Unfortunately immediate enforcement would break backwards compatibility, so for now such programs still are allowed to run, but a one time warning is emitted in dmesg and the verifier emits a warning in the verifier log as well so developers are made aware about this and can fix their programs before the enforcement becomes mandatory. Signed-off-by: Thomas Gleixner --- V3: Still allow run-time allocation for !RT. Emit warnings. Split out the RT part as this really should be backported to stable kernels. V2: New patch --- kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++++++----------- 1 file changed, 28 insertions(+), 11 deletions(-) --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8143,26 +8143,43 @@ static bool is_tracing_prog_type(enum bp } } +static bool is_preallocated_map(struct bpf_map *map) +{ + if (!check_map_prealloc(map)) + return false; + if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta)) + return false; + return true; +} + static int check_map_prog_compatibility(struct bpf_verifier_env *env, struct bpf_map *map, struct bpf_prog *prog) { - /* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use - * preallocated hash maps, since doing memory allocation - * in overflow_handler can crash depending on where nmi got - * triggered. + /* + * Validate that trace type programs use preallocated hash maps. + * + * For programs attached to PERF events this is mandatory as the + * perf NMI can hit any arbitrary code sequence. + * + * All other trace types using preallocated hash maps are unsafe as + * well because tracepoint or kprobes can be inside locked regions + * of the memory allocator or at a place where a recursion into the + * memory allocator would see inconsistent state. + * + * For now running such programs is allowed for backwards + * compatibility reasons, but warnings are emitted so developers are + * made aware of the unsafety and can fix their programs before this + * is enforced. */ - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { - if (!check_map_prealloc(map)) { + if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { + if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } - if (map->inner_map_meta && - !check_map_prealloc(map->inner_map_meta)) { - verbose(env, "perf_event programs can only use preallocated inner hash map\n"); - return -EINVAL; - } + WARN_ONCE(1, "trace type BPF program uses run-time allocation\n"); + verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } if ((is_tracing_prog_type(prog->type) || From patchwork Mon Feb 24 14:01:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243203 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R5250Dxzz9sRh for ; Tue, 25 Feb 2020 02:04:29 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728260AbgBXPEJ (ORCPT ); Mon, 24 Feb 2020 10:04:09 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50215 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727938AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFp-0004s4-1b; Mon, 24 Feb 2020 16:02:41 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id BA5E010408F; Mon, 24 Feb 2020 16:02:40 +0100 (CET) Message-Id: <20200224145642.648784007@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:33 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 02/22] bpf: Enforce preallocation for instrumentation programs on RT References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Aside of the general unsafety of run-time map allocation for instrumentation type programs RT enabled kernels have another constraint: The instrumentation programs are invoked with preemption disabled, but the memory allocator spinlocks cannot be acquired in atomic context because they are converted to 'sleeping' spinlocks on RT. Therefore enforce map preallocation for these programs types when RT is enabled. Signed-off-by: Thomas Gleixner --- V3: New patch --- kernel/bpf/verifier.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8168,16 +8168,21 @@ static int check_map_prog_compatibility( * of the memory allocator or at a place where a recursion into the * memory allocator would see inconsistent state. * - * For now running such programs is allowed for backwards - * compatibility reasons, but warnings are emitted so developers are - * made aware of the unsafety and can fix their programs before this - * is enforced. + * On RT enabled kernels run-time allocation of all trace type + * programs is strictly prohibited due to lock type constraints. On + * !RT kernels it is allowed for backwards compatibility reasons for + * now, but warnings are emitted so developers are made aware of + * the unsafety and can fix their programs before this is enforced. */ if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + verbose(env, "trace type programs can only use preallocated hash map\n"); + return -EINVAL; + } WARN_ONCE(1, "trace type BPF program uses run-time allocation\n"); verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } From patchwork Mon Feb 24 14:01:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243196 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51Y5PmRz9sNg for ; Tue, 25 Feb 2020 02:04:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728190AbgBXPD5 (ORCPT ); Mon, 24 Feb 2020 10:03:57 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50216 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727950AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFp-0004sA-7g; Mon, 24 Feb 2020 16:02:41 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id EEAC1FFB71; Mon, 24 Feb 2020 16:02:40 +0100 (CET) Message-Id: <20200224145642.755793061@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:34 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 03/22] bpf: Update locking comment in hashtab code References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The comment where the bucket lock is acquired says: /* bpf_map_update_elem() can be called in_irq() */ which is not really helpful and aside of that it does not explain the subtle details of the hash bucket locks expecially in the context of BPF and perf, kprobes and tracing. Add a comment at the top of the file which explains the protection scopes and the details how potential deadlocks are prevented. Signed-off-by: Thomas Gleixner --- kernel/bpf/hashtab.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -27,6 +27,26 @@ .map_delete_batch = \ generic_map_delete_batch +/* + * The bucket lock has two protection scopes: + * + * 1) Serializing concurrent operations from BPF programs on differrent + * CPUs + * + * 2) Serializing concurrent operations from BPF programs and sys_bpf() + * + * BPF programs can execute in any context including perf, kprobes and + * tracing. As there are almost no limits where perf, kprobes and tracing + * can be invoked from the lock operations need to be protected against + * deadlocks. Deadlocks can be caused by recursion and by an invocation in + * the lock held section when functions which acquire this lock are invoked + * from sys_bpf(). BPF recursion is prevented by incrementing the per CPU + * variable bpf_prog_active, which prevents BPF programs attached to perf + * events, kprobes and tracing to be invoked before the prior invocation + * from one of these contexts completed. sys_bpf() uses the same mechanism + * by pinning the task to the current CPU and incrementing the recursion + * protection accross the map operation. + */ struct bucket { struct hlist_nulls_head head; raw_spinlock_t lock; @@ -884,7 +904,6 @@ static int htab_map_update_elem(struct b */ } - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -964,7 +983,6 @@ static int htab_lru_map_update_elem(stru return -ENOMEM; memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size); - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1019,7 +1037,6 @@ static int __htab_percpu_map_update_elem b = __select_bucket(htab, hash); head = &b->head; - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1083,7 +1100,6 @@ static int __htab_lru_percpu_map_update_ return -ENOMEM; } - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); From patchwork Mon Feb 24 14:01:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243186 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R50s4gJdz9sRk for ; Tue, 25 Feb 2020 02:03:25 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727992AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50210 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727901AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFp-0004sE-F5; Mon, 24 Feb 2020 16:02:41 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 2F32710408E; Mon, 24 Feb 2020 16:02:41 +0100 (CET) Message-Id: <20200224145642.847220186@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:35 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 04/22] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner --- V3: Add a cant_sleep() check. --- kernel/trace/bpf_trace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1516,10 +1516,9 @@ void bpf_put_raw_tracepoint(struct bpf_r static __always_inline void __bpf_trace_run(struct bpf_prog *prog, u64 *args) { + cant_sleep(); rcu_read_lock(); - preempt_disable(); (void) BPF_PROG_RUN(prog, args); - preempt_enable(); rcu_read_unlock(); } From patchwork Mon Feb 24 14:01:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243212 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52c5jh4z9sRh for ; Tue, 25 Feb 2020 02:04:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728359AbgBXPEv (ORCPT ); Mon, 24 Feb 2020 10:04:51 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50207 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726762AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFp-0004sK-Tv; Mon, 24 Feb 2020 16:02:42 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 9B151FFB71; Mon, 24 Feb 2020 16:02:41 +0100 (CET) Message-Id: <20200224145643.059995527@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:37 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 06/22] bpf/trace: Remove redundant preempt_disable from trace_call_bpf() References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner --- V3: New patch. Replaces the previous one which converted this to migrate_disable() --- kernel/trace/bpf_trace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -83,7 +83,7 @@ unsigned int trace_call_bpf(struct trace if (in_nmi()) /* not supported yet */ return 1; - preempt_disable(); + cant_sleep(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { /* @@ -115,7 +115,6 @@ unsigned int trace_call_bpf(struct trace out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); return ret; } From patchwork Mon Feb 24 14:01:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243227 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R53F6yv5z9sRQ for ; Tue, 25 Feb 2020 02:05:29 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727792AbgBXPDU (ORCPT ); Mon, 24 Feb 2020 10:03:20 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50179 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727498AbgBXPDT (ORCPT ); Mon, 24 Feb 2020 10:03:19 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFt-0004sL-63; Mon, 24 Feb 2020 16:02:45 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id D0BC510408E; Mon, 24 Feb 2020 16:02:41 +0100 (CET) Message-Id: <20200224145643.151953573@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:38 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 07/22] perf/bpf: Remove preempt disable around BPF invocation References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The BPF invocation from the perf event overflow handler does not require to disable preemption because this is called from NMI or at least hard interrupt context which is already non-preemptible. Signed-off-by: Thomas Gleixner --- kernel/events/core.c | 2 -- 1 file changed, 2 deletions(-) --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9206,7 +9206,6 @@ static void bpf_overflow_handler(struct int ret = 0; ctx.regs = perf_arch_bpf_user_pt_regs(regs); - preempt_disable(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) goto out; rcu_read_lock(); @@ -9214,7 +9213,6 @@ static void bpf_overflow_handler(struct rcu_read_unlock(); out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); if (!ret) return; From patchwork Mon Feb 24 14:01:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243215 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52n2TNlz9sRf for ; Tue, 25 Feb 2020 02:05:05 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728387AbgBXPE7 (ORCPT ); Mon, 24 Feb 2020 10:04:59 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50209 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727899AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFv-0004sQ-BQ; Mon, 24 Feb 2020 16:02:47 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 11412FFB71; Mon, 24 Feb 2020 16:02:42 +0100 (CET) Message-Id: <20200224145643.259118710@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:39 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 08/22] bpf: Remove recursion prevention from rcu free callback References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If an element is freed via RCU then recursion into BPF instrumentation functions is not a concern. The element is already detached from the map and the RCU callback does not hold any locks on which a kprobe, perf event or tracepoint attached BPF program could deadlock. Signed-off-by: Thomas Gleixner --- V2: New patch --- kernel/bpf/hashtab.c | 8 -------- 1 file changed, 8 deletions(-) --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -706,15 +706,7 @@ static void htab_elem_free_rcu(struct rc struct htab_elem *l = container_of(head, struct htab_elem, rcu); struct bpf_htab *htab = l->htab; - /* must increment bpf_prog_active to avoid kprobe+bpf triggering while - * we're calling kfree, otherwise deadlock is possible if kprobes - * are placed somewhere inside of slub - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); htab_elem_free(htab, l); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); } static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) From patchwork Mon Feb 24 14:01:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243209 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52J0FVLz9sRk for ; Tue, 25 Feb 2020 02:04:40 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728012AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50191 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727718AbgBXPDV (ORCPT ); Mon, 24 Feb 2020 10:03:21 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFv-0004sR-IQ; Mon, 24 Feb 2020 16:02:47 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 46DF210408F; Mon, 24 Feb 2020 16:02:42 +0100 (CET) Message-Id: <20200224145643.365930116@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:40 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 09/22] bpf: Dont iterate over possible CPUs with interrupts disabled References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org pcpu_freelist_populate() is disabling interrupts and then iterates over the possible CPUs. The reason why this disables interrupts is to silence lockdep because the invoked ___pcpu_freelist_push() takes spin locks. Neither the interrupt disabling nor the locking are required in this function because it's called during initialization and the resulting map is not yet visible to anything. Split out the actual push assignement into an inline, call it from the loop and remove the interrupt disable. Signed-off-by: Thomas Gleixner --- kernel/bpf/percpu_freelist.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) --- a/kernel/bpf/percpu_freelist.c +++ b/kernel/bpf/percpu_freelist.c @@ -25,12 +25,18 @@ void pcpu_freelist_destroy(struct pcpu_f free_percpu(s->freelist); } +static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head, + struct pcpu_freelist_node *node) +{ + node->next = head->first; + head->first = node; +} + static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head, struct pcpu_freelist_node *node) { raw_spin_lock(&head->lock); - node->next = head->first; - head->first = node; + pcpu_freelist_push_node(head, node); raw_spin_unlock(&head->lock); } @@ -56,21 +62,16 @@ void pcpu_freelist_populate(struct pcpu_ u32 nr_elems) { struct pcpu_freelist_head *head; - unsigned long flags; int i, cpu, pcpu_entries; pcpu_entries = nr_elems / num_possible_cpus() + 1; i = 0; - /* disable irq to workaround lockdep false positive - * in bpf usage pcpu_freelist_populate() will never race - * with pcpu_freelist_push() - */ - local_irq_save(flags); for_each_possible_cpu(cpu) { again: head = per_cpu_ptr(s->freelist, cpu); - ___pcpu_freelist_push(head, buf); + /* No locking required as this is not visible yet. */ + pcpu_freelist_push_node(head, buf); i++; buf += elem_size; if (i == nr_elems) @@ -78,7 +79,6 @@ void pcpu_freelist_populate(struct pcpu_ if (i % pcpu_entries) goto again; } - local_irq_restore(flags); } struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s) From patchwork Mon Feb 24 14:01:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243224 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R53356Qbz9sNg for ; Tue, 25 Feb 2020 02:05:19 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728072AbgBXPFT (ORCPT ); Mon, 24 Feb 2020 10:05:19 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50184 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727699AbgBXPDU (ORCPT ); Mon, 24 Feb 2020 10:03:20 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFq-0004sT-Qt; Mon, 24 Feb 2020 16:02:42 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 7D244104090; Mon, 24 Feb 2020 16:02:42 +0100 (CET) Message-Id: <20200224145643.474592620@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:41 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 10/22] bpf: Provide bpf_prog_run_pin_on_cpu() helper References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org BPF programs require to run on one CPU to completion as they use per CPU storage, but according to Alexei they don't need reentrancy protection as obviously BPF programs running in thread context can always be 'preempted' by hard and soft interrupts and instrumentation and the same program can run concurrently on a different CPU. The currently used mechanism to ensure CPUness is to wrap the invocation into a preempt_disable/enable() pair. Disabling preemption is also disabling migration for a task. preempt_disable/enable() is used because there is no explicit way to reliably disable only migration. Provide a separate macro to invoke a BPF program which can be used in migrateable task context. It wraps BPF_PROG_RUN() in a migrate_disable/enable() pair which maps on non RT enabled kernels to preempt_disable/enable(). On RT enabled kernels this merely disables migration. Both methods ensure that the invoked BPF program runs on one CPU to completion. Signed-off-by: Thomas Gleixner --- V3: Make the 'ctx' argument const to unbreak the build V2: Use an inline function (Mathieu) --- include/linux/filter.h | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -576,8 +576,30 @@ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabl } \ ret; }) -#define BPF_PROG_RUN(prog, ctx) __BPF_PROG_RUN(prog, ctx, \ - bpf_dispatcher_nopfunc) +#define BPF_PROG_RUN(prog, ctx) \ + __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc) + +/* + * Use in preemptible and therefore migratable context to make sure that + * the execution of the BPF program runs on one CPU. + * + * This uses migrate_disable/enable() explicitly to document that the + * invocation of a BPF program does not require reentrancy protection + * against a BPF program which is invoked from a preempting task. + * + * For non RT enabled kernels migrate_disable/enable() maps to + * preempt_disable/enable(), i.e. it disables also preemption. + */ +static inline u32 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog, + void *ctx) +{ + u32 ret; + + migrate_disable(); + ret = __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc); + migrate_enable(); + return ret; +} #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN From patchwork Mon Feb 24 14:01:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243185 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R50r2dntz9sRR for ; Tue, 25 Feb 2020 02:03:24 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727954AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50205 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727885AbgBXPDW (ORCPT ); Mon, 24 Feb 2020 10:03:22 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFs-0004sW-6V; Mon, 24 Feb 2020 16:02:44 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id B2A2E104092; Mon, 24 Feb 2020 16:02:42 +0100 (CET) Message-Id: <20200224145643.583038889@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:42 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 11/22] bpf: Replace cant_sleep() with cant_migrate() References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org As already discussed in the previous change which introduced BPF_RUN_PROG_PIN_ON_CPU() BPF only requires to disable migration to guarantee per CPUness. If RT substitutes the preempt disable based migration protection then the cant_sleep() check will obviously trigger as preemption is not disabled. Replace it by cant_migrate() which maps to cant_sleep() on a non RT kernel and will verify that migration is disabled on a full RT kernel. Signed-off-by: Thomas Gleixner --- include/linux/filter.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -561,7 +561,7 @@ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabl #define __BPF_PROG_RUN(prog, ctx, dfunc) ({ \ u32 ret; \ - cant_sleep(); \ + cant_migrate(); \ if (static_branch_unlikely(&bpf_stats_enabled_key)) { \ struct bpf_prog_stats *stats; \ u64 start = sched_clock(); \ From patchwork Mon Feb 24 14:01:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243204 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R5275HWvz9sRh for ; Tue, 25 Feb 2020 02:04:31 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728244AbgBXPEI (ORCPT ); Mon, 24 Feb 2020 10:04:08 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50217 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727952AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFx-0004th-LJ; Mon, 24 Feb 2020 16:02:49 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id E73EF104094; Mon, 24 Feb 2020 16:02:42 +0100 (CET) Message-Id: <20200224145643.691493094@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:43 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 12/22] bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Miller All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner --- V2: No change. Amended changelog vs. seccomp --- include/linux/filter.h | 4 +--- kernel/seccomp.c | 4 +--- net/core/flow_dissector.c | 4 +--- net/core/skmsg.c | 8 ++------ net/kcm/kcmsock.c | 4 +--- 5 files changed, 6 insertions(+), 18 deletions(-) --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -717,9 +717,7 @@ static inline u32 bpf_prog_run_clear_cb( if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ - preempt_disable(); for (; f; f = f->prev) { - u32 cur_ret = BPF_PROG_RUN(f->prog, sd); + u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd); if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) { ret = cur_ret; *match = f; } } - preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -920,9 +920,7 @@ bool bpf_flow_dissect(struct bpf_prog *p (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP); flow_keys->flags = flags; - preempt_disable(); - result = BPF_PROG_RUN(prog, ctx); - preempt_enable(); + result = bpf_prog_run_pin_on_cpu(prog, ctx); flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen); flow_keys->thoff = clamp_t(u16, flow_keys->thoff, --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -628,7 +628,6 @@ int sk_psock_msg_verdict(struct sock *sk struct bpf_prog *prog; int ret; - preempt_disable(); rcu_read_lock(); prog = READ_ONCE(psock->progs.msg_parser); if (unlikely(!prog)) { @@ -638,7 +637,7 @@ int sk_psock_msg_verdict(struct sock *sk sk_msg_compute_data_pointers(msg); msg->sk = sk; - ret = BPF_PROG_RUN(prog, msg); + ret = bpf_prog_run_pin_on_cpu(prog, msg); ret = sk_psock_map_verd(ret, msg->sk_redir); psock->apply_bytes = msg->apply_bytes; if (ret == __SK_REDIRECT) { @@ -653,7 +652,6 @@ int sk_psock_msg_verdict(struct sock *sk } out: rcu_read_unlock(); - preempt_enable(); return ret; } EXPORT_SYMBOL_GPL(sk_psock_msg_verdict); @@ -665,9 +663,7 @@ static int sk_psock_bpf_run(struct sk_ps skb->sk = psock->sk; bpf_compute_data_end_sk_skb(skb); - preempt_disable(); - ret = BPF_PROG_RUN(prog, skb); - preempt_enable(); + ret = bpf_prog_run_pin_on_cpu(prog, skb); /* strparser clones the skb before handing it to a upper layer, * meaning skb_orphan has been called. We NULL sk on the way out * to ensure we don't trigger a BUG_ON() in skb/sk operations --- a/net/kcm/kcmsock.c +++ b/net/kcm/kcmsock.c @@ -380,9 +380,7 @@ static int kcm_parse_func_strparser(stru struct bpf_prog *prog = psock->bpf_prog; int res; - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } From patchwork Mon Feb 24 14:01:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243218 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52v0sWmz9sRf for ; Tue, 25 Feb 2020 02:05:11 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727935AbgBXPDW (ORCPT ); Mon, 24 Feb 2020 10:03:22 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50196 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727498AbgBXPDV (ORCPT ); Mon, 24 Feb 2020 10:03:21 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFu-0004v8-Iz; Mon, 24 Feb 2020 16:02:46 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 28A4B104090; Mon, 24 Feb 2020 16:02:43 +0100 (CET) Message-Id: <20200224145643.785306549@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:44 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 13/22] bpf/tests: Use migrate disable instead of preempt disable References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Miller Replace the preemption disable/enable with migrate_disable/enable() to reflect the actual requirement and to allow PREEMPT_RT to substitute it with an actual migration disable mechanism which does not disable preemption. [ tglx: Switched it over to migrate disable ] Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner --- lib/test_bpf.c | 4 ++-- net/bpf/test_run.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -6660,14 +6660,14 @@ static int __run_one(const struct bpf_pr u64 start, finish; int ret = 0, i; - preempt_disable(); + migrate_disable(); start = ktime_get_ns(); for (i = 0; i < runs; i++) ret = BPF_PROG_RUN(fp, data); finish = ktime_get_ns(); - preempt_enable(); + migrate_enable(); *duration = finish - start; do_div(*duration, runs); --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -37,7 +37,7 @@ static int bpf_test_run(struct bpf_prog repeat = 1; rcu_read_lock(); - preempt_disable(); + migrate_disable(); time_start = ktime_get_ns(); for (i = 0; i < repeat; i++) { bpf_cgroup_storage_set(storage); @@ -54,18 +54,18 @@ static int bpf_test_run(struct bpf_prog if (need_resched()) { time_spent += ktime_get_ns() - time_start; - preempt_enable(); + migrate_enable(); rcu_read_unlock(); cond_resched(); rcu_read_lock(); - preempt_disable(); + migrate_disable(); time_start = ktime_get_ns(); } } time_spent += ktime_get_ns() - time_start; - preempt_enable(); + migrate_enable(); rcu_read_unlock(); do_div(time_spent, repeat); From patchwork Mon Feb 24 14:01:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243214 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52g5N3Jz9sRl for ; Tue, 25 Feb 2020 02:04:59 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728059AbgBXPE5 (ORCPT ); Mon, 24 Feb 2020 10:04:57 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50203 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727882AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFv-0004x0-0k; Mon, 24 Feb 2020 16:02:47 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 5E1C4104096; Mon, 24 Feb 2020 16:02:43 +0100 (CET) Message-Id: <20200224145643.891428873@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:45 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 14/22] bpf: Use migrate_disable/enabe() in trampoline code. References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: David Miller Instead of preemption disable/enable to reflect the purpose. This allows PREEMPT_RT to substitute it with an actual migration disable implementation. On non RT kernels this is still mapped to preempt_disable/enable(). Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner --- kernel/bpf/trampoline.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -367,8 +367,9 @@ void bpf_trampoline_put(struct bpf_tramp mutex_unlock(&trampoline_mutex); } -/* The logic is similar to BPF_PROG_RUN, but with explicit rcu and preempt that - * are needed for trampoline. The macro is split into +/* The logic is similar to BPF_PROG_RUN, but with an explicit + * rcu_read_lock() and migrate_disable() which are required + * for the trampoline. The macro is split into * call _bpf_prog_enter * call prog->bpf_func * call __bpf_prog_exit @@ -378,7 +379,7 @@ u64 notrace __bpf_prog_enter(void) u64 start = 0; rcu_read_lock(); - preempt_disable(); + migrate_disable(); if (static_branch_unlikely(&bpf_stats_enabled_key)) start = sched_clock(); return start; @@ -401,7 +402,7 @@ void notrace __bpf_prog_exit(struct bpf_ stats->nsecs += sched_clock() - start; u64_stats_update_end(&stats->syncp); } - preempt_enable(); + migrate_enable(); rcu_read_unlock(); } From patchwork Mon Feb 24 14:01:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243205 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52D3006z9sRk for ; Tue, 25 Feb 2020 02:04:36 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728227AbgBXPEH (ORCPT ); Mon, 24 Feb 2020 10:04:07 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50211 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727896AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FG0-0004y1-FQ; Mon, 24 Feb 2020 16:02:52 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 958E8104097; Mon, 24 Feb 2020 16:02:43 +0100 (CET) Message-Id: <20200224145643.998293311@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:46 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 15/22] bpf: Use migrate_disable/enable in array macros and cgroup/lirc code. References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Miller Replace the preemption disable/enable with migrate_disable/enable() to reflect the actual requirement and to allow PREEMPT_RT to substitute it with an actual migration disable mechanism which does not disable preemption. Including the code paths that go via __bpf_prog_run_save_cb(). Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner --- include/linux/bpf.h | 8 ++++---- include/linux/filter.h | 5 +++-- 2 files changed, 7 insertions(+), 6 deletions(-) --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -885,7 +885,7 @@ int bpf_prog_array_copy(struct bpf_prog_ struct bpf_prog *_prog; \ struct bpf_prog_array *_array; \ u32 _ret = 1; \ - preempt_disable(); \ + migrate_disable(); \ rcu_read_lock(); \ _array = rcu_dereference(array); \ if (unlikely(check_non_null && !_array))\ @@ -898,7 +898,7 @@ int bpf_prog_array_copy(struct bpf_prog_ } \ _out: \ rcu_read_unlock(); \ - preempt_enable(); \ + migrate_enable(); \ _ret; \ }) @@ -932,7 +932,7 @@ int bpf_prog_array_copy(struct bpf_prog_ u32 ret; \ u32 _ret = 1; \ u32 _cn = 0; \ - preempt_disable(); \ + migrate_disable(); \ rcu_read_lock(); \ _array = rcu_dereference(array); \ _item = &_array->items[0]; \ @@ -944,7 +944,7 @@ int bpf_prog_array_copy(struct bpf_prog_ _item++; \ } \ rcu_read_unlock(); \ - preempt_enable(); \ + migrate_enable(); \ if (_ret) \ _ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS); \ else \ --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -677,6 +677,7 @@ static inline u8 *bpf_skb_cb(struct sk_b return qdisc_skb_cb(skb)->data; } +/* Must be invoked with migration disabled */ static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog, struct sk_buff *skb) { @@ -702,9 +703,9 @@ static inline u32 bpf_prog_run_save_cb(c { u32 res; - preempt_disable(); + migrate_disable(); res = __bpf_prog_run_save_cb(prog, skb); - preempt_enable(); + migrate_enable(); return res; } From patchwork Mon Feb 24 14:01:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243199 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51g6ZRCz9sNg for ; Tue, 25 Feb 2020 02:04:07 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728225AbgBXPEH (ORCPT ); Mon, 24 Feb 2020 10:04:07 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50214 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727930AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFw-0004yv-0M; Mon, 24 Feb 2020 16:02:48 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id CC5C2104098; Mon, 24 Feb 2020 16:02:43 +0100 (CET) Message-Id: <20200224145644.103910133@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:47 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 16/22] bpf: Provide recursion prevention helpers References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org The places which need to prevent the execution of trace type BPF programs to prevent deadlocks on the hash bucket lock do this open coded. Provide two inline functions, bpf_disable/enable_instrumentation() to replace these open coded protection constructs. Use migrate_disable/enable() instead of preempt_disable/enable() right away so this works on RT enabled kernels. On a !RT kernel migrate_disable / enable() are mapped to preempt_disable/enable(). These helpers use this_cpu_inc/dec() instead of __this_cpu_inc/dec() on an RT enabled kernel because migrate disabled regions are preemptible and preemption might hit in the middle of a RMW operation which can lead to inconsistent state. Signed-off-by: Thomas Gleixner --- V2: New patch. Use this_cpu_inc/dec() as pointed out by Mathieu. --- include/linux/bpf.h | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -961,6 +961,36 @@ int bpf_prog_array_copy(struct bpf_prog_ #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); +/* + * Block execution of BPF programs attached to instrumentation (perf, + * kprobes, tracepoints) to prevent deadlocks on map operations as any of + * these events can happen inside a region which holds a map bucket lock + * and can deadlock on it. + * + * Use the preemption safe inc/dec variants on RT because migrate disable + * is preemptible on RT and preemption in the middle of the RMW operation + * might lead to inconsistent state. Use the raw variants for non RT + * kernels as migrate_disable() maps to preempt_disable() so the slightly + * more expensive save operation can be avoided. + */ +static inline void bpf_disable_instrumentation(void) +{ + migrate_disable(); + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + this_cpu_inc(bpf_prog_active); + else + __this_cpu_inc(bpf_prog_active); +} + +static inline void bpf_enable_instrumentation(void) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + this_cpu_dec(bpf_prog_active); + else + __this_cpu_dec(bpf_prog_active); + migrate_enable(); +} + extern const struct file_operations bpf_map_fops; extern const struct file_operations bpf_prog_fops; From patchwork Mon Feb 24 14:01:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243188 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R5166GlSz9sNg for ; Tue, 25 Feb 2020 02:03:38 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727973AbgBXPDi (ORCPT ); Mon, 24 Feb 2020 10:03:38 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50222 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727995AbgBXPDZ (ORCPT ); Mon, 24 Feb 2020 10:03:25 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFz-0004za-7t; Mon, 24 Feb 2020 16:02:51 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 0F9B9104099; Mon, 24 Feb 2020 16:02:44 +0100 (CET) Message-Id: <20200224145644.211208533@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:48 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 17/22] bpf: Use recursion prevention helpers in hashtab code References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The required protection is that the caller cannot be migrated to a different CPU as these places take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_disable/enable() and this_cpu_inc/dec() pairs with the new recursion prevention helpers to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable in the helpers map to preempt_disable/enable(), i.e. no functional change. Signed-off-by: Thomas Gleixner --- V3: Use recursion prevention helpers consistently --- kernel/bpf/hashtab.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1333,8 +1333,7 @@ static int } again: - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); again_nocopy: dst_key = keys; @@ -1362,8 +1361,7 @@ static int */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); goto after_loop; } @@ -1374,8 +1372,7 @@ static int */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); kvfree(keys); kvfree(values); goto alloc; @@ -1445,8 +1442,7 @@ static int } rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys, key_size * bucket_cnt) || copy_to_user(uvalues + total * value_size, values, From patchwork Mon Feb 24 14:01:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243189 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51G2yZMz9sRQ for ; Tue, 25 Feb 2020 02:03:46 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728103AbgBXPDh (ORCPT ); Mon, 24 Feb 2020 10:03:37 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50219 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727973AbgBXPDZ (ORCPT ); Mon, 24 Feb 2020 10:03:25 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFw-00051R-NK; Mon, 24 Feb 2020 16:02:48 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 451AB10409A; Mon, 24 Feb 2020 16:02:44 +0100 (CET) Message-Id: <20200224145644.317843926@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:49 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 18/22] bpf: Replace open coded recursion prevention in sys_bpf() References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The required protection is that the caller cannot be migrated to a different CPU as these functions end up in places which take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs with the new helper functions. These functions are already prepared to make BPF work on PREEMPT_RT enabled kernels. No functional change for !RT kernels. Signed-off-by: Thomas Gleixner --- V3: Restrict it to sys_bpf(). Hashtab code is already handled. --- kernel/bpf/syscall.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -171,11 +171,7 @@ static int bpf_map_update_value(struct b flags); } - /* must increment bpf_prog_active to avoid kprobe+bpf triggering from - * inside bpf map update or delete otherwise deadlocks are possible - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_update(map, key, value, flags); @@ -206,8 +202,7 @@ static int bpf_map_update_value(struct b err = map->ops->map_update_elem(map, key, value, flags); rcu_read_unlock(); } - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @@ -222,8 +217,7 @@ static int bpf_map_copy_value(struct bpf if (bpf_map_is_dev_bound(map)) return bpf_map_offload_lookup_elem(map, key, value); - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); @@ -268,8 +262,7 @@ static int bpf_map_copy_value(struct bpf rcu_read_unlock(); } - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @@ -1136,13 +1129,11 @@ static int map_delete_elem(union bpf_att goto out; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); out: kfree(key); @@ -1254,13 +1245,11 @@ int generic_map_delete_batch(struct bpf_ break; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); if (err) break; From patchwork Mon Feb 24 14:01:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243210 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R52M69dfz9sRQ for ; Tue, 25 Feb 2020 02:04:43 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727515AbgBXPEm (ORCPT ); Mon, 24 Feb 2020 10:04:42 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50213 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727902AbgBXPDX (ORCPT ); Mon, 24 Feb 2020 10:03:23 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FG2-00052k-6l; Mon, 24 Feb 2020 16:02:54 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 7B6B7104092; Mon, 24 Feb 2020 16:02:44 +0100 (CET) Message-Id: <20200224145644.420416916@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:50 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 19/22] bpf: Factor out hashtab bucket lock operations References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org As a preparation for making the BPF locking RT friendly, factor out the hash bucket lock operations into inline functions. This allows to do the necessary RT modification in one place instead of sprinkling it all over the place. No functional change. The now unused htab argument of the lock/unlock functions will be used in the next step which adds PREEMPT_RT support. Signed-off-by: Thomas Gleixner --- V3: Adopt to the lock changes in __htab_map_lookup_and_delete_batch() --- kernel/bpf/hashtab.c | 69 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 46 insertions(+), 23 deletions(-) --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -88,6 +88,32 @@ struct htab_elem { char key[0] __aligned(8); }; +static void htab_init_buckets(struct bpf_htab *htab) +{ + unsigned i; + + for (i = 0; i < htab->n_buckets; i++) { + INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); + raw_spin_lock_init(&htab->buckets[i].lock); + } +} + +static inline unsigned long htab_lock_bucket(const struct bpf_htab *htab, + struct bucket *b) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&b->lock, flags); + return flags; +} + +static inline void htab_unlock_bucket(const struct bpf_htab *htab, + struct bucket *b, + unsigned long flags) +{ + raw_spin_unlock_irqrestore(&b->lock, flags); +} + static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); static bool htab_is_lru(const struct bpf_htab *htab) @@ -348,8 +374,8 @@ static struct bpf_map *htab_map_alloc(un bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU); bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC); struct bpf_htab *htab; - int err, i; u64 cost; + int err; htab = kzalloc(sizeof(*htab), GFP_USER); if (!htab) @@ -411,10 +437,7 @@ static struct bpf_map *htab_map_alloc(un else htab->hashrnd = get_random_int(); - for (i = 0; i < htab->n_buckets; i++) { - INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - raw_spin_lock_init(&htab->buckets[i].lock); - } + htab_init_buckets(htab); if (prealloc) { err = prealloc_init(htab); @@ -622,7 +645,7 @@ static bool htab_lru_map_delete_node(voi b = __select_bucket(htab, tgt_l->hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) if (l == tgt_l) { @@ -630,7 +653,7 @@ static bool htab_lru_map_delete_node(voi break; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return l == tgt_l; } @@ -896,7 +919,7 @@ static int htab_map_update_elem(struct b */ } - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -937,7 +960,7 @@ static int htab_map_update_elem(struct b } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -975,7 +998,7 @@ static int htab_lru_map_update_elem(stru return -ENOMEM; memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size); - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -994,7 +1017,7 @@ static int htab_lru_map_update_elem(stru ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (ret) bpf_lru_push_free(&htab->lru, &l_new->lru_node); @@ -1029,7 +1052,7 @@ static int __htab_percpu_map_update_elem b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1052,7 +1075,7 @@ static int __htab_percpu_map_update_elem } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -1092,7 +1115,7 @@ static int __htab_lru_percpu_map_update_ return -ENOMEM; } - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1114,7 +1137,7 @@ static int __htab_lru_percpu_map_update_ } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l_new) bpf_lru_push_free(&htab->lru, &l_new->lru_node); return ret; @@ -1152,7 +1175,7 @@ static int htab_map_delete_elem(struct b b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @@ -1162,7 +1185,7 @@ static int htab_map_delete_elem(struct b ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -1184,7 +1207,7 @@ static int htab_lru_map_delete_elem(stru b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @@ -1193,7 +1216,7 @@ static int htab_lru_map_delete_elem(stru ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l) bpf_lru_push_free(&htab->lru, &l->lru_node); return ret; @@ -1342,7 +1365,7 @@ static int head = &b->head; /* do not grab the lock unless need it (bucket_cnt > 0). */ if (locked) - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); bucket_cnt = 0; hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) @@ -1359,7 +1382,7 @@ static int /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); bpf_enable_instrumentation(); goto after_loop; @@ -1370,7 +1393,7 @@ static int /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); bpf_enable_instrumentation(); kvfree(keys); @@ -1423,7 +1446,7 @@ static int dst_val += value_size; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); locked = false; while (node_to_free) { From patchwork Mon Feb 24 14:01:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243193 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51P1TBsz9sRk for ; Tue, 25 Feb 2020 02:03:53 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728149AbgBXPDt (ORCPT ); Mon, 24 Feb 2020 10:03:49 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50223 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728002AbgBXPDZ (ORCPT ); Mon, 24 Feb 2020 10:03:25 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFy-00052t-3b; Mon, 24 Feb 2020 16:02:50 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id B2F2610409C; Mon, 24 Feb 2020 16:02:44 +0100 (CET) Message-Id: <20200224145644.509685912@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:51 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 20/22] bpf: Prepare hashtab locking for PREEMPT_RT References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org PREEMPT_RT forbids certain operations like memory allocations (even with GFP_ATOMIC) from atomic contexts. This is required because even with GFP_ATOMIC the memory allocator calls into code pathes which acquire locks with long held lock sections. To ensure the deterministic behaviour these locks are regular spinlocks, which are converted to 'sleepable' spinlocks on RT. The only true atomic contexts on an RT kernel are the low level hardware handling, scheduling, low level interrupt handling, NMIs etc. None of these contexts should ever do memory allocations. As regular device interrupt handlers and soft interrupts are forced into thread context, the existing code which does spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); just works. In theory the BPF locks could be converted to regular spinlocks as well, but the bucket locks and percpu_freelist locks can be taken from arbitrary contexts (perf, kprobes, tracepoints) which are required to be atomic contexts even on RT. These mechanisms require preallocated maps, so there is no need to invoke memory allocations within the lock held sections. BPF maps which need dynamic allocation are only used from (forced) thread context on RT and can therefore use regular spinlocks which in turn allows to invoke memory allocations from the lock held section. To achieve this make the hash bucket lock a union of a raw and a regular spinlock and initialize and lock/unlock either the raw spinlock for preallocated maps or the regular variant for maps which require memory allocations. On a non RT kernel this distinction is neither possible nor required. spinlock maps to raw_spinlock and the extra code and conditional is optimized out by the compiler. No functional change. Signed-off-by: Thomas Gleixner --- kernel/bpf/hashtab.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 56 insertions(+), 9 deletions(-) --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -46,10 +46,43 @@ * from one of these contexts completed. sys_bpf() uses the same mechanism * by pinning the task to the current CPU and incrementing the recursion * protection accross the map operation. + * + * This has subtle implications on PREEMPT_RT. PREEMPT_RT forbids certain + * operations like memory allocations (even with GFP_ATOMIC) from atomic + * contexts. This is required because even with GFP_ATOMIC the memory + * allocator calls into code pathes which acquire locks with long held lock + * sections. To ensure the deterministic behaviour these locks are regular + * spinlocks, which are converted to 'sleepable' spinlocks on RT. The only + * true atomic contexts on an RT kernel are the low level hardware + * handling, scheduling, low level interrupt handling, NMIs etc. None of + * these contexts should ever do memory allocations. + * + * As regular device interrupt handlers and soft interrupts are forced into + * thread context, the existing code which does + * spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); + * just works. + * + * In theory the BPF locks could be converted to regular spinlocks as well, + * but the bucket locks and percpu_freelist locks can be taken from + * arbitrary contexts (perf, kprobes, tracepoints) which are required to be + * atomic contexts even on RT. These mechanisms require preallocated maps, + * so there is no need to invoke memory allocations within the lock held + * sections. + * + * BPF maps which need dynamic allocation are only used from (forced) + * thread context on RT and can therefore use regular spinlocks which in + * turn allows to invoke memory allocations from the lock held section. + * + * On a non RT kernel this distinction is neither possible nor required. + * spinlock maps to raw_spinlock and the extra code is optimized out by the + * compiler. */ struct bucket { struct hlist_nulls_head head; - raw_spinlock_t lock; + union { + raw_spinlock_t raw_lock; + spinlock_t lock; + }; }; struct bpf_htab { @@ -88,13 +121,26 @@ struct htab_elem { char key[0] __aligned(8); }; +static inline bool htab_is_prealloc(const struct bpf_htab *htab) +{ + return !(htab->map.map_flags & BPF_F_NO_PREALLOC); +} + +static inline bool htab_use_raw_lock(const struct bpf_htab *htab) +{ + return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab)); +} + static void htab_init_buckets(struct bpf_htab *htab) { unsigned i; for (i = 0; i < htab->n_buckets; i++) { INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - raw_spin_lock_init(&htab->buckets[i].lock); + if (htab_use_raw_lock(htab)) + raw_spin_lock_init(&htab->buckets[i].raw_lock); + else + spin_lock_init(&htab->buckets[i].lock); } } @@ -103,7 +149,10 @@ static inline unsigned long htab_lock_bu { unsigned long flags; - raw_spin_lock_irqsave(&b->lock, flags); + if (htab_use_raw_lock(htab)) + raw_spin_lock_irqsave(&b->raw_lock, flags); + else + spin_lock_irqsave(&b->lock, flags); return flags; } @@ -111,7 +160,10 @@ static inline void htab_unlock_bucket(co struct bucket *b, unsigned long flags) { - raw_spin_unlock_irqrestore(&b->lock, flags); + if (htab_use_raw_lock(htab)) + raw_spin_unlock_irqrestore(&b->raw_lock, flags); + else + spin_unlock_irqrestore(&b->lock, flags); } static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); @@ -128,11 +180,6 @@ static bool htab_is_percpu(const struct htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH; } -static bool htab_is_prealloc(const struct bpf_htab *htab) -{ - return !(htab->map.map_flags & BPF_F_NO_PREALLOC); -} - static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size, void __percpu *pptr) { From patchwork Mon Feb 24 14:01:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243200 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: incoming-bpf@patchwork.ozlabs.org Delivered-To: patchwork-incoming-bpf@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=bpf-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51j1X7jz9sRh for ; Tue, 25 Feb 2020 02:04:09 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728229AbgBXPEH (ORCPT ); Mon, 24 Feb 2020 10:04:07 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50220 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727978AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FG1-000547-JW; Mon, 24 Feb 2020 16:02:53 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id E769610409D; Mon, 24 Feb 2020 16:02:44 +0100 (CET) Message-Id: <20200224145644.602129531@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:52 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 21/22] bpf, lpm: Make locking RT friendly References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org The LPM trie map cannot be used in contexts like perf, kprobes and tracing as this map type dynamically allocates memory. The memory allocation happens with a raw spinlock held which is a truly spinning lock on a PREEMPT RT enabled kernel which disables preemption and interrupts. As RT does not allow memory allocation from such a section for various reasons, convert the raw spinlock to a regular spinlock. On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks which provide the proper protection but keep the code preemptible. On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does not cause any functional change. Signed-off-by: Thomas Gleixner --- kernel/bpf/lpm_trie.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -34,7 +34,7 @@ struct lpm_trie { size_t n_entries; size_t max_prefixlen; size_t data_size; - raw_spinlock_t lock; + spinlock_t lock; }; /* This trie implements a longest prefix match algorithm that can be used to @@ -315,7 +315,7 @@ static int trie_update_elem(struct bpf_m if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Allocate and fill a new node */ @@ -422,7 +422,7 @@ static int trie_update_elem(struct bpf_m kfree(im_node); } - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @@ -442,7 +442,7 @@ static int trie_delete_elem(struct bpf_m if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Walk the tree looking for an exact key/length match and keeping * track of the path we traverse. We will need to know the node @@ -518,7 +518,7 @@ static int trie_delete_elem(struct bpf_m kfree_rcu(node, rcu); out: - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @@ -575,7 +575,7 @@ static struct bpf_map *trie_alloc(union if (ret) goto out_err; - raw_spin_lock_init(&trie->lock); + spin_lock_init(&trie->lock); return &trie->map; out_err: From patchwork Mon Feb 24 14:01:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 1243195 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linutronix.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48R51S4qX0z9sRR for ; Tue, 25 Feb 2020 02:03:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728174AbgBXPDz (ORCPT ); Mon, 24 Feb 2020 10:03:55 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:50221 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727888AbgBXPDY (ORCPT ); Mon, 24 Feb 2020 10:03:24 -0500 Received: from [5.158.153.52] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1j6FFz-00054v-Ti; Mon, 24 Feb 2020 16:02:52 +0100 Received: from nanos.tec.linutronix.de (localhost [IPv6:::1]) by nanos.tec.linutronix.de (Postfix) with ESMTP id 28C3F10409E; Mon, 24 Feb 2020 16:02:45 +0100 (CET) Message-Id: <20200224145644.708960317@linutronix.de> User-Agent: quilt/0.65 Date: Mon, 24 Feb 2020 15:01:53 +0100 From: Thomas Gleixner To: LKML Cc: David Miller , bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Sebastian Sewior , Peter Zijlstra , Clark Williams , Steven Rostedt , Juri Lelli , Ingo Molnar , Mathieu Desnoyers , Vinicius Costa Gomes , Jakub Kicinski Subject: [patch V3 22/22] bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled References: <20200224140131.461979697@linutronix.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Miller In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner --- kernel/bpf/stackmap.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -40,6 +40,9 @@ static void do_up_read(struct irq_work * { struct stack_map_irq_work *work; + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT))) + return; + work = container_of(entry, struct stack_map_irq_work, irq_work); up_read_non_owner(work->sem); work->sem = NULL; @@ -288,10 +291,19 @@ static void stack_map_get_build_id_offse struct stack_map_irq_work *work = NULL; if (irqs_disabled()) { - work = this_cpu_ptr(&up_read_work); - if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) - /* cannot queue more up_read, fallback */ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + work = this_cpu_ptr(&up_read_work); + if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) { + /* cannot queue more up_read, fallback */ + irq_work_busy = true; + } + } else { + /* + * PREEMPT_RT does not allow to trylock mmap sem in + * interrupt disabled context. Force the fallback code. + */ irq_work_busy = true; + } } /*