Message ID | 20200529043839.15824-3-alexei.starovoitov@gmail.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | bpf: Introduce minimal support for sleepable progs | expand |
On Thu, May 28, 2020 at 9:39 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > Introduce sleepable BPF programs that can request such property for themselves > via BPF_F_SLEEPABLE flag at program load time. In such case they will be able > to use helpers like bpf_copy_from_user() that might sleep. At present only > fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only > when they are attached to kernel functions that are known to allow sleeping. > > The non-sleepable programs are relying on implicit rcu_read_lock() and > migrate_disable() to protect life time of programs, maps that they use and > per-cpu kernel structures used to pass info between bpf programs and the > kernel. The sleepable programs cannot be enclosed into rcu_read_lock(). > migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs > should not be enclosed in migrate_disable() as well. Therefore bpf_srcu is used > to protect the life time of sleepable progs. > > There are many networking and tracing program types. In many cases the > 'struct bpf_prog *' pointer itself is rcu protected within some other kernel > data structure and the kernel code is using rcu_dereference() to load that > program pointer and call BPF_PROG_RUN() on it. All these cases are not touched. > Instead sleepable bpf programs are allowed with bpf trampoline only. The > program pointers are hard-coded into generated assembly of bpf trampoline and > synchronize_srcu(&bpf_srcu) is used to protect the life time of the program. > The same trampoline can hold both sleepable and non-sleepable progs. > > When bpf_srcu lock is held it means that some sleepable bpf program is running > from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru > maps. These map types are waiting on programs to complete via > synchronize_srcu(&bpf_srcu); > > Updates to trampoline now has to do synchronize_srcu + synchronize_rcu_tasks > to wait for sleepable progs to finish and for trampoline assembly to finish. > > In the future srcu will be replaced with upcoming rcu_trace. > That will complete the first step of introducing sleepable progs. > > After that dynamically allocated hash maps can be allowed. All map elements > would have to be srcu protected instead of normal rcu. > per-cpu maps will be allowed. Either via the following pattern: > void *elem = bpf_map_lookup_elem(map, key); > if (elem) { > // access elem > bpf_map_release_elem(map, elem); > } > where modified lookup() helper will do migrate_disable() and > new bpf_map_release_elem() will do corresponding migrate_enable(). > Or explicit bpf_migrate_disable/enable() helpers will be introduced. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > Acked-by: KP Singh <kpsingh@google.com> > --- > arch/x86/net/bpf_jit_comp.c | 36 ++++++++++++++------ > include/linux/bpf.h | 4 +++ > include/uapi/linux/bpf.h | 8 +++++ > kernel/bpf/arraymap.c | 5 +++ > kernel/bpf/hashtab.c | 19 +++++++---- > kernel/bpf/syscall.c | 12 +++++-- > kernel/bpf/trampoline.c | 33 +++++++++++++++++- > kernel/bpf/verifier.c | 62 +++++++++++++++++++++++++++++++++- > tools/include/uapi/linux/bpf.h | 8 +++++ > 9 files changed, 165 insertions(+), 22 deletions(-) > [...] > +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will > + * restrict map and helper usage for such programs. Sleepable BPF programs can > + * only be attached to hooks where kernel execution context allows sleeping. > + * Such programs are allowed to use helpers that may sleep like > + * bpf_copy_from_user(). > + */ > +#define BPF_F_SLEEPABLE (1U << 4) > + > /* When BPF ldimm64's insn[0].src_reg != 0 then this can have > * two extensions: > * > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > index 11584618e861..26b18b6a3dbc 100644 > --- a/kernel/bpf/arraymap.c > +++ b/kernel/bpf/arraymap.c > @@ -393,6 +393,11 @@ static void array_map_free(struct bpf_map *map) > */ > synchronize_rcu(); > > + /* arrays could have been used by both sleepable and non-sleepable bpf > + * progs. Make sure to wait for both prog types to finish executing. > + */ > + synchronize_srcu(&bpf_srcu); > + to minimize churn later on when you switch to rcu_trace, maybe extract synchronize_rcu() + synchronize_srcu(&bpf_srcu) into a function (e.g., something like synchronize_sleepable_bpf?), exposed as an internal API? That way you also wouldn't need to add bpf_srcu to linux/bpf.h? > if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) > bpf_array_free_percpu(array); > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index b4b288a3c3c9..b001957fdcbf 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -577,8 +577,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) > struct htab_elem *l; > u32 hash, key_size; > > - /* Must be called with rcu_read_lock. */ > - WARN_ON_ONCE(!rcu_read_lock_held()); > + /* Must be called with s?rcu_read_lock. */ > + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); > Similar to above, might be worthwhile extracting into a function? > key_size = map->key_size; > > @@ -935,7 +935,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, > /* unknown flags */ > return -EINVAL; > > - WARN_ON_ONCE(!rcu_read_lock_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); > > key_size = map->key_size; > [...] > static int check_attach_btf_id(struct bpf_verifier_env *env) > { > struct bpf_prog *prog = env->prog; > @@ -10549,6 +10582,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > long addr; > u64 key; > > + if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && > + prog->type != BPF_PROG_TYPE_LSM) { > + verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); > + return -EINVAL; > + } BPF_PROG_TYPE_TRACING also includes iterator and raw tracepoint programs. You mention only fentry/fexit/fmod_ret are allowed. What about those two? I don't see any explicit checks for iterator and raw_tracepoint attach types in a switch below, so just checking if they should be allowed to be sleepable? Also seems like freplace ones are also sleeepable, if they replace sleepable programs, right? > + > if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) > return check_struct_ops_btf_id(env); > > @@ -10762,8 +10801,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > if (ret) > verbose(env, "%s() is not modifiable\n", > prog->aux->attach_func_name); > + } else if (prog->aux->sleepable) { > + switch (prog->type) { > + case BPF_PROG_TYPE_TRACING: > + /* fentry/fexit progs can be sleepable only if they are > + * attached to ALLOW_ERROR_INJECTION or security_*() funcs. > + */ > + ret = check_attach_modify_return(prog, addr); I was so confused about this piece... check_attach_modify_return() should probably be renamed to something else, it's not for fmod_ret only anymore. > + if (!ret) > + ret = check_sleepable_blacklist(addr); > + break; > + case BPF_PROG_TYPE_LSM: > + /* LSM progs check that they are attached to bpf_lsm_*() funcs > + * which are sleepable too. > + */ > + ret = check_sleepable_blacklist(addr); > + break; > + default: > + break; > + } > + if (ret) > + verbose(env, "%s is not sleepable\n", > + prog->aux->attach_func_name); > } > - > if (ret) > goto out; > tr->func.addr = (void *)addr; [...]
On Fri, May 29, 2020 at 01:25:06AM -0700, Andrii Nakryiko wrote: > > index 11584618e861..26b18b6a3dbc 100644 > > --- a/kernel/bpf/arraymap.c > > +++ b/kernel/bpf/arraymap.c > > @@ -393,6 +393,11 @@ static void array_map_free(struct bpf_map *map) > > */ > > synchronize_rcu(); > > > > + /* arrays could have been used by both sleepable and non-sleepable bpf > > + * progs. Make sure to wait for both prog types to finish executing. > > + */ > > + synchronize_srcu(&bpf_srcu); > > + > > to minimize churn later on when you switch to rcu_trace, maybe extract > synchronize_rcu() + synchronize_srcu(&bpf_srcu) into a function (e.g., > something like synchronize_sleepable_bpf?), exposed as an internal > API? That way you also wouldn't need to add bpf_srcu to linux/bpf.h? I think the opposite is must have actually. I think rcu operations should never be hidden in helpers. All rcu/srcu/rcu_trace ops should always be open coded. > > @@ -577,8 +577,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) > > struct htab_elem *l; > > u32 hash, key_size; > > > > - /* Must be called with rcu_read_lock. */ > > - WARN_ON_ONCE(!rcu_read_lock_held()); > > + /* Must be called with s?rcu_read_lock. */ > > + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); > > > > Similar to above, might be worthwhile extracting into a function? This one I'm 50/50, since this pattern will be in many places. But what kind of helper that would be? Clear name is very hard. WARN_ON_ONCE(!bpf_specific_rcu_lock_held()) ? Moving WARN into the helper would be even worse. When rcu_trace is available the churn of patches to convert srcu to rcu_trace will be a good thing. The patches will convey the difference. Like bpf_srcu will disappear. They will give a way to do benchmarking before/after and will help to go back to srcu in unlikely case there is some obscure bug in rcu_trace. Hiding srcu vs rcu_trace details behind helpers is not how the code should read. The trade off with one and another will be different case by case. Like synchronize_srcu() is ok, but synchronize_rcu_trace() may be too heavy in the trampoline update code and extra counter would be needed. Also there will be synchronize_multi() that I plan to use as well. > > > > + if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && > > + prog->type != BPF_PROG_TYPE_LSM) { > > + verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); > > + return -EINVAL; > > + } > > > BPF_PROG_TYPE_TRACING also includes iterator and raw tracepoint > programs. You mention only fentry/fexit/fmod_ret are allowed. What > about those two? I don't see any explicit checks for iterator and > raw_tracepoint attach types in a switch below, so just checking if > they should be allowed to be sleepable? good point. tp_btf and iter don't use trampoline, so sleepable flag is ignored. which is wrong. I'll add a check to get the prog rejected. > Also seems like freplace ones are also sleeepable, if they replace > sleepable programs, right? freplace is a different program type. So it's rejected by this code already. Eventually I'll add support to allow sleepable freplace prog that extend sleepable target. But that's future. > > + > > if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) > > return check_struct_ops_btf_id(env); > > > > @@ -10762,8 +10801,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > > if (ret) > > verbose(env, "%s() is not modifiable\n", > > prog->aux->attach_func_name); > > + } else if (prog->aux->sleepable) { > > + switch (prog->type) { > > + case BPF_PROG_TYPE_TRACING: > > + /* fentry/fexit progs can be sleepable only if they are > > + * attached to ALLOW_ERROR_INJECTION or security_*() funcs. > > + */ > > + ret = check_attach_modify_return(prog, addr); > > I was so confused about this piece... check_attach_modify_return() > should probably be renamed to something else, it's not for fmod_ret > only anymore. why? I think the name is correct. The helper checks whether target allows modifying its return value. It's a first while list. When that passes the black list applies via check_sleepable_blacklist() function. I was considering using whitelist for sleepable as well, but that's overkill. Too much overlap with mod_ret. Imo check whitelist + check blacklist for white list exceptions is clean enough. > > > + if (!ret) > > + ret = check_sleepable_blacklist(addr); > > + break; > > + case BPF_PROG_TYPE_LSM: > > + /* LSM progs check that they are attached to bpf_lsm_*() funcs > > + * which are sleepable too. > > + */ > > + ret = check_sleepable_blacklist(addr); > > + break;
On Fri, May 29, 2020 at 1:12 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, May 29, 2020 at 01:25:06AM -0700, Andrii Nakryiko wrote: > > > index 11584618e861..26b18b6a3dbc 100644 > > > --- a/kernel/bpf/arraymap.c > > > +++ b/kernel/bpf/arraymap.c > > > @@ -393,6 +393,11 @@ static void array_map_free(struct bpf_map *map) > > > */ > > > synchronize_rcu(); > > > > > > + /* arrays could have been used by both sleepable and non-sleepable bpf > > > + * progs. Make sure to wait for both prog types to finish executing. > > > + */ > > > + synchronize_srcu(&bpf_srcu); > > > + > > > > to minimize churn later on when you switch to rcu_trace, maybe extract > > synchronize_rcu() + synchronize_srcu(&bpf_srcu) into a function (e.g., > > something like synchronize_sleepable_bpf?), exposed as an internal > > API? That way you also wouldn't need to add bpf_srcu to linux/bpf.h? > > I think the opposite is must have actually. I think rcu operations should never > be hidden in helpers. All rcu/srcu/rcu_trace ops should always be open coded. Ok, that's fair. > > > > @@ -577,8 +577,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) > > > struct htab_elem *l; > > > u32 hash, key_size; > > > > > > - /* Must be called with rcu_read_lock. */ > > > - WARN_ON_ONCE(!rcu_read_lock_held()); > > > + /* Must be called with s?rcu_read_lock. */ > > > + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); > > > > > > > Similar to above, might be worthwhile extracting into a function? > > This one I'm 50/50, since this pattern will be in many places. > But what kind of helper that would be? > Clear name is very hard. > WARN_ON_ONCE(!bpf_specific_rcu_lock_held()) ? > Moving WARN into the helper would be even worse. yeah, naming is hard, it's fine to leave as is, I think > > When rcu_trace is available the churn of patches to convert srcu to rcu_trace > will be a good thing. The patches will convey the difference. > Like bpf_srcu will disappear. They will give a way to do benchmarking before/after > and will help to go back to srcu in unlikely case there is some obscure bug > in rcu_trace. Hiding srcu vs rcu_trace details behind helpers is not how > the code should read. The trade off with one and another will be different > case by case. Like synchronize_srcu() is ok, but synchronize_rcu_trace() > may be too heavy in the trampoline update code and extra counter would be needed. > Also there will be synchronize_multi() that I plan to use as well. yeah, makes sense > > > > > > > + if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && > > > + prog->type != BPF_PROG_TYPE_LSM) { > > > + verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); > > > + return -EINVAL; > > > + } > > > > > > BPF_PROG_TYPE_TRACING also includes iterator and raw tracepoint > > programs. You mention only fentry/fexit/fmod_ret are allowed. What > > about those two? I don't see any explicit checks for iterator and > > raw_tracepoint attach types in a switch below, so just checking if > > they should be allowed to be sleepable? > > good point. tp_btf and iter don't use trampoline, so sleepable flag > is ignored. which is wrong. I'll add a check to get the prog rejected. > > > Also seems like freplace ones are also sleeepable, if they replace > > sleepable programs, right? > > freplace is a different program type. So it's rejected by this code already. > Eventually I'll add support to allow sleepable freplace prog that extend > sleepable target. But that's future. Yeah, I know they are rejected (because they are EXT, not LSM/TRACING). But they do use trampoline and they run in the same context as replaced programs, so they are effectively same type as replaced programs, which is why I asked. And yes, it's ok to do it in the future, was mostly curious whether freplace have anything specific precluding them to be sleepable. > > > > + > > > if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) > > > return check_struct_ops_btf_id(env); > > > > > > @@ -10762,8 +10801,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > > > if (ret) > > > verbose(env, "%s() is not modifiable\n", > > > prog->aux->attach_func_name); > > > + } else if (prog->aux->sleepable) { > > > + switch (prog->type) { > > > + case BPF_PROG_TYPE_TRACING: > > > + /* fentry/fexit progs can be sleepable only if they are > > > + * attached to ALLOW_ERROR_INJECTION or security_*() funcs. > > > + */ > > > + ret = check_attach_modify_return(prog, addr); > > > > I was so confused about this piece... check_attach_modify_return() > > should probably be renamed to something else, it's not for fmod_ret > > only anymore. > > why? I think the name is correct. The helper checks whether target > allows modifying its return value. It's a first while list. check_attach_modify_return() name implies to me that it's strictly for fmod_ret-specific attachment checks, that's all. It's minor, if you feel like name is appropriate I'm fine with it. > When that passes the black list applies via check_sleepable_blacklist() function. > > I was considering using whitelist for sleepable as well, but that's overkill. > Too much overlap with mod_ret. > Imo check whitelist + check blacklist for white list exceptions is clean enough. I agree about whitelist+blacklist, my only point was that check_attach_modify_return() is not communicating that it's a whitelist. check_sleepable_blacklist() is clear as day, check_sleepable_whitelist() would be as clear, even if internally it (for now) just calls into check_attach_modify_return(). Eventually it might be evolved beyond what's in check_attach_modify_return(). Not a big deal and can be changed later, if necessary. > > > > > > + if (!ret) > > > + ret = check_sleepable_blacklist(addr); > > > + break; > > > + case BPF_PROG_TYPE_LSM: > > > + /* LSM progs check that they are attached to bpf_lsm_*() funcs > > > + * which are sleepable too. > > > + */ > > > + ret = check_sleepable_blacklist(addr); > > > + break;
On Fri, May 29, 2020 at 01:38:40PM -0700, Andrii Nakryiko wrote: > > > > if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) > > > > return check_struct_ops_btf_id(env); > > > > > > > > @@ -10762,8 +10801,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) > > > > if (ret) > > > > verbose(env, "%s() is not modifiable\n", > > > > prog->aux->attach_func_name); > > > > + } else if (prog->aux->sleepable) { > > > > + switch (prog->type) { > > > > + case BPF_PROG_TYPE_TRACING: > > > > + /* fentry/fexit progs can be sleepable only if they are > > > > + * attached to ALLOW_ERROR_INJECTION or security_*() funcs. > > > > + */ > > > > + ret = check_attach_modify_return(prog, addr); > > > > > > I was so confused about this piece... check_attach_modify_return() > > > should probably be renamed to something else, it's not for fmod_ret > > > only anymore. > > > > why? I think the name is correct. The helper checks whether target > > allows modifying its return value. It's a first while list. > > check_attach_modify_return() name implies to me that it's strictly for > fmod_ret-specific attachment checks, that's all. It's minor, if you > feel like name is appropriate I'm fine with it. ahh. i see the confusion. I've read check_attach_modify_return as whether target kernel function allows tweaking it's return value. whereas it sounds that you've read it as it's check whether target func is ok for modify_return bpf program type. > > > When that passes the black list applies via check_sleepable_blacklist() function. > > > > I was considering using whitelist for sleepable as well, but that's overkill. > > Too much overlap with mod_ret. > > Imo check whitelist + check blacklist for white list exceptions is clean enough. > > I agree about whitelist+blacklist, my only point was that > check_attach_modify_return() is not communicating that it's a > whitelist. check_sleepable_blacklist() is clear as day, > check_sleepable_whitelist() would be as clear, even if internally it > (for now) just calls into check_attach_modify_return(). Eventually it > might be evolved beyond what's in check_attach_modify_return(). Not a > big deal and can be changed later, if necessary. got it. I will wrap it into another helper.
Hi Alexei, I love your patch! Yet something to improve: [auto build test ERROR on net-next/master] [also build test ERROR on v5.7-rc7 next-20200529] [cannot apply to bpf-next/master bpf/master net/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Alexei-Starovoitov/bpf-Introduce-minimal-support-for-sleepable-progs/20200531-055259 base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git bc183dec08f9cb177cf5206a010b7a9e7b22e567 config: microblaze-randconfig-c003-20200531 (attached as .config) compiler: microblaze-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>, old ones prefixed by <<): microblaze-linux-ld: kernel/bpf/syscall.o: in function `__bpf_prog_put_noref': kernel/bpf/syscall.c:1729: undefined reference to `bpf_srcu' microblaze-linux-ld: kernel/bpf/hashtab.o: in function `lock_is_held': >> include/linux/lockdep.h:406: undefined reference to `bpf_srcu' microblaze-linux-ld: kernel/bpf/hashtab.o: in function `htab_map_free': kernel/bpf/hashtab.c:1305: undefined reference to `bpf_srcu' microblaze-linux-ld: kernel/bpf/hashtab.o: in function `lock_is_held': >> include/linux/lockdep.h:406: undefined reference to `bpf_srcu' >> microblaze-linux-ld: include/linux/lockdep.h:406: undefined reference to `bpf_srcu' microblaze-linux-ld: kernel/bpf/hashtab.o:include/linux/lockdep.h:406: more undefined references to `bpf_srcu' follow vim +406 include/linux/lockdep.h f8319483f57f1c Peter Zijlstra 2016-11-30 403 08f36ff642342f Matthew Wilcox 2018-01-17 404 static inline int lock_is_held(const struct lockdep_map *lock) f8319483f57f1c Peter Zijlstra 2016-11-30 405 { f8319483f57f1c Peter Zijlstra 2016-11-30 @406 return lock_is_held_type(lock, -1); f8319483f57f1c Peter Zijlstra 2016-11-30 407 } f607c668577481 Peter Zijlstra 2009-07-20 408 :::::: The code at line 406 was first introduced by commit :::::: f8319483f57f1ca22370f4150bb990aca7728a67 locking/lockdep: Provide a type check for lock_is_held :::::: TO: Peter Zijlstra <peterz@infradead.org> :::::: CC: Dave Chinner <david@fromorbit.com> --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Alexei, I love your patch! Yet something to improve: [auto build test ERROR on net-next/master] [also build test ERROR on v5.7-rc7 next-20200529] [cannot apply to bpf-next/master bpf/master net/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Alexei-Starovoitov/bpf-Introduce-minimal-support-for-sleepable-progs/20200531-055259 base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git bc183dec08f9cb177cf5206a010b7a9e7b22e567 config: um-randconfig-r004-20200529 (attached as .config) compiler: gcc-6 (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 reproduce (this is a W=1 build): # save the attached .config to linux build tree make W=1 ARCH=um If you fix the issue, kindly add following tag as appropriate Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>, old ones prefixed by <<): /usr/bin/ld: kernel/bpf/syscall.o: in function `__bpf_prog_put_noref': kernel/bpf/syscall.c:1729: undefined reference to `bpf_srcu' /usr/bin/ld: kernel/bpf/hashtab.o: in function `htab_map_free': kernel/bpf/hashtab.c:1305: undefined reference to `bpf_srcu' /usr/bin/ld: kernel/bpf/hashtab.o: in function `lock_is_held': include/linux/lockdep.h:406: undefined reference to `bpf_srcu' >> /usr/bin/ld: include/linux/lockdep.h:406: undefined reference to `bpf_srcu' >> /usr/bin/ld: include/linux/lockdep.h:406: undefined reference to `bpf_srcu' /usr/bin/ld: kernel/bpf/hashtab.o:include/linux/lockdep.h:406: more undefined references to `bpf_srcu' follow collect2: error: ld returned 1 exit status --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 42b6709e6dc7..3fdb62c89d6f 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1379,10 +1379,17 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, u8 *prog = *pprog; int cnt = 0; - if (emit_call(&prog, __bpf_prog_enter, prog)) - return -EINVAL; - /* remember prog start time returned by __bpf_prog_enter */ - emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + if (p->aux->sleepable) { + if (emit_call(&prog, __bpf_prog_enter_sleepable, prog)) + return -EINVAL; + /* remember srcu idx returned by __bpf_prog_enter_sleepable */ + emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + } else { + if (emit_call(&prog, __bpf_prog_enter, prog)) + return -EINVAL; + /* remember prog start time returned by __bpf_prog_enter */ + emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + } /* arg1: lea rdi, [rbp - stack_size] */ EMIT4(0x48, 0x8D, 0x7D, -stack_size); @@ -1402,13 +1409,20 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, if (mod_ret) emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); - /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, - (u32) (long) p); - /* arg2: mov rsi, rbx <- start time in nsec */ - emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); - if (emit_call(&prog, __bpf_prog_exit, prog)) - return -EINVAL; + if (p->aux->sleepable) { + /* arg1: mov rdi, rbx <- srcu idx */ + emit_mov_reg(&prog, true, BPF_REG_1, BPF_REG_6); + if (emit_call(&prog, __bpf_prog_exit_sleepable, prog)) + return -EINVAL; + } else { + /* arg1: mov rdi, progs[i] */ + emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, + (u32) (long) p); + /* arg2: mov rsi, rbx <- start time in nsec */ + emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); + if (emit_call(&prog, __bpf_prog_exit, prog)) + return -EINVAL; + } *pprog = prog; return 0; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index efe8836b5c48..7b8291a96838 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -36,6 +36,7 @@ struct seq_operations; extern struct idr btf_idr; extern spinlock_t btf_idr_lock; +extern struct srcu_struct bpf_srcu; /* map is generic key/value storage optionally accesible by eBPF programs */ struct bpf_map_ops { @@ -476,6 +477,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, /* these two functions are called from generated trampoline */ u64 notrace __bpf_prog_enter(void); void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); +u64 notrace __bpf_prog_enter_sleepable(void); +void notrace __bpf_prog_exit_sleepable(u64 idx); struct bpf_ksym { unsigned long start; @@ -668,6 +671,7 @@ struct bpf_prog_aux { bool offload_requested; bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */ bool func_proto_unreliable; + bool sleepable; enum bpf_tramp_prog_type trampoline_prog_type; struct bpf_trampoline *trampoline; struct hlist_node tramp_hlist; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 54b93f8b49b8..cc08a2064d4e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -329,6 +329,14 @@ enum bpf_link_type { /* The verifier internal test flag. Behavior is undefined */ #define BPF_F_TEST_STATE_FREQ (1U << 3) +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will + * restrict map and helper usage for such programs. Sleepable BPF programs can + * only be attached to hooks where kernel execution context allows sleeping. + * Such programs are allowed to use helpers that may sleep like + * bpf_copy_from_user(). + */ +#define BPF_F_SLEEPABLE (1U << 4) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * two extensions: * diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 11584618e861..26b18b6a3dbc 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -393,6 +393,11 @@ static void array_map_free(struct bpf_map *map) */ synchronize_rcu(); + /* arrays could have been used by both sleepable and non-sleepable bpf + * progs. Make sure to wait for both prog types to finish executing. + */ + synchronize_srcu(&bpf_srcu); + if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) bpf_array_free_percpu(array); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index b4b288a3c3c9..b001957fdcbf 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -577,8 +577,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) struct htab_elem *l; u32 hash, key_size; - /* Must be called with rcu_read_lock. */ - WARN_ON_ONCE(!rcu_read_lock_held()); + /* Must be called with s?rcu_read_lock. */ + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); key_size = map->key_size; @@ -935,7 +935,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); key_size = map->key_size; @@ -1026,7 +1026,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); key_size = map->key_size; @@ -1214,7 +1214,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) u32 hash, key_size; int ret = -ENOENT; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); key_size = map->key_size; @@ -1246,7 +1246,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) u32 hash, key_size; int ret = -ENOENT; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !srcu_read_lock_held(&bpf_srcu)); key_size = map->key_size; @@ -1297,6 +1297,13 @@ static void htab_map_free(struct bpf_map *map) */ synchronize_rcu(); + /* preallocated hash map could have been used by both sleepable and + * non-sleepable bpf progs. Make sure to wait for both prog types + * to finish executing. + */ + if (htab_is_prealloc(htab)) + synchronize_srcu(&bpf_srcu); + /* some of free_htab_elem() callbacks for elements of this map may * not have executed. Wait for them. */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 2c969a9b90d3..44d0c3ed744a 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1725,10 +1725,14 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) btf_put(prog->aux->btf); bpf_prog_free_linfo(prog); - if (deferred) - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); - else + if (deferred) { + if (prog->aux->sleepable) + call_srcu(&bpf_srcu, &prog->aux->rcu, __bpf_prog_put_rcu); + else + call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); + } else { __bpf_prog_put_rcu(&prog->aux->rcu); + } } static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) @@ -2093,6 +2097,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT | BPF_F_TEST_STATE_FREQ | + BPF_F_SLEEPABLE | BPF_F_TEST_RND_HI32)) return -EINVAL; @@ -2148,6 +2153,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) } prog->aux->offload_requested = !!attr->prog_ifindex; + prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE; err = security_bpf_prog_alloc(prog->aux); if (err) diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 9be85aa4ec5f..d35d26e9693d 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -205,13 +205,21 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs) flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; + /* the same trampoline can hold both sleepable and non-sleepable progs. + * synchronize_srcu() is cheap when sleepable progs are not running, so + * just call it here to make sure all sleepable progs finish executing. + * Then call synchronize_rcu_tasks() to make sure that the rest of + * generated tramopline assembly finishes too before updating + * trampoline. + */ + synchronize_srcu(&bpf_srcu); + /* Though the second half of trampoline page is unused a task could be * preempted in the middle of the first half of trampoline and two * updates to trampoline would change the code from underneath the * preempted task. Hence wait for tasks to voluntarily schedule or go * to userspace. */ - synchronize_rcu_tasks(); err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2, @@ -344,6 +352,11 @@ void bpf_trampoline_put(struct bpf_trampoline *tr) if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT]))) goto out; bpf_image_ksym_del(&tr->ksym); + /* This code will be executed when all bpf progs (both sleepable and + * non-sleepable) went through + * bpf_prog_put()->call_s?rcu()->bpf_prog_free_deferred(). + * Hence no need for another synchronize_srcu(&bpf_srcu) here. + */ /* wait for tasks to get out of trampoline before freeing it */ synchronize_rcu_tasks(); bpf_jit_free_exec(tr->image); @@ -394,6 +407,23 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) rcu_read_unlock(); } +/* when bpf_srcu lock is held it means that some sleepable bpf program is + * running. Those programs can use bpf arrays and preallocated hash maps. These + * map types are waiting on programs to complete via + * synchronize_srcu(&bpf_srcu); + */ +struct srcu_struct bpf_srcu; + +u64 notrace __bpf_prog_enter_sleepable(void) +{ + return srcu_read_lock(&bpf_srcu); +} + +void notrace __bpf_prog_exit_sleepable(u64 idx) +{ + srcu_read_unlock(&bpf_srcu, idx); +} + int __weak arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, @@ -409,6 +439,7 @@ static int __init init_trampolines(void) for (i = 0; i < TRAMPOLINE_TABLE_SIZE; i++) INIT_HLIST_HEAD(&trampoline_table[i]); + init_srcu_struct(&bpf_srcu); return 0; } late_initcall(init_trampolines); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8cf8dae86f00..f4d41a9d692c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8922,6 +8922,23 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } + if (prog->aux->sleepable) + switch (map->map_type) { + case BPF_MAP_TYPE_HASH: + case BPF_MAP_TYPE_LRU_HASH: + case BPF_MAP_TYPE_ARRAY: + if (!is_preallocated_map(map)) { + verbose(env, + "Sleepable programs can only use preallocated hash maps\n"); + return -EINVAL; + } + break; + default: + verbose(env, + "Sleepable programs can only use array and hash maps\n"); + return -EINVAL; + } + return 0; } @@ -10532,6 +10549,22 @@ static int check_attach_modify_return(struct bpf_prog *prog, unsigned long addr) return -EINVAL; } +/* list of non-sleepable kernel functions that are otherwise + * available to attach by bpf_lsm or fmod_ret progs. + */ +static int check_sleepable_blacklist(unsigned long addr) +{ +#ifdef CONFIG_BPF_LSM + if (addr == (long)bpf_lsm_task_free) + return -EINVAL; +#endif +#ifdef CONFIG_SECURITY + if (addr == (long)security_task_free) + return -EINVAL; +#endif + return 0; +} + static int check_attach_btf_id(struct bpf_verifier_env *env) { struct bpf_prog *prog = env->prog; @@ -10549,6 +10582,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) long addr; u64 key; + if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && + prog->type != BPF_PROG_TYPE_LSM) { + verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); + return -EINVAL; + } + if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) return check_struct_ops_btf_id(env); @@ -10762,8 +10801,29 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) if (ret) verbose(env, "%s() is not modifiable\n", prog->aux->attach_func_name); + } else if (prog->aux->sleepable) { + switch (prog->type) { + case BPF_PROG_TYPE_TRACING: + /* fentry/fexit progs can be sleepable only if they are + * attached to ALLOW_ERROR_INJECTION or security_*() funcs. + */ + ret = check_attach_modify_return(prog, addr); + if (!ret) + ret = check_sleepable_blacklist(addr); + break; + case BPF_PROG_TYPE_LSM: + /* LSM progs check that they are attached to bpf_lsm_*() funcs + * which are sleepable too. + */ + ret = check_sleepable_blacklist(addr); + break; + default: + break; + } + if (ret) + verbose(env, "%s is not sleepable\n", + prog->aux->attach_func_name); } - if (ret) goto out; tr->func.addr = (void *)addr; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 54b93f8b49b8..cc08a2064d4e 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -329,6 +329,14 @@ enum bpf_link_type { /* The verifier internal test flag. Behavior is undefined */ #define BPF_F_TEST_STATE_FREQ (1U << 3) +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will + * restrict map and helper usage for such programs. Sleepable BPF programs can + * only be attached to hooks where kernel execution context allows sleeping. + * Such programs are allowed to use helpers that may sleep like + * bpf_copy_from_user(). + */ +#define BPF_F_SLEEPABLE (1U << 4) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * two extensions: *