Message ID | 20190724165803.87470-3-brianvv@google.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | bpf: add BPF_MAP_DUMP command to dump more than one entry per call | expand |
On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote: > > This introduces a new command to retrieve multiple number of entries > from a bpf map, wrapping the existing bpf methods: > map_get_next_key and map_lookup_elem > > To start dumping the map from the beginning you must specify NULL as > the prev_key. > > The new API returns 0 when it successfully copied all the elements > requested or it copied less because there weren't more elements to > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. I think I understand this, but perhaps it can be explained a bit more concisely without reference to ENOENT and error masking. The function returns the min of the number of requested elements and the number of remaining elements in the map from the given starting point (prev_key). > On a successful call buf and buf_len will contain correct data and in > case prev_key was provided (not for the first walk, since prev_key is > NULL) it will contain the last_key copied into the prev_key which will > simplify next call. > > Only when it can't find a single element it will return -ENOENT meaning > that the map has been entirely walked. When an error is return buf, > buf_len and prev_key shouldn't be read nor used. That's common for error handling. No need to state explicitly. > Because maps can be called from userspace and kernel code, this function > can have a scenario where the next_key was found but by the time we > try to retrieve the value the element is not there, in this case the > function continues and tries to get a new next_key value, skipping the > deleted key. If at some point the function find itself trap in a loop, > it will return -EINTR. Good to point this out! I don't think that unbounded continue; statements until an interrupt happens is sufficient. Please bound the number of retries to a low number. > The function will try to fit as much as possible in the buf provided and > will return -EINVAL if buf_len is smaller than elem_size. > > QUEUE and STACK maps are not supported. > > Note that map_dump doesn't guarantee that reading the entire table is > consistent since this function is always racing with kernel and user code > but the same behaviour is found when the entire table is walked using > the current interfaces: map_get_next_key + map_lookup_elem. > It is also important to note that with a locked map, the lock is grabbed > for 1 entry at the time, meaning that the returned buf might or might not > be consistent. Would it be informative to signal to the caller if the read was complete and consistent (because the entire table was read while the lock was held)? > > Suggested-by: Stanislav Fomichev <sdf@google.com> > Signed-off-by: Brian Vazquez <brianvv@google.com>
On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote: > > This introduces a new command to retrieve multiple number of entries > from a bpf map, wrapping the existing bpf methods: > map_get_next_key and map_lookup_elem > > To start dumping the map from the beginning you must specify NULL as > the prev_key. > > The new API returns 0 when it successfully copied all the elements > requested or it copied less because there weren't more elements to > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. > > On a successful call buf and buf_len will contain correct data and in > case prev_key was provided (not for the first walk, since prev_key is > NULL) it will contain the last_key copied into the prev_key which will > simplify next call. > > Only when it can't find a single element it will return -ENOENT meaning > that the map has been entirely walked. When an error is return buf, > buf_len and prev_key shouldn't be read nor used. > > Because maps can be called from userspace and kernel code, this function > can have a scenario where the next_key was found but by the time we > try to retrieve the value the element is not there, in this case the > function continues and tries to get a new next_key value, skipping the > deleted key. If at some point the function find itself trap in a loop, > it will return -EINTR. > > The function will try to fit as much as possible in the buf provided and > will return -EINVAL if buf_len is smaller than elem_size. > > QUEUE and STACK maps are not supported. > > Note that map_dump doesn't guarantee that reading the entire table is > consistent since this function is always racing with kernel and user code > but the same behaviour is found when the entire table is walked using > the current interfaces: map_get_next_key + map_lookup_elem. > It is also important to note that with a locked map, the lock is grabbed > for 1 entry at the time, meaning that the returned buf might or might not > be consistent. > > Suggested-by: Stanislav Fomichev <sdf@google.com> > Signed-off-by: Brian Vazquez <brianvv@google.com> > --- > include/uapi/linux/bpf.h | 9 +++ > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 126 insertions(+) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index fa1c753dcdbc7..66dab5385170d 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -106,6 +106,7 @@ enum bpf_cmd { > BPF_TASK_FD_QUERY, > BPF_MAP_LOOKUP_AND_DELETE_ELEM, > BPF_MAP_FREEZE, > + BPF_MAP_DUMP, > }; > > enum bpf_map_type { > @@ -388,6 +389,14 @@ union bpf_attr { > __u64 flags; > }; > > + struct { /* struct used by BPF_MAP_DUMP command */ > + __aligned_u64 prev_key; > + __aligned_u64 buf; > + __aligned_u64 buf_len; /* input/output: len of buf */ > + __u64 flags; Please add explanation of flags here. Also, we need to update the comments of BPF_F_LOCK for BPF_MAP_DUMP. > + __u32 map_fd; > + } dump; > + > struct { /* anonymous struct used by BPF_PROG_LOAD command */ > __u32 prog_type; /* one of enum bpf_prog_type */ > __u32 insn_cnt; > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 86cdc2f7bb56e..0c35505aa219f 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr) > return err; > } > > +/* last field in 'union bpf_attr' used by this command */ > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd > + > +static int map_dump(union bpf_attr *attr) > +{ > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key); > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf); > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len); > + int ufd = attr->dump.map_fd; > + struct bpf_map *map; > + void *buf, *prev_key, *key, *value; > + u32 value_size, elem_size, buf_len, cp_len; > + struct fd f; > + int err; > + bool first_key = false; > + > + if (CHECK_ATTR(BPF_MAP_DUMP)) > + return -EINVAL; > + > + if (attr->dump.flags & ~BPF_F_LOCK) > + return -EINVAL; > + > + f = fdget(ufd); > + map = __bpf_map_get(f); > + if (IS_ERR(map)) > + return PTR_ERR(map); > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { > + err = -EPERM; > + goto err_put; > + } > + > + if ((attr->dump.flags & BPF_F_LOCK) && > + !map_value_has_spin_lock(map)) { > + err = -EINVAL; > + goto err_put; > + } We can share these lines with map_lookup_elem(). Maybe add another helper function? > + > + if (map->map_type == BPF_MAP_TYPE_QUEUE || > + map->map_type == BPF_MAP_TYPE_STACK) { > + err = -ENOTSUPP; > + goto err_put; > + } > + > + value_size = bpf_map_value_size(map); > + > + err = get_user(buf_len, ubuf_len); > + if (err) > + goto err_put; > + > + elem_size = map->key_size + value_size; > + if (buf_len < elem_size) { > + err = -EINVAL; > + goto err_put; > + } > + > + if (ukey) { > + prev_key = __bpf_copy_key(ukey, map->key_size); > + if (IS_ERR(prev_key)) { > + err = PTR_ERR(prev_key); > + goto err_put; > + } > + } else { > + prev_key = NULL; > + first_key = true; > + } > + > + err = -ENOMEM; > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN); > + if (!buf) > + goto err_put; > + > + key = buf; > + value = key + map->key_size; > + for (cp_len = 0; cp_len + elem_size <= buf_len;) { > + if (signal_pending(current)) { > + err = -EINTR; > + break; > + } > + > + rcu_read_lock(); > + err = map->ops->map_get_next_key(map, prev_key, key); If prev_key is deleted before map_get_next_key(), we get the first key again. This is pretty weird. > + rcu_read_unlock(); > + > + if (err) > + break; > + > + err = bpf_map_copy_value(map, key, value, attr->dump.flags); > + > + if (err == -ENOENT) > + continue; > + if (err) > + goto free_buf; > + > + if (copy_to_user(ubuf + cp_len, buf, elem_size)) { > + err = -EFAULT; > + goto free_buf; > + } > + > + prev_key = key; > + cp_len += elem_size; > + } > + > + if (err == -ENOENT && cp_len) > + err = 0; > + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) || > + (!first_key && copy_to_user(ukey, key, map->key_size)))) > + err = -EFAULT; > +free_buf: > + kfree(buf); > +err_put: > + fdput(f); > + return err; > +} > + > #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value > > static int map_lookup_and_delete_elem(union bpf_attr *attr) > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz > case BPF_MAP_LOOKUP_AND_DELETE_ELEM: > err = map_lookup_and_delete_elem(&attr); > break; > + case BPF_MAP_DUMP: > + err = map_dump(&attr); > + break; > default: > err = -EINVAL; > break; > -- > 2.22.0.657.g960e92d24f-goog >
On Wed, Jul 24, 2019 at 12:55 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote: > > > > This introduces a new command to retrieve multiple number of entries > > from a bpf map, wrapping the existing bpf methods: > > map_get_next_key and map_lookup_elem > > > > To start dumping the map from the beginning you must specify NULL as > > the prev_key. > > > > The new API returns 0 when it successfully copied all the elements > > requested or it copied less because there weren't more elements to > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. > > I think I understand this, but perhaps it can be explained a bit more > concisely without reference to ENOENT and error masking. The function > returns the min of the number of requested elements and the number of > remaining elements in the map from the given starting point > (prev_key). That sounds better to me. Thanks! > > On a successful call buf and buf_len will contain correct data and in > > case prev_key was provided (not for the first walk, since prev_key is > > NULL) it will contain the last_key copied into the prev_key which will > > simplify next call. > > > > Only when it can't find a single element it will return -ENOENT meaning > > that the map has been entirely walked. When an error is return buf, > > buf_len and prev_key shouldn't be read nor used. > > That's common for error handling. No need to state explicitly. Ack > > > Because maps can be called from userspace and kernel code, this function > > can have a scenario where the next_key was found but by the time we > > try to retrieve the value the element is not there, in this case the > > function continues and tries to get a new next_key value, skipping the > > deleted key. If at some point the function find itself trap in a loop, > > it will return -EINTR. > > Good to point this out! I don't think that unbounded continue; > statements until an interrupt happens is sufficient. Please bound the > number of retries to a low number. And what would it be a good number? Maybe 3 attempts? And in that case what error should be reported? > > > The function will try to fit as much as possible in the buf provided and > > will return -EINVAL if buf_len is smaller than elem_size. > > > > QUEUE and STACK maps are not supported. > > > > Note that map_dump doesn't guarantee that reading the entire table is > > consistent since this function is always racing with kernel and user code > > but the same behaviour is found when the entire table is walked using > > the current interfaces: map_get_next_key + map_lookup_elem. > > > It is also important to note that with a locked map, the lock is grabbed > > for 1 entry at the time, meaning that the returned buf might or might not > > be consistent. > > Would it be informative to signal to the caller if the read was > complete and consistent (because the entire table was read while the > lock was held)? Mmm.. not sure how we could signal that to the caller. But I don't think there's a way to know it was consistent (i.e. one element was removed in bucket 20 and you are copying the keys in bucket 15, when you get to bucket 20 there's no way to know that some entries were removed when you traversed them). The lock is held for just 1 single entry not the entire table. Maybe clarify more that in the commit message? Thanks for reviewing!
> > > Because maps can be called from userspace and kernel code, this function > > > can have a scenario where the next_key was found but by the time we > > > try to retrieve the value the element is not there, in this case the > > > function continues and tries to get a new next_key value, skipping the > > > deleted key. If at some point the function find itself trap in a loop, > > > it will return -EINTR. > > > > Good to point this out! I don't think that unbounded continue; > > statements until an interrupt happens is sufficient. Please bound the > > number of retries to a low number. > > And what would it be a good number? Maybe 3 attempts? 3 sounds good to me. > And in that case what error should be reported? One that's unambiguous and somewhat intuitive for the given issue. Perhaps EBUSY? > > > The function will try to fit as much as possible in the buf provided and > > > will return -EINVAL if buf_len is smaller than elem_size. > > > > > > QUEUE and STACK maps are not supported. > > > > > > Note that map_dump doesn't guarantee that reading the entire table is > > > consistent since this function is always racing with kernel and user code > > > but the same behaviour is found when the entire table is walked using > > > the current interfaces: map_get_next_key + map_lookup_elem. > > > > > It is also important to note that with a locked map, the lock is grabbed > > > for 1 entry at the time, meaning that the returned buf might or might not > > > be consistent. > > > > Would it be informative to signal to the caller if the read was > > complete and consistent (because the entire table was read while the > > lock was held)? > > Mmm.. not sure how we could signal that to the caller. But I don't > think there's a way to know it was consistent Okay, that makes for a simple answer :) No need to try to add a signal, then.
On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote: > > > > This introduces a new command to retrieve multiple number of entries > > from a bpf map, wrapping the existing bpf methods: > > map_get_next_key and map_lookup_elem > > > > To start dumping the map from the beginning you must specify NULL as > > the prev_key. > > > > The new API returns 0 when it successfully copied all the elements > > requested or it copied less because there weren't more elements to > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. > > > > On a successful call buf and buf_len will contain correct data and in > > case prev_key was provided (not for the first walk, since prev_key is > > NULL) it will contain the last_key copied into the prev_key which will > > simplify next call. > > > > Only when it can't find a single element it will return -ENOENT meaning > > that the map has been entirely walked. When an error is return buf, > > buf_len and prev_key shouldn't be read nor used. > > > > Because maps can be called from userspace and kernel code, this function > > can have a scenario where the next_key was found but by the time we > > try to retrieve the value the element is not there, in this case the > > function continues and tries to get a new next_key value, skipping the > > deleted key. If at some point the function find itself trap in a loop, > > it will return -EINTR. > > > > The function will try to fit as much as possible in the buf provided and > > will return -EINVAL if buf_len is smaller than elem_size. > > > > QUEUE and STACK maps are not supported. > > > > Note that map_dump doesn't guarantee that reading the entire table is > > consistent since this function is always racing with kernel and user code > > but the same behaviour is found when the entire table is walked using > > the current interfaces: map_get_next_key + map_lookup_elem. > > It is also important to note that with a locked map, the lock is grabbed > > for 1 entry at the time, meaning that the returned buf might or might not > > be consistent. > > > > Suggested-by: Stanislav Fomichev <sdf@google.com> > > Signed-off-by: Brian Vazquez <brianvv@google.com> > > --- > > include/uapi/linux/bpf.h | 9 +++ > > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 126 insertions(+) > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index fa1c753dcdbc7..66dab5385170d 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -106,6 +106,7 @@ enum bpf_cmd { > > BPF_TASK_FD_QUERY, > > BPF_MAP_LOOKUP_AND_DELETE_ELEM, > > BPF_MAP_FREEZE, > > + BPF_MAP_DUMP, > > }; > > > > enum bpf_map_type { > > @@ -388,6 +389,14 @@ union bpf_attr { > > __u64 flags; > > }; > > > > + struct { /* struct used by BPF_MAP_DUMP command */ > > + __aligned_u64 prev_key; > > + __aligned_u64 buf; > > + __aligned_u64 buf_len; /* input/output: len of buf */ > > + __u64 flags; > > Please add explanation of flags here. got it! > Also, we need to update the > comments of BPF_F_LOCK for BPF_MAP_DUMP. What do you mean? I didn't get this part. > > > + __u32 map_fd; > > + } dump; > > + > > struct { /* anonymous struct used by BPF_PROG_LOAD command */ > > __u32 prog_type; /* one of enum bpf_prog_type */ > > __u32 insn_cnt; > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 86cdc2f7bb56e..0c35505aa219f 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr) > > return err; > > } > > > > +/* last field in 'union bpf_attr' used by this command */ > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd > > + > > +static int map_dump(union bpf_attr *attr) > > +{ > > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key); > > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf); > > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len); > > + int ufd = attr->dump.map_fd; > > + struct bpf_map *map; > > + void *buf, *prev_key, *key, *value; > > + u32 value_size, elem_size, buf_len, cp_len; > > + struct fd f; > > + int err; > > + bool first_key = false; > > + > > + if (CHECK_ATTR(BPF_MAP_DUMP)) > > + return -EINVAL; > > + > > + if (attr->dump.flags & ~BPF_F_LOCK) > > + return -EINVAL; > > + > > + f = fdget(ufd); > > + map = __bpf_map_get(f); > > + if (IS_ERR(map)) > > + return PTR_ERR(map); > > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { > > + err = -EPERM; > > + goto err_put; > > + } > > + > > + if ((attr->dump.flags & BPF_F_LOCK) && > > + !map_value_has_spin_lock(map)) { > > + err = -EINVAL; > > + goto err_put; > > + } > > We can share these lines with map_lookup_elem(). Maybe > add another helper function? Which are the lines you are referring to? the dump.flags? It makes sense so that way when a new flag is added you only need to modify them in one spot. > > + > > + if (map->map_type == BPF_MAP_TYPE_QUEUE || > > + map->map_type == BPF_MAP_TYPE_STACK) { > > + err = -ENOTSUPP; > > + goto err_put; > > + } > > + > > + value_size = bpf_map_value_size(map); > > + > > + err = get_user(buf_len, ubuf_len); > > + if (err) > > + goto err_put; > > + > > + elem_size = map->key_size + value_size; > > + if (buf_len < elem_size) { > > + err = -EINVAL; > > + goto err_put; > > + } > > + > > + if (ukey) { > > + prev_key = __bpf_copy_key(ukey, map->key_size); > > + if (IS_ERR(prev_key)) { > > + err = PTR_ERR(prev_key); > > + goto err_put; > > + } > > + } else { > > + prev_key = NULL; > > + first_key = true; > > + } > > + > > + err = -ENOMEM; > > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN); > > + if (!buf) > > + goto err_put; > > + > > + key = buf; > > + value = key + map->key_size; > > + for (cp_len = 0; cp_len + elem_size <= buf_len;) { > > + if (signal_pending(current)) { > > + err = -EINTR; > > + break; > > + } > > + > > + rcu_read_lock(); > > + err = map->ops->map_get_next_key(map, prev_key, key); > > If prev_key is deleted before map_get_next_key(), we get the first key > again. This is pretty weird. Yes, I know. But note that the current scenario happens even for the old interface (imagine you are walking a map from userspace and you tried get_next_key the prev_key was removed, you will start again from the beginning without noticing it). I tried to sent a patch in the past but I was missing some context: before NULL was used to get the very first_key the interface relied in a random (non existent) key to retrieve the first_key in the map, and I was told what we still have to support that scenario. > > > + rcu_read_unlock(); > > + > > + if (err) > > + break; > > + > > + err = bpf_map_copy_value(map, key, value, attr->dump.flags); > > + > > + if (err == -ENOENT) > > + continue; > > + if (err) > > + goto free_buf; > > + > > + if (copy_to_user(ubuf + cp_len, buf, elem_size)) { > > + err = -EFAULT; > > + goto free_buf; > > + } > > + > > + prev_key = key; > > + cp_len += elem_size; > > + } > > + > > + if (err == -ENOENT && cp_len) > > + err = 0; > > + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) || > > + (!first_key && copy_to_user(ukey, key, map->key_size)))) > > + err = -EFAULT; > > +free_buf: > > + kfree(buf); > > +err_put: > > + fdput(f); > > + return err; > > +} > > + > > #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value > > > > static int map_lookup_and_delete_elem(union bpf_attr *attr) > > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz > > case BPF_MAP_LOOKUP_AND_DELETE_ELEM: > > err = map_lookup_and_delete_elem(&attr); > > break; > > + case BPF_MAP_DUMP: > > + err = map_dump(&attr); > > + break; > > default: > > err = -EINVAL; > > break; > > -- > > 2.22.0.657.g960e92d24f-goog > > Thanks for reviewing it!!
On Wed, Jul 24, 2019 at 3:44 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote: > > On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote: > > > > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote: > > > > > > This introduces a new command to retrieve multiple number of entries > > > from a bpf map, wrapping the existing bpf methods: > > > map_get_next_key and map_lookup_elem > > > > > > To start dumping the map from the beginning you must specify NULL as > > > the prev_key. > > > > > > The new API returns 0 when it successfully copied all the elements > > > requested or it copied less because there weren't more elements to > > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. > > > > > > On a successful call buf and buf_len will contain correct data and in > > > case prev_key was provided (not for the first walk, since prev_key is > > > NULL) it will contain the last_key copied into the prev_key which will > > > simplify next call. > > > > > > Only when it can't find a single element it will return -ENOENT meaning > > > that the map has been entirely walked. When an error is return buf, > > > buf_len and prev_key shouldn't be read nor used. > > > > > > Because maps can be called from userspace and kernel code, this function > > > can have a scenario where the next_key was found but by the time we > > > try to retrieve the value the element is not there, in this case the > > > function continues and tries to get a new next_key value, skipping the > > > deleted key. If at some point the function find itself trap in a loop, > > > it will return -EINTR. > > > > > > The function will try to fit as much as possible in the buf provided and > > > will return -EINVAL if buf_len is smaller than elem_size. > > > > > > QUEUE and STACK maps are not supported. > > > > > > Note that map_dump doesn't guarantee that reading the entire table is > > > consistent since this function is always racing with kernel and user code > > > but the same behaviour is found when the entire table is walked using > > > the current interfaces: map_get_next_key + map_lookup_elem. > > > It is also important to note that with a locked map, the lock is grabbed > > > for 1 entry at the time, meaning that the returned buf might or might not > > > be consistent. > > > > > > Suggested-by: Stanislav Fomichev <sdf@google.com> > > > Signed-off-by: Brian Vazquez <brianvv@google.com> > > > --- > > > include/uapi/linux/bpf.h | 9 +++ > > > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 126 insertions(+) > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > index fa1c753dcdbc7..66dab5385170d 100644 > > > --- a/include/uapi/linux/bpf.h > > > +++ b/include/uapi/linux/bpf.h > > > @@ -106,6 +106,7 @@ enum bpf_cmd { > > > BPF_TASK_FD_QUERY, > > > BPF_MAP_LOOKUP_AND_DELETE_ELEM, > > > BPF_MAP_FREEZE, > > > + BPF_MAP_DUMP, > > > }; > > > > > > enum bpf_map_type { > > > @@ -388,6 +389,14 @@ union bpf_attr { > > > __u64 flags; > > > }; > > > > > > + struct { /* struct used by BPF_MAP_DUMP command */ > > > + __aligned_u64 prev_key; > > > + __aligned_u64 buf; > > > + __aligned_u64 buf_len; /* input/output: len of buf */ > > > + __u64 flags; > > > > Please add explanation of flags here. > > got it! > > > Also, we need to update the > > comments of BPF_F_LOCK for BPF_MAP_DUMP. > > What do you mean? I didn't get this part. I meant, current comment says BPF_F_LOCK is for BPF_MAP_UPDATE_ELEM. But it is also used by BPF_MAP_LOOKUP_ELEM and BPF_MAP_DUMP. So current comment is not accurate either. Maybe fix it while you are on it? > > > > > > + __u32 map_fd; > > > + } dump; > > > + > > > struct { /* anonymous struct used by BPF_PROG_LOAD command */ > > > __u32 prog_type; /* one of enum bpf_prog_type */ > > > __u32 insn_cnt; > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 86cdc2f7bb56e..0c35505aa219f 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr) > > > return err; > > > } > > > > > > +/* last field in 'union bpf_attr' used by this command */ > > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd > > > + > > > +static int map_dump(union bpf_attr *attr) > > > +{ > > > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key); > > > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf); > > > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len); > > > + int ufd = attr->dump.map_fd; > > > + struct bpf_map *map; > > > + void *buf, *prev_key, *key, *value; > > > + u32 value_size, elem_size, buf_len, cp_len; > > > + struct fd f; > > > + int err; > > > + bool first_key = false; > > > + > > > + if (CHECK_ATTR(BPF_MAP_DUMP)) > > > + return -EINVAL; > > > + > > > + if (attr->dump.flags & ~BPF_F_LOCK) > > > + return -EINVAL; > > > + > > > + f = fdget(ufd); > > > + map = __bpf_map_get(f); > > > + if (IS_ERR(map)) > > > + return PTR_ERR(map); > > > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { > > > + err = -EPERM; > > > + goto err_put; > > > + } > > > + > > > + if ((attr->dump.flags & BPF_F_LOCK) && > > > + !map_value_has_spin_lock(map)) { > > > + err = -EINVAL; > > > + goto err_put; > > > + } > > > > We can share these lines with map_lookup_elem(). Maybe > > add another helper function? > > Which are the lines you are referring to? the dump.flags? It makes > sense so that way when a new flag is added you only need to modify > them in one spot. I think I misread it. attr->dump.flags is not same as attr->flags. So never mind. > > > > + > > > + if (map->map_type == BPF_MAP_TYPE_QUEUE || > > > + map->map_type == BPF_MAP_TYPE_STACK) { > > > + err = -ENOTSUPP; > > > + goto err_put; > > > + } > > > + > > > + value_size = bpf_map_value_size(map); > > > + > > > + err = get_user(buf_len, ubuf_len); > > > + if (err) > > > + goto err_put; > > > + > > > + elem_size = map->key_size + value_size; > > > + if (buf_len < elem_size) { > > > + err = -EINVAL; > > > + goto err_put; > > > + } > > > + > > > + if (ukey) { > > > + prev_key = __bpf_copy_key(ukey, map->key_size); > > > + if (IS_ERR(prev_key)) { > > > + err = PTR_ERR(prev_key); > > > + goto err_put; > > > + } > > > + } else { > > > + prev_key = NULL; > > > + first_key = true; > > > + } > > > + > > > + err = -ENOMEM; > > > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN); > > > + if (!buf) > > > + goto err_put; > > > + > > > + key = buf; > > > + value = key + map->key_size; > > > + for (cp_len = 0; cp_len + elem_size <= buf_len;) { > > > + if (signal_pending(current)) { > > > + err = -EINTR; > > > + break; > > > + } > > > + > > > + rcu_read_lock(); > > > + err = map->ops->map_get_next_key(map, prev_key, key); > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > again. This is pretty weird. > > Yes, I know. But note that the current scenario happens even for the > old interface (imagine you are walking a map from userspace and you > tried get_next_key the prev_key was removed, you will start again from > the beginning without noticing it). > I tried to sent a patch in the past but I was missing some context: > before NULL was used to get the very first_key the interface relied in > a random (non existent) key to retrieve the first_key in the map, and > I was told what we still have to support that scenario. BPF_MAP_DUMP is slightly different, as you may return the first key multiple times in the same call. Also, BPF_MAP_DUMP is new, so we don't have to support legacy scenarios. Since BPF_MAP_DUMP keeps a list of elements. It is possible to try to look up previous keys. Would something down this direction work? Thanks, Song
> > > If prev_key is deleted before map_get_next_key(), we get the first key > > > again. This is pretty weird. > > > > Yes, I know. But note that the current scenario happens even for the > > old interface (imagine you are walking a map from userspace and you > > tried get_next_key the prev_key was removed, you will start again from > > the beginning without noticing it). > > I tried to sent a patch in the past but I was missing some context: > > before NULL was used to get the very first_key the interface relied in > > a random (non existent) key to retrieve the first_key in the map, and > > I was told what we still have to support that scenario. > > BPF_MAP_DUMP is slightly different, as you may return the first key > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > don't have to support legacy scenarios. > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > to look up previous keys. Would something down this direction work? I've been thinking about it and I think first we need a way to detect that since key was not present we got the first_key instead: - One solution I had in mind was to explicitly asked for the first key with map_get_next_key(map, NULL, first_key) and while walking the map check that map_get_next_key(map, prev_key, key) doesn't return the same key. This could be done using memcmp. - Discussing with Stan, he mentioned that another option is to support a flag in map_get_next_key to let it know that we want an error instead of the first_key. After detecting the problem we also need to define what we want to do, here some options: a) Return the error to the caller b) Try with previous keys if any (which be limited to the keys that we have traversed so far in this dump call) c) continue with next entries in the map. array is easy just get the next valid key (starting on i+1), but hmap might be difficult since starting on the next bucket could potentially skip some keys that were concurrently added to the same bucket where key used to be, and starting on the same bucket could lead us to return repeated elements. Or maybe we could support those 3 cases via flags and let the caller decide which one to use? Let me know what are your thoughts. Thanks, Brian
On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > > > again. This is pretty weird. > > > > > > Yes, I know. But note that the current scenario happens even for the > > > old interface (imagine you are walking a map from userspace and you > > > tried get_next_key the prev_key was removed, you will start again from > > > the beginning without noticing it). > > > I tried to sent a patch in the past but I was missing some context: > > > before NULL was used to get the very first_key the interface relied in > > > a random (non existent) key to retrieve the first_key in the map, and > > > I was told what we still have to support that scenario. > > > > BPF_MAP_DUMP is slightly different, as you may return the first key > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > > don't have to support legacy scenarios. > > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > > to look up previous keys. Would something down this direction work? > > I've been thinking about it and I think first we need a way to detect > that since key was not present we got the first_key instead: > > - One solution I had in mind was to explicitly asked for the first key > with map_get_next_key(map, NULL, first_key) and while walking the map > check that map_get_next_key(map, prev_key, key) doesn't return the > same key. This could be done using memcmp. > - Discussing with Stan, he mentioned that another option is to support > a flag in map_get_next_key to let it know that we want an error > instead of the first_key. > > After detecting the problem we also need to define what we want to do, > here some options: > > a) Return the error to the caller > b) Try with previous keys if any (which be limited to the keys that we > have traversed so far in this dump call) > c) continue with next entries in the map. array is easy just get the > next valid key (starting on i+1), but hmap might be difficult since > starting on the next bucket could potentially skip some keys that were > concurrently added to the same bucket where key used to be, and > starting on the same bucket could lead us to return repeated elements. > > Or maybe we could support those 3 cases via flags and let the caller > decide which one to use? this type of indecision is the reason why I wasn't excited about batch dumping in the first place and gave 'soft yes' when Stan mentioned it during lsf/mm/bpf uconf. We probably shouldn't do it. It feels this map_dump makes api more complex and doesn't really give much benefit to the user other than large map dump becomes faster. I think we gotta solve this problem differently.
On Thu, Jul 25, 2019 at 7:54 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > > > > again. This is pretty weird. > > > > > > > > Yes, I know. But note that the current scenario happens even for the > > > > old interface (imagine you are walking a map from userspace and you > > > > tried get_next_key the prev_key was removed, you will start again from > > > > the beginning without noticing it). > > > > I tried to sent a patch in the past but I was missing some context: > > > > before NULL was used to get the very first_key the interface relied in > > > > a random (non existent) key to retrieve the first_key in the map, and > > > > I was told what we still have to support that scenario. > > > > > > BPF_MAP_DUMP is slightly different, as you may return the first key > > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > > > don't have to support legacy scenarios. > > > > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > > > to look up previous keys. Would something down this direction work? > > > > I've been thinking about it and I think first we need a way to detect > > that since key was not present we got the first_key instead: > > > > - One solution I had in mind was to explicitly asked for the first key > > with map_get_next_key(map, NULL, first_key) and while walking the map > > check that map_get_next_key(map, prev_key, key) doesn't return the > > same key. This could be done using memcmp. > > - Discussing with Stan, he mentioned that another option is to support > > a flag in map_get_next_key to let it know that we want an error > > instead of the first_key. > > > > After detecting the problem we also need to define what we want to do, > > here some options: > > > > a) Return the error to the caller > > b) Try with previous keys if any (which be limited to the keys that we > > have traversed so far in this dump call) > > c) continue with next entries in the map. array is easy just get the > > next valid key (starting on i+1), but hmap might be difficult since > > starting on the next bucket could potentially skip some keys that were > > concurrently added to the same bucket where key used to be, and > > starting on the same bucket could lead us to return repeated elements. > > > > Or maybe we could support those 3 cases via flags and let the caller > > decide which one to use? > > this type of indecision is the reason why I wasn't excited about > batch dumping in the first place and gave 'soft yes' when Stan > mentioned it during lsf/mm/bpf uconf. > We probably shouldn't do it. > It feels this map_dump makes api more complex and doesn't really > give much benefit to the user other than large map dump becomes faster. > I think we gotta solve this problem differently. Multiple variants with flags indeed makes the API complex. I think the kernel should expose only the simplest, most obvious behavior that allows the application to recover. In this case, that sounds like option (a) and restart. In practice, the common use case is to allocate enough user memory to read an entire table in one go, in which case the entire issue is moot. The cycle savings of dump are significant for large tables. I'm not sure how we achieve that differently and even simpler? We originally looked at shared memory, but that is obviously much more complex.
On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > > > > again. This is pretty weird. > > > > > > > > Yes, I know. But note that the current scenario happens even for the > > > > old interface (imagine you are walking a map from userspace and you > > > > tried get_next_key the prev_key was removed, you will start again from > > > > the beginning without noticing it). > > > > I tried to sent a patch in the past but I was missing some context: > > > > before NULL was used to get the very first_key the interface relied in > > > > a random (non existent) key to retrieve the first_key in the map, and > > > > I was told what we still have to support that scenario. > > > > > > BPF_MAP_DUMP is slightly different, as you may return the first key > > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > > > don't have to support legacy scenarios. > > > > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > > > to look up previous keys. Would something down this direction work? > > > > I've been thinking about it and I think first we need a way to detect > > that since key was not present we got the first_key instead: > > > > - One solution I had in mind was to explicitly asked for the first key > > with map_get_next_key(map, NULL, first_key) and while walking the map > > check that map_get_next_key(map, prev_key, key) doesn't return the > > same key. This could be done using memcmp. > > - Discussing with Stan, he mentioned that another option is to support > > a flag in map_get_next_key to let it know that we want an error > > instead of the first_key. > > > > After detecting the problem we also need to define what we want to do, > > here some options: > > > > a) Return the error to the caller > > b) Try with previous keys if any (which be limited to the keys that we > > have traversed so far in this dump call) > > c) continue with next entries in the map. array is easy just get the > > next valid key (starting on i+1), but hmap might be difficult since > > starting on the next bucket could potentially skip some keys that were > > concurrently added to the same bucket where key used to be, and > > starting on the same bucket could lead us to return repeated elements. > > > > Or maybe we could support those 3 cases via flags and let the caller > > decide which one to use? > > this type of indecision is the reason why I wasn't excited about > batch dumping in the first place and gave 'soft yes' when Stan > mentioned it during lsf/mm/bpf uconf. > We probably shouldn't do it. > It feels this map_dump makes api more complex and doesn't really > give much benefit to the user other than large map dump becomes faster. > I think we gotta solve this problem differently. Some users are working around the dumping problems with the existing api by creating a bpf_map_get_next_key_and_delete userspace function (see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/) which in my opinion is actually a good idea. The only problem with that is that calling bpf_map_get_next_key(fd, key, next_key) and then bpf_map_delete_elem(fd, key) from userspace is racing with kernel code and it might lose some information when deleting. We could then do map_dump_and_delete using that idea but in the kernel where we could better handle the racing condition. In that scenario even if we retrieve the same key it will contain different info ( the delta between old and new value). Would that work?
On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote: > > On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > > > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > > > > > again. This is pretty weird. > > > > > > > > > > Yes, I know. But note that the current scenario happens even for the > > > > > old interface (imagine you are walking a map from userspace and you > > > > > tried get_next_key the prev_key was removed, you will start again from > > > > > the beginning without noticing it). > > > > > I tried to sent a patch in the past but I was missing some context: > > > > > before NULL was used to get the very first_key the interface relied in > > > > > a random (non existent) key to retrieve the first_key in the map, and > > > > > I was told what we still have to support that scenario. > > > > > > > > BPF_MAP_DUMP is slightly different, as you may return the first key > > > > multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > > > > don't have to support legacy scenarios. > > > > > > > > Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > > > > to look up previous keys. Would something down this direction work? > > > > > > I've been thinking about it and I think first we need a way to detect > > > that since key was not present we got the first_key instead: > > > > > > - One solution I had in mind was to explicitly asked for the first key > > > with map_get_next_key(map, NULL, first_key) and while walking the map > > > check that map_get_next_key(map, prev_key, key) doesn't return the > > > same key. This could be done using memcmp. > > > - Discussing with Stan, he mentioned that another option is to support > > > a flag in map_get_next_key to let it know that we want an error > > > instead of the first_key. > > > > > > After detecting the problem we also need to define what we want to do, > > > here some options: > > > > > > a) Return the error to the caller > > > b) Try with previous keys if any (which be limited to the keys that we > > > have traversed so far in this dump call) > > > c) continue with next entries in the map. array is easy just get the > > > next valid key (starting on i+1), but hmap might be difficult since > > > starting on the next bucket could potentially skip some keys that were > > > concurrently added to the same bucket where key used to be, and > > > starting on the same bucket could lead us to return repeated elements. > > > > > > Or maybe we could support those 3 cases via flags and let the caller > > > decide which one to use? > > > > this type of indecision is the reason why I wasn't excited about > > batch dumping in the first place and gave 'soft yes' when Stan > > mentioned it during lsf/mm/bpf uconf. > > We probably shouldn't do it. > > It feels this map_dump makes api more complex and doesn't really > > give much benefit to the user other than large map dump becomes faster. > > I think we gotta solve this problem differently. > > Some users are working around the dumping problems with the existing > api by creating a bpf_map_get_next_key_and_delete userspace function > (see https://www.bouncybouncy.net/blog/bpf_map_get_next_key-pitfalls/) > which in my opinion is actually a good idea. The only problem with > that is that calling bpf_map_get_next_key(fd, key, next_key) and then > bpf_map_delete_elem(fd, key) from userspace is racing with kernel code > and it might lose some information when deleting. > We could then do map_dump_and_delete using that idea but in the kernel > where we could better handle the racing condition. In that scenario > even if we retrieve the same key it will contain different info ( the > delta between old and new value). Would that work? you mean get_next+lookup+delete at once? Sounds useful. Yonghong has been thinking about batching api as well. I think if we cannot figure out how to make a batch of two commands get_next + lookup to work correctly then we need to identify/invent one command and make batching more generic. Like make one jumbo/compound/atomic command to be get_next+lookup+delete. Define the semantics of this single compound command. And then let batching to be a multiplier of such command. In a sense that multiplier 1 or N should be have the same way. No extra flags to alter the batching. The high level description of the batch would be: pls execute get_next,lookup,delete and repeat it N times. or pls execute get_next,lookup and repeat N times. where each command action is defined to be composable. Just a rough idea.
On 7/25/19 6:47 PM, Alexei Starovoitov wrote: > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote: >> >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key >>>>>>> again. This is pretty weird. >>>>>> >>>>>> Yes, I know. But note that the current scenario happens even for the >>>>>> old interface (imagine you are walking a map from userspace and you >>>>>> tried get_next_key the prev_key was removed, you will start again from >>>>>> the beginning without noticing it). >>>>>> I tried to sent a patch in the past but I was missing some context: >>>>>> before NULL was used to get the very first_key the interface relied in >>>>>> a random (non existent) key to retrieve the first_key in the map, and >>>>>> I was told what we still have to support that scenario. >>>>> >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we >>>>> don't have to support legacy scenarios. >>>>> >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try >>>>> to look up previous keys. Would something down this direction work? >>>> >>>> I've been thinking about it and I think first we need a way to detect >>>> that since key was not present we got the first_key instead: >>>> >>>> - One solution I had in mind was to explicitly asked for the first key >>>> with map_get_next_key(map, NULL, first_key) and while walking the map >>>> check that map_get_next_key(map, prev_key, key) doesn't return the >>>> same key. This could be done using memcmp. >>>> - Discussing with Stan, he mentioned that another option is to support >>>> a flag in map_get_next_key to let it know that we want an error >>>> instead of the first_key. >>>> >>>> After detecting the problem we also need to define what we want to do, >>>> here some options: >>>> >>>> a) Return the error to the caller >>>> b) Try with previous keys if any (which be limited to the keys that we >>>> have traversed so far in this dump call) >>>> c) continue with next entries in the map. array is easy just get the >>>> next valid key (starting on i+1), but hmap might be difficult since >>>> starting on the next bucket could potentially skip some keys that were >>>> concurrently added to the same bucket where key used to be, and >>>> starting on the same bucket could lead us to return repeated elements. >>>> >>>> Or maybe we could support those 3 cases via flags and let the caller >>>> decide which one to use? >>> >>> this type of indecision is the reason why I wasn't excited about >>> batch dumping in the first place and gave 'soft yes' when Stan >>> mentioned it during lsf/mm/bpf uconf. >>> We probably shouldn't do it. >>> It feels this map_dump makes api more complex and doesn't really >>> give much benefit to the user other than large map dump becomes faster. >>> I think we gotta solve this problem differently. >> >> Some users are working around the dumping problems with the existing >> api by creating a bpf_map_get_next_key_and_delete userspace function >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= ) >> which in my opinion is actually a good idea. The only problem with >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code >> and it might lose some information when deleting. >> We could then do map_dump_and_delete using that idea but in the kernel >> where we could better handle the racing condition. In that scenario >> even if we retrieve the same key it will contain different info ( the >> delta between old and new value). Would that work? > > you mean get_next+lookup+delete at once? > Sounds useful. > Yonghong has been thinking about batching api as well. In bcc, we have many instances like this: getting all (key value) pairs, do some analysis and output, delete all keys The implementation typically like /* to get all (key, value) pairs */ while(bpf_get_next_key() == 0) bpf_map_lookup() /* do analysis and output */ for (all keys) bpf_map_delete() get_next+lookup+delete will be definitely useful. batching will be even better to save the number of syscalls. An alternative is to do batch get_next+lookup and batch delete to achieve similar goal as the above code. There is a minor difference between this approach and the above get_next+lookup+delete. During scanning the hash map, get_next+lookup may get less number of elements compared to get_next+lookup+delete as the latter may have more later-inserted hash elements after the operation start. But both are inaccurate, so probably the difference is minor. > > I think if we cannot figure out how to make a batch of two commands > get_next + lookup to work correctly then we need to identify/invent one > command and make batching more generic. not 100% sure. It will be hard to define what is "correctly". For not changing map, looping of (get_next, lookup) and batch get_next+lookup should have the same results. For constant changing loops, not sure how to define which one is correct. If users have concerns, they may need to just pick one which gives them more comfort. > Like make one jumbo/compound/atomic command to be get_next+lookup+delete. > Define the semantics of this single compound command. > And then let batching to be a multiplier of such command. > In a sense that multiplier 1 or N should be have the same way. > No extra flags to alter the batching. > The high level description of the batch would be: > pls execute get_next,lookup,delete and repeat it N times. > or > pls execute get_next,lookup and repeat N times. > where each command action is defined to be composable. > > Just a rough idea. >
On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 7/25/19 6:47 PM, Alexei Starovoitov wrote: > > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote: > >> > >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >>> > >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: > >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key > >>>>>>> again. This is pretty weird. > >>>>>> > >>>>>> Yes, I know. But note that the current scenario happens even for the > >>>>>> old interface (imagine you are walking a map from userspace and you > >>>>>> tried get_next_key the prev_key was removed, you will start again from > >>>>>> the beginning without noticing it). > >>>>>> I tried to sent a patch in the past but I was missing some context: > >>>>>> before NULL was used to get the very first_key the interface relied in > >>>>>> a random (non existent) key to retrieve the first_key in the map, and > >>>>>> I was told what we still have to support that scenario. > >>>>> > >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key > >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we > >>>>> don't have to support legacy scenarios. > >>>>> > >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try > >>>>> to look up previous keys. Would something down this direction work? > >>>> > >>>> I've been thinking about it and I think first we need a way to detect > >>>> that since key was not present we got the first_key instead: > >>>> > >>>> - One solution I had in mind was to explicitly asked for the first key > >>>> with map_get_next_key(map, NULL, first_key) and while walking the map > >>>> check that map_get_next_key(map, prev_key, key) doesn't return the > >>>> same key. This could be done using memcmp. > >>>> - Discussing with Stan, he mentioned that another option is to support > >>>> a flag in map_get_next_key to let it know that we want an error > >>>> instead of the first_key. > >>>> > >>>> After detecting the problem we also need to define what we want to do, > >>>> here some options: > >>>> > >>>> a) Return the error to the caller > >>>> b) Try with previous keys if any (which be limited to the keys that we > >>>> have traversed so far in this dump call) > >>>> c) continue with next entries in the map. array is easy just get the > >>>> next valid key (starting on i+1), but hmap might be difficult since > >>>> starting on the next bucket could potentially skip some keys that were > >>>> concurrently added to the same bucket where key used to be, and > >>>> starting on the same bucket could lead us to return repeated elements. > >>>> > >>>> Or maybe we could support those 3 cases via flags and let the caller > >>>> decide which one to use? > >>> > >>> this type of indecision is the reason why I wasn't excited about > >>> batch dumping in the first place and gave 'soft yes' when Stan > >>> mentioned it during lsf/mm/bpf uconf. > >>> We probably shouldn't do it. > >>> It feels this map_dump makes api more complex and doesn't really > >>> give much benefit to the user other than large map dump becomes faster. > >>> I think we gotta solve this problem differently. > >> > >> Some users are working around the dumping problems with the existing > >> api by creating a bpf_map_get_next_key_and_delete userspace function > >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= ) > >> which in my opinion is actually a good idea. The only problem with > >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then > >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code > >> and it might lose some information when deleting. > >> We could then do map_dump_and_delete using that idea but in the kernel > >> where we could better handle the racing condition. In that scenario > >> even if we retrieve the same key it will contain different info ( the > >> delta between old and new value). Would that work? > > > > you mean get_next+lookup+delete at once? > > Sounds useful. > > Yonghong has been thinking about batching api as well. > > In bcc, we have many instances like this: > getting all (key value) pairs, do some analysis and output, > delete all keys > > The implementation typically like > /* to get all (key, value) pairs */ > while(bpf_get_next_key() == 0) > bpf_map_lookup() > /* do analysis and output */ > for (all keys) > bpf_map_delete() If you do that in a map that is being modified while you are doing the analysis and output, you will lose some new data by deleting the keys, right? > get_next+lookup+delete will be definitely useful. > batching will be even better to save the number of syscalls. > > An alternative is to do batch get_next+lookup and batch delete > to achieve similar goal as the above code. What I mentioned above is what it makes me think that with the deletion it'd be better if we perform these 3 operations at once: get_next+lookup+delete in a jumbo/atomic command and batch them later? > > There is a minor difference between this approach > and the above get_next+lookup+delete. > During scanning the hash map, get_next+lookup may get less number > of elements compared to get_next+lookup+delete as the latter > may have more later-inserted hash elements after the operation > start. But both are inaccurate, so probably the difference > is minor. > > > > > I think if we cannot figure out how to make a batch of two commands > > get_next + lookup to work correctly then we need to identify/invent one > > command and make batching more generic. > > not 100% sure. It will be hard to define what is "correctly". I agree, it'll be hard to define what is the right behavior. > For not changing map, looping of (get_next, lookup) and batch > get_next+lookup should have the same results. This is true for the api I'm presenting the only think that I was missing was what to do for changing maps to avoid the weird scenario (getting the first key due a concurrent deletion). And, in my opinion the way to go should be what also Willem supported: return the err to the caller and restart the dumping. I could do this with existing code just by detecting that we do provide a prev_key and got the first_key instead of the next_key or even implement a new function if you want to. > For constant changing loops, not sure how to define which one > is correct. If users have concerns, they may need to just pick one > which gives them more comfort. > > > Like make one jumbo/compound/atomic command to be get_next+lookup+delete. > > Define the semantics of this single compound command. > > And then let batching to be a multiplier of such command. > > In a sense that multiplier 1 or N should be have the same way. > > No extra flags to alter the batching. > > The high level description of the batch would be: > > pls execute get_next,lookup,delete and repeat it N times. > > or > > pls execute get_next,lookup and repeat N times. But any attempt to do get_next+lookup will have same problem with deletions right? I don't see how we could do it more consistent than what I'm proposing. Let's just support one case: report an error if the prev_key was not found instead of retrieving the first_key. Would that work? > > where each command action is defined to be composable. > > > > Just a rough idea. > >
On Fri, 26 Jul 2019 16:36:19 -0700, Brian Vazquez wrote: > > In bcc, we have many instances like this: > > getting all (key value) pairs, do some analysis and output, > > delete all keys > > > > The implementation typically like > > /* to get all (key, value) pairs */ > > while(bpf_get_next_key() == 0) > > bpf_map_lookup() > > /* do analysis and output */ > > for (all keys) > > bpf_map_delete() > > If you do that in a map that is being modified while you are doing the > analysis and output, you will lose some new data by deleting the keys, > right? > > > get_next+lookup+delete will be definitely useful. > > batching will be even better to save the number of syscalls. > > > > An alternative is to do batch get_next+lookup and batch delete > > to achieve similar goal as the above code. > > What I mentioned above is what it makes me think that with the > deletion it'd be better if we perform these 3 operations at once: > get_next+lookup+delete in a jumbo/atomic command and batch them later? Hm. The lookup+delete are only "atomic" if we insert an RCU sync in between right? The elements' lifetime is protected by RCU. CPU 1 CPU 2 val = lookup(map, key) val = lookup(map, key) delete(map, key) dump(val) val->counter++ So we'd need to walk the hash table, disconnect all lists from the buckets and splice them onto one combined list, then synchronize_rcu() and only after that we can dump the elements and free them. > > There is a minor difference between this approach > > and the above get_next+lookup+delete. > > During scanning the hash map, get_next+lookup may get less number > > of elements compared to get_next+lookup+delete as the latter > > may have more later-inserted hash elements after the operation > > start. But both are inaccurate, so probably the difference > > is minor. > > For not changing map, looping of (get_next, lookup) and batch > > get_next+lookup should have the same results. > > This is true for the api I'm presenting the only think that I was > missing was what to do for changing maps to avoid the weird scenario > (getting the first key due a concurrent deletion). And, in my opinion > the way to go should be what also Willem supported: return the err to > the caller and restart the dumping. I could do this with existing code > just by detecting that we do provide a prev_key and got the first_key > instead of the next_key or even implement a new function if you want > to. My knee jerk reaction to Willem's proposal was to nit pick that sizing the dump space to the map's max entries doesn't guarantee all entries will fit if an entry can be removed and re-added in a different place while dumper proceeds. If we keep elements disconnected from the hash array _without_ decrementing element count until synchronize_rcu() completes we have that guarantee, but OTOH it may not be possible to add any new entries from the datapath, so that doesn't sound great either :/
On 7/26/19 4:36 PM, Brian Vazquez wrote: > On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote: >> >> >> >> On 7/25/19 6:47 PM, Alexei Starovoitov wrote: >>> On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote: >>>> >>>> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov >>>> <alexei.starovoitov@gmail.com> wrote: >>>>> >>>>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote: >>>>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key >>>>>>>>> again. This is pretty weird. >>>>>>>> >>>>>>>> Yes, I know. But note that the current scenario happens even for the >>>>>>>> old interface (imagine you are walking a map from userspace and you >>>>>>>> tried get_next_key the prev_key was removed, you will start again from >>>>>>>> the beginning without noticing it). >>>>>>>> I tried to sent a patch in the past but I was missing some context: >>>>>>>> before NULL was used to get the very first_key the interface relied in >>>>>>>> a random (non existent) key to retrieve the first_key in the map, and >>>>>>>> I was told what we still have to support that scenario. >>>>>>> >>>>>>> BPF_MAP_DUMP is slightly different, as you may return the first key >>>>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we >>>>>>> don't have to support legacy scenarios. >>>>>>> >>>>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try >>>>>>> to look up previous keys. Would something down this direction work? >>>>>> >>>>>> I've been thinking about it and I think first we need a way to detect >>>>>> that since key was not present we got the first_key instead: >>>>>> >>>>>> - One solution I had in mind was to explicitly asked for the first key >>>>>> with map_get_next_key(map, NULL, first_key) and while walking the map >>>>>> check that map_get_next_key(map, prev_key, key) doesn't return the >>>>>> same key. This could be done using memcmp. >>>>>> - Discussing with Stan, he mentioned that another option is to support >>>>>> a flag in map_get_next_key to let it know that we want an error >>>>>> instead of the first_key. >>>>>> >>>>>> After detecting the problem we also need to define what we want to do, >>>>>> here some options: >>>>>> >>>>>> a) Return the error to the caller >>>>>> b) Try with previous keys if any (which be limited to the keys that we >>>>>> have traversed so far in this dump call) >>>>>> c) continue with next entries in the map. array is easy just get the >>>>>> next valid key (starting on i+1), but hmap might be difficult since >>>>>> starting on the next bucket could potentially skip some keys that were >>>>>> concurrently added to the same bucket where key used to be, and >>>>>> starting on the same bucket could lead us to return repeated elements. >>>>>> >>>>>> Or maybe we could support those 3 cases via flags and let the caller >>>>>> decide which one to use? >>>>> >>>>> this type of indecision is the reason why I wasn't excited about >>>>> batch dumping in the first place and gave 'soft yes' when Stan >>>>> mentioned it during lsf/mm/bpf uconf. >>>>> We probably shouldn't do it. >>>>> It feels this map_dump makes api more complex and doesn't really >>>>> give much benefit to the user other than large map dump becomes faster. >>>>> I think we gotta solve this problem differently. >>>> >>>> Some users are working around the dumping problems with the existing >>>> api by creating a bpf_map_get_next_key_and_delete userspace function >>>> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= ) >>>> which in my opinion is actually a good idea. The only problem with >>>> that is that calling bpf_map_get_next_key(fd, key, next_key) and then >>>> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code >>>> and it might lose some information when deleting. >>>> We could then do map_dump_and_delete using that idea but in the kernel >>>> where we could better handle the racing condition. In that scenario >>>> even if we retrieve the same key it will contain different info ( the >>>> delta between old and new value). Would that work? >>> >>> you mean get_next+lookup+delete at once? >>> Sounds useful. >>> Yonghong has been thinking about batching api as well. >> >> In bcc, we have many instances like this: >> getting all (key value) pairs, do some analysis and output, >> delete all keys >> >> The implementation typically like >> /* to get all (key, value) pairs */ >> while(bpf_get_next_key() == 0) >> bpf_map_lookup() >> /* do analysis and output */ >> for (all keys) >> bpf_map_delete() > > If you do that in a map that is being modified while you are doing the > analysis and output, you will lose some new data by deleting the keys, > right? Agreed, it is possible that if the same keys are reused to generate data during analysis and output period, we will miss them by deleting them. From that perspective, your above approach while (bpf_get_next_key()) bpf_map_delete(prev_key) bpf_map_lookup() reset prev_keey should provide a better alternative. > >> get_next+lookup+delete will be definitely useful. >> batching will be even better to save the number of syscalls. >> >> An alternative is to do batch get_next+lookup and batch delete >> to achieve similar goal as the above code. > > What I mentioned above is what it makes me think that with the > deletion it'd be better if we perform these 3 operations at once: > get_next+lookup+delete in a jumbo/atomic command and batch them later? Agree. This is indeed the one most useful for bcc use case as well. > >> >> There is a minor difference between this approach >> and the above get_next+lookup+delete. >> During scanning the hash map, get_next+lookup may get less number >> of elements compared to get_next+lookup+delete as the latter >> may have more later-inserted hash elements after the operation >> start. But both are inaccurate, so probably the difference >> is minor. >> >>> >>> I think if we cannot figure out how to make a batch of two commands >>> get_next + lookup to work correctly then we need to identify/invent one >>> command and make batching more generic. >> >> not 100% sure. It will be hard to define what is "correctly". > > I agree, it'll be hard to define what is the right behavior. > >> For not changing map, looping of (get_next, lookup) and batch >> get_next+lookup should have the same results. > > This is true for the api I'm presenting the only think that I was > missing was what to do for changing maps to avoid the weird scenario > (getting the first key due a concurrent deletion). And, in my opinion > the way to go should be what also Willem supported: return the err to > the caller and restart the dumping. I could do this with existing code > just by detecting that we do provide a prev_key and got the first_key > instead of the next_key or even implement a new function if you want > to. Always starting from the first key has its drawback as we keep getting the new elements if they are constantly populated. This may skew the results for a large hash table. Maybe we can just do lookup+delete or batch lookup+delete? user gives NULL means the first key to lookup/delete. Every (batch) lookup+delete will deletes one or a set of keys. The set of keys are retrieved using internal get_next . The (batch) lookup+delete will return next available key, which user can be used for next (batch) lookup+delete. If user provided key does not match, user can provide a flag to go to the first key, or return an error. > >> For constant changing loops, not sure how to define which one >> is correct. If users have concerns, they may need to just pick one >> which gives them more comfort. >> >>> Like make one jumbo/compound/atomic command to be get_next+lookup+delete. >>> Define the semantics of this single compound command. >>> And then let batching to be a multiplier of such command. >>> In a sense that multiplier 1 or N should be have the same way. >>> No extra flags to alter the batching. >>> The high level description of the batch would be: >>> pls execute get_next,lookup,delete and repeat it N times. >>> or >>> pls execute get_next,lookup and repeat N times. > > But any attempt to do get_next+lookup will have same problem with > deletions right? > > I don't see how we could do it more consistent than what I'm > proposing. Let's just support one case: report an error if the > prev_key was not found instead of retrieving the first_key. Would that > work? > >>> where each command action is defined to be composable. >>> >>> Just a rough idea. >>>
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index fa1c753dcdbc7..66dab5385170d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -106,6 +106,7 @@ enum bpf_cmd { BPF_TASK_FD_QUERY, BPF_MAP_LOOKUP_AND_DELETE_ELEM, BPF_MAP_FREEZE, + BPF_MAP_DUMP, }; enum bpf_map_type { @@ -388,6 +389,14 @@ union bpf_attr { __u64 flags; }; + struct { /* struct used by BPF_MAP_DUMP command */ + __aligned_u64 prev_key; + __aligned_u64 buf; + __aligned_u64 buf_len; /* input/output: len of buf */ + __u64 flags; + __u32 map_fd; + } dump; + struct { /* anonymous struct used by BPF_PROG_LOAD command */ __u32 prog_type; /* one of enum bpf_prog_type */ __u32 insn_cnt; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 86cdc2f7bb56e..0c35505aa219f 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr) return err; } +/* last field in 'union bpf_attr' used by this command */ +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd + +static int map_dump(union bpf_attr *attr) +{ + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key); + void __user *ubuf = u64_to_user_ptr(attr->dump.buf); + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len); + int ufd = attr->dump.map_fd; + struct bpf_map *map; + void *buf, *prev_key, *key, *value; + u32 value_size, elem_size, buf_len, cp_len; + struct fd f; + int err; + bool first_key = false; + + if (CHECK_ATTR(BPF_MAP_DUMP)) + return -EINVAL; + + if (attr->dump.flags & ~BPF_F_LOCK) + return -EINVAL; + + f = fdget(ufd); + map = __bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { + err = -EPERM; + goto err_put; + } + + if ((attr->dump.flags & BPF_F_LOCK) && + !map_value_has_spin_lock(map)) { + err = -EINVAL; + goto err_put; + } + + if (map->map_type == BPF_MAP_TYPE_QUEUE || + map->map_type == BPF_MAP_TYPE_STACK) { + err = -ENOTSUPP; + goto err_put; + } + + value_size = bpf_map_value_size(map); + + err = get_user(buf_len, ubuf_len); + if (err) + goto err_put; + + elem_size = map->key_size + value_size; + if (buf_len < elem_size) { + err = -EINVAL; + goto err_put; + } + + if (ukey) { + prev_key = __bpf_copy_key(ukey, map->key_size); + if (IS_ERR(prev_key)) { + err = PTR_ERR(prev_key); + goto err_put; + } + } else { + prev_key = NULL; + first_key = true; + } + + err = -ENOMEM; + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN); + if (!buf) + goto err_put; + + key = buf; + value = key + map->key_size; + for (cp_len = 0; cp_len + elem_size <= buf_len;) { + if (signal_pending(current)) { + err = -EINTR; + break; + } + + rcu_read_lock(); + err = map->ops->map_get_next_key(map, prev_key, key); + rcu_read_unlock(); + + if (err) + break; + + err = bpf_map_copy_value(map, key, value, attr->dump.flags); + + if (err == -ENOENT) + continue; + if (err) + goto free_buf; + + if (copy_to_user(ubuf + cp_len, buf, elem_size)) { + err = -EFAULT; + goto free_buf; + } + + prev_key = key; + cp_len += elem_size; + } + + if (err == -ENOENT && cp_len) + err = 0; + if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) || + (!first_key && copy_to_user(ukey, key, map->key_size)))) + err = -EFAULT; +free_buf: + kfree(buf); +err_put: + fdput(f); + return err; +} + #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value static int map_lookup_and_delete_elem(union bpf_attr *attr) @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_MAP_LOOKUP_AND_DELETE_ELEM: err = map_lookup_and_delete_elem(&attr); break; + case BPF_MAP_DUMP: + err = map_dump(&attr); + break; default: err = -EINVAL; break;
This introduces a new command to retrieve multiple number of entries from a bpf map, wrapping the existing bpf methods: map_get_next_key and map_lookup_elem To start dumping the map from the beginning you must specify NULL as the prev_key. The new API returns 0 when it successfully copied all the elements requested or it copied less because there weren't more elements to retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. On a successful call buf and buf_len will contain correct data and in case prev_key was provided (not for the first walk, since prev_key is NULL) it will contain the last_key copied into the prev_key which will simplify next call. Only when it can't find a single element it will return -ENOENT meaning that the map has been entirely walked. When an error is return buf, buf_len and prev_key shouldn't be read nor used. Because maps can be called from userspace and kernel code, this function can have a scenario where the next_key was found but by the time we try to retrieve the value the element is not there, in this case the function continues and tries to get a new next_key value, skipping the deleted key. If at some point the function find itself trap in a loop, it will return -EINTR. The function will try to fit as much as possible in the buf provided and will return -EINVAL if buf_len is smaller than elem_size. QUEUE and STACK maps are not supported. Note that map_dump doesn't guarantee that reading the entire table is consistent since this function is always racing with kernel and user code but the same behaviour is found when the entire table is walked using the current interfaces: map_get_next_key + map_lookup_elem. It is also important to note that with a locked map, the lock is grabbed for 1 entry at the time, meaning that the returned buf might or might not be consistent. Suggested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> --- include/uapi/linux/bpf.h | 9 +++ kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 126 insertions(+)