Message ID | 20170331044543.4075183-2-ast@fb.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 30 Mar 2017 21:45:38 -0700 Alexei Starovoitov <ast@fb.com> wrote: > static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) > +{ > + u64 time_start, time_spent = 0; > + u32 ret = 0, i; > + > + if (!repeat) > + repeat = 1; > + time_start = ktime_get_ns(); I've found that is useful to record the CPU cycles, as it is more useful for comparing between CPUs. The nanosec time measurement varies too much between CPUs and GHz. I do use nanosec measurements myself a lot, but that is mostly because it is easier to relate to pps rates. For eBPF code execution I think it is more useful to get a cycles cost count? I've been using tsc[1] (rdtsc) to get the CPU cycles, I believe get_cycles() the more generic call, which have arch specific impl. (but can return 0 if no arch support). The best solution would be to use the perf infrastructure and PMU counter to get both PMU cycles and instructions, as that also tell you about the pipeline efficiency like instructions per cycles. I only got this partly working in [1][2]. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/time_bench.h [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench.c > + for (i = 0; i < repeat; i++) { > + ret = bpf_test_run_one(prog, ctx); > + if (need_resched()) { > + if (signal_pending(current)) > + break; > + time_spent += ktime_get_ns() - time_start; > + cond_resched(); > + time_start = ktime_get_ns(); > + } > + } > + time_spent += ktime_get_ns() - time_start; > + do_div(time_spent, repeat); > + *time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; > + > + return ret; > +}
On 4/1/17 12:14 AM, Jesper Dangaard Brouer wrote: > On Thu, 30 Mar 2017 21:45:38 -0700 > Alexei Starovoitov <ast@fb.com> wrote: > >> static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) >> +{ >> + u64 time_start, time_spent = 0; >> + u32 ret = 0, i; >> + >> + if (!repeat) >> + repeat = 1; >> + time_start = ktime_get_ns(); > > I've found that is useful to record the CPU cycles, as it is more > useful for comparing between CPUs. The nanosec time measurement varies > too much between CPUs and GHz. I do use nanosec measurements myself a > lot, but that is mostly because it is easier to relate to pps rates. > For eBPF code execution I think it is more useful to get a cycles cost > count? for micro-benchmarking of an instruction or small primitives like spin_lock and irq_save/restore, yes. Cycles are more interesting to look at. Here it's the whole program which in case of networking likely does at least a few map lookups. Also this duration field is more of sanity test then actual metric. > I've been using tsc[1] (rdtsc) to get the CPU cycles, I believe > get_cycles() the more generic call, which have arch specific impl. (but > can return 0 if no arch support). > > The best solution would be to use the perf infrastructure and PMU > counter to get both PMU cycles and instructions, as that also tell you > about the pipeline efficiency like instructions per cycles. I only got > this partly working in [1][2]. to use get_cycles() or perf_event_create_kernel_counter() the current simple loop would become kthread pinned to cpu and so on. imo it's an overkill. The only reason 'duration' being reported is a sanity test with user space measurements. What this command allows to do is: $ time ./my_bpf_benchmark The reported time should match the kernel reported 'duration'. The tiny difference will come from resched. That's sanity part. Now we can also do $ perf record ./my_bpf_benchmark and get all perf goodness for free without adding any kernel code. I want this test_run command to stay execution only. All pmu and performance metrics should stay on perf side. In case of performance optimization of bpf programs we're trying to improve perf by changing the way program is written, hence we need perf to point out which line of C code is costly. Second is improving performance by changing JIT, map implementations and so on. Here we also want full perf tool power. Unfortunately there is an issue with perf today, since as soon as my_bpf_benchmark exits, bpf prog is unloaded and ksym is gone, so 'perf report' cannot associate addresses back to source code. We discussed a solution with Arnaldo. So that's orthogonal work in progress which is needed regardless of this test_run command. User space can also pin itself to cpu instead of asking kernel to do it and run the same program on multiple cpus in parallel testing interaction between concurrent map accesses and so on. So by keeping test_run command as execution only primitive we allow user space to do all the fancy tricks and measurements.
On Sat, 1 Apr 2017 08:45:01 -0700 Alexei Starovoitov <ast@fb.com> wrote: > On 4/1/17 12:14 AM, Jesper Dangaard Brouer wrote: > > On Thu, 30 Mar 2017 21:45:38 -0700 > > Alexei Starovoitov <ast@fb.com> wrote: > > > >> static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) > >> +{ > >> + u64 time_start, time_spent = 0; > >> + u32 ret = 0, i; > >> + > >> + if (!repeat) > >> + repeat = 1; > >> + time_start = ktime_get_ns(); > > > > I've found that is useful to record the CPU cycles, as it is more > > useful for comparing between CPUs. The nanosec time measurement varies > > too much between CPUs and GHz. I do use nanosec measurements myself a > > lot, but that is mostly because it is easier to relate to pps rates. > > For eBPF code execution I think it is more useful to get a cycles cost > > count? > > for micro-benchmarking of an instruction or small primitives > like spin_lock and irq_save/restore, yes. Cycles are more interesting > to look at. Here it's the whole program which in case of networking > likely does at least a few map lookups. > Also this duration field is more of sanity test then actual metric. Okay, if it was only a sanity metric. > > I've been using tsc[1] (rdtsc) to get the CPU cycles, I believe > > get_cycles() the more generic call, which have arch specific impl. (but > > can return 0 if no arch support). > > > > The best solution would be to use the perf infrastructure and PMU > > counter to get both PMU cycles and instructions, as that also tell you > > about the pipeline efficiency like instructions per cycles. I only got > > this partly working in [1][2]. > > to use get_cycles() or perf_event_create_kernel_counter() the current > simple loop would become kthread pinned to cpu and so on. > imo it's an overkill. > The only reason 'duration' being reported is a sanity test with user > space measurements. > What this command allows to do is: > $ time ./my_bpf_benchmark > The reported time should match the kernel reported 'duration'. > The tiny difference will come from resched. That's sanity part. > Now we can also do > $ perf record ./my_bpf_benchmark Make perfect sense, to handle it this way. > and get all perf goodness for free without adding any kernel code. > I want this test_run command to stay execution only. All pmu and > performance metrics should stay on perf side. > In case of performance optimization of bpf programs we're trying > to improve perf by changing the way program is written, hence > we need perf to point out which line of C code is costly. > Second is improving performance by changing JIT, map implementations > and so on. Here we also want full perf tool power. > > Unfortunately there is an issue with perf today, since as soon as > my_bpf_benchmark exits, bpf prog is unloaded and ksym is gone, so > 'perf report' cannot associate addresses back to source code. > We discussed a solution with Arnaldo. So that's orthogonal work in > progress which is needed regardless of this test_run command. Yes, that is rather unfortunate. Good to hear there is work in this area. I've started using: sysctl net/core/bpf_jit_kallsyms=1 and adding --kallsyms=/proc/kallsyms to perf report, which is helpful. > User space can also pin itself to cpu instead of asking kernel to > do it and run the same program on multiple cpus in parallel testing > interaction between concurrent map accesses and so on. > So by keeping test_run command as execution only primitive we allow > user space to do all the fancy tricks and measurements. Sound good to me! :-) Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2ae39a3e9ead..bbb513da5075 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -169,6 +169,8 @@ struct bpf_verifier_ops { const struct bpf_insn *src, struct bpf_insn *dst, struct bpf_prog *prog); + int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr); }; struct bpf_prog_type_list { @@ -233,6 +235,11 @@ typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); +int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr); +int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr); + #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 28317a04c34d..a1d95386f562 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -81,6 +81,7 @@ enum bpf_cmd { BPF_OBJ_GET, BPF_PROG_ATTACH, BPF_PROG_DETACH, + BPF_PROG_TEST_RUN, }; enum bpf_map_type { @@ -189,6 +190,17 @@ union bpf_attr { __u32 attach_type; __u32 attach_flags; }; + + struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ + __u32 prog_fd; + __u32 retval; + __u32 data_size_in; + __u32 data_size_out; + __aligned_u64 data_in; + __aligned_u64 data_out; + __u32 repeat; + __u32 duration; + } test; } __attribute__((aligned(8))); /* BPF helper function descriptions: diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c35ebfe6d84d..ab0cf4c43690 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -973,6 +973,28 @@ static int bpf_prog_detach(const union bpf_attr *attr) } #endif /* CONFIG_CGROUP_BPF */ +#define BPF_PROG_TEST_RUN_LAST_FIELD test.duration + +static int bpf_prog_test_run(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + struct bpf_prog *prog; + int ret = -ENOTSUPP; + + if (CHECK_ATTR(BPF_PROG_TEST_RUN)) + return -EINVAL; + + prog = bpf_prog_get(attr->test.prog_fd); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + if (prog->aux->ops->test_run) + ret = prog->aux->ops->test_run(prog, attr, uattr); + + bpf_prog_put(prog); + return ret; +} + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; @@ -1039,7 +1061,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_OBJ_GET: err = bpf_obj_get(&attr); break; - #ifdef CONFIG_CGROUP_BPF case BPF_PROG_ATTACH: err = bpf_prog_attach(&attr); @@ -1048,7 +1069,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz err = bpf_prog_detach(&attr); break; #endif - + case BPF_PROG_TEST_RUN: + err = bpf_prog_test_run(&attr, uattr); + break; default: err = -EINVAL; break; diff --git a/net/Makefile b/net/Makefile index 9b681550e3a3..9086ffbb5085 100644 --- a/net/Makefile +++ b/net/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_NET) += $(tmp-y) # LLC has to be linked before the files in net/802/ obj-$(CONFIG_LLC) += llc/ -obj-$(CONFIG_NET) += ethernet/ 802/ sched/ netlink/ +obj-$(CONFIG_NET) += ethernet/ 802/ sched/ netlink/ bpf/ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_INET) += ipv4/ obj-$(CONFIG_XFRM) += xfrm/ diff --git a/net/bpf/Makefile b/net/bpf/Makefile new file mode 100644 index 000000000000..27b2992a0692 --- /dev/null +++ b/net/bpf/Makefile @@ -0,0 +1 @@ +obj-y := test_run.o diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c new file mode 100644 index 000000000000..8a6d0a37c30c --- /dev/null +++ b/net/bpf/test_run.c @@ -0,0 +1,172 @@ +/* Copyright (c) 2017 Facebook + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#include <linux/bpf.h> +#include <linux/slab.h> +#include <linux/vmalloc.h> +#include <linux/etherdevice.h> +#include <linux/filter.h> +#include <linux/sched/signal.h> + +static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx) +{ + u32 ret; + + preempt_disable(); + rcu_read_lock(); + ret = BPF_PROG_RUN(prog, ctx); + rcu_read_unlock(); + preempt_enable(); + + return ret; +} + +static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) +{ + u64 time_start, time_spent = 0; + u32 ret = 0, i; + + if (!repeat) + repeat = 1; + time_start = ktime_get_ns(); + for (i = 0; i < repeat; i++) { + ret = bpf_test_run_one(prog, ctx); + if (need_resched()) { + if (signal_pending(current)) + break; + time_spent += ktime_get_ns() - time_start; + cond_resched(); + time_start = ktime_get_ns(); + } + } + time_spent += ktime_get_ns() - time_start; + do_div(time_spent, repeat); + *time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; + + return ret; +} + +static int bpf_test_finish(union bpf_attr __user *uattr, const void *data, + u32 size, u32 retval, u32 duration) +{ + void __user *data_out = u64_to_user_ptr(uattr->test.data_out); + int err = -EFAULT; + + if (data_out && copy_to_user(data_out, data, size)) + goto out; + if (copy_to_user(&uattr->test.data_size_out, &size, sizeof(size))) + goto out; + if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval))) + goto out; + if (copy_to_user(&uattr->test.duration, &duration, sizeof(duration))) + goto out; + err = 0; +out: + return err; +} + +static void *bpf_test_init(const union bpf_attr *kattr, u32 size, + u32 headroom, u32 tailroom) +{ + void __user *data_in = u64_to_user_ptr(kattr->test.data_in); + void *data; + + if (size < ETH_HLEN || size > PAGE_SIZE - headroom - tailroom) + return ERR_PTR(-EINVAL); + + data = kzalloc(size + headroom + tailroom, GFP_USER); + if (!data) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(data + headroom, data_in, size)) { + kfree(data); + return ERR_PTR(-EFAULT); + } + return data; +} + +int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + bool is_l2 = false, is_direct_pkt_access = false; + u32 size = kattr->test.data_size_in; + u32 repeat = kattr->test.repeat; + u32 retval, duration; + struct sk_buff *skb; + void *data; + int ret; + + data = bpf_test_init(kattr, size, NET_SKB_PAD, + SKB_DATA_ALIGN(sizeof(struct skb_shared_info))); + if (IS_ERR(data)) + return PTR_ERR(data); + + switch (prog->type) { + case BPF_PROG_TYPE_SCHED_CLS: + case BPF_PROG_TYPE_SCHED_ACT: + is_l2 = true; + /* fall through */ + case BPF_PROG_TYPE_LWT_IN: + case BPF_PROG_TYPE_LWT_OUT: + case BPF_PROG_TYPE_LWT_XMIT: + is_direct_pkt_access = true; + break; + default: + break; + } + + skb = build_skb(data, 0); + if (!skb) { + kfree(data); + return -ENOMEM; + } + + skb_reserve(skb, NET_SKB_PAD); + __skb_put(skb, size); + skb->protocol = eth_type_trans(skb, current->nsproxy->net_ns->loopback_dev); + skb_reset_network_header(skb); + + if (is_l2) + __skb_push(skb, ETH_HLEN); + if (is_direct_pkt_access) + bpf_compute_data_end(skb); + retval = bpf_test_run(prog, skb, repeat, &duration); + if (!is_l2) + __skb_push(skb, ETH_HLEN); + size = skb->len; + /* bpf program can never convert linear skb to non-linear */ + if (WARN_ON_ONCE(skb_is_nonlinear(skb))) + size = skb_headlen(skb); + ret = bpf_test_finish(uattr, skb->data, size, retval, duration); + kfree_skb(skb); + return ret; +} + +int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + u32 size = kattr->test.data_size_in; + u32 repeat = kattr->test.repeat; + struct xdp_buff xdp = {}; + u32 retval, duration; + void *data; + int ret; + + data = bpf_test_init(kattr, size, XDP_PACKET_HEADROOM, 0); + if (IS_ERR(data)) + return PTR_ERR(data); + + xdp.data_hard_start = data; + xdp.data = data + XDP_PACKET_HEADROOM; + xdp.data_end = xdp.data + size; + + retval = bpf_test_run(prog, &xdp, repeat, &duration); + if (xdp.data != data + XDP_PACKET_HEADROOM) + size = xdp.data_end - xdp.data; + ret = bpf_test_finish(uattr, xdp.data, size, retval, duration); + kfree(data); + return ret; +} diff --git a/net/core/filter.c b/net/core/filter.c index dfb9f61a2fd5..15e9a81ffebe 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3309,24 +3309,28 @@ static const struct bpf_verifier_ops tc_cls_act_ops = { .is_valid_access = tc_cls_act_is_valid_access, .convert_ctx_access = tc_cls_act_convert_ctx_access, .gen_prologue = tc_cls_act_prologue, + .test_run = bpf_prog_test_run_skb, }; static const struct bpf_verifier_ops xdp_ops = { .get_func_proto = xdp_func_proto, .is_valid_access = xdp_is_valid_access, .convert_ctx_access = xdp_convert_ctx_access, + .test_run = bpf_prog_test_run_xdp, }; static const struct bpf_verifier_ops cg_skb_ops = { .get_func_proto = cg_skb_func_proto, .is_valid_access = sk_filter_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, + .test_run = bpf_prog_test_run_skb, }; static const struct bpf_verifier_ops lwt_inout_ops = { .get_func_proto = lwt_inout_func_proto, .is_valid_access = lwt_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, + .test_run = bpf_prog_test_run_skb, }; static const struct bpf_verifier_ops lwt_xmit_ops = { @@ -3334,6 +3338,7 @@ static const struct bpf_verifier_ops lwt_xmit_ops = { .is_valid_access = lwt_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, .gen_prologue = tc_cls_act_prologue, + .test_run = bpf_prog_test_run_skb, }; static const struct bpf_verifier_ops cg_sock_ops = {