Message ID | 20200120092917.13949-1-bjorn.topel@gmail.com |
---|---|
State | Accepted |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf-next] xsk, net: make sock_def_readable() have external linkage | expand |
Björn Töpel wrote: > From: Björn Töpel <bjorn.topel@intel.com> > > XDP sockets use the default implementation of struct sock's > sk_data_ready callback, which is sock_def_readable(). This function is > called in the XDP socket fast-path, and involves a retpoline. By > letting sock_def_readable() have external linkage, and being called > directly, the retpoline can be avoided. > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> > --- > include/net/sock.h | 2 ++ > net/core/sock.c | 2 +- > net/xdp/xsk.c | 2 +- > 3 files changed, 4 insertions(+), 2 deletions(-) > I think this is fine but curious were you able to measure the difference with before/after pps or something?
On Tue, 21 Jan 2020 at 01:48, John Fastabend <john.fastabend@gmail.com> wrote: > > Björn Töpel wrote: > > From: Björn Töpel <bjorn.topel@intel.com> > > > > XDP sockets use the default implementation of struct sock's > > sk_data_ready callback, which is sock_def_readable(). This function is > > called in the XDP socket fast-path, and involves a retpoline. By > > letting sock_def_readable() have external linkage, and being called > > directly, the retpoline can be avoided. > > > > Signed-off-by: Björn Töpel <bjorn.topel@intel.com> > > --- > > include/net/sock.h | 2 ++ > > net/core/sock.c | 2 +- > > net/xdp/xsk.c | 2 +- > > 3 files changed, 4 insertions(+), 2 deletions(-) > > > > I think this is fine but curious were you able to measure the > difference with before/after pps or something? Ugh, yeah, of course I've should have added that. Sorry for that! Here goes; Benchmark is xdpsock rxdrop, NAPI running on core 20: **Pre-patch: xdpsock rxdrop: 22.8 Mpps Performance counter stats for 'CPU(s) 20': 10,000.58 msec cpu-clock # 1.000 CPUs utilized 12 context-switches # 0.001 K/sec 1 cpu-migrations # 0.000 K/sec 0 page-faults # 0.000 K/sec 29,931,407,416 cycles # 2.993 GHz 82,538,852,331 instructions # 2.76 insn per cycle 15,894,169,979 branches # 1589.324 M/sec 30,916,486 branch-misses # 0.19% of all branches 10.000636027 seconds time elapsed **Post-patch: xdpsock rxdrop: 23.2 Mpps 10,000.90 msec cpu-clock # 1.000 CPUs utilized 12 context-switches # 0.001 K/sec 1 cpu-migrations # 0.000 K/sec 0 page-faults # 0.000 K/sec 29,932,353,067 cycles # 2.993 GHz 84,299,636,827 instructions # 2.82 insn per cycle 16,228,795,437 branches # 1622.733 M/sec 28,113,847 branch-misses # 0.17% of all branches 10.000596454 seconds time elapsed This could fall into the category of noise. :-) PPS and IPC is up a bit. OTOH, maybe UDP can benefit from this as well? Björn
On 1/21/20 1:06 PM, Björn Töpel wrote: > On Tue, 21 Jan 2020 at 01:48, John Fastabend <john.fastabend@gmail.com> wrote: >> Björn Töpel wrote: >>> From: Björn Töpel <bjorn.topel@intel.com> >>> >>> XDP sockets use the default implementation of struct sock's >>> sk_data_ready callback, which is sock_def_readable(). This function is >>> called in the XDP socket fast-path, and involves a retpoline. By >>> letting sock_def_readable() have external linkage, and being called >>> directly, the retpoline can be avoided. >>> >>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> >>> --- >>> include/net/sock.h | 2 ++ >>> net/core/sock.c | 2 +- >>> net/xdp/xsk.c | 2 +- >>> 3 files changed, 4 insertions(+), 2 deletions(-) >>> >> >> I think this is fine but curious were you able to measure the >> difference with before/after pps or something? > > Ugh, yeah, of course I've should have added that. Sorry for that! Here > goes; Benchmark is xdpsock rxdrop, NAPI running on core 20: > > **Pre-patch: xdpsock rxdrop: 22.8 Mpps > Performance counter stats for 'CPU(s) 20': > > 10,000.58 msec cpu-clock # 1.000 CPUs > utilized > 12 context-switches # 0.001 K/sec > 1 cpu-migrations # 0.000 K/sec > 0 page-faults # 0.000 K/sec > 29,931,407,416 cycles # 2.993 GHz > 82,538,852,331 instructions # 2.76 insn per > cycle > 15,894,169,979 branches # 1589.324 M/sec > 30,916,486 branch-misses # 0.19% of all > branches > > 10.000636027 seconds time elapsed > > **Post-patch: xdpsock rxdrop: 23.2 Mpps > 10,000.90 msec cpu-clock # 1.000 CPUs > utilized > 12 context-switches # 0.001 K/sec > 1 cpu-migrations # 0.000 K/sec > 0 page-faults # 0.000 K/sec > 29,932,353,067 cycles # 2.993 GHz > 84,299,636,827 instructions # 2.82 insn per > cycle > 16,228,795,437 branches # 1622.733 M/sec > 28,113,847 branch-misses # 0.17% of all > branches > > 10.000596454 seconds time elapsed > > This could fall into the category of noise. :-) PPS and IPC is up a > bit. OTOH, maybe UDP can benefit from this as well? Yeah, might be within noise range though getting rid of retpolines in XDP [socket] fast-path is fine and patch is tiny, so applied, thanks!
diff --git a/include/net/sock.h b/include/net/sock.h index 8dff68b4c316..0891c55f1e82 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2612,4 +2612,6 @@ static inline bool sk_dev_equal_l3scope(struct sock *sk, int dif) return false; } +void sock_def_readable(struct sock *sk); + #endif /* _SOCK_H */ diff --git a/net/core/sock.c b/net/core/sock.c index 8459ad579f73..a4c8fac781ff 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2786,7 +2786,7 @@ static void sock_def_error_report(struct sock *sk) rcu_read_unlock(); } -static void sock_def_readable(struct sock *sk) +void sock_def_readable(struct sock *sk) { struct socket_wq *wq; diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 02ada7ab8c6e..df600487a68d 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -217,7 +217,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) static void xsk_flush(struct xdp_sock *xs) { xskq_prod_submit(xs->rx); - xs->sk.sk_data_ready(&xs->sk); + sock_def_readable(&xs->sk); } int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)