mbox series

[RFC,00/11] udp: full early demux for unconnected sockets

Message ID cover.1506114055.git.pabeni@redhat.com
Headers show
Series udp: full early demux for unconnected sockets | expand

Message

Paolo Abeni Sept. 22, 2017, 9:06 p.m. UTC
This series refactor the UDP early demux code so that:

* full socket lookup is performed for unicast packets
* a sk is grabbed even for unconnected socket match
* a dst cache is used even in such scenario

To perform this tasks a couple of facilities are added:

* noref socket references, scoped inside the current RCU section, to be
  explicitly cleared before leaving such section
* a dst cache inside the inet and inet6 local addresses tables, caching the
  related local dst entry

The measured performance gain under small packet UDP flood is as follow:

ingress NIC	vanilla		patched		delta
rx queues	(kpps)		(kpps)		(%)
[ipv4]
1		2177		2414		10
2		2527		2892		14
3		3050		3733		22
4		3918		4643		18
5		5074		5699		12
6		5654		6869		21

[ipv6]
1		2002		2821		40
2		2087		3148		50
3		2583		4008		55
4		3072		4963		61
5		3719		5992		61
6		4314		6910		60

The number of user space process in use is equal to the number of
NIC rx queue; when multiple user space processes the SO_REUSEPORT 
options is used, as described below:

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        udp_sink  --reuse-port --recvfrom --count 1000000000 --port 9 $1 &
        taskset -p $((MASK << ($I + $n) )) $!
done

Paolo Abeni (11):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux
  ip/route: factor out helper for local route creation
  ipv6/addrconf: add an helper for inet6 address lookup
  net: implement local route cache inside ifaddr
  route: add ipv4/6 helpers to do partial route lookup vs local dst
  IP: early demux can return an error code
  udp: dst lookup in early demux for unconnected sockets

 include/linux/inetdevice.h       |   4 ++
 include/linux/skbuff.h           |  31 +++++++++++
 include/linux/udp.h              |   2 +
 include/net/addrconf.h           |   3 ++
 include/net/dst.h                |  20 +++++++
 include/net/if_inet6.h           |   4 ++
 include/net/ip6_route.h          |   1 +
 include/net/protocol.h           |   4 +-
 include/net/route.h              |   4 ++
 include/net/tcp.h                |   2 +-
 include/net/udp.h                |   2 +-
 net/core/dst.c                   |  12 +++++
 net/core/sock.c                  |   7 +++
 net/ipv4/devinet.c               |  29 ++++++++++-
 net/ipv4/ip_input.c              |  33 ++++++++----
 net/ipv4/netfilter/nf_dup_ipv4.c |   3 ++
 net/ipv4/route.c                 |  73 +++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c              |   9 ++--
 net/ipv4/udp.c                   |  95 +++++++++++++++-------------------
 net/ipv6/addrconf.c              | 109 +++++++++++++++++++++++++++------------
 net/ipv6/ip6_input.c             |   4 ++
 net/ipv6/netfilter/nf_dup_ipv6.c |   3 ++
 net/ipv6/route.c                 |  13 +++++
 net/ipv6/udp.c                   |  72 ++++++++++----------------
 net/netfilter/nf_queue.c         |   3 ++
 25 files changed, 383 insertions(+), 159 deletions(-)

Comments

Eric Dumazet Sept. 22, 2017, 9:58 p.m. UTC | #1
On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> This series refactor the UDP early demux code so that:
> 
> * full socket lookup is performed for unicast packets
> * a sk is grabbed even for unconnected socket match
> * a dst cache is used even in such scenario
> 
> To perform this tasks a couple of facilities are added:
> 
> * noref socket references, scoped inside the current RCU section, to be
>   explicitly cleared before leaving such section
> * a dst cache inside the inet and inet6 local addresses tables, caching the
>   related local dst entry
> 
> The measured performance gain under small packet UDP flood is as follow:
> 
> ingress NIC	vanilla		patched		delta
> rx queues	(kpps)		(kpps)		(%)
> [ipv4]
> 1		2177		2414		10
> 2		2527		2892		14
> 3		3050		3733		22


This is a clear sign your program is not using latest SO_REUSEPORT +
[ec]BPF filter [1]

return socket[RX_QUEUE# | or CPU#];

If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
based on a lazy hash, meaning that you do not have proper siloing.

return socket[hash(skb)];

Multiple cpus can then :
 - compete on grabbing same socket refcount
 - compete on grabbing the receive queue lock
 - compete for releasing lock and socket refcount
 - skb freeing done on different cpus than where allocated.

You are adding complexity to the kernel because you are using a
sub-optimal user space program, favoring false sharing.

First solve the false sharing issue.

Performance with 2 rx queues should be almost twice the performance with
1 rx queue.

Then we can see if the gains you claim are still applicable.

Thanks

PS: Wei Wan is about to release the IPV6 changes so that the big
differences you showed are going to disappear soon.

Refs [1]

tools/testing/selftests/net/reuseport_bpf.c

6a5ef90c58daada158ba16ba330558efc3471491 Merge branch 'faster-soreuseport'
3ca8e4029969d40ab90e3f1ecd83ab1cadd60fbb soreuseport: BPF selection functional test
538950a1b7527a0a52ccd9337e3fcd304f027f13 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF
e32ea7e747271a0abcd37e265005e97cc81d9df5 soreuseport: fast reuseport UDP socket selection
ef456144da8ef507c8cf504284b6042e9201a05c soreuseport: define reuseport groups
Paolo Abeni Sept. 25, 2017, 8:26 p.m. UTC | #2
On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo