diff mbox

[RFC,net-next,7/8] net: ipv4: listified version of ip_rcv

Message ID 5716347D.3030808@solarflare.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Edward Cree April 19, 2016, 1:37 p.m. UTC
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
horrendous) we just let it call the regular okfn (in this case
ip_rcv_finish()) for any packets it steals, and have it give us back a list
of packets it's synchronously accepted (which normally NF_HOOK would
automatically call okfn() on, but we want to be able to potentially pass
the list to a listified version of okfn().)

There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any accepts
earlier in the list.  However, it was already possible for an asynchronous
accept to cause out-of-order receives, so hopefully I haven't broken
anything that wasn't broken already.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  3 ++
 include/linux/netfilter.h | 27 +++++++++++++++++
 include/net/ip.h          |  2 ++
 net/core/dev.c            | 11 +++++--
 net/ipv4/af_inet.c        |  1 +
 net/ipv4/ip_input.c       | 75 ++++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 110 insertions(+), 9 deletions(-)

Comments

Eric Dumazet April 19, 2016, 2:50 p.m. UTC | #1
On Tue, 2016-04-19 at 14:37 +0100, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
> Rather than attempting to make netfilter know about lists (which would be
> horrendous) we just let it call the regular okfn (in this case
> ip_rcv_finish()) for any packets it steals, and have it give us back a list
> of packets it's synchronously accepted (which normally NF_HOOK would
> automatically call okfn() on, but we want to be able to potentially pass
> the list to a listified version of okfn().)
> 
> There is potential for out-of-order receives if the netfilter hook ends up
> synchronously stealing packets, as they will be processed before any accepts
> earlier in the list.  However, it was already possible for an asynchronous
> accept to cause out-of-order receives, so hopefully I haven't broken
> anything that wasn't broken already.
> 
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---

We have hard time to deal with latencies already, and maintaining some
sanity in the stack(s)

This is not going to give us a 10x or even 2x improvement factor, so
what about working on something that would really lower cache line
misses and use pipelines to amortize the costs ?

The main problem in UDP stack today is having to lock the socket because
of the dumb forward allocation problem. Are you really going to provide
a list of skbs up to _one_ UDP socket ?
Tom Herbert April 19, 2016, 3:46 p.m. UTC | #2
On Tue, Apr 19, 2016 at 7:50 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-04-19 at 14:37 +0100, Edward Cree wrote:
>> Also involved adding a way to run a netfilter hook over a list of packets.
>> Rather than attempting to make netfilter know about lists (which would be
>> horrendous) we just let it call the regular okfn (in this case
>> ip_rcv_finish()) for any packets it steals, and have it give us back a list
>> of packets it's synchronously accepted (which normally NF_HOOK would
>> automatically call okfn() on, but we want to be able to potentially pass
>> the list to a listified version of okfn().)
>>
>> There is potential for out-of-order receives if the netfilter hook ends up
>> synchronously stealing packets, as they will be processed before any accepts
>> earlier in the list.  However, it was already possible for an asynchronous
>> accept to cause out-of-order receives, so hopefully I haven't broken
>> anything that wasn't broken already.
>>
>> Signed-off-by: Edward Cree <ecree@solarflare.com>
>> ---
>
> We have hard time to deal with latencies already, and maintaining some
> sanity in the stack(s)
>
Right, this is significant complexity for a fairly narrow use case.
One alternative might be to move early type demux like functionality
to the GRO layer. There's a lot of work done by GRO to parse and
identify packets of the same flow, even if we can't aggregate such
packets it might be nice if we can at least provide a cached route so
that we avoid doing a full route lookup on each one later on.

Tom

> This is not going to give us a 10x or even 2x improvement factor, so
> what about working on something that would really lower cache line
> misses and use pipelines to amortize the costs ?
>
> The main problem in UDP stack today is having to lock the socket because
> of the dumb forward allocation problem. Are you really going to provide
> a list of skbs up to _one_ UDP socket ?
>
>
>
Edward Cree April 19, 2016, 4:50 p.m. UTC | #3
On 19/04/16 15:50, Eric Dumazet wrote:
> The main problem in UDP stack today is having to lock the socket because
> of the dumb forward allocation problem.
I'm not quite sure what you're referring to here, care to educate me?

> Are you really going to provide
> a list of skbs up to _one_ UDP socket ?
In principle we should be able to take it that far, yes.  AFAICT the
socket already has a receive queue that we end up appending the packet
to (and which I presume the recvmsg() syscall pulls from), I don't see
why we couldn't just splice a list of skbs on the end rather than
appending them one by one.  Thus amortising looking up the socket.
Eric Dumazet April 19, 2016, 4:54 p.m. UTC | #4
On Tue, 2016-04-19 at 08:46 -0700, Tom Herbert wrote:

> Right, this is significant complexity for a fairly narrow use case.
> One alternative might be to move early type demux like functionality
> to the GRO layer. There's a lot of work done by GRO to parse and
> identify packets of the same flow, even if we can't aggregate such
> packets it might be nice if we can at least provide a cached route so
> that we avoid doing a full route lookup on each one later on.

Moving early demux earlier in the stack would also allow us to implement
RFS more efficiently, removing one hash lookup.
(no extra cache line miss in global RFS table)
Edward Cree April 19, 2016, 5:12 p.m. UTC | #5
On 19/04/16 16:46, Tom Herbert wrote:
> On Tue, Apr 19, 2016 at 7:50 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> We have hard time to deal with latencies already, and maintaining some
>> sanity in the stack(s)
> Right, this is significant complexity for a fairly narrow use case.
Why do you say the use case is narrow?  This approach should increase
packet rate for any (non-GROed) traffic, whether for local delivery or
forwarding.  If you're line-rate limited, it'll save CPU time instead.
The only reason I focused my testing on single-byte UDP is because the
benefits are more easily measured in that case.

If anything, the use case is broader than GRO, because GRO can't be used
for datagram protocols where packet boundaries must be maintained.
And because the listified processing is at least partly sharing code with
the regular stack, it's less complexity than GRO which has to have
essentially its own receive stack, _and_ code to coalesce the results
back into a superframe.

I think if we pushed bundled RX all the way up to the TCP layer, it might
potentially also be faster than GRO, because it avoids the work of
coalescing superframes; plus going through the GRO callbacks for each
packet could end up blowing icache in the same way the regular stack does.
If bundling did prove faster, we could then remove GRO, and overall
complexity would be _reduced_.

But I admit it may be a long shot.

-Ed
Eric Dumazet April 19, 2016, 5:54 p.m. UTC | #6
On Tue, 2016-04-19 at 18:12 +0100, Edward Cree wrote:

> I think if we pushed bundled RX all the way up to the TCP layer, it might
> potentially also be faster than GRO, because it avoids the work of
> coalescing superframes; plus going through the GRO callbacks for each
> packet could end up blowing icache in the same way the regular stack does.
> If bundling did prove faster, we could then remove GRO, and overall
> complexity would be _reduced_.

Oh well. You really are coming 8 years too late.

I receive ~40 Gbit on a _single_ TCP flow with GRO, I have no idea what
you expect to get without GRO, but my educated guess is :

It will be horrible, and will add memory overhead. (remember GRO shares
a single sk_buff/head for all the segs)

And as a bonus, GRO works also on forwarding workloads, when this packet
is delivered to another NIC or a virtual device.

Are you claiming TSO is useless ?

It is not because a model works well in some other stack (userspace I
guess), that we have to copy it in linux kernel.

Feel free to experiment things, but if your non GRO solution is slower
than GRO, don't even mention it here.

Remember to test things with a realistic iptables setup.
Eric Dumazet April 19, 2016, 6:06 p.m. UTC | #7
On Tue, 2016-04-19 at 17:50 +0100, Edward Cree wrote:
> On 19/04/16 15:50, Eric Dumazet wrote:
> > The main problem in UDP stack today is having to lock the socket because
> > of the dumb forward allocation problem.
> I'm not quite sure what you're referring to here, care to educate me?


This was added for memory accounting, commit
95766fff6b9a78d11fc2d3812dd035381690b55d

TCP stack does not reclaim its forward allocations as aggressively.
(It has a concept of 'memory pressure')
Tom Herbert April 19, 2016, 6:38 p.m. UTC | #8
On Tue, Apr 19, 2016 at 10:12 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 19/04/16 16:46, Tom Herbert wrote:
>> On Tue, Apr 19, 2016 at 7:50 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> We have hard time to deal with latencies already, and maintaining some
>>> sanity in the stack(s)
>> Right, this is significant complexity for a fairly narrow use case.
> Why do you say the use case is narrow?  This approach should increase
> packet rate for any (non-GROed) traffic, whether for local delivery or
> forwarding.  If you're line-rate limited, it'll save CPU time instead.
> The only reason I focused my testing on single-byte UDP is because the
> benefits are more easily measured in that case.
>
It's a narrow use case because of the intent to "suggested that having
multiple packets traverse the network stack together". Beyond queuing
to the backlog I don't understand what more processing can be done
without splitting the list up. We need to do a route lookup on each
packet, need to run each through IP tables, need to deliver each
packet individually to the application. For the queuing to backlog
that seems to me to be more of a localized bulk enqueue/dequeue
problem instead of a stack level infrastructure problem.

The general alternative to grouping packets together is to apply
cached values that were found in lookups for previous "similar"
packets. Since nearly all traffic fits some profile of a flow, we can
leverage the point that packets in a flow should have similar lookup
results. So, for example, the first time we see a flow we can create a
flow state and save any results of lookups found for that packets in
the flow (route lookup, IP tables etc.). For subsequent packets, if we
match the flow then we have the answers for all the lookups we would
need. Maintaining temporal flow states and performing fixed 5-tuple
flow state lookups in the hash table is easy for a host (and we can
often throw a lot of memory at it to size hash tables to avoid
collisions). VLP matching, open ended rule chains, multi table
lookups, crazy hashes over 35 fields in headers are the things we only
want to do when there is no other recourse. This illustrates one
reason why a host is not a switch, we have no hardware to do complex
lookups.

Tom

> If anything, the use case is broader than GRO, because GRO can't be used
> for datagram protocols where packet boundaries must be maintained.
> And because the listified processing is at least partly sharing code with
> the regular stack, it's less complexity than GRO which has to have
> essentially its own receive stack, _and_ code to coalesce the results
> back into a superframe.
>
> I think if we pushed bundled RX all the way up to the TCP layer, it might
> potentially also be faster than GRO, because it avoids the work of
> coalescing superframes; plus going through the GRO callbacks for each
> packet could end up blowing icache in the same way the regular stack does.
> If bundling did prove faster, we could then remove GRO, and overall
> complexity would be _reduced_.
>
> But I admit it may be a long shot.
>
> -Ed
Edward Cree April 21, 2016, 5:24 p.m. UTC | #9
On 19/04/16 14:37, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
Turns out, this breaks the build if netfilter is *disabled*, because I forgot to
add a stub in that case.  Next version of patch (if there is one) will have the
following under #ifndef CONFIG_NETFILTER:

static inline void
NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
	     struct sk_buff_head *list, struct sk_buff_head *sublist,
	     struct net_device *in, struct net_device *out,
	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
	__skb_queue_head_init(sublist);
	/* Move everything to the sublist */
	skb_queue_splice_init(list, sublist);
}

-Ed
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 682d0ad..292f2d5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2143,6 +2143,9 @@  struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct sk_buff_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9a..e18e91b 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -220,6 +220,24 @@  NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return ret;
 }
 
+static inline void
+NF_HOOK_LIST_THRESH(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+		    struct sk_buff_head *list, struct sk_buff_head *sublist,
+		    struct net_device *in, struct net_device *out,
+		    int (*okfn)(struct net *, struct sock *, struct sk_buff *),
+		    int thresh)
+{
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(sublist); /* list of synchronously ACCEPTed skbs */
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		int ret = nf_hook_thresh(pf, hook, net, sk, skb, in, out, okfn,
+					 thresh);
+		if (ret == 1)
+			__skb_queue_tail(sublist, skb);
+	}
+}
+
 static inline int
 NF_HOOK_COND(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	     struct sk_buff *skb, struct net_device *in, struct net_device *out,
@@ -242,6 +260,15 @@  NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return NF_HOOK_THRESH(pf, hook, net, sk, skb, in, out, okfn, INT_MIN);
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct sk_buff_head *list, struct sk_buff_head *sublist,
+	     struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	NF_HOOK_LIST_THRESH(pf, hook, net, sk, list, sublist, in, out, okfn, INT_MIN);
+}
+
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
diff --git a/include/net/ip.h b/include/net/ip.h
index 93725e5..c994c44 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -106,6 +106,8 @@  int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct sk_buff_head *list, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index db1d16a..da768e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4230,8 +4230,15 @@  static inline void __netif_receive_skb_list_ptype(struct sk_buff_head *list,
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(list)) != NULL)
-		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	if (!pt_prev)
+		return;
+	if (skb_queue_empty(list))
+		return;
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(list, pt_prev, orig_dev);
+	else
+		while ((skb = __skb_dequeue(list)) != NULL)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct sk_buff_head *list, bool pfmemalloc)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2e6e65f..1424147 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1757,6 +1757,7 @@  fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 
 static int __init inet_init(void)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e3d7827..e7d0d85 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -395,10 +395,9 @@  drop:
 /*
  * 	Main IP Receive routine.
  */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 
 	/* When the interface is in promisc. mode, drop all the crap
@@ -408,7 +407,6 @@  int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
 
 
-	net = dev_net(dev);
 	IP_UPD_PO_STATS_BH(net, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -475,9 +473,7 @@  int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
 
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
+	return skb;
 
 csum_error:
 	IP_INC_STATS_BH(net, IPSTATS_MIB_CSUMERRORS);
@@ -486,5 +482,70 @@  inhdr_error:
 drop:
 	kfree_skb(skb);
 out:
-	return NET_RX_DROP;
+	return NULL;
+}
+
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
+		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+
+static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
+			   struct net *net)
+{
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     list, &sublist, dev, NULL, ip_rcv_finish);
+	while ((skb = __skb_dequeue(&sublist)) != NULL)
+		ip_rcv_finish(net, NULL, skb);
+}
+
+/* Receive a list of IP packets */
+void ip_list_rcv(struct sk_buff_head *list, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+
+		if (skb_queue_empty(&sublist)) {
+			curr_dev = dev;
+			curr_net = net;
+		} else if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			__skb_queue_head_init(&sublist);
+			curr_dev = dev;
+			curr_net = net;
+		}
+		/* add to current sublist */
+		__skb_queue_tail(&sublist, skb);
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(&sublist, curr_dev, curr_net);
 }