diff mbox series

[v4,net-next,7/9] net: ipv4: listified version of ip_rcv

Message ID 68230d9e-e93c-ef91-4b54-e7aebedf1f84@solarflare.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series Handle multiple received packets at each stage | expand

Commit Message

Edward Cree July 2, 2018, 3:14 p.m. UTC
Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/netfilter.h | 22 +++++++++++++++
 include/net/ip.h          |  2 ++
 net/core/dev.c            |  8 +++---
 net/ipv4/af_inet.c        |  1 +
 net/ipv4/ip_input.c       | 68 ++++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 94 insertions(+), 10 deletions(-)

Comments

Pablo Neira Ayuso July 3, 2018, 10:50 a.m. UTC | #1
On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)
> The netfilter hooks themselves are indirect calls that still happen per-
>  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
>  for future work.
> 
> There is potential for out-of-order receives if the netfilter hook ends up
>  synchronously stealing packets, as they will be processed before any
>  accepts earlier in the list.  However, it was already possible for an
>  asynchronous accept to cause out-of-order receives, so presumably this is
>  considered OK.

I think we can simplify things if these chained packets don't follow
the standard forwarding path, this would require to revisit many
subsystems to handle these new chained packets - potentially a lot of
work and likely breaking many things - and I would expect we (and
other subsystems too) will not get very much benefits from these
chained packets.

In general I like this infrastructure, but I think we can get
something simpler if we combine it with the flowtable idea, so chained
packets follow the non-standard flowtable forwarding path as described
in [1].

We could generalize and place the flowtable code in the core if
needed, and make it not netfilter dependent if that's a problem.

Thanks.

[1] https://marc.info/?l=netfilter-devel&m=152898601419841&w=2
Florian Westphal July 3, 2018, 12:13 p.m. UTC | #2
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> > Also involved adding a way to run a netfilter hook over a list of packets.
> >  Rather than attempting to make netfilter know about lists (which would be
> >  a major project in itself) we just let it call the regular okfn (in this
> >  case ip_rcv_finish()) for any packets it steals, and have it give us back
> >  a list of packets it's synchronously accepted (which normally NF_HOOK
> >  would automatically call okfn() on, but we want to be able to potentially
> >  pass the list to a listified version of okfn().)
> > The netfilter hooks themselves are indirect calls that still happen per-
> >  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
> >  for future work.
> > 
> > There is potential for out-of-order receives if the netfilter hook ends up
> >  synchronously stealing packets, as they will be processed before any
> >  accepts earlier in the list.  However, it was already possible for an
> >  asynchronous accept to cause out-of-order receives, so presumably this is
> >  considered OK.
> 
> I think we can simplify things if these chained packets don't follow
> the standard forwarding path, this would require to revisit many
> subsystems to handle these new chained packets - potentially a lot of
> work and likely breaking many things - and I would expect we (and
> other subsystems too) will not get very much benefits from these
> chained packets.

I guess it depends on what type of 'bundles' to expect on the wire.
For an average netfilter ruleset it will require more unbundling
because we might have

ipv4/udp -> ipv4/tcp -> ipv6/udp -> ipv4/tcp -> ipv4/udp

From Eds patch set, this would be rebundled to

ipv4/udp -> ipv4/tcp -> ipv4/tcp -> ipv4/udp
and
ipv6/udp

nf ipv4 rule eval has to untangle again, to
ipv4/udp -> ipv4/udp
ipv4/tcp -> ipv4/tcp

HOWEVER, there are hooks that are L4 agnostic, such as defragmentation,
so we might still be able to take advantage.

Defrag is extra silly at the moment, we eat the indirect call cost
only to return NF_ACCEPT without doing anything in 99.99% of all cases.

So I still think there is value in exploring to pass the bundle
into the nf core, via new nf_hook_slow_list() that can be used
from forward path (ingress, prerouting, input, forward, postrouting).

We can make this step-by-step, first splitting everything in
nf_hook_slow_list() and just calling hooksfns for each skb in list.

> In general I like this infrastructure, but I think we can get
> something simpler if we combine it with the flowtable idea, so chained
> packets follow the non-standard flowtable forwarding path as described
> in [1].

Yes, in case all packets would be in the flow table (offload software
fallback) we might be able to keep list as-is in case everything has
same nexthop.

However, I don't see any need to make changes to this series for this
now, it can be added on top.

Did i miss anything?

I do agree that from netfilter p.o.v. ingress hook looks like a good
initial candidate for a listification.
Edward Cree July 4, 2018, 4:01 p.m. UTC | #3
On 02/07/18 16:14, Edward Cree wrote:
> +/* Receive a list of IP packets */
> +void ip_list_rcv(struct list_head *head, struct packet_type *pt,
> +		 struct net_device *orig_dev)
> +{
> +	struct net_device *curr_dev = NULL;
> +	struct net *curr_net = NULL;
> +	struct sk_buff *skb, *next;
> +	struct list_head sublist;
> +
> +	list_for_each_entry_safe(skb, next, head, list) {
> +		struct net_device *dev = skb->dev;
> +		struct net *net = dev_net(dev);
> +
> +		skb = ip_rcv_core(skb, net);
> +		if (skb == NULL)
> +			continue;
I've spotted a bug here, in that if ip_rcv_core() eats the skb (e.g. by
 freeing it) it won't list_del() it, so when we process the sublist we'll
 end up trying to process this skb anyway.
Thus, places where an skb could get freed (possibly remotely, as in nf
 hooks that can steal packets) need to use the dequeue/enqueue model
 rather than the list_cut_before() approach.
Followup patches soon.

-Ed
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e104b2e4a735..fe81a2bfcd08 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2291,6 +2291,9 @@  struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct list_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index dd2052f0efb7..5a5e0a2ab2a3 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -288,6 +288,20 @@  NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return ret;
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
+		if (ret != 1)
+			list_del(&skb->list);
+	}
+}
+
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
@@ -369,6 +383,14 @@  NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return okfn(net, sk, skb);
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	/* nothing to do */
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 			  struct sock *sk, struct sk_buff *skb,
 			  struct net_device *indev, struct net_device *outdev,
diff --git a/include/net/ip.h b/include/net/ip.h
index 0d2281b4b27a..1de72f9cb23c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -138,6 +138,8 @@  int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index edd67b1f1e12..4c5ebfab9bc8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4692,9 +4692,11 @@  static inline void __netif_receive_skb_list_ptype(struct list_head *head,
 		return;
 	if (list_empty(head))
 		return;
-
-	list_for_each_entry_safe(skb, next, head, list)
-		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(head, pt_prev, orig_dev);
+	else
+		list_for_each_entry_safe(skb, next, head, list)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 06b218a2870f..3ff7659c9afd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1882,6 +1882,7 @@  fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 
 static int __init inet_init(void)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713dd18f..914240830bdf 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -408,10 +408,9 @@  static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 /*
  * 	Main IP Receive routine.
  */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 
 	/* When the interface is in promisc. mode, drop all the crap
@@ -421,7 +420,6 @@  int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
 
 
-	net = dev_net(dev);
 	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -489,9 +487,7 @@  int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
 
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
+	return skb;
 
 csum_error:
 	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
@@ -500,5 +496,63 @@  int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 drop:
 	kfree_skb(skb);
 out:
-	return NET_RX_DROP;
+	return NULL;
+}
+
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
+		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
+	struct sk_buff *skb, *next;
+
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     head, dev, NULL, ip_rcv_finish);
+	list_for_each_entry_safe(skb, next, head, list)
+		ip_rcv_finish(net, NULL, skb);
+}
+
+/* Receive a list of IP packets */
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+
+		if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			curr_dev = dev;
+			curr_net = net;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(head, curr_dev, curr_net);
 }