diff mbox

net: Cache route in IP tunnels

Message ID alpine.DEB.2.02.1312112207400.27914@tomh.mtv.corp.google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Dec. 12, 2013, 6:17 a.m. UTC
Avoid doing a route lookup on every packet being tunneled.

In ip_tunnel.c cache the route returned from ip_route_output if
the tunnel is "connected", that is all the routing parameters are
taken from tunnel parms for a packet. Specifically, not NBMA tunnel
and tos is from tunnel parms (not inner packet).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/net/ip_tunnels.h |   3 ++
 net/ipv4/ip_tunnel.c     | 106 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 86 insertions(+), 23 deletions(-)

Comments

Eric Dumazet Dec. 12, 2013, 7:37 a.m. UTC | #1
On Wed, 2013-12-11 at 22:17 -0800, Tom Herbert wrote:
> Avoid doing a route lookup on every packet being tunneled.
> 
> In ip_tunnel.c cache the route returned from ip_route_output if
> the tunnel is "connected", that is all the routing parameters are
> taken from tunnel parms for a packet. Specifically, not NBMA tunnel
> and tos is from tunnel parms (not inner packet).
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---

Replacing per cpu dst (nh_pcpu_rth_output) with a single shared dst is
not going to be better on SMP.

Do you have any performance data using multiqueue NIC, with say 32 cpus
and 32 TX queues ?

I suspect a high false sharing on dst refcount.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 12, 2013, 6:54 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 11 Dec 2013 23:37:18 -0800

> On Wed, 2013-12-11 at 22:17 -0800, Tom Herbert wrote:
>> Avoid doing a route lookup on every packet being tunneled.
>> 
>> In ip_tunnel.c cache the route returned from ip_route_output if
>> the tunnel is "connected", that is all the routing parameters are
>> taken from tunnel parms for a packet. Specifically, not NBMA tunnel
>> and tos is from tunnel parms (not inner packet).
>> 
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> ---
> 
> Replacing per cpu dst (nh_pcpu_rth_output) with a single shared dst is
> not going to be better on SMP.
> 
> Do you have any performance data using multiqueue NIC, with say 32 cpus
> and 32 TX queues ?
> 
> I suspect a high false sharing on dst refcount.

First off I'm glad someone looked into this, I've been meaning to play
with this for a while.

Secondly, Eric's concern is valid but we should keep in mind that IPV6
tunnels do this kind of caching already so if it's undesirable we
should perhaps undo the caching there too.

But honestly I suspect that doing a lookup every packet is more
expensive than false sharing on a cache entry, but I'm happy to be
proven wrong :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Dec. 12, 2013, 8:19 p.m. UTC | #3
On Thu, Dec 12, 2013 at 10:54 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 11 Dec 2013 23:37:18 -0800
>
>> On Wed, 2013-12-11 at 22:17 -0800, Tom Herbert wrote:
>>> Avoid doing a route lookup on every packet being tunneled.
>>>
>>> In ip_tunnel.c cache the route returned from ip_route_output if
>>> the tunnel is "connected", that is all the routing parameters are
>>> taken from tunnel parms for a packet. Specifically, not NBMA tunnel
>>> and tos is from tunnel parms (not inner packet).
>>>
>>> Signed-off-by: Tom Herbert <therbert@google.com>
>>> ---
>>
>> Replacing per cpu dst (nh_pcpu_rth_output) with a single shared dst is
>> not going to be better on SMP.
>>
>> Do you have any performance data using multiqueue NIC, with say 32 cpus
>> and 32 TX queues ?
>>
>> I suspect a high false sharing on dst refcount.
>
> First off I'm glad someone looked into this, I've been meaning to play
> with this for a while.
>
> Secondly, Eric's concern is valid but we should keep in mind that IPV6
> tunnels do this kind of caching already so if it's undesirable we
> should perhaps undo the caching there too.
>
We could also use a per CPU dst cache in the tunnel. Might be
reasonable if people aren't configuring thousands of tunnels.

> But honestly I suspect that doing a lookup every packet is more
> expensive than false sharing on a cache entry, but I'm happy to be
> proven wrong :)

My performance runs were inconclusive, but I assumed that caching in
the tunnel would scale better with larger numbers of routes.

btw, I needed to disable gro on the tun interface. With it enabled,
performance in TCP_RR is abysmal. Looking into that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Dec. 13, 2013, 7:53 a.m. UTC | #4
Hello,

On Wed, 11 Dec 2013, Tom Herbert wrote:

> +static inline void
> +tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
> +{
> +	struct dst_entry *old_dst;
> +
> +	spin_lock(&t->dst_lock);
> +	old_dst = rcu_dereference_raw(t->dst_cache);
> +	rcu_assign_pointer(t->dst_cache, dst);
> +	dst_release(old_dst);
> +	spin_unlock(&t->dst_lock);

	It seems the func is also called by ip_tunnel_update(),
do we need _bh locks in such case?

> +}
> +
> +static inline void
> +tunnel_dst_reset(struct ip_tunnel *t)
> +{
> +	tunnel_dst_set(t, NULL);
> +}
> +

...

> @@ -696,6 +750,7 @@ static void ip_tunnel_update(struct ip_tunnel_net *itn,
>  		if (set_mtu)
>  			dev->mtu = mtu;
>  	}
> +	tunnel_dst_reset(t);
>  	netdev_state_change(dev);
>  }

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 732f8c6..bde50fc 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -54,6 +54,9 @@  struct ip_tunnel {
 	int		hlen;		/* Precalculated header length */
 	int		mlink;
 
+	struct		dst_entry __rcu *dst_cache;
+	spinlock_t	dst_lock;
+
 	struct ip_tunnel_parm parms;
 
 	/* for SIT */
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 90ff957..c4db0cc 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -68,6 +68,50 @@  static unsigned int ip_tunnel_hash(struct ip_tunnel_net *itn,
 			 IP_TNL_HASH_BITS);
 }
 
+static inline void
+tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
+{
+	struct dst_entry *old_dst;
+
+	spin_lock(&t->dst_lock);
+	old_dst = rcu_dereference_raw(t->dst_cache);
+	rcu_assign_pointer(t->dst_cache, dst);
+	dst_release(old_dst);
+	spin_unlock(&t->dst_lock);
+}
+
+static inline void
+tunnel_dst_reset(struct ip_tunnel *t)
+{
+	tunnel_dst_set(t, NULL);
+}
+
+static inline struct dst_entry *
+tunnel_dst_get(struct ip_tunnel *t)
+{
+	struct dst_entry *dst;
+
+	rcu_read_lock();
+	dst = rcu_dereference(t->dst_cache);
+	if (dst)
+		dst_hold(dst);
+	rcu_read_unlock();
+	return dst;
+}
+
+struct dst_entry *tunnel_dst_check(struct ip_tunnel *t, u32 cookie)
+{
+	struct dst_entry *dst = tunnel_dst_get(t);
+
+	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		tunnel_dst_reset(t);
+		dst_release(dst);
+		return NULL;
+	}
+
+	return dst;
+}
+
 /* Often modified stats are per cpu, other are shared (netdev->stats) */
 struct rtnl_link_stats64 *ip_tunnel_get_stats64(struct net_device *dev,
 						struct rtnl_link_stats64 *tot)
@@ -318,11 +362,9 @@  failed:
 	return ERR_PTR(err);
 }
 
-static inline struct rtable *ip_route_output_tunnel(struct net *net,
-						    struct flowi4 *fl4,
-						    int proto,
-						    __be32 daddr, __be32 saddr,
-						    __be32 key, __u8 tos, int oif)
+static inline void init_tunnel_flow(struct flowi4 *fl4, int proto,
+				    __be32 daddr, __be32 saddr,
+				    __be32 key, __u8 tos, int oif)
 {
 	memset(fl4, 0, sizeof(*fl4));
 	fl4->flowi4_oif = oif;
@@ -331,7 +373,6 @@  static inline struct rtable *ip_route_output_tunnel(struct net *net,
 	fl4->flowi4_tos = tos;
 	fl4->flowi4_proto = proto;
 	fl4->fl4_gre_key = key;
-	return ip_route_output_key(net, fl4);
 }
 
 static int ip_tunnel_bind_dev(struct net_device *dev)
@@ -350,12 +391,11 @@  static int ip_tunnel_bind_dev(struct net_device *dev)
 		struct flowi4 fl4;
 		struct rtable *rt;
 
-		rt = ip_route_output_tunnel(tunnel->net, &fl4,
-					    tunnel->parms.iph.protocol,
-					    iph->daddr, iph->saddr,
-					    tunnel->parms.o_key,
-					    RT_TOS(iph->tos),
-					    tunnel->parms.link);
+		init_tunnel_flow(&fl4, iph->protocol, iph->daddr,
+				 iph->saddr, tunnel->parms.o_key,
+				 RT_TOS(iph->tos), tunnel->parms.link);
+		rt = ip_route_output_key(tunnel->net, &fl4);
+
 		if (!IS_ERR(rt)) {
 			tdev = rt->dst.dev;
 			ip_rt_put(rt);
@@ -532,6 +572,7 @@  void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 	unsigned int max_headroom;	/* The extra header space needed */
 	__be32 dst;
 	int err;
+	bool connected = true;
 
 	inner_iph = (const struct iphdr *)skb_inner_network_header(skb);
 
@@ -581,26 +622,39 @@  void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 #endif
 		else
 			goto tx_error;
+
+		connected = false;
 	}
 
 	tos = tnl_params->tos;
 	if (tos & 0x1) {
 		tos &= ~0x1;
-		if (skb->protocol == htons(ETH_P_IP))
+		if (skb->protocol == htons(ETH_P_IP)) {
 			tos = inner_iph->tos;
-		else if (skb->protocol == htons(ETH_P_IPV6))
+			connected = false;
+		} else if (skb->protocol == htons(ETH_P_IPV6)) {
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
+			connected = false;
+		}
 	}
 
-	rt = ip_route_output_tunnel(tunnel->net, &fl4,
-				    protocol,
-				    dst, tnl_params->saddr,
-				    tunnel->parms.o_key,
-				    RT_TOS(tos),
-				    tunnel->parms.link);
-	if (IS_ERR(rt)) {
-		dev->stats.tx_carrier_errors++;
-		goto tx_error;
+	init_tunnel_flow(&fl4, protocol, dst, tnl_params->saddr,
+			 tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link);
+
+	if (connected)
+		rt = (struct rtable *)tunnel_dst_check(tunnel, 0);
+	else
+		rt = NULL;
+
+	if (!rt) {
+		rt = ip_route_output_key(tunnel->net, &fl4);
+
+		if (IS_ERR(rt)) {
+			dev->stats.tx_carrier_errors++;
+			goto tx_error;
+		}
+		if (connected)
+			tunnel_dst_set(tunnel, dst_clone(&rt->dst));
 	}
 	if (rt->dst.dev == dev) {
 		ip_rt_put(rt);
@@ -696,6 +750,7 @@  static void ip_tunnel_update(struct ip_tunnel_net *itn,
 		if (set_mtu)
 			dev->mtu = mtu;
 	}
+	tunnel_dst_reset(t);
 	netdev_state_change(dev);
 }
 
@@ -1001,6 +1056,9 @@  int ip_tunnel_init(struct net_device *dev)
 	iph->version		= 4;
 	iph->ihl		= 5;
 
+	tunnel->dst_cache = NULL;
+	spin_lock_init(&tunnel->dst_lock);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_init);
@@ -1015,6 +1073,8 @@  void ip_tunnel_uninit(struct net_device *dev)
 	/* fb_tunnel_dev will be unregisted in net-exit call. */
 	if (itn->fb_tunnel_dev != dev)
 		ip_tunnel_del(netdev_priv(dev));
+
+	tunnel_dst_reset(tunnel);
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_uninit);