diff mbox

[net-next-2.6] ip_gre: lockless xmit

Message ID 1285664747.2607.48.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 28, 2010, 9:05 a.m. UTC
GRE tunnels can benefit from lockless xmits, using NETIF_F_LLTX

Note: If tunnels are created with the "oseq" option, LLTX is not
enabled :

Even using an atomic_t o_seq, we would increase chance for packets being
out of order at receiver.

Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
10000000 UDP frames via one gre tunnel (size:200 bytes per frame)

Before patch : 
real	3m0.094s
user	0m9.365s
sys	47m50.103s

After patch:
real	0m29.756s
user	0m11.097s
sys	7m33.012s

Last problem to solve is the contention on dst :


38660.00 21.4% __ip_route_output_key          vmlinux             
20786.00 11.5% dst_release                    vmlinux             
14191.00  7.8% __xfrm_lookup                  vmlinux             
12410.00  6.9% ip_finish_output               vmlinux             
 4540.00  2.5% ip_push_pending_frames         vmlinux             
 4427.00  2.4% ip_append_data                 vmlinux             
 4265.00  2.4% __alloc_skb                    vmlinux             
 4140.00  2.3% __ip_local_out                 vmlinux             
 3991.00  2.2% dev_queue_xmit                 vmlinux     

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_gre.c |    4 ++++
 1 files changed, 4 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Nicolas Dichtel Sept. 29, 2010, 8:10 a.m. UTC | #1
NETIF_F_LLTX is marked as deprecated:

include/linux/netdevice.h:
#define NETIF_F_LLTX            4096    /* LockLess TX - deprecated. 
Please */
                                         /* do not use LLTX in new 
drivers */

Is it right to use it?

Regards,
Nicolas

Eric Dumazet wrote:
> GRE tunnels can benefit from lockless xmits, using NETIF_F_LLTX
> 
> Note: If tunnels are created with the "oseq" option, LLTX is not
> enabled :
> 
> Even using an atomic_t o_seq, we would increase chance for packets being
> out of order at receiver.
> 
> Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
> 10000000 UDP frames via one gre tunnel (size:200 bytes per frame)
> 
> Before patch : 
> real	3m0.094s
> user	0m9.365s
> sys	47m50.103s
> 
> After patch:
> real	0m29.756s
> user	0m11.097s
> sys	7m33.012s
> 
> Last problem to solve is the contention on dst :
> 
> 
> 38660.00 21.4% __ip_route_output_key          vmlinux             
> 20786.00 11.5% dst_release                    vmlinux             
> 14191.00  7.8% __xfrm_lookup                  vmlinux             
> 12410.00  6.9% ip_finish_output               vmlinux             
>  4540.00  2.5% ip_push_pending_frames         vmlinux             
>  4427.00  2.4% ip_append_data                 vmlinux             
>  4265.00  2.4% __alloc_skb                    vmlinux             
>  4140.00  2.3% __ip_local_out                 vmlinux             
>  3991.00  2.2% dev_queue_xmit                 vmlinux     
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/ip_gre.c |    4 ++++
>  1 files changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index a1b5d5e..035db63 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -1557,6 +1557,10 @@ static int ipgre_newlink(struct net *src_net, struct net_device *dev, struct nla
>  	if (!tb[IFLA_MTU])
>  		dev->mtu = mtu;
>  
> +	/* Can use a lockless transmit, unless we generate output sequences */
> +	if (!(nt->parms.o_flags & GRE_SEQ))
> +		dev->features |= NETIF_F_LLTX;
> +
>  	err = register_netdevice(dev);
>  	if (err)
>  		goto out;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 29, 2010, 8:18 a.m. UTC | #2
Le mercredi 29 septembre 2010 à 10:10 +0200, Nicolas Dichtel a écrit :
> NETIF_F_LLTX is marked as deprecated:
> 
> include/linux/netdevice.h:
> #define NETIF_F_LLTX            4096    /* LockLess TX - deprecated. 
> Please */
>                                          /* do not use LLTX in new 
> drivers */
> 
> Is it right to use it?
> 

In this particular case (and drivers/net/loopback.c), yes.

This is the only way to avoid the locking in core network
(net/core/dev.c)

What is deprecated is to assert NETIF_F_LLTX and yet, use a lock in the
ndo_xmit() driver method.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 29, 2010, 8:21 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 29 Sep 2010 10:18:08 +0200

> What is deprecated is to assert NETIF_F_LLTX and yet, use a lock in the
> ndo_xmit() driver method.

Also we've discussed recently to do away with the NETIF_F_LLTX
flag entirely for queue-less devices such as loopback and
sw tunnels.

Then all will be left are the truly "deprecated" cases Eric
mentions.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Sept. 29, 2010, 5:33 p.m. UTC | #4
On Tue, Sep 28, 2010 at 2:05 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> GRE tunnels can benefit from lockless xmits, using NETIF_F_LLTX
>
> Note: If tunnels are created with the "oseq" option, LLTX is not
> enabled :
>
> Even using an atomic_t o_seq, we would increase chance for packets being
> out of order at receiver.
>
> Bench on a 16 cpus machine (dual E5540 cpus), 16 threads sending
> 10000000 UDP frames via one gre tunnel (size:200 bytes per frame)
>
> Before patch :
> real    3m0.094s
> user    0m9.365s
> sys     47m50.103s
>
> After patch:
> real    0m29.756s
> user    0m11.097s
> sys     7m33.012s
>
> Last problem to solve is the contention on dst :
>
>
> 38660.00 21.4% __ip_route_output_key          vmlinux
> 20786.00 11.5% dst_release                    vmlinux
> 14191.00  7.8% __xfrm_lookup                  vmlinux
> 12410.00  6.9% ip_finish_output               vmlinux
>  4540.00  2.5% ip_push_pending_frames         vmlinux
>  4427.00  2.4% ip_append_data                 vmlinux
>  4265.00  2.4% __alloc_skb                    vmlinux
>  4140.00  2.3% __ip_local_out                 vmlinux
>  3991.00  2.2% dev_queue_xmit                 vmlinux
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

The tx lock has another use here: to break local loops.  With this
change, a misconfigured tunnel can bring down the machine with a stack
overflow.  There are clearly other ways to fix this that don't require
a lock that restricts parallelism, such as a loop counter, but that's
the way it is now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 29, 2010, 6:43 p.m. UTC | #5
Le mercredi 29 septembre 2010 à 10:33 -0700, Jesse Gross a écrit :

> The tx lock has another use here: to break local loops.  With this
> change, a misconfigured tunnel can bring down the machine with a stack
> overflow.  There are clearly other ways to fix this that don't require
> a lock that restricts parallelism, such as a loop counter, but that's
> the way it is now.

Thats a very good point !

We could use a loop counter in the skb, but this use a bit of ram,
or percpu counters in tunnel drivers, to avoid a given level of
recursion.

/* this should be shared by all tunnels */
DEFINE_PER_CPU(tunnel_xmit_count);



tunnel_xmit()
{
if (__this_cpu_read(tunnel_xmit_count) >= LIMIT)
	goto tx_error;
__this_cpu_inc(tunnel_xmit_count);

....

__IPTUNNEL_XMIT(tstats, &dev->stats);

__this_cpu_dec(tunnel_xmit_count),
return NETDEV_TX_OK;

}



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Sept. 29, 2010, 7:24 p.m. UTC | #6
On Wed, Sep 29, 2010 at 11:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 29 septembre 2010 à 10:33 -0700, Jesse Gross a écrit :
>
>> The tx lock has another use here: to break local loops.  With this
>> change, a misconfigured tunnel can bring down the machine with a stack
>> overflow.  There are clearly other ways to fix this that don't require
>> a lock that restricts parallelism, such as a loop counter, but that's
>> the way it is now.
>
> Thats a very good point !
>
> We could use a loop counter in the skb, but this use a bit of ram,
> or percpu counters in tunnel drivers, to avoid a given level of
> recursion.

I agree, a percpu loop counter is the way to go.  I implemented
something nearly identical to deal with this problem in Open vSwitch.

There are actually a number of optimizations in the Open vSwitch
tunneling stack that you may be interested in taking a look at as well
(for example, it has had this sort of lockless transmit for a while):

http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/tunnel.c;hb=HEAD

The plan is to upstream all of the kernel code (or at least offer it),
just trying to get the userspace interfaces settled down first.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a1b5d5e..035db63 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1557,6 +1557,10 @@  static int ipgre_newlink(struct net *src_net, struct net_device *dev, struct nla
 	if (!tb[IFLA_MTU])
 		dev->mtu = mtu;
 
+	/* Can use a lockless transmit, unless we generate output sequences */
+	if (!(nt->parms.o_flags & GRE_SEQ))
+		dev->features |= NETIF_F_LLTX;
+
 	err = register_netdevice(dev);
 	if (err)
 		goto out;