diff mbox

tun: performance regression in 2.6.30-rc1

Message ID 20090416213122.GB5894@dhcp-1-124.tlv.redhat.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Michael S. Tsirkin April 16, 2009, 9:31 p.m. UTC
Hi,
I have a simple test that sends 10K packets out of a tap device.  Average time
needed to send a packet has gone up from 2.6.29 to 2.6.30-rc1.

2.6.30-rc1:

#sh runsend
time per packet:       7570 ns

2.6.29:

#git checkout v2.6.29 -- drivers/net/tun.c
#make modules modules_install
#rmmod tun
#sh runsend
time per packet:       6337 ns

I note that before 2.6.29, all tun skbs would typically be linear,
while in 2.6.30-rc1, skbs for packet size > 1 page would be paged.
And I found this comment by Rusty (it appears in the comment for
commit f42157cb568c1eb02eca7df4da67553a9edae24a):

    My original version of this patch always allocate paged skbs for big
    packets.  But that made performance drop from 8.4 seconds to 8.8
    seconds on 1G lguest->Host TCP xmit.  So now we only do that as a
    fallback.

So just for fun, I did this:


This makes all skbs linear in tun. And now:

2.6.30-rc1 made linear:
#sh runsend
time per packet:       6611 ns

Two points of interest here:
- It seems that linear skbs are generally faster.
  Would it make sense to make tun try to use linear skbs again,
  as it did before 2.6.29?

- The new code seems to introduce some measurable overhead.
  My understanding is that it's main motivation is memory
  accounting - would it make sense to create a faster code path
  for the default case where accounting is disabled?

Thanks,

Comments

Michael S. Tsirkin April 16, 2009, 10:15 p.m. UTC | #1
On Fri, Apr 17, 2009 at 12:31:22AM +0300, Michael S. Tsirkin wrote:
> Hi,
> I have a simple test that sends 10K packets out of a tap device.  Average time
> needed to send a packet has gone up from 2.6.29 to 2.6.30-rc1.
> 
> 2.6.30-rc1:
> 
> #sh runsend
> time per packet:       7570 ns
> 
> 2.6.29:
> 
> #git checkout v2.6.29 -- drivers/net/tun.c
> #make modules modules_install
> #rmmod tun
> #sh runsend
> time per packet:       6337 ns
> 
> I note that before 2.6.29, all tun skbs would typically be linear,
> while in 2.6.30-rc1, skbs for packet size > 1 page would be paged.
> And I found this comment by Rusty (it appears in the comment for
> commit f42157cb568c1eb02eca7df4da67553a9edae24a):
> 
>     My original version of this patch always allocate paged skbs for big
>     packets.  But that made performance drop from 8.4 seconds to 8.8
>     seconds on 1G lguest->Host TCP xmit.  So now we only do that as a
>     fallback.
> 
> So just for fun, I did this:
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 37a5a04..1234d6b 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -520,7 +518,6 @@ static inline struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
>         int err;
> 
>         /* Under a page?  Don't bother with paged skb. */
> -       if (prepad + len < PAGE_SIZE)
>                 linear = len;
> 
>         skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> 
> This makes all skbs linear in tun. And now:
> 
> 2.6.30-rc1 made linear:
> #sh runsend
> time per packet:       6611 ns
> 
> Two points of interest here:
> - It seems that linear skbs are generally faster.
>   Would it make sense to make tun try to use linear skbs again,
>   as it did before 2.6.29?
> 
> - The new code seems to introduce some measurable overhead.
>   My understanding is that it's main motivation is memory
>   accounting - would it make sense to create a faster code path
>   for the default case where accounting is disabled?

Continuing with the investigation, commenting out
atomic_inc_not_zero and atomic_dec_and_test in tun_get/tun_put
gets us back most of the rest of the performance:

# sh runsend
time per packet:       6461 ns

I was wondering whether the socket reference counting,
which is done anyway, can be reused in some way.
Ideas?
Herbert Xu April 16, 2009, 11:51 p.m. UTC | #2
On Fri, Apr 17, 2009 at 12:31:22AM +0300, Michael S. Tsirkin wrote:
> Hi,
> I have a simple test that sends 10K packets out of a tap device.  Average time
> needed to send a packet has gone up from 2.6.29 to 2.6.30-rc1.
> 
> 2.6.30-rc1:
> 
> #sh runsend
> time per packet:       7570 ns
> 
> 2.6.29:
> 
> #git checkout v2.6.29 -- drivers/net/tun.c
> #make modules modules_install
> #rmmod tun
> #sh runsend
> time per packet:       6337 ns
> 
> I note that before 2.6.29, all tun skbs would typically be linear,
> while in 2.6.30-rc1, skbs for packet size > 1 page would be paged.
> And I found this comment by Rusty (it appears in the comment for
> commit f42157cb568c1eb02eca7df4da67553a9edae24a):

Again this should already be fixed in the latest net-2.6.

Thanks,
Herbert Xu April 16, 2009, 11:57 p.m. UTC | #3
On Fri, Apr 17, 2009 at 01:15:05AM +0300, Michael S. Tsirkin wrote:
>
> Continuing with the investigation, commenting out
> atomic_inc_not_zero and atomic_dec_and_test in tun_get/tun_put
> gets us back most of the rest of the performance:

I'll try to think of a way to kill these ref counts, once I get
my other patch fixed :)

Cheers,
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 37a5a04..1234d6b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -520,7 +518,6 @@  static inline struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
        int err;

        /* Under a page?  Don't bother with paged skb. */
-       if (prepad + len < PAGE_SIZE)
                linear = len;

        skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,