Patchwork Patch for tbench regression.

login
register
mail settings
Submitter Evgeniy Polyakov
Date Sept. 28, 2008, 9:15 p.m.
Message ID <20080928211530.GA9341@2ka.mipt.ru>
Download mbox | patch
Permalink /patch/1843/
State RFC
Delegated to: David Miller
Headers show

Comments

Evgeniy Polyakov - Sept. 28, 2008, 9:15 p.m.
Hi.

Attached patch fixes (at least partially) tbench regressions reported
recently. I ran it on 4-way machine and noticed more than 20%
performance degradation comapred to 2.6.22 kernel. Unfortunately all my
remote machine are now stuck in death at various (apparently unbootable)
bisections, so I switched to the Xen domain, which only has 256 mb of
RAM and is generally very slow.

Because of that I was not able to run 2.6.22 tree (compilation and git
operations take really long time on this 'machine' and it is middle of
the night in Moscow), but I tested it on 2.6.27-rc7 and was able reach
performance higher than 2.6.26. According to my tests there were no
noticeble regressions in 2.6.24-2.6.26, so this patch should at least
fix 2.6.26->2.6.27 one.

Idea is rather trivial: disable TSO and GSO on loopback. The latter was
actually enabled by bisected e5a4a72d4f8 commit, which enables GSO
unconditionally if device supports scatter/gather and checksumming.
Apparently GSO packet construction has bigger overhead and smaller
packet processing. I did not bisect TSO in loopback patch, but concluded
it from above (actually its gain is even bigger than GSO on SG
device).

I will try to bring my tast machines back tomorrow and run it there,
but patch does fix the same regression tested in small Xen domain.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Herbert Xu - Sept. 29, 2008, 3:12 a.m.
On Mon, Sep 29, 2008 at 01:15:30AM +0400, Evgeniy Polyakov wrote:
>
> Idea is rather trivial: disable TSO and GSO on loopback. The latter was

Weird.  That's the very first thing I tried but for me it goes
the other way.

With TSO/GSO:

Throughput 169.19 MB/sec 1 procs

Without:

Throughput 24.0079 MB/sec 1 procs

Note that I only disabled TSO/GSO using ethtool:

etch1:~# ethtool -k lo
Offload parameters for lo:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
etch1:~# 

Can you see if reverting the patch but using ethtool gives you the
same results?

PS I'm using FV Xen.

Thanks,
Evgeniy Polyakov - Sept. 29, 2008, 5:36 a.m.
On Mon, Sep 29, 2008 at 11:12:44AM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> Can you see if reverting the patch but using ethtool gives you the
> same results?

Yes, I disabled gso and tso on vanilla 2.6.27-rc7 via ethtools and
got the same results:

$ sudo ethtool -k lo
Offload parameters for lo:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

1 proc: 195 mb
4 proc: 195 mb

$ sudo ethtool -k lo
Offload parameters for lo:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on

1 proc: 190 mb
4 proc: 189 mb
Herbert Xu - Sept. 29, 2008, 5:40 a.m.
On Mon, Sep 29, 2008 at 09:36:06AM +0400, Evgeniy Polyakov wrote:
> On Mon, Sep 29, 2008 at 11:12:44AM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > Can you see if reverting the patch but using ethtool gives you the
> > same results?
> 
> Yes, I disabled gso and tso on vanilla 2.6.27-rc7 via ethtools and
> got the same results:

Are you using Xen FV or PV? What processor?

Thanks,
Evgeniy Polyakov - Sept. 29, 2008, 5:52 a.m.
On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> Are you using Xen FV or PV? What processor?

Can I determine that within guest?
I do not think so except by loking at cpu flags?

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz
stepping	: 11
cpu MHz		: 2333.406
cache size	: 4096 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca
cmov pat clflush dts mmx fxsr sse sse2 ss pbe syscall lm constant_tsc up
pebs bts pni ds_cpl ssse3 cx16 xtpr lahf_lm
bogomips	: 4666.81
clflush size	: 64
power management:

Actually it is a bit strange, if full or paravirtualization affects how
loopback network works, I thought it should only affect communication
between domains?
Herbert Xu - Sept. 29, 2008, 6:40 a.m.
On Mon, Sep 29, 2008 at 09:52:03AM +0400, Evgeniy Polyakov wrote:
> On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > Are you using Xen FV or PV? What processor?
> 
> Can I determine that within guest?

cat /proc/interrupts

should do the trick

> Actually it is a bit strange, if full or paravirtualization affects how
> loopback network works, I thought it should only affect communication
> between domains?

It also affects the MMU and other things.

Cheers,
Evgeniy Polyakov - Sept. 29, 2008, 6:45 a.m.
On Mon, Sep 29, 2008 at 02:40:06PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> On Mon, Sep 29, 2008 at 09:52:03AM +0400, Evgeniy Polyakov wrote:
> > On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > > Are you using Xen FV or PV? What processor?
> > 
> > Can I determine that within guest?
> 
> cat /proc/interrupts
> 
> should do the trick

That's what I have:

$ cat /proc/interrupts 
CPU0       
0:         71    XT-PIC-XT        timer
1:          8    XT-PIC-XT        i8042
2:          0    XT-PIC-XT        cascade
5:      17606    XT-PIC-XT        eth0
12:          5    XT-PIC-XT        i8042
14:       6925    XT-PIC-XT        ide0
15:        141    XT-PIC-XT        ide1
NMI:          0   Non-maskable interrupts
LOC:    1190668   Local timer interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
SPU:          0   Spurious interrupts
ERR:          0
MIS:          0


> > Actually it is a bit strange, if full or paravirtualization affects how
> > loopback network works, I thought it should only affect communication
> > between domains?
> 
> It also affects the MMU and other things.

Shouldn't tests over loopback be like lots of memcpy in the userspace
process? Usually its performance is close enough to the kernel's range,
despite very different sizes of TLB entries.
Herbert Xu - Sept. 29, 2008, 7:02 a.m.
On Mon, Sep 29, 2008 at 10:45:18AM +0400, Evgeniy Polyakov wrote:
>
> $ cat /proc/interrupts 
> CPU0       
> 0:         71    XT-PIC-XT        timer
> 1:          8    XT-PIC-XT        i8042
> 2:          0    XT-PIC-XT        cascade
> 5:      17606    XT-PIC-XT        eth0
> 12:          5    XT-PIC-XT        i8042
> 14:       6925    XT-PIC-XT        ide0
> 15:        141    XT-PIC-XT        ide1
> NMI:          0   Non-maskable interrupts
> LOC:    1190668   Local timer interrupts
> RES:          0   Rescheduling interrupts
> CAL:          0   function call interrupts
> TLB:          0   TLB shootdowns
> TRM:          0   Thermal event interrupts
> SPU:          0   Spurious interrupts
> ERR:          0
> MIS:          0

OK you're on FV as well.  I'll try it on my laptaop next.

> Shouldn't tests over loopback be like lots of memcpy in the userspace
> process? Usually its performance is close enough to the kernel's range,
> despite very different sizes of TLB entries.

Where it may differ is when you have context switches.

Cheers,
Evgeniy Polyakov - Sept. 29, 2008, 7:11 a.m.
On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > $ cat /proc/interrupts 
> > CPU0       
> > 0:         71    XT-PIC-XT        timer
> > 1:          8    XT-PIC-XT        i8042
> > 2:          0    XT-PIC-XT        cascade
> > 5:      17606    XT-PIC-XT        eth0
> > 12:          5    XT-PIC-XT        i8042
> > 14:       6925    XT-PIC-XT        ide0
> > 15:        141    XT-PIC-XT        ide1
> > NMI:          0   Non-maskable interrupts
> > LOC:    1190668   Local timer interrupts
> > RES:          0   Rescheduling interrupts
> > CAL:          0   function call interrupts
> > TLB:          0   TLB shootdowns
> > TRM:          0   Thermal event interrupts
> > SPU:          0   Spurious interrupts
> > ERR:          0
> > MIS:          0
> 
> OK you're on FV as well.  I'll try it on my laptaop next.

How did you find that? :)

> > Shouldn't tests over loopback be like lots of memcpy in the userspace
> > process? Usually its performance is close enough to the kernel's range,
> > despite very different sizes of TLB entries.
> 
> Where it may differ is when you have context switches.

Yes, of course, even single empty syscall may potentially force process
the be scheduled away, bit still performance will not be with 24/190
ratio... Weird.
Herbert Xu - Sept. 29, 2008, 7:20 a.m.
On Mon, Sep 29, 2008 at 11:11:19AM +0400, Evgeniy Polyakov wrote:
>
> How did you find that? :)

Because your PIC looks normal.  You'll know when you're in PV
Xen because it looks like this:

           CPU0              CPU1              
  1:          2          0        Phys-irq  i8042
  7:          0          0        Phys-irq  parport0
  8:          0          0        Phys-irq  rtc
  9:          0          0        Phys-irq  acpi
 12:          4          0        Phys-irq  i8042
 14:   14710832   47935559        Phys-irq  ide0
 15:   11484815   42035535        Phys-irq  ide1
 16:          0          0        Phys-irq  uhci_hcd:usb1
 17:          0          0        Phys-irq  libata
 18:          0          0        Phys-irq  uhci_hcd:usb5, ehci_hcd:usb6
 19:        212         90        Phys-irq  uhci_hcd:usb4, libata
 20:          0          0        Phys-irq  uhci_hcd:usb2
 21:          2          0        Phys-irq  uhci_hcd:usb3, ehci_hcd:usb7
 22: 1038949859          0        Phys-irq  peth0
 23:        216          0        Phys-irq  HDA Intel
256: 1006488099          0     Dynamic-irq  timer0
257:   64033545          0     Dynamic-irq  resched0
258:         98          0     Dynamic-irq  callfunc0
259:          0   85739450     Dynamic-irq  resched1
260:          0        183     Dynamic-irq  callfunc1
261:          0  271431605     Dynamic-irq  timer1
262:       6778      16816     Dynamic-irq  xenbus
263:          0          0     Dynamic-irq  console
NMI:          0          0 
LOC:          0          0 
ERR:          0
MIS:          0

> Yes, of course, even single empty syscall may potentially force process
> the be scheduled away, bit still performance will not be with 24/190
> ratio... Weird.

I don't think PV/FV is the issue anyway since we're both on FV :)

Cheers,
Herbert Xu - Sept. 29, 2008, 9:43 a.m.
On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu wrote:
>
> OK you're on FV as well.  I'll try it on my laptaop next.

Interesting.  On my laptop I'm seeing 113MB/s with TSO on and
119MB/s without.  However, netperf shows TSO on is slightly
better than TSO off (4716Mb/s vs. 4680Mb/s).

A packet dump on tbench indicates it's sending lots of small
packets so TSO wouldn't be contributing anything positive at
all.  Hmm, it seems that even netperf isn't producing anything
larger than MTU so something seems amiss with the TCP stack.

BTW what were the actual numbers on your machine with tbench?
And what about netperf?

Cheers,
Herbert Xu - Sept. 29, 2008, 9:51 a.m.
On Mon, Sep 29, 2008 at 05:43:52PM +0800, Herbert Xu wrote:
> 
> A packet dump on tbench indicates it's sending lots of small
> packets so TSO wouldn't be contributing anything positive at
> all.  Hmm, it seems that even netperf isn't producing anything
> larger than MTU so something seems amiss with the TCP stack.

It seems that netperf is issuing 16384-byte writes and as such
we're sending the packets out immediately so TSO doesn't get a
chance to merge the data.  Running netperf with -m 65536 makes
TSO beat non-TSO by 6293Mb/s to 4761Mb/s.

Cheers,
Evgeniy Polyakov - Sept. 29, 2008, 10:34 a.m.
On Mon, Sep 29, 2008 at 05:43:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> BTW what were the actual numbers on your machine with tbench?

Do you mean real machine and not xen domain?
Here are the results:
==> /tmp/tbench/tbench-2.6.22-mainline-slab <==
Throughput 479.82 MB/sec 8 procs

==> /tmp/tbench/tbench-2.6.23-mainline <==
Throughput 454.36 MB/sec 8 procs

==> /tmp/tbench/tbench-2.6.24 <==
Throughput 399.912 MB/sec 8 procs

==> /tmp/tbench/tbench-2.6.25 <==
Throughput 391.59 MB/sec 8 procs

==> /tmp/tbench/tbench-8-2.6.26-mainline-slub <==
Throughput 398.508 MB/sec 8 procs

==> /tmp/tbench/tbench-8-2.6.27-rc7-mainline-slab <==
Throughput 366.046 MB/sec 8 procs

==> /tmp/tbench/tbench-8-2.6.27-rc7-mainline-slub <==
Throughput 360.78 MB/sec 8 procs

> And what about netperf?

Machines are dead right now, since apparently all bisections I tried are
unbootable (I gathered only one one dump in
e1000_watchdog+0x25/__netif_schedule+0xa), so I can not test anything
right now, but can run netperf when they will be alive again.
Christoph Lameter - Sept. 29, 2008, 1:34 p.m.
I did some graphs of the time that each pass through the network stack takes
and from what I see we still have measurements where we get the optimal
performance with 2.6.27 that 2.6.22 does. However, the time differentials vary
much more with 2.6.27. 2.6.22 times results cluster around the minimum. 2.6.27
has periods where the time increases by around 30usec. It dips down repeatedly
to 2.6.22 performance.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Sept. 29, 2008, 5:01 p.m.
Herbert Xu wrote:
> On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu wrote:
> 
>>OK you're on FV as well.  I'll try it on my laptaop next.
> 
> 
> Interesting.  On my laptop I'm seeing 113MB/s with TSO on and
> 119MB/s without.  However, netperf shows TSO on is slightly
> better than TSO off (4716Mb/s vs. 4680Mb/s).
> 
> A packet dump on tbench indicates it's sending lots of small
> packets so TSO wouldn't be contributing anything positive at
> all.  Hmm, it seems that even netperf isn't producing anything
> larger than MTU so something seems amiss with the TCP stack.

What is netperf reporting as its send "message" (misnomer) size? (I'm 
assuming this is a TCP_STREAM test rather than a TCP_RR or other test?)

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Sept. 29, 2008, 5:07 p.m.
Sigh - email reading timing....  anyway

> It seems that netperf is issuing 16384-byte writes and as such
> we're sending the packets out immediately so TSO doesn't get a
> chance to merge the data.  Running netperf with -m 65536 makes
> TSO beat non-TSO by 6293Mb/s to 4761Mb/s.

By default, netperf's TCP_STREAM test will use whatever 
getsockopt(SO_SNDBUF) reports just after the data socket is created. 
The choice was completely arbitrary and burried deep in the history of 
netperf.

For evaluating changes, it would probably be a good idea to test a 
number of settings for the test-specific -m option.  Of course I have no 
good idea what those values should be.  There is the tcp_range_script 
(might be a bit dusty today) but those values are again pretty arbitrary.

It would probably be a good idea to include the TCP_RR test.

happy benchmarking,

rick jones

As an asside - I would be interested in hearing peoples' opinions 
(offline) on a future version of netperf possibly violating the 
principle of least surprise and automatically including CPU utilization 
if the code is running on a system which does not require calibration...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 3b43bfd..a22ae35 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -169,7 +169,6 @@  static void loopback_setup(struct net_device *dev)
 	dev->type		= ARPHRD_LOOPBACK;	/* 0x0001*/
 	dev->flags		= IFF_LOOPBACK;
 	dev->features 		= NETIF_F_SG | NETIF_F_FRAGLIST
-		| NETIF_F_TSO
 		| NETIF_F_NO_CSUM
 		| NETIF_F_HIGHDMA
 		| NETIF_F_LLTX
diff --git a/net/core/dev.c b/net/core/dev.c
index e8eb2b4..dddb5c2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4003,10 +4003,6 @@  int register_netdevice(struct net_device *dev)
 		}
 	}
 
-	/* Enable software GSO if SG is supported. */
-	if (dev->features & NETIF_F_SG)
-		dev->features |= NETIF_F_GSO;
-
 	netdev_initialize_kobject(dev);
 	ret = netdev_register_kobject(dev);
 	if (ret)