Message ID | 20080928211530.GA9341@2ka.mipt.ru |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Sep 29, 2008 at 01:15:30AM +0400, Evgeniy Polyakov wrote: > > Idea is rather trivial: disable TSO and GSO on loopback. The latter was Weird. That's the very first thing I tried but for me it goes the other way. With TSO/GSO: Throughput 169.19 MB/sec 1 procs Without: Throughput 24.0079 MB/sec 1 procs Note that I only disabled TSO/GSO using ethtool: etch1:~# ethtool -k lo Offload parameters for lo: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off etch1:~# Can you see if reverting the patch but using ethtool gives you the same results? PS I'm using FV Xen. Thanks,
On Mon, Sep 29, 2008 at 11:12:44AM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > Can you see if reverting the patch but using ethtool gives you the > same results? Yes, I disabled gso and tso on vanilla 2.6.27-rc7 via ethtools and got the same results: $ sudo ethtool -k lo Offload parameters for lo: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off 1 proc: 195 mb 4 proc: 195 mb $ sudo ethtool -k lo Offload parameters for lo: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on 1 proc: 190 mb 4 proc: 189 mb
On Mon, Sep 29, 2008 at 09:36:06AM +0400, Evgeniy Polyakov wrote: > On Mon, Sep 29, 2008 at 11:12:44AM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > Can you see if reverting the patch but using ethtool gives you the > > same results? > > Yes, I disabled gso and tso on vanilla 2.6.27-rc7 via ethtools and > got the same results: Are you using Xen FV or PV? What processor? Thanks,
On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> Are you using Xen FV or PV? What processor?
Can I determine that within guest?
I do not think so except by loking at cpu flags?
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 11
cpu MHz : 2333.406
cache size : 4096 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca
cmov pat clflush dts mmx fxsr sse sse2 ss pbe syscall lm constant_tsc up
pebs bts pni ds_cpl ssse3 cx16 xtpr lahf_lm
bogomips : 4666.81
clflush size : 64
power management:
Actually it is a bit strange, if full or paravirtualization affects how
loopback network works, I thought it should only affect communication
between domains?
On Mon, Sep 29, 2008 at 09:52:03AM +0400, Evgeniy Polyakov wrote: > On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > Are you using Xen FV or PV? What processor? > > Can I determine that within guest? cat /proc/interrupts should do the trick > Actually it is a bit strange, if full or paravirtualization affects how > loopback network works, I thought it should only affect communication > between domains? It also affects the MMU and other things. Cheers,
On Mon, Sep 29, 2008 at 02:40:06PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > On Mon, Sep 29, 2008 at 09:52:03AM +0400, Evgeniy Polyakov wrote: > > On Mon, Sep 29, 2008 at 01:40:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > > Are you using Xen FV or PV? What processor? > > > > Can I determine that within guest? > > cat /proc/interrupts > > should do the trick That's what I have: $ cat /proc/interrupts CPU0 0: 71 XT-PIC-XT timer 1: 8 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 5: 17606 XT-PIC-XT eth0 12: 5 XT-PIC-XT i8042 14: 6925 XT-PIC-XT ide0 15: 141 XT-PIC-XT ide1 NMI: 0 Non-maskable interrupts LOC: 1190668 Local timer interrupts RES: 0 Rescheduling interrupts CAL: 0 function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts SPU: 0 Spurious interrupts ERR: 0 MIS: 0 > > Actually it is a bit strange, if full or paravirtualization affects how > > loopback network works, I thought it should only affect communication > > between domains? > > It also affects the MMU and other things. Shouldn't tests over loopback be like lots of memcpy in the userspace process? Usually its performance is close enough to the kernel's range, despite very different sizes of TLB entries.
On Mon, Sep 29, 2008 at 10:45:18AM +0400, Evgeniy Polyakov wrote: > > $ cat /proc/interrupts > CPU0 > 0: 71 XT-PIC-XT timer > 1: 8 XT-PIC-XT i8042 > 2: 0 XT-PIC-XT cascade > 5: 17606 XT-PIC-XT eth0 > 12: 5 XT-PIC-XT i8042 > 14: 6925 XT-PIC-XT ide0 > 15: 141 XT-PIC-XT ide1 > NMI: 0 Non-maskable interrupts > LOC: 1190668 Local timer interrupts > RES: 0 Rescheduling interrupts > CAL: 0 function call interrupts > TLB: 0 TLB shootdowns > TRM: 0 Thermal event interrupts > SPU: 0 Spurious interrupts > ERR: 0 > MIS: 0 OK you're on FV as well. I'll try it on my laptaop next. > Shouldn't tests over loopback be like lots of memcpy in the userspace > process? Usually its performance is close enough to the kernel's range, > despite very different sizes of TLB entries. Where it may differ is when you have context switches. Cheers,
On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > $ cat /proc/interrupts > > CPU0 > > 0: 71 XT-PIC-XT timer > > 1: 8 XT-PIC-XT i8042 > > 2: 0 XT-PIC-XT cascade > > 5: 17606 XT-PIC-XT eth0 > > 12: 5 XT-PIC-XT i8042 > > 14: 6925 XT-PIC-XT ide0 > > 15: 141 XT-PIC-XT ide1 > > NMI: 0 Non-maskable interrupts > > LOC: 1190668 Local timer interrupts > > RES: 0 Rescheduling interrupts > > CAL: 0 function call interrupts > > TLB: 0 TLB shootdowns > > TRM: 0 Thermal event interrupts > > SPU: 0 Spurious interrupts > > ERR: 0 > > MIS: 0 > > OK you're on FV as well. I'll try it on my laptaop next. How did you find that? :) > > Shouldn't tests over loopback be like lots of memcpy in the userspace > > process? Usually its performance is close enough to the kernel's range, > > despite very different sizes of TLB entries. > > Where it may differ is when you have context switches. Yes, of course, even single empty syscall may potentially force process the be scheduled away, bit still performance will not be with 24/190 ratio... Weird.
On Mon, Sep 29, 2008 at 11:11:19AM +0400, Evgeniy Polyakov wrote: > > How did you find that? :) Because your PIC looks normal. You'll know when you're in PV Xen because it looks like this: CPU0 CPU1 1: 2 0 Phys-irq i8042 7: 0 0 Phys-irq parport0 8: 0 0 Phys-irq rtc 9: 0 0 Phys-irq acpi 12: 4 0 Phys-irq i8042 14: 14710832 47935559 Phys-irq ide0 15: 11484815 42035535 Phys-irq ide1 16: 0 0 Phys-irq uhci_hcd:usb1 17: 0 0 Phys-irq libata 18: 0 0 Phys-irq uhci_hcd:usb5, ehci_hcd:usb6 19: 212 90 Phys-irq uhci_hcd:usb4, libata 20: 0 0 Phys-irq uhci_hcd:usb2 21: 2 0 Phys-irq uhci_hcd:usb3, ehci_hcd:usb7 22: 1038949859 0 Phys-irq peth0 23: 216 0 Phys-irq HDA Intel 256: 1006488099 0 Dynamic-irq timer0 257: 64033545 0 Dynamic-irq resched0 258: 98 0 Dynamic-irq callfunc0 259: 0 85739450 Dynamic-irq resched1 260: 0 183 Dynamic-irq callfunc1 261: 0 271431605 Dynamic-irq timer1 262: 6778 16816 Dynamic-irq xenbus 263: 0 0 Dynamic-irq console NMI: 0 0 LOC: 0 0 ERR: 0 MIS: 0 > Yes, of course, even single empty syscall may potentially force process > the be scheduled away, bit still performance will not be with 24/190 > ratio... Weird. I don't think PV/FV is the issue anyway since we're both on FV :) Cheers,
On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu wrote: > > OK you're on FV as well. I'll try it on my laptaop next. Interesting. On my laptop I'm seeing 113MB/s with TSO on and 119MB/s without. However, netperf shows TSO on is slightly better than TSO off (4716Mb/s vs. 4680Mb/s). A packet dump on tbench indicates it's sending lots of small packets so TSO wouldn't be contributing anything positive at all. Hmm, it seems that even netperf isn't producing anything larger than MTU so something seems amiss with the TCP stack. BTW what were the actual numbers on your machine with tbench? And what about netperf? Cheers,
On Mon, Sep 29, 2008 at 05:43:52PM +0800, Herbert Xu wrote: > > A packet dump on tbench indicates it's sending lots of small > packets so TSO wouldn't be contributing anything positive at > all. Hmm, it seems that even netperf isn't producing anything > larger than MTU so something seems amiss with the TCP stack. It seems that netperf is issuing 16384-byte writes and as such we're sending the packets out immediately so TSO doesn't get a chance to merge the data. Running netperf with -m 65536 makes TSO beat non-TSO by 6293Mb/s to 4761Mb/s. Cheers,
On Mon, Sep 29, 2008 at 05:43:52PM +0800, Herbert Xu (herbert@gondor.apana.org.au) wrote: > BTW what were the actual numbers on your machine with tbench? Do you mean real machine and not xen domain? Here are the results: ==> /tmp/tbench/tbench-2.6.22-mainline-slab <== Throughput 479.82 MB/sec 8 procs ==> /tmp/tbench/tbench-2.6.23-mainline <== Throughput 454.36 MB/sec 8 procs ==> /tmp/tbench/tbench-2.6.24 <== Throughput 399.912 MB/sec 8 procs ==> /tmp/tbench/tbench-2.6.25 <== Throughput 391.59 MB/sec 8 procs ==> /tmp/tbench/tbench-8-2.6.26-mainline-slub <== Throughput 398.508 MB/sec 8 procs ==> /tmp/tbench/tbench-8-2.6.27-rc7-mainline-slab <== Throughput 366.046 MB/sec 8 procs ==> /tmp/tbench/tbench-8-2.6.27-rc7-mainline-slub <== Throughput 360.78 MB/sec 8 procs > And what about netperf? Machines are dead right now, since apparently all bisections I tried are unbootable (I gathered only one one dump in e1000_watchdog+0x25/__netif_schedule+0xa), so I can not test anything right now, but can run netperf when they will be alive again.
I did some graphs of the time that each pass through the network stack takes and from what I see we still have measurements where we get the optimal performance with 2.6.27 that 2.6.22 does. However, the time differentials vary much more with 2.6.27. 2.6.22 times results cluster around the minimum. 2.6.27 has periods where the time increases by around 30usec. It dips down repeatedly to 2.6.22 performance. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu wrote: > On Mon, Sep 29, 2008 at 03:02:13PM +0800, Herbert Xu wrote: > >>OK you're on FV as well. I'll try it on my laptaop next. > > > Interesting. On my laptop I'm seeing 113MB/s with TSO on and > 119MB/s without. However, netperf shows TSO on is slightly > better than TSO off (4716Mb/s vs. 4680Mb/s). > > A packet dump on tbench indicates it's sending lots of small > packets so TSO wouldn't be contributing anything positive at > all. Hmm, it seems that even netperf isn't producing anything > larger than MTU so something seems amiss with the TCP stack. What is netperf reporting as its send "message" (misnomer) size? (I'm assuming this is a TCP_STREAM test rather than a TCP_RR or other test?) rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sigh - email reading timing.... anyway > It seems that netperf is issuing 16384-byte writes and as such > we're sending the packets out immediately so TSO doesn't get a > chance to merge the data. Running netperf with -m 65536 makes > TSO beat non-TSO by 6293Mb/s to 4761Mb/s. By default, netperf's TCP_STREAM test will use whatever getsockopt(SO_SNDBUF) reports just after the data socket is created. The choice was completely arbitrary and burried deep in the history of netperf. For evaluating changes, it would probably be a good idea to test a number of settings for the test-specific -m option. Of course I have no good idea what those values should be. There is the tcp_range_script (might be a bit dusty today) but those values are again pretty arbitrary. It would probably be a good idea to include the TCP_RR test. happy benchmarking, rick jones As an asside - I would be interested in hearing peoples' opinions (offline) on a future version of netperf possibly violating the principle of least surprise and automatically including CPU utilization if the code is running on a system which does not require calibration... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 3b43bfd..a22ae35 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -169,7 +169,6 @@ static void loopback_setup(struct net_device *dev) dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ dev->flags = IFF_LOOPBACK; dev->features = NETIF_F_SG | NETIF_F_FRAGLIST - | NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX diff --git a/net/core/dev.c b/net/core/dev.c index e8eb2b4..dddb5c2 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4003,10 +4003,6 @@ int register_netdevice(struct net_device *dev) } } - /* Enable software GSO if SG is supported. */ - if (dev->features & NETIF_F_SG) - dev->features |= NETIF_F_GSO; - netdev_initialize_kobject(dev); ret = netdev_register_kobject(dev); if (ret)
Hi. Attached patch fixes (at least partially) tbench regressions reported recently. I ran it on 4-way machine and noticed more than 20% performance degradation comapred to 2.6.22 kernel. Unfortunately all my remote machine are now stuck in death at various (apparently unbootable) bisections, so I switched to the Xen domain, which only has 256 mb of RAM and is generally very slow. Because of that I was not able to run 2.6.22 tree (compilation and git operations take really long time on this 'machine' and it is middle of the night in Moscow), but I tested it on 2.6.27-rc7 and was able reach performance higher than 2.6.26. According to my tests there were no noticeble regressions in 2.6.24-2.6.26, so this patch should at least fix 2.6.26->2.6.27 one. Idea is rather trivial: disable TSO and GSO on loopback. The latter was actually enabled by bisected e5a4a72d4f8 commit, which enables GSO unconditionally if device supports scatter/gather and checksumming. Apparently GSO packet construction has bigger overhead and smaller packet processing. I did not bisect TSO in loopback patch, but concluded it from above (actually its gain is even bigger than GSO on SG device). I will try to bring my tast machines back tomorrow and run it there, but patch does fix the same regression tested in small Xen domain. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>