Message ID | 20090427080501.GA21433@gondor.apana.org.au |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Mon, 27 Apr 2009 16:05:01 +0800 > gro: Fix handling of headers that extend over the tail > > The skb_gro_* code fails to handle the case where a header starts > in the linear area but ends in the frags area. Since the goal > of skb_gro_* is to optimise the case of completely non-linear > packets, we can simply bail out if we have anything in the linear > area. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote: > On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote: > > > > These results are indeed quite close, so the performance problem seems > > isolated to AMD CPUS, and perhaps due to the smaller caches. > > Do you have any AMD you can use as a receiver? > > I now have an AMD with 512K cache to test this. Unfortunately > I'd just locked it up before I got a chance to do any serious > testing. So it might take a while. OK that's been fixed up. Indeed the AMD can't do wire speed. But still the performance seems comparable. Both of them sit between 6600Mb/s and 7100Mb/s. The sender is running at about 66% idle in either case. Cheers,
Herbert Xu wrote: > On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote: >> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote: >>> These results are indeed quite close, so the performance problem seems >>> isolated to AMD CPUS, and perhaps due to the smaller caches. >>> Do you have any AMD you can use as a receiver? >> I now have an AMD with 512K cache to test this. Unfortunately >> I'd just locked it up before I got a chance to do any serious >> testing. So it might take a while. > > OK that's been fixed up. Indeed the AMD can't do wire speed. > But still the performance seems comparable. Both of them sit > between 6600Mb/s and 7100Mb/s. The sender is running at about > 66% idle in either case. Its strange, I still consistently see about 1Gb/s better performance from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO) when binding everything to the same CPU. Mpstat -P 0 shows roughly 10% more time spent in "soft" when using GRO vs LRO: GRO: 10:17:45 CPU %user %nice %system %iowait %irq %soft %idle intr/s 10:17:46 0 0.00 0.00 54.00 0.00 0.00 46.00 0.00 11754.00 10:17:47 0 0.00 0.00 54.00 0.00 1.00 45.00 0.00 11718.00 10:17:48 0 0.00 0.00 47.00 0.00 2.00 51.00 0.00 11639.00 LRO: 10:21:55 CPU %user %nice %system %iowait %irq %soft %idle intr/s 10:21:56 0 0.00 0.00 66.00 0.00 1.00 33.00 0.00 13228.00 10:21:57 0 0.00 0.00 65.35 0.00 1.98 32.67 0.00 13118.81 10:21:58 0 0.00 0.00 63.00 0.00 1.00 36.00 0.00 13238.00 According to oprofile, the top 20 samples running GRO are: CPU: AMD64 processors, speed 2050.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 4382 30.5408 vmlinux vmlinux copy_user_generic_string 534 3.7218 myri10ge.ko myri10ge myri10ge_poll 463 3.2269 vmlinux vmlinux _raw_spin_lock 394 2.7460 vmlinux vmlinux rb_get_reader_page 382 2.6624 vmlinux vmlinux acpi_pm_read 356 2.4812 vmlinux vmlinux inet_gro_receive 293 2.0421 oprofiled oprofiled (no symbols) 268 1.8679 vmlinux vmlinux find_next_bit 268 1.8679 vmlinux vmlinux tg_shares_up 257 1.7912 vmlinux vmlinux ring_buffer_consume 247 1.7215 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 247 1.7215 vmlinux vmlinux tcp_gro_receive 228 1.5891 vmlinux vmlinux __free_pages_ok 219 1.5263 vmlinux vmlinux skb_gro_receive 167 1.1639 vmlinux vmlinux skb_gro_header 149 1.0385 bash bash (no symbols) 141 0.9827 vmlinux vmlinux skb_copy_datagram_iovec 132 0.9200 vmlinux vmlinux rb_buffer_peek 129 0.8991 vmlinux vmlinux _raw_spin_unlock 123 0.8573 vmlinux vmlinux delay_tsc Nothing really stands out for me. Here is LRO: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 4884 33.1164 vmlinux vmlinux copy_user_generic_string 721 4.8888 myri10ge.ko myri10ge myri10ge_poll 580 3.9327 vmlinux vmlinux _raw_spin_lock 409 2.7733 vmlinux vmlinux acpi_pm_read 306 2.0749 vmlinux vmlinux rb_get_reader_page 293 1.9867 oprofiled oprofiled (no symbols) 286 1.9392 myri10ge.ko myri10ge myri10ge_get_frag_header 253 1.7155 vmlinux vmlinux __lro_proc_segment 250 1.6951 vmlinux vmlinux rb_buffer_peek 247 1.6748 vmlinux vmlinux ring_buffer_consume 232 1.5731 vmlinux vmlinux __free_pages_ok 211 1.4307 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 206 1.3968 vmlinux vmlinux tg_shares_up 175 1.1866 vmlinux vmlinux skb_copy_datagram_iovec 158 1.0713 vmlinux vmlinux find_next_bit 146 0.9900 vmlinux vmlinux lro_tcp_ip_check 131 0.8883 oprofile.ko oprofile op_cpu_buffer_read_entry 127 0.8611 vmlinux vmlinux delay_tsc 125 0.8476 bash bash (no symbols) 125 0.8476 vmlinux vmlinux _raw_spin_unlock If I can't figure out why LRO is so much faster in some cases, then I think maybe I'll just put together a patch which keeps LRO, and does GRO only if LRO is disabled. Kind of ugly, but better than loosing 15% performance on some machines. Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Andrew Gallatin <gallatin@myri.com> Date: Tue, 28 Apr 2009 11:00:16 -0400 > If I can't figure out why LRO is so much faster in some cases, then I > think maybe I'll just put together a patch which keeps LRO, and does > GRO only if LRO is disabled. Kind of ugly, but better than loosing > 15% performance on some machines. I refuse to apply such a patch. Figure out this performance problem, don't work around it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote: > > Its strange, I still consistently see about 1Gb/s better performance > from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO) > when binding everything to the same CPU. Mpstat -P 0 shows roughly > 10% more time spent in "soft" when using GRO vs LRO: Did you check the utilisation of the all the cores on the sender? Cheers,
Herbert Xu wrote: > On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote: >> Its strange, I still consistently see about 1Gb/s better performance >> from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO) >> when binding everything to the same CPU. Mpstat -P 0 shows roughly >> 10% more time spent in "soft" when using GRO vs LRO: > > Did you check the utilisation of the all the cores on the sender? Yes. It is about the same +/- 2%. The utilization when sending to GRO is a bit lower, but its going slower. Here is what might be more interesting.. I'm trying to isolate the softirq path in oprofile. So in this test, I bound the IRQ to CPU1, and the netserver to CPU0. In these tests, I see near line rate from both LRO and GRO. Here is oprofile output separated by CPU, and sorted on CPU1. (Sorry about binding to CPU1 and making the output more confusing; I could not get oprofile to emit samples when the irq was bound to CPU0). I've included the top 20 entries: GRO: 0 0 1414 15.8485 myri10ge.ko myri10ge myri10ge_poll 0 0 932 10.4461 vmlinux vmlinux inet_gro_receive 0 0 705 7.9018 vmlinux vmlinux tcp_gro_receive 0 0 681 7.6328 vmlinux vmlinux skb_gro_receive 0 0 652 7.3078 vmlinux vmlinux skb_gro_header 0 0 517 5.7947 vmlinux vmlinux __napi_gro_receive 0 0 316 3.5418 vmlinux vmlinux dev_gro_receive 0 0 309 3.4633 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 415 3.1243 251 2.8133 vmlinux vmlinux _raw_spin_lock 0 0 233 2.6115 vmlinux vmlinux napi_frags_skb 0 0 178 1.9951 vmlinux vmlinux tcp4_gro_receive 306 2.3037 152 1.7037 vmlinux vmlinux rb_get_reader_page 0 0 150 1.6812 vmlinux vmlinux napi_get_frags 188 1.4153 131 1.4683 vmlinux vmlinux rb_buffer_peek 195 1.4680 101 1.1320 vmlinux vmlinux ring_buffer_consume 0 0 96 1.0760 vmlinux vmlinux ip_rcv_finish 0 0 94 1.0536 vmlinux vmlinux napi_gro_frags 0 0 92 1.0312 vmlinux vmlinux skb_copy_bits 0 0 86 0.9639 vmlinux vmlinux napi_frags_finish 225 1.6939 85 0.9527 oprofile.ko oprofile op_cpu_buffer_read_entry LRO: 0 0 1937 15.1281 myri10ge.ko myri10ge myri10ge_poll 0 0 1876 14.6517 myri10ge.ko myri10ge myri10ge_get_frag_header 0 0 943 7.3649 vmlinux vmlinux __lro_proc_segment 0 0 723 5.6467 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 0 0 392 3.0615 vmlinux vmlinux lro_gen_skb 0 0 369 2.8819 vmlinux vmlinux lro_tcp_ip_check 353 2.7435 357 2.7882 vmlinux vmlinux _raw_spin_lock 290 2.2538 328 2.5617 vmlinux vmlinux rb_get_reader_page 4 0.0311 270 2.1087 vmlinux vmlinux csum_partial 26 0.2021 214 1.6714 vmlinux vmlinux memset_c 0 0 202 1.5776 vmlinux vmlinux lro_add_common 8 0.0622 191 1.4917 vmlinux vmlinux __slab_alloc 0 0 188 1.4683 vmlinux vmlinux ip_rcv_finish 84 0.6528 183 1.4292 vmlinux vmlinux _raw_spin_unlock 0 0 180 1.4058 vmlinux vmlinux lro_tcp_data_csum 0 0 180 1.4058 vmlinux vmlinux lro_get_desc 167 1.2979 178 1.3902 vmlinux vmlinux ring_buffer_consume 0 0 167 1.3043 vmlinux vmlinux netif_receive_skb 0 0 143 1.1168 vmlinux vmlinux ip_route_input 0 0 125 0.9763 vmlinux vmlinux __inet_lookup_established Does anything strike you as being inordinately expensive for GRO? Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
For variety, I grabbed a different "slow" receiver. This is another 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895) processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 37 model name : AMD Opteron(tm) Processor 252 stepping : 1 cpu MHz : 2611.738 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm bogomips : 5223.47 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp The sender was an identical machine running an ancient RHEL4 kernel (2.6.9-42.ELsmp) and our downloadable (backported) driver. (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz) I disabled LRO, on the sender. Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with LRO and 8.0Gb/s with GRO. Binding the IRQ to CPU0, and the netserver to CPU0, I see 6.9Gb/s with LRO and 5.5 Gb/s with GRO. Monitoring the packet/byte counts on the interface once per second, LRO looks like this: Ipkts IBytes Opkts Obytes 588992 891733888 9758 644028 589610 892669540 9771 644886 589079 891865606 9754 643764 And GRO looks like this: 480309 727187826 7949 524634 480032 726768448 7947 524502 480000 726720000 7943 524238 Similarly, in this same scenario, binding the app/irq to the same CPU and running mpstat -P 0 1 shows about 60%sys and 40% irq+softirq while GRO shows about 45% sys and 55% irq+softirq. I can't put my finger on it, but something about GRO is certainly more expensive on these types of machines. I wish there was some way you could see it, since it happens on every older AMD I try it on. If you haven't been able to reproduce it, I'll see if I can make it happen on a newer "slow" amd64 box I have tomorrow. Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Gallatin wrote: > For variety, I grabbed a different "slow" receiver. This is another > 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895) > > processor : 0 > vendor_id : AuthenticAMD > cpu family : 15 > model : 37 > model name : AMD Opteron(tm) Processor 252 <...> > The sender was an identical machine running an ancient RHEL4 kernel > (2.6.9-42.ELsmp) and our downloadable (backported) driver. > (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz) > I disabled LRO, on the sender. > > Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with > LRO and 8.0Gb/s with GRO. With the recent patch to fix idle CPU time accounting from LKML applied, it is again possible to trust netperf's service demand (based on %CPU). So here is raw netperf output for LRO and GRO, bound as above. TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB LRO: 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 1.535 GRO: 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 1.739 The difference is bigger if you disable TCP timestamps (and thus shrink the packets headers down so they require fewer cachelines): LRO: 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 1.565 GRO: 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 1.839 As you can see, even though the raw bandwidth is very close, the service demand makes it clear that GRO is more expensive than LRO. I just wish I understood why. Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Gallatin a écrit : > Andrew Gallatin wrote: >> For variety, I grabbed a different "slow" receiver. This is another >> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895) >> >> processor : 0 >> vendor_id : AuthenticAMD >> cpu family : 15 >> model : 37 >> model name : AMD Opteron(tm) Processor 252 > <...> >> The sender was an identical machine running an ancient RHEL4 kernel >> (2.6.9-42.ELsmp) and our downloadable (backported) driver. >> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz) >> I disabled LRO, on the sender. >> >> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with >> LRO and 8.0Gb/s with GRO. > > With the recent patch to fix idle CPU time accounting from LKML applied, > it is again possible to trust netperf's service demand (based on %CPU). > So here is raw netperf output for LRO and GRO, bound as above. > > TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind > Recv Send Send Utilization Service > Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB > us/KB > > LRO: > 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 1.535 > GRO: > 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 1.739 > > The difference is bigger if you disable TCP timestamps (and thus shrink > the packets headers down so they require fewer cachelines): > LRO: > 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 1.565 > GRO: > 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 1.839 > > > As you can see, even though the raw bandwidth is very close, the > service demand makes it clear that GRO is more expensive > than LRO. I just wish I understood why. > What are "vmstat 1" ouputs on both tests ? Any difference on say... context switches ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > Andrew Gallatin a écrit : >> Andrew Gallatin wrote: >>> For variety, I grabbed a different "slow" receiver. This is another >>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895) >>> >>> processor : 0 >>> vendor_id : AuthenticAMD >>> cpu family : 15 >>> model : 37 >>> model name : AMD Opteron(tm) Processor 252 >> <...> >>> The sender was an identical machine running an ancient RHEL4 kernel >>> (2.6.9-42.ELsmp) and our downloadable (backported) driver. >>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz) >>> I disabled LRO, on the sender. >>> >>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with >>> LRO and 8.0Gb/s with GRO. >> With the recent patch to fix idle CPU time accounting from LKML applied, >> it is again possible to trust netperf's service demand (based on %CPU). >> So here is raw netperf output for LRO and GRO, bound as above. >> >> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to >> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind >> Recv Send Send Utilization Service >> Demand >> Socket Socket Message Elapsed Send Recv Send Recv >> Size Size Size Time Throughput local remote local remote >> bytes bytes bytes secs. 10^6bits/s % S % S us/KB >> us/KB >> >> LRO: >> 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 1.535 >> GRO: >> 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 1.739 >> >> The difference is bigger if you disable TCP timestamps (and thus shrink >> the packets headers down so they require fewer cachelines): >> LRO: >> 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 1.565 >> GRO: >> 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 1.839 >> >> >> As you can see, even though the raw bandwidth is very close, the >> service demand makes it clear that GRO is more expensive >> than LRO. I just wish I understood why. >> > > What are "vmstat 1" ouputs on both tests ? Any difference on say... context switches ? Not much difference is apparent from vmstat, except for a lower load and slightly higher IRQ rate from LRO: LRO: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 676960 19280 209812 0 0 0 0 14817 24 0 73 27 0 0 1 0 0 677084 19280 209812 0 0 0 0 14834 20 0 73 27 0 0 1 0 0 676916 19280 209812 0 0 0 0 14833 16 0 74 26 0 0 GRO: r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 678244 18008 209784 0 0 0 24 14288 32 0 84 16 0 0 1 0 0 678268 18008 209788 0 0 0 0 14403 22 0 85 15 0 0 1 0 0 677956 18008 209788 0 0 0 0 14331 20 0 84 16 0 0 The real difference is visible mainly from mpstat on the CPU handing the interrupts where you see softirq is much higher: LRO: 07:15:16 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 07:15:17 0 0.00 0.00 0.00 0.00 0.00 45.00 0.00 55.00 12907.92 07:15:18 0 0.00 0.00 1.00 0.00 2.00 43.00 0.00 54.00 12707.92 07:15:19 0 0.00 0.00 1.00 0.00 0.00 46.00 0.00 53.00 12825.00 GRO 07:11:59 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 07:12:00 0 0.00 0.00 0.00 0.00 0.99 66.34 0.00 32.67 12242.57 07:12:01 0 0.00 0.00 0.00 0.00 1.01 66.67 0.00 32.32 12220.00 07:12:02 0 0.00 0.00 0.99 0.00 0.99 65.35 0.00 32.67 12336.00 So it is like "something" GRO is doing in the softirq context is more expensive than what LRO is doing. Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Gallatin a écrit : > Eric Dumazet wrote: >> Andrew Gallatin a écrit : >>> Andrew Gallatin wrote: >>>> For variety, I grabbed a different "slow" receiver. This is another >>>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895) >>>> >>>> processor : 0 >>>> vendor_id : AuthenticAMD >>>> cpu family : 15 >>>> model : 37 >>>> model name : AMD Opteron(tm) Processor 252 >>> <...> >>>> The sender was an identical machine running an ancient RHEL4 kernel >>>> (2.6.9-42.ELsmp) and our downloadable (backported) driver. >>>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz) >>>> I disabled LRO, on the sender. >>>> >>>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with >>>> LRO and 8.0Gb/s with GRO. >>> With the recent patch to fix idle CPU time accounting from LKML applied, >>> it is again possible to trust netperf's service demand (based on %CPU). >>> So here is raw netperf output for LRO and GRO, bound as above. >>> >>> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to >>> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind >>> Recv Send Send Utilization Service >>> Demand >>> Socket Socket Message Elapsed Send Recv Send > Recv >>> Size Size Size Time Throughput local remote local > remote >>> bytes bytes bytes secs. 10^6bits/s % S % S us/KB >>> us/KB >>> >>> LRO: >>> 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 > 1.535 >>> GRO: >>> 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 > 1.739 >>> >>> The difference is bigger if you disable TCP timestamps (and thus shrink >>> the packets headers down so they require fewer cachelines): >>> LRO: >>> 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 > 1.565 >>> GRO: >>> 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 > 1.839 >>> >>> >>> As you can see, even though the raw bandwidth is very close, the >>> service demand makes it clear that GRO is more expensive >>> than LRO. I just wish I understood why. >>> >> >> What are "vmstat 1" ouputs on both tests ? Any difference on say... > context switches ? > > Not much difference is apparent from vmstat, except for a > lower load and slightly higher IRQ rate from LRO: > > LRO: > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy > id wa st > 1 0 0 676960 19280 209812 0 0 0 0 14817 24 0 73 > 27 0 0 > 1 0 0 677084 19280 209812 0 0 0 0 14834 20 0 73 > 27 0 0 > 1 0 0 676916 19280 209812 0 0 0 0 14833 16 0 74 > 26 0 0 > > > GRO: > r b swpd free buff cache si so bi bo in cs us sy > id wa st > 1 0 0 678244 18008 209784 0 0 0 24 14288 32 0 84 > 16 0 0 > 1 0 0 678268 18008 209788 0 0 0 0 14403 22 0 85 > 15 0 0 > 1 0 0 677956 18008 209788 0 0 0 0 14331 20 0 84 > 16 0 0 > > > > > The real difference is visible mainly from mpstat on the CPU handing the > interrupts where you see softirq is much higher: > > LRO: > 07:15:16 CPU %user %nice %sys %iowait %irq %soft %steal > %idle intr/s > 07:15:17 0 0.00 0.00 0.00 0.00 0.00 45.00 0.00 > 55.00 12907.92 > 07:15:18 0 0.00 0.00 1.00 0.00 2.00 43.00 0.00 > 54.00 12707.92 > 07:15:19 0 0.00 0.00 1.00 0.00 0.00 46.00 0.00 > 53.00 12825.00 > > > GRO > 07:11:59 CPU %user %nice %sys %iowait %irq %soft %steal > %idle intr/s > 07:12:00 0 0.00 0.00 0.00 0.00 0.99 66.34 0.00 > 32.67 12242.57 > 07:12:01 0 0.00 0.00 0.00 0.00 1.01 66.67 0.00 > 32.32 12220.00 > 07:12:02 0 0.00 0.00 0.99 0.00 0.99 65.35 0.00 > 32.67 12336.00 > > > So it is like "something" GRO is doing in the softirq context is more > expensive than what LRO is doing. Sure, probably more cache misses or something... You could try a longer oprofile session (with at least one million samples) and : opannotate -a vmlinux >/tmp/FILE And select 3 or 4 suspect functions : inet_gro_receive() tcp_gro_receive(), skb_gro_receive(), skb_gro_header() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > > Sure, probably more cache misses or something... Yes, that's what I thought. The code is much more complete, and spread out than LRO, and seems to open itself to cache misses. > You could try a longer oprofile session (with at least one million samples) > and : > > opannotate -a vmlinux >/tmp/FILE > > And select 3 or 4 suspect functions : inet_gro_receive() tcp_gro_receive(), > skb_gro_receive(), skb_gro_header() Here is the opreport -l output from this machine for GRO for a 25 minute profiling run: samples % image name app name symbol name 3742674 32.2793 vmlinux vmlinux copy_user_generic_string 890179 7.6775 myri10ge.ko myri10ge myri10ge_poll 547572 4.7226 vmlinux vmlinux inet_gro_receive 477479 4.1181 vmlinux vmlinux skb_gro_receive 406562 3.5065 vmlinux vmlinux free_hot_cold_page 396796 3.4222 vmlinux vmlinux tcp_gro_receive 332364 2.8665 vmlinux vmlinux __rmqueue_smallest 319455 2.7552 vmlinux vmlinux skb_gro_header 269040 2.3204 vmlinux vmlinux dev_gro_receive 252885 2.1810 vmlinux vmlinux free_pages_bulk 247832 2.1375 vmlinux vmlinux get_pageblock_flags_group 211592 1.8249 myri10ge.ko myri10ge myri10ge_alloc_rx_pages 208867 1.8014 vmlinux vmlinux __list_add 201491 1.7378 vmlinux vmlinux tcp4_gro_receive 187591 1.6179 vmlinux vmlinux __napi_gro_receive 170156 1.4675 vmlinux vmlinux get_page_from_freelist 116321 1.0032 vmlinux vmlinux list_del 107994 0.9314 vmlinux vmlinux kfree 106434 0.9180 vmlinux vmlinux skb_copy_datagram_iovec 100675 0.8683 vmlinux vmlinux put_page And is here is the opannotate -a output for a few GRO functions. BTW, did you mean -s rather than -a? I'd naively think source might be more helpful. But here is what you asked for: ffffffff80479f20 <inet_gro_receive>: /* inet_gro_receive total: 547572 5.2554 */ 12187 0.1170 :ffffffff80479f20: push %r13 2611 0.0251 :ffffffff80479f22: mov %rdi,%r13 :ffffffff80479f25: push %r12 :ffffffff80479f27: push %rbp 4031 0.0387 :ffffffff80479f28: push %rbx :ffffffff80479f29: mov %rsi,%rbx :ffffffff80479f2c: mov $0x14,%esi 6303 0.0605 :ffffffff80479f31: mov %rbx,%rdi :ffffffff80479f34: sub $0x8,%rsp :ffffffff80479f38: callq ffffffff804357a1 <skb_gro_header> :ffffffff80479f3d: test %rax,%rax 2494 0.0239 :ffffffff80479f40: mov %rax,%r8 :ffffffff80479f43: je ffffffff8047a0a4 <inet_gro_receive+0x184> :ffffffff80479f49: movzbl 0x9(%rax),%eax 2541 0.0244 :ffffffff80479f4d: mov 0xffffffff80d06280(,%rax,8),%r11 33 3.2e-04 :ffffffff80479f55: test %r11,%r11 5 4.8e-05 :ffffffff80479f58: je ffffffff8047a0a4 <inet_gro_receive+0x184> 11016 0.1057 :ffffffff80479f5e: cmpq $0x0,0x20(%r11) 292 0.0028 :ffffffff80479f63: je ffffffff8047a0a4 <inet_gro_receive+0x184> 1 9.6e-06 :ffffffff80479f69: cmpb $0x45,(%r8) 4297 0.0412 :ffffffff80479f6d: jne ffffffff8047a0a4 <inet_gro_receive+0x184> 6086 0.0584 :ffffffff80479f73: mov $0x5,%eax :ffffffff80479f78: mov %r8,%rcx 18706 0.1795 :ffffffff80479f7b: mov (%rcx),%edx 341 0.0033 :ffffffff80479f7d: sub $0x4,%eax :ffffffff80479f80: jbe ffffffff80479fa6 <inet_gro_receive+0x86> 4609 0.0442 :ffffffff80479f82: add 0x4(%rcx),%edx 398 0.0038 :ffffffff80479f85: adc 0x8(%rcx),%edx :ffffffff80479f88: adc 0xc(%rcx),%edx 4310 0.0414 :ffffffff80479f8b: adc 0x10(%rcx),%edx 790 0.0076 :ffffffff80479f8e: lea 0x4(%rcx),%rcx :ffffffff80479f92: dec %eax 9097 0.0873 :ffffffff80479f94: jne ffffffff80479f8b <inet_gro_receive+0x6b> 541 0.0052 :ffffffff80479f96: adc $0x0,%edx :ffffffff80479f99: mov %edx,%eax 1919 0.0184 :ffffffff80479f9b: shr $0x10,%edx 535 0.0051 :ffffffff80479f9e: add %ax,%dx :ffffffff80479fa1: adc $0x0,%edx 3633 0.0349 :ffffffff80479fa4: not %edx 683 0.0066 :ffffffff80479fa6: test %dx,%dx 1 9.6e-06 :ffffffff80479fa9: jne ffffffff8047a0a4 <inet_gro_receive+0x184> 4725 0.0453 :ffffffff80479faf: movzwl 0x2(%r8),%eax 9728 0.0934 :ffffffff80479fb4: mov 0x68(%rbx),%edx 8 7.7e-05 :ffffffff80479fb7: mov $0x1,%ebp 43000 0.4127 :ffffffff80479fbc: sub 0x38(%rbx),%edx 11149 0.1070 :ffffffff80479fbf: mov %eax,%ecx :ffffffff80479fc1: shl $0x8,%eax 66497 0.6382 :ffffffff80479fc4: shr $0x8,%ecx 735 0.0071 :ffffffff80479fc7: or %ecx,%eax :ffffffff80479fc9: movzwl %ax,%eax 5459 0.0524 :ffffffff80479fcc: cmp %edx,%eax 522 0.0050 :ffffffff80479fce: jne ffffffff80479fdc <inet_gro_receive+0xbc> :ffffffff80479fd0: xor %ebp,%ebp 5373 0.0516 :ffffffff80479fd2: cmpw $0x40,0x6(%r8) 345 0.0033 :ffffffff80479fd8: setne %bpl :ffffffff80479fdc: movzwl 0x4(%r8),%eax 2384 0.0229 :ffffffff80479fe1: mov 0x0(%r13),%r10 631 0.0061 :ffffffff80479fe5: mov %eax,%edx :ffffffff80479fe7: shl $0x8,%eax 3044 0.0292 :ffffffff80479fea: shr $0x8,%edx 303 0.0029 :ffffffff80479fed: or %edx,%eax :ffffffff80479fef: movzwl %ax,%r12d 2747 0.0264 :ffffffff80479ff3: jmp ffffffff8047a071 <inet_gro_receive+0x151> 2109 0.0202 :ffffffff80479ff5: lea 0x38(%r10),%r9 12 1.2e-04 :ffffffff80479ff9: cmpl $0x0,0x4(%r9) 23 2.2e-04 :ffffffff80479ffe: je ffffffff8047a06e <inet_gro_receive+0x14e> 2104 0.0202 :ffffffff8047a000: mov 0xac(%r10),%edi 2 1.9e-05 :ffffffff8047a007: add 0xc0(%r10),%rdi :ffffffff8047a00e: mov 0x9(%rdi),%sil 2391 0.0229 :ffffffff8047a012: mov 0x1(%rdi),%al 2 1.9e-05 :ffffffff8047a015: xor 0x9(%r8),%sil 7 6.7e-05 :ffffffff8047a019: xor 0x1(%r8),%al 2101 0.0202 :ffffffff8047a01d: mov 0xc(%rdi),%edx 1 9.6e-06 :ffffffff8047a020: mov 0x10(%rdi),%ecx :ffffffff8047a023: xor 0xc(%r8),%edx 2775 0.0266 :ffffffff8047a027: xor 0x10(%r8),%ecx :ffffffff8047a02b: or %esi,%eax :ffffffff8047a02d: movzbl %al,%eax 62734 0.6021 :ffffffff8047a030: or %edx,%ecx :ffffffff8047a032: or %eax,%ecx :ffffffff8047a034: je ffffffff8047a040 <inet_gro_receive+0x120> :ffffffff8047a036: movl $0x0,0x4(%r9) :ffffffff8047a03e: jmp ffffffff8047a06e <inet_gro_receive+0x14e> 2106 0.0202 :ffffffff8047a040: movzwl 0x4(%rdi),%edx :ffffffff8047a044: mov 0x8(%rdi),%al :ffffffff8047a047: xor 0x8(%r8),%eax 64244 0.6166 :ffffffff8047a04b: mov %edx,%ecx :ffffffff8047a04d: shl $0x8,%edx :ffffffff8047a050: shr $0x8,%ecx 2072 0.0199 :ffffffff8047a053: movzbl %al,%eax :ffffffff8047a056: or 0x8(%r9),%eax :ffffffff8047a05a: or %ecx,%edx 2629 0.0252 :ffffffff8047a05c: add 0xc(%r9),%edx 2 1.9e-05 :ffffffff8047a060: movzwl %dx,%edx :ffffffff8047a063: xor %r12d,%edx 58223 0.5588 :ffffffff8047a066: or %edx,%eax 3 2.9e-05 :ffffffff8047a068: or %ebp,%eax :ffffffff8047a06a: mov %eax,0x8(%r9) 21878 0.2100 :ffffffff8047a06e: mov (%r10),%r10 2156 0.0207 :ffffffff8047a071: test %r10,%r10 :ffffffff8047a074: jne ffffffff80479ff5 <inet_gro_receive+0xd5> 3007 0.0289 :ffffffff8047a07a: mov 0x38(%rbx),%eax 61 5.9e-04 :ffffffff8047a07d: or %ebp,0x40(%rbx) 3 2.9e-05 :ffffffff8047a080: mov %rbx,%rsi 3091 0.0297 :ffffffff8047a083: mov %r13,%rdi 41 3.9e-04 :ffffffff8047a086: add $0x14,%eax :ffffffff8047a089: mov %eax,0x38(%rbx) 3704 0.0355 :ffffffff8047a08c: sub 0xc0(%rbx),%eax 33 3.2e-04 :ffffffff8047a092: add 0xc8(%rbx),%eax :ffffffff8047a098: mov %eax,0xa8(%rbx) 2468 0.0237 :ffffffff8047a09e: callq *0x20(%r11) 20011 0.1921 :ffffffff8047a0a2: jmp ffffffff8047a0ab <inet_gro_receive+0x18b> :ffffffff8047a0a4: xor %eax,%eax :ffffffff8047a0a6: mov $0x1,%ebp 24082 0.2311 :ffffffff8047a0ab: or %ebp,0x40(%rbx) 626 0.0060 :ffffffff8047a0ae: pop %r10 1718 0.0165 :ffffffff8047a0b0: pop %rbx 446 0.0043 :ffffffff8047a0b1: pop %rbp 4074 0.0391 :ffffffff8047a0b2: pop %r12 2089 0.0200 :ffffffff8047a0b4: pop %r13 434 0.0042 :ffffffff8047a0b6: retq ffffffff80430ea9 <skb_gro_receive>: /* skb_gro_receive total: 477479 4.5827 */ 2158 0.0207 :ffffffff80430ea9: push %r15 2492 0.0239 :ffffffff80430eab: mov %rdi,%r15 :ffffffff80430eae: push %r14 :ffffffff80430eb0: push %r13 2432 0.0233 :ffffffff80430eb2: push %r12 1 9.6e-06 :ffffffff80430eb4: push %rbp 1 9.6e-06 :ffffffff80430eb5: mov %rsi,%rbp 2430 0.0233 :ffffffff80430eb8: push %rbx :ffffffff80430eb9: sub $0x8,%rsp :ffffffff80430ebd: mov 0x68(%rsi),%ecx 2420 0.0232 :ffffffff80430ec0: mov (%rdi),%r12 1 9.6e-06 :ffffffff80430ec3: mov %ecx,%r14d 1 9.6e-06 :ffffffff80430ec6: sub 0x38(%rsi),%r14d 2317 0.0222 :ffffffff80430eca: mov %r14d,%eax 1 9.6e-06 :ffffffff80430ecd: add 0x68(%r12),%eax 1 9.6e-06 :ffffffff80430ed2: cmp $0xffff,%eax 3865 0.0371 :ffffffff80430ed7: ja ffffffff80431261 <skb_gro_receive+0x3b8> :ffffffff80430edd: mov 0xb8(%r12),%eax :ffffffff80430ee5: mov 0xc0(%r12),%rdx 8082 0.0776 :ffffffff80430eed: lea (%rdx,%rax,1),%rsi :ffffffff80430ef1: cmpq $0x0,0x18(%rsi) 2 1.9e-05 :ffffffff80430ef6: jne ffffffff804311ab <skb_gro_receive+0x302> 9249 0.0888 :ffffffff80430efc: mov %ecx,%edi :ffffffff80430efe: sub 0x6c(%rbp),%edi 6 5.8e-05 :ffffffff80430f01: cmp 0x38(%rbp),%edi 3104 0.0298 :ffffffff80430f04: ja ffffffff80430fe2 <skb_gro_receive+0x139> 2 1.9e-05 :ffffffff80430f0a: mov 0xb8(%rbp),%ecx :ffffffff80430f10: movzwl 0x4(%rsi),%edx 8825 0.0847 :ffffffff80430f14: add 0xc0(%rbp),%rcx :ffffffff80430f1b: movzwl 0x4(%rcx),%eax 21 2.0e-04 :ffffffff80430f1f: add %edx,%eax 19668 0.1888 :ffffffff80430f21: cmp $0x12,%eax 1 9.6e-06 :ffffffff80430f24: ja ffffffff80431261 <skb_gro_receive+0x3b8> :ffffffff80430f2a: mov 0x38(%rcx),%eax 1974 0.0189 :ffffffff80430f2d: add 0x38(%rbp),%eax :ffffffff80430f30: cld :ffffffff80430f31: sub %edi,%eax 7666 0.0736 :ffffffff80430f33: mov %eax,0x38(%rcx) 2 1.9e-05 :ffffffff80430f36: mov 0xb8(%rbp),%edx :ffffffff80430f3c: add 0xc0(%rbp),%rdx 52468 0.5036 :ffffffff80430f43: mov 0x3c(%rdx),%eax 2 1.9e-05 :ffffffff80430f46: add 0x68(%rbp),%eax 1 9.6e-06 :ffffffff80430f49: sub 0x6c(%rbp),%eax 6592 0.0633 :ffffffff80430f4c: sub 0x38(%rbp),%eax :ffffffff80430f4f: mov %eax,0x3c(%rdx) :ffffffff80430f52: mov 0xb8(%r12),%eax 23018 0.2209 :ffffffff80430f5a: add 0xc0(%r12),%rax 1 9.6e-06 :ffffffff80430f62: mov 0xb8(%rbp),%esi :ffffffff80430f68: add 0xc0(%rbp),%rsi 8477 0.0814 :ffffffff80430f6f: movzwl 0x4(%rax),%edi 6 5.8e-05 :ffffffff80430f73: movzwl 0x4(%rsi),%ecx :ffffffff80430f77: add $0x30,%rsi 21338 0.2048 :ffffffff80430f7b: shl $0x4,%rdi 3 2.9e-05 :ffffffff80430f7f: lea 0x30(%rdi,%rax,1),%rdi 1 9.6e-06 :ffffffff80430f84: shl $0x4,%rcx 150632 1.4457 :ffffffff80430f88: rep movsb %ds:(%rsi),%es:(%rdi) 3988 0.0383 :ffffffff80430f8a: mov 0xb8(%r12),%eax 2015 0.0193 :ffffffff80430f92: mov 0xb8(%rbp),%ecx 11 1.1e-04 :ffffffff80430f98: add 0xc0(%r12),%rax 8 7.7e-05 :ffffffff80430fa0: mov 0xc0(%rbp),%rdx 3295 0.0316 :ffffffff80430fa7: mov 0x4(%rdx,%rcx,1),%edx :ffffffff80430fab: add %dx,0x4(%rax) 8 7.7e-05 :ffffffff80430faf: mov 0xb8(%rbp),%edx 2507 0.0241 :ffffffff80430fb5: mov 0xc0(%rbp),%rax :ffffffff80430fbc: movw $0x0,0x4(%rax,%rdx,1) 3233 0.0310 :ffffffff80430fc3: mov 0x6c(%rbp),%eax 1 9.6e-06 :ffffffff80430fc6: sub %eax,0xd0(%rbp) :ffffffff80430fcc: sub %eax,0x68(%rbp) 41540 0.3987 :ffffffff80430fcf: movl $0x0,0x6c(%rbp) :ffffffff80430fd6: movl $0x1,0x48(%rbp) :ffffffff80430fdd: jmpq ffffffff8043123f <skb_gro_receive+0x396> :ffffffff80430fe2: mov 0xc8(%r12),%rax :ffffffff80430fea: mov 0x20(%r12),%rdi :ffffffff80430fef: mov %eax,%r13d :ffffffff80430ff2: sub %edx,%r13d :ffffffff80430ff5: mov $0x20,%edx :ffffffff80430ffa: mov %r13d,%esi :ffffffff80430ffd: add 0x38(%r12),%esi :ffffffff80431002: callq ffffffff8042ffe0 <__netdev_alloc_skb> :ffffffff80431007: mov %rax,%rbx :ffffffff8043100a: mov $0xfffffff4,%eax :ffffffff8043100f: test %rbx,%rbx :ffffffff80431012: je ffffffff80431266 <skb_gro_receive+0x3bd> :ffffffff80431018: mov %r12,%rsi :ffffffff8043101b: mov %rbx,%rdi :ffffffff8043101e: callq ffffffff8042e2c0 <__copy_skb_header> :ffffffff80431023: mov 0x70(%r12),%eax :ffffffff80431028: add %r13d,0xb4(%rbx) :ffffffff8043102f: mov %ax,0x70(%rbx) :ffffffff80431033: movslq %r13d,%rax :ffffffff80431036: add %rax,0xc8(%rbx) :ffffffff8043103d: cmpl $0x0,0x6c(%rbx) :ffffffff80431041: mov 0x38(%r12),%edx :ffffffff80431046: mov 0xb4(%rbx),%eax :ffffffff8043104c: je ffffffff80431052 <skb_gro_receive+0x1a9> :ffffffff8043104e: ud2a :ffffffff80431050: jmp ffffffff80431050 <skb_gro_receive+0x1a7> :ffffffff80431052: lea (%rdx,%rax,1),%eax :ffffffff80431055: add %edx,0x68(%rbx) :ffffffff80431058: mov 0xc8(%r12),%rcx :ffffffff80431060: mov 0xc8(%rbx),%rdx :ffffffff80431067: sub 0xc0(%rbx),%edx :ffffffff8043106d: mov %eax,0xb4(%rbx) :ffffffff80431073: mov 0xb0(%r12),%eax :ffffffff8043107b: add 0xc0(%r12),%rax :ffffffff80431083: sub %ecx,%eax :ffffffff80431085: add %edx,%eax :ffffffff80431087: mov %eax,0xb0(%rbx) :ffffffff8043108d: mov 0xac(%r12),%eax :ffffffff80431095: add 0xc0(%r12),%rax :ffffffff8043109d: sub %ecx,%eax :ffffffff8043109f: add %edx,%eax :ffffffff804310a1: mov %eax,0xac(%rbx) :ffffffff804310a7: mov 0xa8(%r12),%eax :ffffffff804310af: add 0xc0(%r12),%rax :ffffffff804310b7: sub %ecx,%eax :ffffffff804310b9: add %edx,%eax :ffffffff804310bb: mov %eax,0xa8(%rbx) :ffffffff804310c1: mov 0x68(%r12),%eax :ffffffff804310c6: mov 0x38(%r12),%edx :ffffffff804310cb: sub %edx,%eax :ffffffff804310cd: cmp 0x6c(%r12),%eax :ffffffff804310d2: mov %eax,0x68(%r12) :ffffffff804310d7: jae ffffffff804310dd <skb_gro_receive+0x234> :ffffffff804310d9: ud2a :ffffffff804310db: jmp ffffffff804310db <skb_gro_receive+0x232> :ffffffff804310dd: mov 0xb0(%r12),%esi :ffffffff804310e5: mov %edx,%ecx :ffffffff804310e7: add 0xc8(%r12),%rcx :ffffffff804310ef: add 0xc0(%r12),%rsi :ffffffff804310f7: mov 0xb0(%rbx),%edi :ffffffff804310fd: add 0xc0(%rbx),%rdi :ffffffff80431104: cld :ffffffff80431105: mov %rcx,0xc8(%r12) :ffffffff8043110d: sub %rsi,%rcx :ffffffff80431110: rep movsb %ds:(%rsi),%es:(%rdi) :ffffffff80431112: lea 0x38(%rbx),%rdi :ffffffff80431116: lea 0x38(%r12),%rsi :ffffffff8043111b: mov $0x5,%cl :ffffffff8043111d: rep movsl %ds:(%rsi),%es:(%rdi) :ffffffff8043111f: mov 0xb8(%rbx),%edx :ffffffff80431125: mov 0xc0(%rbx),%rax :ffffffff8043112c: mov %r12,0x18(%rax,%rdx,1) :ffffffff80431131: mov 0xb8(%r12),%edx :ffffffff80431139: mov 0xc0(%r12),%rax :ffffffff80431141: mov 0xb8(%rbx),%esi :ffffffff80431147: mov 0xc0(%rbx),%rcx :ffffffff8043114e: mov 0x6(%rax,%rdx,1),%ax :ffffffff80431153: mov %ax,0x6(%rcx,%rsi,1) :ffffffff80431158: testb $0x10,0x7c(%r12) :ffffffff8043115e: je ffffffff80431164 <skb_gro_receive+0x2bb> :ffffffff80431160: ud2a :ffffffff80431162: jmp ffffffff80431162 <skb_gro_receive+0x2b9> :ffffffff80431164: mov 0xb8(%r12),%eax :ffffffff8043116c: orb $0x10,0x7c(%r12) :ffffffff80431172: add 0xc0(%r12),%rax :ffffffff8043117a: lock addl $0x10000,(%rax) :ffffffff80431181: mov 0x68(%r12),%eax :ffffffff80431186: mov %r12,0x8(%rbx) :ffffffff8043118a: add %eax,0x6c(%rbx) :ffffffff8043118d: add %eax,0xd0(%rbx) :ffffffff80431193: add %eax,0x68(%rbx) :ffffffff80431196: mov %rbx,(%r15) :ffffffff80431199: mov (%r12),%rax :ffffffff8043119d: mov %rax,(%rbx) :ffffffff804311a0: movq $0x0,(%r12) :ffffffff804311a8: mov %rbx,%r12 :ffffffff804311ab: mov 0x68(%rbp),%ecx :ffffffff804311ae: sub 0x6c(%rbp),%ecx :ffffffff804311b1: cmp %ecx,0x38(%rbp) :ffffffff804311b4: jbe ffffffff804311f3 <skb_gro_receive+0x34a> :ffffffff804311b6: mov 0xb8(%rbp),%edx :ffffffff804311bc: add 0xc0(%rbp),%rdx :ffffffff804311c3: mov 0x38(%rdx),%eax :ffffffff804311c6: add 0x38(%rbp),%eax :ffffffff804311c9: sub %ecx,%eax :ffffffff804311cb: mov %eax,0x38(%rdx) :ffffffff804311ce: mov 0xb8(%rbp),%edx :ffffffff804311d4: add 0xc0(%rbp),%rdx :ffffffff804311db: mov 0x3c(%rdx),%eax :ffffffff804311de: add 0x68(%rbp),%eax :ffffffff804311e1: sub 0x6c(%rbp),%eax :ffffffff804311e4: sub 0x38(%rbp),%eax :ffffffff804311e7: mov %eax,0x3c(%rdx) :ffffffff804311ea: mov 0x68(%rbp),%eax :ffffffff804311ed: sub 0x6c(%rbp),%eax :ffffffff804311f0: mov %eax,0x38(%rbp) :ffffffff804311f3: mov 0x68(%rbp),%eax :ffffffff804311f6: mov 0x38(%rbp),%edx :ffffffff804311f9: sub %edx,%eax :ffffffff804311fb: cmp 0x6c(%rbp),%eax :ffffffff804311fe: mov %eax,0x68(%rbp) :ffffffff80431201: jae ffffffff80431207 <skb_gro_receive+0x35e> :ffffffff80431203: ud2a :ffffffff80431205: jmp ffffffff80431205 <skb_gro_receive+0x35c> :ffffffff80431207: mov %edx,%eax :ffffffff80431209: add %rax,0xc8(%rbp) :ffffffff80431210: mov 0x8(%r12),%rax :ffffffff80431215: mov %rbp,0x8(%r12) :ffffffff8043121a: mov %rbp,(%rax) :ffffffff8043121d: testb $0x10,0x7c(%rbp) :ffffffff80431221: je ffffffff80431227 <skb_gro_receive+0x37e> :ffffffff80431223: ud2a :ffffffff80431225: jmp ffffffff80431225 <skb_gro_receive+0x37c> :ffffffff80431227: mov 0xb8(%rbp),%eax :ffffffff8043122d: orb $0x10,0x7c(%rbp) :ffffffff80431231: add 0xc0(%rbp),%rax :ffffffff80431238: lock addl $0x10000,(%rax) 34919 0.3351 :ffffffff8043123f: add %r14d,0x6c(%r12) 1989 0.0191 :ffffffff80431244: add %r14d,0xd0(%r12) 1 9.6e-06 :ffffffff8043124c: xor %eax,%eax :ffffffff8043124e: add %r14d,0x68(%r12) 20605 0.1978 :ffffffff80431253: incl 0x44(%r12) :ffffffff80431258: movl $0x1,0x3c(%rbp) :ffffffff8043125f: jmp ffffffff80431266 <skb_gro_receive+0x3bd> :ffffffff80431261: mov $0xfffffff9,%eax 13260 0.1273 :ffffffff80431266: pop %r11 1946 0.0187 :ffffffff80431268: pop %rbx 2010 0.0193 :ffffffff80431269: pop %rbp 64 6.1e-04 :ffffffff8043126a: pop %r12 1948 0.0187 :ffffffff8043126c: pop %r13 2746 0.0264 :ffffffff8043126e: pop %r14 57 5.5e-04 :ffffffff80431270: pop %r15 2067 0.0198 :ffffffff80431272: retq ffffffff80460663 <tcp_gro_receive>: /* tcp_gro_receive total: 396796 3.8083 */ 4433 0.0425 :ffffffff80460663: push %r15 2204 0.0212 :ffffffff80460665: push %r14 :ffffffff80460667: mov %rdi,%r14 :ffffffff8046066a: push %r13 2275 0.0218 :ffffffff8046066c: push %r12 :ffffffff8046066e: mov %rsi,%r12 :ffffffff80460671: mov $0x14,%esi 5933 0.0569 :ffffffff80460676: mov %r12,%rdi :ffffffff80460679: push %rbp :ffffffff8046067a: push %rbx 2180 0.0209 :ffffffff8046067b: sub $0x8,%rsp :ffffffff8046067f: callq ffffffff804357a1 <skb_gro_header> :ffffffff80460684: test %rax,%rax 3218 0.0309 :ffffffff80460687: je ffffffff804607ed <tcp_gro_receive+0x18a> :ffffffff8046068d: mov 0xc(%rax),%al 1 9.6e-06 :ffffffff80460690: shr $0x4,%al 3528 0.0339 :ffffffff80460693: movzbl %al,%eax :ffffffff80460696: lea 0x0(,%rax,4),%r13d 1 9.6e-06 :ffffffff8046069e: cmp $0x13,%r13d 2773 0.0266 :ffffffff804606a2: jbe ffffffff804607ed <tcp_gro_receive+0x18a> :ffffffff804606a8: mov %r13d,%esi :ffffffff804606ab: mov %r12,%rdi 3327 0.0319 :ffffffff804606ae: callq ffffffff804357a1 <skb_gro_header> :ffffffff804606b3: test %rax,%rax 2094 0.0201 :ffffffff804606b6: mov %rax,%r8 :ffffffff804606b9: je ffffffff804607ed <tcp_gro_receive+0x18a> :ffffffff804606bf: lea 0x38(%r12),%r15 2245 0.0215 :ffffffff804606c4: add %r13d,(%r15) :ffffffff804606c7: mov 0x68(%r12),%ebp :ffffffff804606cc: sub 0x38(%r12),%ebp 2394 0.0230 :ffffffff804606d1: mov 0xc(%rax),%ebx :ffffffff804606d4: jmp ffffffff80460710 <tcp_gro_receive+0xad> 2111 0.0203 :ffffffff804606d6: lea 0x38(%rdi),%r9 3 2.9e-05 :ffffffff804606da: cmpl $0x0,0x4(%r9) 21 2.0e-04 :ffffffff804606df: je ffffffff8046070d <tcp_gro_receive+0xaa> 2592 0.0249 :ffffffff804606e1: mov 0xa8(%rdi),%eax :ffffffff804606e7: mov 0xc0(%rdi),%r10 :ffffffff804606ee: mov 0x2(%r8),%dx 2440 0.0234 :ffffffff804606f3: lea (%r10,%rax,1),%rcx :ffffffff804606f7: mov (%r8),%eax 1 9.6e-06 :ffffffff804606fa: xor 0x2(%rcx),%dx 6275 0.0602 :ffffffff804606fe: xor (%rcx),%eax 3 2.9e-05 :ffffffff80460700: or %ax,%dx :ffffffff80460703: je ffffffff8046071d <tcp_gro_receive+0xba> :ffffffff80460705: movl $0x0,0x4(%r9) :ffffffff8046070d: mov %rdi,%r14 2920 0.0280 :ffffffff80460710: mov (%r14),%rdi 18 1.7e-04 :ffffffff80460713: test %rdi,%rdi 2 1.9e-05 :ffffffff80460716: jne ffffffff804606d6 <tcp_gro_receive+0x73> 33 3.2e-04 :ffffffff80460718: jmpq ffffffff80460807 <tcp_gro_receive+0x1a4> 4253 0.0408 :ffffffff8046071d: mov 0xe(%r8),%ax 2125 0.0204 :ffffffff80460722: xor 0xe(%rcx),%ax 2 1.9e-05 :ffffffff80460726: mov %ebx,%edx :ffffffff80460728: and $0x8000,%edx 8066 0.0774 :ffffffff8046072e: or 0x8(%r9),%edx :ffffffff80460732: movzwl %ax,%esi :ffffffff80460735: mov 0x8(%r8),%eax 64740 0.6214 :ffffffff80460739: xor 0x8(%rcx),%eax :ffffffff8046073c: or %eax,%esi :ffffffff8046073e: mov %ebx,%eax 2084 0.0200 :ffffffff80460740: xor 0xc(%rcx),%eax :ffffffff80460743: and $0x76,%ah :ffffffff80460746: or %eax,%edx 2132 0.0205 :ffffffff80460748: or %edx,%esi :ffffffff8046074a: mov $0x14,%edx :ffffffff8046074f: jmp ffffffff8046075e <tcp_gro_receive+0xfb> :ffffffff80460751: movslq %edx,%rax :ffffffff80460754: add $0x4,%edx :ffffffff80460757: mov (%r8,%rax,1),%esi :ffffffff8046075b: xor (%rcx,%rax,1),%esi 3670 0.0352 :ffffffff8046075e: test %esi,%esi 2162 0.0208 :ffffffff80460760: jne ffffffff80460767 <tcp_gro_receive+0x104> :ffffffff80460762: cmp %r13d,%edx 1 9.6e-06 :ffffffff80460765: jb ffffffff80460751 <tcp_gro_receive+0xee> 50209 0.4819 :ffffffff80460767: mov 0xb8(%rdi),%eax 4473 0.0429 :ffffffff8046076d: mov 0x4(%rcx),%edx :ffffffff80460770: bswap %edx 9554 0.0917 :ffffffff80460772: mov 0x4(%r8),%ecx :ffffffff80460776: bswap %ecx :ffffffff80460778: movzwl 0x6(%r10,%rax,1),%r13d 7572 0.0727 :ffffffff8046077e: mov 0x68(%rdi),%eax :ffffffff80460781: sub 0x38(%rdi),%eax :ffffffff80460784: add %edx,%eax 9803 0.0941 :ffffffff80460786: xor %eax,%ecx :ffffffff80460788: cmp %r13d,%ebp :ffffffff8046078b: seta %al 50608 0.4857 :ffffffff8046078e: test %ebp,%ebp :ffffffff80460790: sete %dl :ffffffff80460793: or %edx,%eax 3161 0.0303 :ffffffff80460795: movzbl %al,%eax :ffffffff80460798: or %eax,%esi :ffffffff8046079a: or %esi,%ecx 3278 0.0315 :ffffffff8046079c: jne ffffffff804607f6 <tcp_gro_receive+0x193> :ffffffff8046079e: mov %r12,%rsi 2 1.9e-05 :ffffffff804607a1: mov %r14,%rdi 2579 0.0248 :ffffffff804607a4: callq ffffffff80430ea9 <skb_gro_receive> 2059 0.0198 :ffffffff804607a9: test %eax,%eax 49 4.7e-04 :ffffffff804607ab: jne ffffffff804607f6 <tcp_gro_receive+0x193> :ffffffff804607ad: mov (%r14),%rcx 1945 0.0187 :ffffffff804607b0: mov %ebx,%edx 3 2.9e-05 :ffffffff804607b2: and $0x900,%edx :ffffffff804607b8: mov 0xa8(%rcx),%eax 2530 0.0243 :ffffffff804607be: add 0xc0(%rcx),%rax 3 2.9e-05 :ffffffff804607c5: or %edx,0xc(%rax) 13 1.2e-04 :ffffffff804607c8: xor %eax,%eax 4881 0.0468 :ffffffff804607ca: cmp %r13d,%ebp :ffffffff804607cd: setb %al :ffffffff804607d0: and $0x2f00,%ebx 1912 0.0184 :ffffffff804607d6: or %ebx,%eax :ffffffff804607d8: test %rcx,%rcx :ffffffff804607db: je ffffffff80460816 <tcp_gro_receive+0x1b3> 2163 0.0208 :ffffffff804607dd: cmpl $0x0,0x4(%r15) 136 0.0013 :ffffffff804607e2: je ffffffff804607e8 <tcp_gro_receive+0x185> 2455 0.0236 :ffffffff804607e4: test %eax,%eax 57 5.5e-04 :ffffffff804607e6: je ffffffff80460816 <tcp_gro_receive+0x1b3> 148 0.0014 :ffffffff804607e8: mov %r14,%rdi 735 0.0071 :ffffffff804607eb: jmp ffffffff80460818 <tcp_gro_receive+0x1b5> :ffffffff804607ed: xor %edi,%edi :ffffffff804607ef: mov $0x1,%eax :ffffffff804607f4: jmp ffffffff80460818 <tcp_gro_receive+0x1b5> 68 6.5e-04 :ffffffff804607f6: xor %eax,%eax 1 9.6e-06 :ffffffff804607f8: test %ebp,%ebp 67 6.4e-04 :ffffffff804607fa: sete %al 47 4.5e-04 :ffffffff804607fd: and $0x2f00,%ebx :ffffffff80460803: or %ebx,%eax 58 5.6e-04 :ffffffff80460805: jmp ffffffff804607dd <tcp_gro_receive+0x17a> 122 0.0012 :ffffffff80460807: xor %eax,%eax 9 8.6e-05 :ffffffff80460809: test %ebp,%ebp :ffffffff8046080b: sete %al 67 6.4e-04 :ffffffff8046080e: and $0x2f00,%ebx 6 5.8e-05 :ffffffff80460814: or %ebx,%eax 1995 0.0191 :ffffffff80460816: xor %edi,%edi 68 6.5e-04 :ffffffff80460818: or %eax,0x40(%r12) 275 0.0026 :ffffffff8046081d: mov %rdi,%rax 2037 0.0196 :ffffffff80460820: pop %r11 191 0.0018 :ffffffff80460822: pop %rbx 4346 0.0417 :ffffffff80460823: pop %rbp 4739 0.0455 :ffffffff80460824: pop %r12 167 0.0016 :ffffffff80460826: pop %r13 23735 0.2278 :ffffffff80460828: pop %r14 56070 0.5381 :ffffffff8046082a: pop %r15 140 0.0013 :ffffffff8046082c: retq ffffffff804357a1 <skb_gro_header>: /* skb_gro_header total: 319455 3.0660 */ 13604 0.1306 :ffffffff804357a1: push %rbp 14938 0.1434 :ffffffff804357a2: push %rbx :ffffffff804357a3: mov %rdi,%rbx :ffffffff804357a6: sub $0x8,%rsp 18392 0.1765 :ffffffff804357aa: mov 0x38(%rdi),%ebp :ffffffff804357ad: mov 0x68(%rdi),%edx 1 9.6e-06 :ffffffff804357b0: add %ebp,%esi 20559 0.1973 :ffffffff804357b2: mov %edx,%edi :ffffffff804357b4: sub 0x6c(%rbx),%edi :ffffffff804357b7: jne ffffffff804357cc <skb_gro_header+0x2b> 36626 0.3515 :ffffffff804357b9: mov 0xb8(%rbx),%ecx 2 1.9e-05 :ffffffff804357bf: mov 0xc0(%rbx),%rax 3 2.9e-05 :ffffffff804357c6: cmp %esi,0x3c(%rax,%rcx,1) 18577 0.1783 :ffffffff804357ca: jae ffffffff804357ee <skb_gro_header+0x4d> :ffffffff804357cc: cmp %edi,%esi :ffffffff804357ce: jbe ffffffff804357e3 <skb_gro_header+0x42> :ffffffff804357d0: cmp %edx,%esi :ffffffff804357d2: ja ffffffff80435833 <skb_gro_header+0x92> :ffffffff804357d4: sub %edi,%esi :ffffffff804357d6: mov %rbx,%rdi :ffffffff804357d9: callq ffffffff8042f6ee <__pskb_pull_tail> :ffffffff804357de: test %rax,%rax :ffffffff804357e1: je ffffffff80435833 <skb_gro_header+0x92> :ffffffff804357e3: mov %ebp,%eax :ffffffff804357e5: add 0xc8(%rbx),%rax :ffffffff804357ec: jmp ffffffff80435835 <skb_gro_header+0x94> 3 2.9e-05 :ffffffff804357ee: add 0xc0(%rbx),%rcx 25999 0.2495 :ffffffff804357f5: mov $0x1e0000000000,%rax :ffffffff804357ff: mov $0x6db6db6db6db6db7,%rdx 44557 0.4276 :ffffffff80435809: add 0x30(%rcx),%rax :ffffffff8043580d: sar $0x3,%rax 12588 0.1208 :ffffffff80435811: imul %rdx,%rax 10104 0.0970 :ffffffff80435815: mov $0xffff880000000000,%rdx :ffffffff8043581f: shl $0xc,%rax :ffffffff80435823: add %rdx,%rax 16404 0.1574 :ffffffff80435826: mov 0x38(%rcx),%edx :ffffffff80435829: add %rdx,%rax :ffffffff8043582c: mov %ebp,%edx 15264 0.1465 :ffffffff8043582e: add %rdx,%rax :ffffffff80435831: jmp ffffffff80435835 <skb_gro_header+0x94> :ffffffff80435833: xor %eax,%eax 45844 0.4400 :ffffffff80435835: pop %r10 2 1.9e-05 :ffffffff80435837: pop %rbx 12844 0.1233 :ffffffff80435838: pop %rbp 13144 0.1262 :ffffffff80435839: retq Thanks for your help, Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi: Unfortunately the myricom card I was using is now refusing to work: myri10ge: Version 1.4.4-1.412 myri10ge 0000:04:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33 myri10ge 0000:04:00.0: setting latency timer to 64 myri10ge 0000:04:00.0: PCIE x4 Link myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat myri10ge 0000:04:00.0: command 1 failed, result = 14 myri10ge 0000:04:00.0: failed reset myri10ge 0000:04:00.0: failed reset myri10ge 0000:04:00.0: myri10ge_probe() failed: MAC=00:60:dd:47:80:7d, SN=312225 myri10ge 0000:04:00.0: PCI INT A disabled So I won't be able to test this until I locate another myri10ge card or get this one back up and running again. Cheers,
On Thu, Apr 30, 2009 at 04:10:51PM +0800, Herbert Xu wrote: > > So I won't be able to test this until I locate another myri10ge > card or get this one back up and running again. Another reboot seems to have fixed it. So all is good. Cheers,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2e7783f..0396447 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1145,7 +1145,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb) static inline void *skb_gro_mac_header(struct sk_buff *skb) { - return skb_mac_header(skb) < skb->data ? skb_mac_header(skb) : + return skb_headlen(skb) ? skb_mac_header(skb) : page_address(skb_shinfo(skb)->frags[0].page) + skb_shinfo(skb)->frags[0].page_offset; } diff --git a/net/core/dev.c b/net/core/dev.c index 308a7d0..ef38e4f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2378,18 +2378,13 @@ void *skb_gro_header(struct sk_buff *skb, unsigned int hlen) unsigned int offset = skb_gro_offset(skb); hlen += offset; - if (hlen <= skb_headlen(skb)) - return skb->data + offset; - - if (unlikely(!skb_shinfo(skb)->nr_frags || - skb_shinfo(skb)->frags[0].size <= - hlen - skb_headlen(skb) || + if (unlikely(skb_headlen(skb) || + skb_shinfo(skb)->frags[0].size < hlen || PageHighMem(skb_shinfo(skb)->frags[0].page))) return pskb_may_pull(skb, hlen) ? skb->data + offset : NULL; return page_address(skb_shinfo(skb)->frags[0].page) + - skb_shinfo(skb)->frags[0].page_offset + - offset - skb_headlen(skb); + skb_shinfo(skb)->frags[0].page_offset + offset; } EXPORT_SYMBOL(skb_gro_header);