diff mbox

myr10ge: again fix lro_gen_skb() alignment

Message ID 20090427080501.GA21433@gondor.apana.org.au
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu April 27, 2009, 8:05 a.m. UTC
On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>
> These results are indeed quite close, so the performance problem seems
> isolated to AMD CPUS, and perhaps due to the smaller caches.
> Do you have any AMD you can use as a receiver?

I now have an AMD with 512K cache to test this.  Unfortunately
I'd just locked it up before I got a chance to do any serious
testing.  So it might take a while.

> Note that the GRO results were still obtained by (bogusly) setting
> CHECKSUM_UNNECESSARY.  I tried to use your patch, and I see
> terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
> to 5Gb/s with GRO disabled).  I don't see bad checksums in netstat
> on the receiver, but it *feels* like something like that.

OK, I'd created a silly bug with the skb_gro_* optimisation.
Here are two patches to fix them.

gro: Fix handling of headers that extend over the tail

The skb_gro_* code fails to handle the case where a header starts
in the linear area but ends in the frags area.  Since the goal
of skb_gro_* is to optimise the case of completely non-linear
packets, we can simply bail out if we have anything in the linear
area.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,

Comments

David Miller April 27, 2009, 12:45 p.m. UTC | #1
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 27 Apr 2009 16:05:01 +0800

> gro: Fix handling of headers that extend over the tail
> 
> The skb_gro_* code fails to handle the case where a header starts
> in the linear area but ends in the frags area.  Since the goal
> of skb_gro_* is to optimise the case of completely non-linear
> packets, we can simply bail out if we have anything in the linear
> area.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 28, 2009, 6:12 a.m. UTC | #2
On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
> >
> > These results are indeed quite close, so the performance problem seems
> > isolated to AMD CPUS, and perhaps due to the smaller caches.
> > Do you have any AMD you can use as a receiver?
> 
> I now have an AMD with 512K cache to test this.  Unfortunately
> I'd just locked it up before I got a chance to do any serious
> testing.  So it might take a while.

OK that's been fixed up.  Indeed the AMD can't do wire speed.
But still the performance seems comparable.  Both of them sit
between 6600Mb/s and 7100Mb/s.  The sender is running at about
66% idle in either case.

Cheers,
Andrew Gallatin April 28, 2009, 3 p.m. UTC | #3
Herbert Xu wrote:
 > On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
 >> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
 >>> These results are indeed quite close, so the performance problem seems
 >>> isolated to AMD CPUS, and perhaps due to the smaller caches.
 >>> Do you have any AMD you can use as a receiver?
 >> I now have an AMD with 512K cache to test this.  Unfortunately
 >> I'd just locked it up before I got a chance to do any serious
 >> testing.  So it might take a while.
 >
 > OK that's been fixed up.  Indeed the AMD can't do wire speed.
 > But still the performance seems comparable.  Both of them sit
 > between 6600Mb/s and 7100Mb/s.  The sender is running at about
 > 66% idle in either case.

Its strange, I still consistently see about 1Gb/s better performance
from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
when binding everything to the same CPU. Mpstat -P 0 shows roughly
10% more time spent in "soft" when using GRO vs LRO:

GRO:
  10:17:45     CPU   %user   %nice %system %iowait    %irq   %soft 
%idle    intr/s
10:17:46       0    0.00    0.00   54.00    0.00    0.00   46.00    0.00 
  11754.00
10:17:47       0    0.00    0.00   54.00    0.00    1.00   45.00    0.00 
  11718.00
10:17:48       0    0.00    0.00   47.00    0.00    2.00   51.00    0.00 
  11639.00


LRO:
10:21:55     CPU   %user   %nice %system %iowait    %irq   %soft   %idle 
    intr/s
10:21:56       0    0.00    0.00   66.00    0.00    1.00   33.00    0.00 
  13228.00
10:21:57       0    0.00    0.00   65.35    0.00    1.98   32.67    0.00 
  13118.81
10:21:58       0    0.00    0.00   63.00    0.00    1.00   36.00    0.00 
  13238.00


According to oprofile, the top 20 samples running GRO are:
CPU: AMD64 processors, speed 2050.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a 
unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name 
symbol name
4382     30.5408  vmlinux                  vmlinux 
copy_user_generic_string
534       3.7218  myri10ge.ko              myri10ge 
myri10ge_poll
463       3.2269  vmlinux                  vmlinux 
_raw_spin_lock
394       2.7460  vmlinux                  vmlinux 
rb_get_reader_page
382       2.6624  vmlinux                  vmlinux 
acpi_pm_read
356       2.4812  vmlinux                  vmlinux 
inet_gro_receive
293       2.0421  oprofiled                oprofiled                (no 
symbols)
268       1.8679  vmlinux                  vmlinux 
find_next_bit
268       1.8679  vmlinux                  vmlinux 
tg_shares_up
257       1.7912  vmlinux                  vmlinux 
ring_buffer_consume
247       1.7215  myri10ge.ko              myri10ge 
myri10ge_alloc_rx_pages
247       1.7215  vmlinux                  vmlinux 
tcp_gro_receive
228       1.5891  vmlinux                  vmlinux 
__free_pages_ok
219       1.5263  vmlinux                  vmlinux 
skb_gro_receive
167       1.1639  vmlinux                  vmlinux 
skb_gro_header
149       1.0385  bash                     bash                     (no 
symbols)
141       0.9827  vmlinux                  vmlinux 
skb_copy_datagram_iovec
132       0.9200  vmlinux                  vmlinux 
rb_buffer_peek
129       0.8991  vmlinux                  vmlinux 
_raw_spin_unlock
123       0.8573  vmlinux                  vmlinux 
delay_tsc

Nothing really stands out for me.  Here is LRO:


Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a 
unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name 
symbol name
4884     33.1164  vmlinux                  vmlinux 
copy_user_generic_string
721       4.8888  myri10ge.ko              myri10ge 
myri10ge_poll
580       3.9327  vmlinux                  vmlinux 
_raw_spin_lock
409       2.7733  vmlinux                  vmlinux 
acpi_pm_read
306       2.0749  vmlinux                  vmlinux 
rb_get_reader_page
293       1.9867  oprofiled                oprofiled                (no 
symbols)
286       1.9392  myri10ge.ko              myri10ge 
myri10ge_get_frag_header
253       1.7155  vmlinux                  vmlinux 
__lro_proc_segment
250       1.6951  vmlinux                  vmlinux 
rb_buffer_peek
247       1.6748  vmlinux                  vmlinux 
ring_buffer_consume
232       1.5731  vmlinux                  vmlinux 
__free_pages_ok
211       1.4307  myri10ge.ko              myri10ge 
myri10ge_alloc_rx_pages
206       1.3968  vmlinux                  vmlinux 
tg_shares_up
175       1.1866  vmlinux                  vmlinux 
skb_copy_datagram_iovec
158       1.0713  vmlinux                  vmlinux 
find_next_bit
146       0.9900  vmlinux                  vmlinux 
lro_tcp_ip_check
131       0.8883  oprofile.ko              oprofile 
op_cpu_buffer_read_entry
127       0.8611  vmlinux                  vmlinux 
delay_tsc
125       0.8476  bash                     bash                     (no 
symbols)
125       0.8476  vmlinux                  vmlinux 
_raw_spin_unlock


If I can't figure out why LRO is so much faster in some cases, then I
think maybe I'll just put together a patch which keeps LRO, and does
GRO only if LRO is disabled.  Kind of ugly, but better than loosing
15% performance on some machines.

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 28, 2009, 3:02 p.m. UTC | #4
From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 28 Apr 2009 11:00:16 -0400

> If I can't figure out why LRO is so much faster in some cases, then I
> think maybe I'll just put together a patch which keeps LRO, and does
> GRO only if LRO is disabled.  Kind of ugly, but better than loosing
> 15% performance on some machines.

I refuse to apply such a patch.

Figure out this performance problem, don't work around it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 28, 2009, 3:20 p.m. UTC | #5
On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote:
>
> Its strange, I still consistently see about 1Gb/s better performance
> from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
> when binding everything to the same CPU. Mpstat -P 0 shows roughly
> 10% more time spent in "soft" when using GRO vs LRO:

Did you check the utilisation of the all the cores on the sender?

Cheers,
Andrew Gallatin April 28, 2009, 3:44 p.m. UTC | #6
Herbert Xu wrote:
 > On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote:
 >> Its strange, I still consistently see about 1Gb/s better performance
 >> from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
 >> when binding everything to the same CPU. Mpstat -P 0 shows roughly
 >> 10% more time spent in "soft" when using GRO vs LRO:
 >
 > Did you check the utilisation of the all the cores on the sender?

Yes.  It is about the same +/- 2%.  The utilization when sending
to GRO is a bit lower, but its going slower.

Here is what might be more interesting.. I'm trying to isolate the
softirq path in oprofile.  So in this test, I bound the IRQ to CPU1,
and the netserver to CPU0.  In these tests, I see near line rate from
both LRO and GRO.  Here is oprofile output separated by CPU, and
sorted on CPU1.  (Sorry about binding to CPU1 and making the output
more confusing; I could not get oprofile to emit samples when the irq
was bound to CPU0).  I've included the top 20 entries:

GRO:
0              0  1414     15.8485  myri10ge.ko              myri10ge 
               myri10ge_poll
0              0  932      10.4461  vmlinux                  vmlinux 
               inet_gro_receive
0              0  705       7.9018  vmlinux                  vmlinux 
               tcp_gro_receive
0              0  681       7.6328  vmlinux                  vmlinux 
               skb_gro_receive
0              0  652       7.3078  vmlinux                  vmlinux 
               skb_gro_header
0              0  517       5.7947  vmlinux                  vmlinux 
               __napi_gro_receive
0              0  316       3.5418  vmlinux                  vmlinux 
               dev_gro_receive
0              0  309       3.4633  myri10ge.ko              myri10ge 
               myri10ge_alloc_rx_pages
415       3.1243  251       2.8133  vmlinux                  vmlinux 
               _raw_spin_lock
0              0  233       2.6115  vmlinux                  vmlinux 
               napi_frags_skb
0              0  178       1.9951  vmlinux                  vmlinux 
               tcp4_gro_receive
306       2.3037  152       1.7037  vmlinux                  vmlinux 
               rb_get_reader_page
0              0  150       1.6812  vmlinux                  vmlinux 
               napi_get_frags
188       1.4153  131       1.4683  vmlinux                  vmlinux 
               rb_buffer_peek
195       1.4680  101       1.1320  vmlinux                  vmlinux 
               ring_buffer_consume
0              0  96        1.0760  vmlinux                  vmlinux 
               ip_rcv_finish
0              0  94        1.0536  vmlinux                  vmlinux 
               napi_gro_frags
0              0  92        1.0312  vmlinux                  vmlinux 
               skb_copy_bits
0              0  86        0.9639  vmlinux                  vmlinux 
               napi_frags_finish
225       1.6939  85        0.9527  oprofile.ko              oprofile 
               op_cpu_buffer_read_entry

LRO:
0              0  1937     15.1281  myri10ge.ko              myri10ge 
               myri10ge_poll
0              0  1876     14.6517  myri10ge.ko              myri10ge 
               myri10ge_get_frag_header
0              0  943       7.3649  vmlinux                  vmlinux 
               __lro_proc_segment
0              0  723       5.6467  myri10ge.ko              myri10ge 
               myri10ge_alloc_rx_pages
0              0  392       3.0615  vmlinux                  vmlinux 
               lro_gen_skb
0              0  369       2.8819  vmlinux                  vmlinux 
               lro_tcp_ip_check
353       2.7435  357       2.7882  vmlinux                  vmlinux 
               _raw_spin_lock
290       2.2538  328       2.5617  vmlinux                  vmlinux 
               rb_get_reader_page
4         0.0311  270       2.1087  vmlinux                  vmlinux 
               csum_partial
26        0.2021  214       1.6714  vmlinux                  vmlinux 
               memset_c
0              0  202       1.5776  vmlinux                  vmlinux 
               lro_add_common
8         0.0622  191       1.4917  vmlinux                  vmlinux 
               __slab_alloc
0              0  188       1.4683  vmlinux                  vmlinux 
               ip_rcv_finish
84        0.6528  183       1.4292  vmlinux                  vmlinux 
               _raw_spin_unlock
0              0  180       1.4058  vmlinux                  vmlinux 
               lro_tcp_data_csum
0              0  180       1.4058  vmlinux                  vmlinux 
               lro_get_desc
167       1.2979  178       1.3902  vmlinux                  vmlinux 
               ring_buffer_consume
0              0  167       1.3043  vmlinux                  vmlinux 
               netif_receive_skb
0              0  143       1.1168  vmlinux                  vmlinux 
               ip_route_input
0              0  125       0.9763  vmlinux                  vmlinux 
               __inet_lookup_established



Does anything strike you as being inordinately expensive for GRO?

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Gallatin April 28, 2009, 9:12 p.m. UTC | #7
For variety, I grabbed a different "slow" receiver.  This is another
2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 37
model name      : AMD Opteron(tm) Processor 252
stepping        : 1
cpu MHz         : 2611.738
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt 
lm 3dnowext 3dnow rep_good pni lahf_lm
bogomips        : 5223.47
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

The sender was an identical machine running an ancient RHEL4 kernel
(2.6.9-42.ELsmp) and our downloadable (backported) driver.
(http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
I disabled LRO, on the sender.

Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
LRO and 8.0Gb/s with GRO.

Binding the IRQ to CPU0, and the netserver to CPU0, I see 6.9Gb/s
with LRO and 5.5 Gb/s with GRO.  Monitoring the packet/byte counts
on the interface once per second, LRO looks like this:

        Ipkts       IBytes        Opkts       Obytes
       588992    891733888         9758       644028
       589610    892669540         9771       644886
       589079    891865606         9754       643764

And GRO looks like this:

       480309    727187826         7949       524634
       480032    726768448         7947       524502
       480000    726720000         7943       524238


Similarly, in this same scenario, binding the app/irq to the same
CPU and running mpstat -P 0 1 shows about 60%sys and 40% irq+softirq
while GRO shows about 45% sys and 55% irq+softirq.

I can't put my finger on it, but something about GRO is certainly
more expensive on these types of machines.  I wish there was some
way you could see it, since it happens on every older AMD I try
it on.  If you haven't been able to reproduce it, I'll see if I
can make it happen on a newer "slow" amd64 box I have tomorrow.


Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Gallatin April 29, 2009, 1:42 p.m. UTC | #8
Andrew Gallatin wrote:
> For variety, I grabbed a different "slow" receiver.  This is another
> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
> 
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 37
> model name      : AMD Opteron(tm) Processor 252
<...>
> The sender was an identical machine running an ancient RHEL4 kernel
> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
> I disabled LRO, on the sender.
> 
> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
> LRO and 8.0Gb/s with GRO.

With the recent patch to fix idle CPU time accounting from LKML applied,
it is again possible to trust netperf's service demand (based on %CPU).
So here is raw netperf output for LRO and GRO, bound as above.

TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

LRO:
  87380  65536  65536    60.00      8279.36   8.10     77.55    0.160 
1.535
GRO:
  87380  65536  65536    60.00      8053.19   7.86     85.47    0.160 
1.739

The difference is bigger if you disable TCP timestamps (and thus shrink
the packets headers down so they require fewer cachelines):
LRO:
  87380  65536  65536    60.02      7753.55   8.01     74.06    0.169 
1.565
GRO:
  87380  65536  65536    60.02      7535.12   7.27     84.57    0.158 
1.839


As you can see, even though the raw bandwidth is very close, the
service demand makes it clear that GRO is more expensive
than LRO.  I just wish I understood why.

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 29, 2009, 1:53 p.m. UTC | #9
Andrew Gallatin a écrit :
> Andrew Gallatin wrote:
>> For variety, I grabbed a different "slow" receiver.  This is another
>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 15
>> model           : 37
>> model name      : AMD Opteron(tm) Processor 252
> <...>
>> The sender was an identical machine running an ancient RHEL4 kernel
>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>> I disabled LRO, on the sender.
>>
>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>> LRO and 8.0Gb/s with GRO.
> 
> With the recent patch to fix idle CPU time accounting from LKML applied,
> it is again possible to trust netperf's service demand (based on %CPU).
> So here is raw netperf output for LRO and GRO, bound as above.
> 
> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB  
> us/KB
> 
> LRO:
>  87380  65536  65536    60.00      8279.36   8.10     77.55    0.160 1.535
> GRO:
>  87380  65536  65536    60.00      8053.19   7.86     85.47    0.160 1.739
> 
> The difference is bigger if you disable TCP timestamps (and thus shrink
> the packets headers down so they require fewer cachelines):
> LRO:
>  87380  65536  65536    60.02      7753.55   8.01     74.06    0.169 1.565
> GRO:
>  87380  65536  65536    60.02      7535.12   7.27     84.57    0.158 1.839
> 
> 
> As you can see, even though the raw bandwidth is very close, the
> service demand makes it clear that GRO is more expensive
> than LRO.  I just wish I understood why.
> 

What are "vmstat 1" ouputs on both tests ? Any difference on say... context switches ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Gallatin April 29, 2009, 2:18 p.m. UTC | #10
Eric Dumazet wrote:
 > Andrew Gallatin a écrit :
 >> Andrew Gallatin wrote:
 >>> For variety, I grabbed a different "slow" receiver.  This is another
 >>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
 >>>
 >>> processor       : 0
 >>> vendor_id       : AuthenticAMD
 >>> cpu family      : 15
 >>> model           : 37
 >>> model name      : AMD Opteron(tm) Processor 252
 >> <...>
 >>> The sender was an identical machine running an ancient RHEL4 kernel
 >>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
 >>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
 >>> I disabled LRO, on the sender.
 >>>
 >>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
 >>> LRO and 8.0Gb/s with GRO.
 >> With the recent patch to fix idle CPU time accounting from LKML applied,
 >> it is again possible to trust netperf's service demand (based on %CPU).
 >> So here is raw netperf output for LRO and GRO, bound as above.
 >>
 >> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 >> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
 >> Recv   Send    Send                          Utilization       Service
 >> Demand
 >> Socket Socket  Message  Elapsed              Send     Recv     Send 
    Recv
 >> Size   Size    Size     Time     Throughput  local    remote   local 
remote
 >> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
 >> us/KB
 >>
 >> LRO:
 >>  87380  65536  65536    60.00      8279.36   8.10     77.55    0.160 
1.535
 >> GRO:
 >>  87380  65536  65536    60.00      8053.19   7.86     85.47    0.160 
1.739
 >>
 >> The difference is bigger if you disable TCP timestamps (and thus shrink
 >> the packets headers down so they require fewer cachelines):
 >> LRO:
 >>  87380  65536  65536    60.02      7753.55   8.01     74.06    0.169 
1.565
 >> GRO:
 >>  87380  65536  65536    60.02      7535.12   7.27     84.57    0.158 
1.839
 >>
 >>
 >> As you can see, even though the raw bandwidth is very close, the
 >> service demand makes it clear that GRO is more expensive
 >> than LRO.  I just wish I understood why.
 >>
 >
 > What are "vmstat 1" ouputs on both tests ? Any difference on say... 
context switches ?

Not much difference is apparent from vmstat, except for a
lower load and slightly higher IRQ rate from LRO:

LRO:
procs -----------memory---------- ---swap-- -----io---- --system-- 
-----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
  1  0      0 676960  19280 209812    0    0     0     0 14817   24  0 
73 27  0  0
  1  0      0 677084  19280 209812    0    0     0     0 14834   20  0 
73 27  0  0
  1  0      0 676916  19280 209812    0    0     0     0 14833   16  0 
74 26  0  0


GRO:
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
  1  0      0 678244  18008 209784    0    0     0    24 14288   32  0 
84 16  0  0
  1  0      0 678268  18008 209788    0    0     0     0 14403   22  0 
85 15  0  0
  1  0      0 677956  18008 209788    0    0     0     0 14331   20  0 
84 16  0  0




The real difference is visible mainly from mpstat on the CPU handing the
interrupts where you see softirq is much higher:

LRO:
07:15:16     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal 
   %idle    intr/s
07:15:17       0    0.00    0.00    0.00    0.00    0.00   45.00    0.00 
   55.00  12907.92
07:15:18       0    0.00    0.00    1.00    0.00    2.00   43.00    0.00 
   54.00  12707.92
07:15:19       0    0.00    0.00    1.00    0.00    0.00   46.00    0.00 
   53.00  12825.00


GRO
07:11:59     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal 
   %idle    intr/s
07:12:00       0    0.00    0.00    0.00    0.00    0.99   66.34    0.00 
   32.67  12242.57
07:12:01       0    0.00    0.00    0.00    0.00    1.01   66.67    0.00 
   32.32  12220.00
07:12:02       0    0.00    0.00    0.99    0.00    0.99   65.35    0.00 
   32.67  12336.00


So it is like "something" GRO is doing in the softirq context is more
expensive than what LRO is doing.

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 29, 2009, 3:26 p.m. UTC | #11
Andrew Gallatin a écrit :
> Eric Dumazet wrote:
>> Andrew Gallatin a écrit :
>>> Andrew Gallatin wrote:
>>>> For variety, I grabbed a different "slow" receiver.  This is another
>>>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>>>
>>>> processor       : 0
>>>> vendor_id       : AuthenticAMD
>>>> cpu family      : 15
>>>> model           : 37
>>>> model name      : AMD Opteron(tm) Processor 252
>>> <...>
>>>> The sender was an identical machine running an ancient RHEL4 kernel
>>>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>>>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>>>> I disabled LRO, on the sender.
>>>>
>>>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>>>> LRO and 8.0Gb/s with GRO.
>>> With the recent patch to fix idle CPU time accounting from LKML applied,
>>> it is again possible to trust netperf's service demand (based on %CPU).
>>> So here is raw netperf output for LRO and GRO, bound as above.
>>>
>>> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>>> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
>>> Recv   Send    Send                          Utilization       Service
>>> Demand
>>> Socket Socket  Message  Elapsed              Send     Recv     Send
>    Recv
>>> Size   Size    Size     Time     Throughput  local    remote   local
> remote
>>> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
>>> us/KB
>>>
>>> LRO:
>>>  87380  65536  65536    60.00      8279.36   8.10     77.55    0.160
> 1.535
>>> GRO:
>>>  87380  65536  65536    60.00      8053.19   7.86     85.47    0.160
> 1.739
>>>
>>> The difference is bigger if you disable TCP timestamps (and thus shrink
>>> the packets headers down so they require fewer cachelines):
>>> LRO:
>>>  87380  65536  65536    60.02      7753.55   8.01     74.06    0.169
> 1.565
>>> GRO:
>>>  87380  65536  65536    60.02      7535.12   7.27     84.57    0.158
> 1.839
>>>
>>>
>>> As you can see, even though the raw bandwidth is very close, the
>>> service demand makes it clear that GRO is more expensive
>>> than LRO.  I just wish I understood why.
>>>
>>
>> What are "vmstat 1" ouputs on both tests ? Any difference on say...
> context switches ?
> 
> Not much difference is apparent from vmstat, except for a
> lower load and slightly higher IRQ rate from LRO:
> 
> LRO:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa st
>  1  0      0 676960  19280 209812    0    0     0     0 14817   24  0 73
> 27  0  0
>  1  0      0 677084  19280 209812    0    0     0     0 14834   20  0 73
> 27  0  0
>  1  0      0 676916  19280 209812    0    0     0     0 14833   16  0 74
> 26  0  0
> 
> 
> GRO:
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa st
>  1  0      0 678244  18008 209784    0    0     0    24 14288   32  0 84
> 16  0  0
>  1  0      0 678268  18008 209788    0    0     0     0 14403   22  0 85
> 15  0  0
>  1  0      0 677956  18008 209788    0    0     0     0 14331   20  0 84
> 16  0  0
> 
> 
> 
> 
> The real difference is visible mainly from mpstat on the CPU handing the
> interrupts where you see softirq is much higher:
> 
> LRO:
> 07:15:16     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>   %idle    intr/s
> 07:15:17       0    0.00    0.00    0.00    0.00    0.00   45.00    0.00
>   55.00  12907.92
> 07:15:18       0    0.00    0.00    1.00    0.00    2.00   43.00    0.00
>   54.00  12707.92
> 07:15:19       0    0.00    0.00    1.00    0.00    0.00   46.00    0.00
>   53.00  12825.00
> 
> 
> GRO
> 07:11:59     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>   %idle    intr/s
> 07:12:00       0    0.00    0.00    0.00    0.00    0.99   66.34    0.00
>   32.67  12242.57
> 07:12:01       0    0.00    0.00    0.00    0.00    1.01   66.67    0.00
>   32.32  12220.00
> 07:12:02       0    0.00    0.00    0.99    0.00    0.99   65.35    0.00
>   32.67  12336.00
> 
> 
> So it is like "something" GRO is doing in the softirq context is more
> expensive than what LRO is doing.


Sure, probably more cache misses or something...

You could try a longer oprofile session (with at least one million samples)
and :

opannotate -a vmlinux >/tmp/FILE

And select 3 or 4 suspect functions : inet_gro_receive() tcp_gro_receive(),
skb_gro_receive(), skb_gro_header()


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Gallatin April 29, 2009, 5:28 p.m. UTC | #12
Eric Dumazet wrote:

 >
 > Sure, probably more cache misses or something...

Yes, that's what I thought.  The code is much more complete,
and spread out than LRO, and seems to open itself to cache
misses.

 > You could try a longer oprofile session (with at least one million 
samples)
 > and :
 >
 > opannotate -a vmlinux >/tmp/FILE
 >
 > And select 3 or 4 suspect functions : inet_gro_receive() 
tcp_gro_receive(),
 > skb_gro_receive(), skb_gro_header()

Here is the opreport -l output from this machine for GRO for a 25 minute
profiling run:


samples  %        image name               app name 
symbol name
3742674  32.2793  vmlinux                  vmlinux 
copy_user_generic_string
890179    7.6775  myri10ge.ko              myri10ge 
myri10ge_poll
547572    4.7226  vmlinux                  vmlinux 
inet_gro_receive
477479    4.1181  vmlinux                  vmlinux 
skb_gro_receive
406562    3.5065  vmlinux                  vmlinux 
free_hot_cold_page
396796    3.4222  vmlinux                  vmlinux 
tcp_gro_receive
332364    2.8665  vmlinux                  vmlinux 
__rmqueue_smallest
319455    2.7552  vmlinux                  vmlinux 
skb_gro_header
269040    2.3204  vmlinux                  vmlinux 
dev_gro_receive
252885    2.1810  vmlinux                  vmlinux 
free_pages_bulk
247832    2.1375  vmlinux                  vmlinux 
get_pageblock_flags_group
211592    1.8249  myri10ge.ko              myri10ge 
myri10ge_alloc_rx_pages
208867    1.8014  vmlinux                  vmlinux 
__list_add
201491    1.7378  vmlinux                  vmlinux 
tcp4_gro_receive
187591    1.6179  vmlinux                  vmlinux 
__napi_gro_receive
170156    1.4675  vmlinux                  vmlinux 
get_page_from_freelist
116321    1.0032  vmlinux                  vmlinux                  list_del
107994    0.9314  vmlinux                  vmlinux                  kfree
106434    0.9180  vmlinux                  vmlinux 
skb_copy_datagram_iovec
100675    0.8683  vmlinux                  vmlinux                  put_page

And is here is the opannotate -a output for a few GRO functions.  BTW, 
did you mean -s
rather than -a?   I'd naively think source might be more helpful.  But 
here is
what you asked for:

ffffffff80479f20 <inet_gro_receive>: /* inet_gro_receive total: 547572 
5.2554 */
  12187  0.1170 :ffffffff80479f20:       push   %r13
   2611  0.0251 :ffffffff80479f22:       mov    %rdi,%r13
                :ffffffff80479f25:       push   %r12
                :ffffffff80479f27:       push   %rbp
   4031  0.0387 :ffffffff80479f28:       push   %rbx
                :ffffffff80479f29:       mov    %rsi,%rbx
                :ffffffff80479f2c:       mov    $0x14,%esi
   6303  0.0605 :ffffffff80479f31:       mov    %rbx,%rdi
                :ffffffff80479f34:       sub    $0x8,%rsp
                :ffffffff80479f38:       callq  ffffffff804357a1 
<skb_gro_header>
                :ffffffff80479f3d:       test   %rax,%rax
   2494  0.0239 :ffffffff80479f40:       mov    %rax,%r8
                :ffffffff80479f43:       je     ffffffff8047a0a4 
<inet_gro_receive+0x184>
                :ffffffff80479f49:       movzbl 0x9(%rax),%eax
   2541  0.0244 :ffffffff80479f4d:       mov 
0xffffffff80d06280(,%rax,8),%r11
     33 3.2e-04 :ffffffff80479f55:       test   %r11,%r11
      5 4.8e-05 :ffffffff80479f58:       je     ffffffff8047a0a4 
<inet_gro_receive+0x184>
  11016  0.1057 :ffffffff80479f5e:       cmpq   $0x0,0x20(%r11)
    292  0.0028 :ffffffff80479f63:       je     ffffffff8047a0a4 
<inet_gro_receive+0x184>
      1 9.6e-06 :ffffffff80479f69:       cmpb   $0x45,(%r8)
   4297  0.0412 :ffffffff80479f6d:       jne    ffffffff8047a0a4 
<inet_gro_receive+0x184>
   6086  0.0584 :ffffffff80479f73:       mov    $0x5,%eax
                :ffffffff80479f78:       mov    %r8,%rcx
  18706  0.1795 :ffffffff80479f7b:       mov    (%rcx),%edx
    341  0.0033 :ffffffff80479f7d:       sub    $0x4,%eax
                :ffffffff80479f80:       jbe    ffffffff80479fa6 
<inet_gro_receive+0x86>
   4609  0.0442 :ffffffff80479f82:       add    0x4(%rcx),%edx
    398  0.0038 :ffffffff80479f85:       adc    0x8(%rcx),%edx
                :ffffffff80479f88:       adc    0xc(%rcx),%edx
   4310  0.0414 :ffffffff80479f8b:       adc    0x10(%rcx),%edx
    790  0.0076 :ffffffff80479f8e:       lea    0x4(%rcx),%rcx
                :ffffffff80479f92:       dec    %eax
   9097  0.0873 :ffffffff80479f94:       jne    ffffffff80479f8b 
<inet_gro_receive+0x6b>
    541  0.0052 :ffffffff80479f96:       adc    $0x0,%edx
                :ffffffff80479f99:       mov    %edx,%eax
   1919  0.0184 :ffffffff80479f9b:       shr    $0x10,%edx
    535  0.0051 :ffffffff80479f9e:       add    %ax,%dx
                :ffffffff80479fa1:       adc    $0x0,%edx
   3633  0.0349 :ffffffff80479fa4:       not    %edx
    683  0.0066 :ffffffff80479fa6:       test   %dx,%dx
      1 9.6e-06 :ffffffff80479fa9:       jne    ffffffff8047a0a4 
<inet_gro_receive+0x184>
   4725  0.0453 :ffffffff80479faf:       movzwl 0x2(%r8),%eax
   9728  0.0934 :ffffffff80479fb4:       mov    0x68(%rbx),%edx
      8 7.7e-05 :ffffffff80479fb7:       mov    $0x1,%ebp
  43000  0.4127 :ffffffff80479fbc:       sub    0x38(%rbx),%edx
  11149  0.1070 :ffffffff80479fbf:       mov    %eax,%ecx
                :ffffffff80479fc1:       shl    $0x8,%eax
  66497  0.6382 :ffffffff80479fc4:       shr    $0x8,%ecx
    735  0.0071 :ffffffff80479fc7:       or     %ecx,%eax
                :ffffffff80479fc9:       movzwl %ax,%eax
   5459  0.0524 :ffffffff80479fcc:       cmp    %edx,%eax
    522  0.0050 :ffffffff80479fce:       jne    ffffffff80479fdc 
<inet_gro_receive+0xbc>
                :ffffffff80479fd0:       xor    %ebp,%ebp
   5373  0.0516 :ffffffff80479fd2:       cmpw   $0x40,0x6(%r8)
    345  0.0033 :ffffffff80479fd8:       setne  %bpl
                :ffffffff80479fdc:       movzwl 0x4(%r8),%eax
   2384  0.0229 :ffffffff80479fe1:       mov    0x0(%r13),%r10
    631  0.0061 :ffffffff80479fe5:       mov    %eax,%edx
                :ffffffff80479fe7:       shl    $0x8,%eax
   3044  0.0292 :ffffffff80479fea:       shr    $0x8,%edx
    303  0.0029 :ffffffff80479fed:       or     %edx,%eax
                :ffffffff80479fef:       movzwl %ax,%r12d
   2747  0.0264 :ffffffff80479ff3:       jmp    ffffffff8047a071 
<inet_gro_receive+0x151>
   2109  0.0202 :ffffffff80479ff5:       lea    0x38(%r10),%r9
     12 1.2e-04 :ffffffff80479ff9:       cmpl   $0x0,0x4(%r9)
     23 2.2e-04 :ffffffff80479ffe:       je     ffffffff8047a06e 
<inet_gro_receive+0x14e>
   2104  0.0202 :ffffffff8047a000:       mov    0xac(%r10),%edi
      2 1.9e-05 :ffffffff8047a007:       add    0xc0(%r10),%rdi
                :ffffffff8047a00e:       mov    0x9(%rdi),%sil
   2391  0.0229 :ffffffff8047a012:       mov    0x1(%rdi),%al
      2 1.9e-05 :ffffffff8047a015:       xor    0x9(%r8),%sil
      7 6.7e-05 :ffffffff8047a019:       xor    0x1(%r8),%al
   2101  0.0202 :ffffffff8047a01d:       mov    0xc(%rdi),%edx
      1 9.6e-06 :ffffffff8047a020:       mov    0x10(%rdi),%ecx
                :ffffffff8047a023:       xor    0xc(%r8),%edx
   2775  0.0266 :ffffffff8047a027:       xor    0x10(%r8),%ecx
                :ffffffff8047a02b:       or     %esi,%eax
                :ffffffff8047a02d:       movzbl %al,%eax
  62734  0.6021 :ffffffff8047a030:       or     %edx,%ecx
                :ffffffff8047a032:       or     %eax,%ecx
                :ffffffff8047a034:       je     ffffffff8047a040 
<inet_gro_receive+0x120>
                :ffffffff8047a036:       movl   $0x0,0x4(%r9)
                :ffffffff8047a03e:       jmp    ffffffff8047a06e 
<inet_gro_receive+0x14e>
   2106  0.0202 :ffffffff8047a040:       movzwl 0x4(%rdi),%edx
                :ffffffff8047a044:       mov    0x8(%rdi),%al
                :ffffffff8047a047:       xor    0x8(%r8),%eax
  64244  0.6166 :ffffffff8047a04b:       mov    %edx,%ecx
                :ffffffff8047a04d:       shl    $0x8,%edx
                :ffffffff8047a050:       shr    $0x8,%ecx
   2072  0.0199 :ffffffff8047a053:       movzbl %al,%eax
                :ffffffff8047a056:       or     0x8(%r9),%eax
                :ffffffff8047a05a:       or     %ecx,%edx
   2629  0.0252 :ffffffff8047a05c:       add    0xc(%r9),%edx
      2 1.9e-05 :ffffffff8047a060:       movzwl %dx,%edx
                :ffffffff8047a063:       xor    %r12d,%edx
  58223  0.5588 :ffffffff8047a066:       or     %edx,%eax
      3 2.9e-05 :ffffffff8047a068:       or     %ebp,%eax
                :ffffffff8047a06a:       mov    %eax,0x8(%r9)
  21878  0.2100 :ffffffff8047a06e:       mov    (%r10),%r10
   2156  0.0207 :ffffffff8047a071:       test   %r10,%r10
                :ffffffff8047a074:       jne    ffffffff80479ff5 
<inet_gro_receive+0xd5>
   3007  0.0289 :ffffffff8047a07a:       mov    0x38(%rbx),%eax
     61 5.9e-04 :ffffffff8047a07d:       or     %ebp,0x40(%rbx)
      3 2.9e-05 :ffffffff8047a080:       mov    %rbx,%rsi
   3091  0.0297 :ffffffff8047a083:       mov    %r13,%rdi
     41 3.9e-04 :ffffffff8047a086:       add    $0x14,%eax
                :ffffffff8047a089:       mov    %eax,0x38(%rbx)
   3704  0.0355 :ffffffff8047a08c:       sub    0xc0(%rbx),%eax
     33 3.2e-04 :ffffffff8047a092:       add    0xc8(%rbx),%eax
                :ffffffff8047a098:       mov    %eax,0xa8(%rbx)
   2468  0.0237 :ffffffff8047a09e:       callq  *0x20(%r11)
  20011  0.1921 :ffffffff8047a0a2:       jmp    ffffffff8047a0ab 
<inet_gro_receive+0x18b>
                :ffffffff8047a0a4:       xor    %eax,%eax
                :ffffffff8047a0a6:       mov    $0x1,%ebp
  24082  0.2311 :ffffffff8047a0ab:       or     %ebp,0x40(%rbx)
    626  0.0060 :ffffffff8047a0ae:       pop    %r10
   1718  0.0165 :ffffffff8047a0b0:       pop    %rbx
    446  0.0043 :ffffffff8047a0b1:       pop    %rbp
   4074  0.0391 :ffffffff8047a0b2:       pop    %r12
   2089  0.0200 :ffffffff8047a0b4:       pop    %r13
    434  0.0042 :ffffffff8047a0b6:       retq




ffffffff80430ea9 <skb_gro_receive>: /* skb_gro_receive total: 477479 
4.5827 */
   2158  0.0207 :ffffffff80430ea9:       push   %r15
   2492  0.0239 :ffffffff80430eab:       mov    %rdi,%r15
                :ffffffff80430eae:       push   %r14
                :ffffffff80430eb0:       push   %r13
   2432  0.0233 :ffffffff80430eb2:       push   %r12
      1 9.6e-06 :ffffffff80430eb4:       push   %rbp
      1 9.6e-06 :ffffffff80430eb5:       mov    %rsi,%rbp
   2430  0.0233 :ffffffff80430eb8:       push   %rbx
                :ffffffff80430eb9:       sub    $0x8,%rsp
                :ffffffff80430ebd:       mov    0x68(%rsi),%ecx
   2420  0.0232 :ffffffff80430ec0:       mov    (%rdi),%r12
      1 9.6e-06 :ffffffff80430ec3:       mov    %ecx,%r14d
      1 9.6e-06 :ffffffff80430ec6:       sub    0x38(%rsi),%r14d
   2317  0.0222 :ffffffff80430eca:       mov    %r14d,%eax
      1 9.6e-06 :ffffffff80430ecd:       add    0x68(%r12),%eax
      1 9.6e-06 :ffffffff80430ed2:       cmp    $0xffff,%eax
   3865  0.0371 :ffffffff80430ed7:       ja     ffffffff80431261 
<skb_gro_receive+0x3b8>
                :ffffffff80430edd:       mov    0xb8(%r12),%eax
                :ffffffff80430ee5:       mov    0xc0(%r12),%rdx
   8082  0.0776 :ffffffff80430eed:       lea    (%rdx,%rax,1),%rsi
                :ffffffff80430ef1:       cmpq   $0x0,0x18(%rsi)
      2 1.9e-05 :ffffffff80430ef6:       jne    ffffffff804311ab 
<skb_gro_receive+0x302>
   9249  0.0888 :ffffffff80430efc:       mov    %ecx,%edi
                :ffffffff80430efe:       sub    0x6c(%rbp),%edi
      6 5.8e-05 :ffffffff80430f01:       cmp    0x38(%rbp),%edi
   3104  0.0298 :ffffffff80430f04:       ja     ffffffff80430fe2 
<skb_gro_receive+0x139>
      2 1.9e-05 :ffffffff80430f0a:       mov    0xb8(%rbp),%ecx
                :ffffffff80430f10:       movzwl 0x4(%rsi),%edx
   8825  0.0847 :ffffffff80430f14:       add    0xc0(%rbp),%rcx
                :ffffffff80430f1b:       movzwl 0x4(%rcx),%eax
     21 2.0e-04 :ffffffff80430f1f:       add    %edx,%eax
  19668  0.1888 :ffffffff80430f21:       cmp    $0x12,%eax
      1 9.6e-06 :ffffffff80430f24:       ja     ffffffff80431261 
<skb_gro_receive+0x3b8>
                :ffffffff80430f2a:       mov    0x38(%rcx),%eax
   1974  0.0189 :ffffffff80430f2d:       add    0x38(%rbp),%eax
                :ffffffff80430f30:       cld
                :ffffffff80430f31:       sub    %edi,%eax
   7666  0.0736 :ffffffff80430f33:       mov    %eax,0x38(%rcx)
      2 1.9e-05 :ffffffff80430f36:       mov    0xb8(%rbp),%edx
                :ffffffff80430f3c:       add    0xc0(%rbp),%rdx
  52468  0.5036 :ffffffff80430f43:       mov    0x3c(%rdx),%eax
      2 1.9e-05 :ffffffff80430f46:       add    0x68(%rbp),%eax
      1 9.6e-06 :ffffffff80430f49:       sub    0x6c(%rbp),%eax
   6592  0.0633 :ffffffff80430f4c:       sub    0x38(%rbp),%eax
                :ffffffff80430f4f:       mov    %eax,0x3c(%rdx)
                :ffffffff80430f52:       mov    0xb8(%r12),%eax
  23018  0.2209 :ffffffff80430f5a:       add    0xc0(%r12),%rax
      1 9.6e-06 :ffffffff80430f62:       mov    0xb8(%rbp),%esi
                :ffffffff80430f68:       add    0xc0(%rbp),%rsi
   8477  0.0814 :ffffffff80430f6f:       movzwl 0x4(%rax),%edi
      6 5.8e-05 :ffffffff80430f73:       movzwl 0x4(%rsi),%ecx
                :ffffffff80430f77:       add    $0x30,%rsi
  21338  0.2048 :ffffffff80430f7b:       shl    $0x4,%rdi
      3 2.9e-05 :ffffffff80430f7f:       lea    0x30(%rdi,%rax,1),%rdi
      1 9.6e-06 :ffffffff80430f84:       shl    $0x4,%rcx
150632  1.4457 :ffffffff80430f88:       rep movsb %ds:(%rsi),%es:(%rdi)
   3988  0.0383 :ffffffff80430f8a:       mov    0xb8(%r12),%eax
   2015  0.0193 :ffffffff80430f92:       mov    0xb8(%rbp),%ecx
     11 1.1e-04 :ffffffff80430f98:       add    0xc0(%r12),%rax
      8 7.7e-05 :ffffffff80430fa0:       mov    0xc0(%rbp),%rdx
   3295  0.0316 :ffffffff80430fa7:       mov    0x4(%rdx,%rcx,1),%edx
                :ffffffff80430fab:       add    %dx,0x4(%rax)
      8 7.7e-05 :ffffffff80430faf:       mov    0xb8(%rbp),%edx
   2507  0.0241 :ffffffff80430fb5:       mov    0xc0(%rbp),%rax
                :ffffffff80430fbc:       movw   $0x0,0x4(%rax,%rdx,1)
   3233  0.0310 :ffffffff80430fc3:       mov    0x6c(%rbp),%eax
      1 9.6e-06 :ffffffff80430fc6:       sub    %eax,0xd0(%rbp)
                :ffffffff80430fcc:       sub    %eax,0x68(%rbp)
  41540  0.3987 :ffffffff80430fcf:       movl   $0x0,0x6c(%rbp)
                :ffffffff80430fd6:       movl   $0x1,0x48(%rbp)
                :ffffffff80430fdd:       jmpq   ffffffff8043123f 
<skb_gro_receive+0x396>
                :ffffffff80430fe2:       mov    0xc8(%r12),%rax
                :ffffffff80430fea:       mov    0x20(%r12),%rdi
                :ffffffff80430fef:       mov    %eax,%r13d
                :ffffffff80430ff2:       sub    %edx,%r13d
                :ffffffff80430ff5:       mov    $0x20,%edx
                :ffffffff80430ffa:       mov    %r13d,%esi
                :ffffffff80430ffd:       add    0x38(%r12),%esi
                :ffffffff80431002:       callq  ffffffff8042ffe0 
<__netdev_alloc_skb>
                :ffffffff80431007:       mov    %rax,%rbx
                :ffffffff8043100a:       mov    $0xfffffff4,%eax
                :ffffffff8043100f:       test   %rbx,%rbx
                :ffffffff80431012:       je     ffffffff80431266 
<skb_gro_receive+0x3bd>
                :ffffffff80431018:       mov    %r12,%rsi
                :ffffffff8043101b:       mov    %rbx,%rdi
                :ffffffff8043101e:       callq  ffffffff8042e2c0 
<__copy_skb_header>
                :ffffffff80431023:       mov    0x70(%r12),%eax
                :ffffffff80431028:       add    %r13d,0xb4(%rbx)
                :ffffffff8043102f:       mov    %ax,0x70(%rbx)
                :ffffffff80431033:       movslq %r13d,%rax
                :ffffffff80431036:       add    %rax,0xc8(%rbx)
                :ffffffff8043103d:       cmpl   $0x0,0x6c(%rbx)
                :ffffffff80431041:       mov    0x38(%r12),%edx
                :ffffffff80431046:       mov    0xb4(%rbx),%eax
                :ffffffff8043104c:       je     ffffffff80431052 
<skb_gro_receive+0x1a9>
                :ffffffff8043104e:       ud2a
                :ffffffff80431050:       jmp    ffffffff80431050 
<skb_gro_receive+0x1a7>
                :ffffffff80431052:       lea    (%rdx,%rax,1),%eax
                :ffffffff80431055:       add    %edx,0x68(%rbx)
                :ffffffff80431058:       mov    0xc8(%r12),%rcx
                :ffffffff80431060:       mov    0xc8(%rbx),%rdx
                :ffffffff80431067:       sub    0xc0(%rbx),%edx
                :ffffffff8043106d:       mov    %eax,0xb4(%rbx)
                :ffffffff80431073:       mov    0xb0(%r12),%eax
                :ffffffff8043107b:       add    0xc0(%r12),%rax
                :ffffffff80431083:       sub    %ecx,%eax
                :ffffffff80431085:       add    %edx,%eax
                :ffffffff80431087:       mov    %eax,0xb0(%rbx)
                :ffffffff8043108d:       mov    0xac(%r12),%eax
                :ffffffff80431095:       add    0xc0(%r12),%rax
                :ffffffff8043109d:       sub    %ecx,%eax
                :ffffffff8043109f:       add    %edx,%eax
                :ffffffff804310a1:       mov    %eax,0xac(%rbx)
                :ffffffff804310a7:       mov    0xa8(%r12),%eax
                :ffffffff804310af:       add    0xc0(%r12),%rax
                :ffffffff804310b7:       sub    %ecx,%eax
                :ffffffff804310b9:       add    %edx,%eax
                :ffffffff804310bb:       mov    %eax,0xa8(%rbx)
                :ffffffff804310c1:       mov    0x68(%r12),%eax
                :ffffffff804310c6:       mov    0x38(%r12),%edx
                :ffffffff804310cb:       sub    %edx,%eax
                :ffffffff804310cd:       cmp    0x6c(%r12),%eax
                :ffffffff804310d2:       mov    %eax,0x68(%r12)
                :ffffffff804310d7:       jae    ffffffff804310dd 
<skb_gro_receive+0x234>
                :ffffffff804310d9:       ud2a
                :ffffffff804310db:       jmp    ffffffff804310db 
<skb_gro_receive+0x232>
                :ffffffff804310dd:       mov    0xb0(%r12),%esi
                :ffffffff804310e5:       mov    %edx,%ecx
                :ffffffff804310e7:       add    0xc8(%r12),%rcx
                :ffffffff804310ef:       add    0xc0(%r12),%rsi
                :ffffffff804310f7:       mov    0xb0(%rbx),%edi
                :ffffffff804310fd:       add    0xc0(%rbx),%rdi
                :ffffffff80431104:       cld
                :ffffffff80431105:       mov    %rcx,0xc8(%r12)
                :ffffffff8043110d:       sub    %rsi,%rcx
                :ffffffff80431110:       rep movsb %ds:(%rsi),%es:(%rdi)
                :ffffffff80431112:       lea    0x38(%rbx),%rdi
                :ffffffff80431116:       lea    0x38(%r12),%rsi
                :ffffffff8043111b:       mov    $0x5,%cl
                :ffffffff8043111d:       rep movsl %ds:(%rsi),%es:(%rdi)
                :ffffffff8043111f:       mov    0xb8(%rbx),%edx
                :ffffffff80431125:       mov    0xc0(%rbx),%rax
                :ffffffff8043112c:       mov    %r12,0x18(%rax,%rdx,1)
                :ffffffff80431131:       mov    0xb8(%r12),%edx
                :ffffffff80431139:       mov    0xc0(%r12),%rax
                :ffffffff80431141:       mov    0xb8(%rbx),%esi
                :ffffffff80431147:       mov    0xc0(%rbx),%rcx
                :ffffffff8043114e:       mov    0x6(%rax,%rdx,1),%ax
                :ffffffff80431153:       mov    %ax,0x6(%rcx,%rsi,1)
                :ffffffff80431158:       testb  $0x10,0x7c(%r12)
                :ffffffff8043115e:       je     ffffffff80431164 
<skb_gro_receive+0x2bb>
                :ffffffff80431160:       ud2a
                :ffffffff80431162:       jmp    ffffffff80431162 
<skb_gro_receive+0x2b9>
                :ffffffff80431164:       mov    0xb8(%r12),%eax
                :ffffffff8043116c:       orb    $0x10,0x7c(%r12)
                :ffffffff80431172:       add    0xc0(%r12),%rax
                :ffffffff8043117a:       lock addl $0x10000,(%rax)
                :ffffffff80431181:       mov    0x68(%r12),%eax
                :ffffffff80431186:       mov    %r12,0x8(%rbx)
                :ffffffff8043118a:       add    %eax,0x6c(%rbx)
                :ffffffff8043118d:       add    %eax,0xd0(%rbx)
                :ffffffff80431193:       add    %eax,0x68(%rbx)
                :ffffffff80431196:       mov    %rbx,(%r15)
                :ffffffff80431199:       mov    (%r12),%rax
                :ffffffff8043119d:       mov    %rax,(%rbx)
                :ffffffff804311a0:       movq   $0x0,(%r12)
                :ffffffff804311a8:       mov    %rbx,%r12
                :ffffffff804311ab:       mov    0x68(%rbp),%ecx
                :ffffffff804311ae:       sub    0x6c(%rbp),%ecx
                :ffffffff804311b1:       cmp    %ecx,0x38(%rbp)
                :ffffffff804311b4:       jbe    ffffffff804311f3 
<skb_gro_receive+0x34a>
                :ffffffff804311b6:       mov    0xb8(%rbp),%edx
                :ffffffff804311bc:       add    0xc0(%rbp),%rdx
                :ffffffff804311c3:       mov    0x38(%rdx),%eax
                :ffffffff804311c6:       add    0x38(%rbp),%eax
                :ffffffff804311c9:       sub    %ecx,%eax
                :ffffffff804311cb:       mov    %eax,0x38(%rdx)
                :ffffffff804311ce:       mov    0xb8(%rbp),%edx
                :ffffffff804311d4:       add    0xc0(%rbp),%rdx
                :ffffffff804311db:       mov    0x3c(%rdx),%eax
                :ffffffff804311de:       add    0x68(%rbp),%eax
                :ffffffff804311e1:       sub    0x6c(%rbp),%eax
                :ffffffff804311e4:       sub    0x38(%rbp),%eax
                :ffffffff804311e7:       mov    %eax,0x3c(%rdx)
                :ffffffff804311ea:       mov    0x68(%rbp),%eax
                :ffffffff804311ed:       sub    0x6c(%rbp),%eax
                :ffffffff804311f0:       mov    %eax,0x38(%rbp)
                :ffffffff804311f3:       mov    0x68(%rbp),%eax
                :ffffffff804311f6:       mov    0x38(%rbp),%edx
                :ffffffff804311f9:       sub    %edx,%eax
                :ffffffff804311fb:       cmp    0x6c(%rbp),%eax
                :ffffffff804311fe:       mov    %eax,0x68(%rbp)
                :ffffffff80431201:       jae    ffffffff80431207 
<skb_gro_receive+0x35e>
                :ffffffff80431203:       ud2a
                :ffffffff80431205:       jmp    ffffffff80431205 
<skb_gro_receive+0x35c>
                :ffffffff80431207:       mov    %edx,%eax
                :ffffffff80431209:       add    %rax,0xc8(%rbp)
                :ffffffff80431210:       mov    0x8(%r12),%rax
                :ffffffff80431215:       mov    %rbp,0x8(%r12)
                :ffffffff8043121a:       mov    %rbp,(%rax)
                :ffffffff8043121d:       testb  $0x10,0x7c(%rbp)
                :ffffffff80431221:       je     ffffffff80431227 
<skb_gro_receive+0x37e>
                :ffffffff80431223:       ud2a
                :ffffffff80431225:       jmp    ffffffff80431225 
<skb_gro_receive+0x37c>
                :ffffffff80431227:       mov    0xb8(%rbp),%eax
                :ffffffff8043122d:       orb    $0x10,0x7c(%rbp)
                :ffffffff80431231:       add    0xc0(%rbp),%rax
                :ffffffff80431238:       lock addl $0x10000,(%rax)
  34919  0.3351 :ffffffff8043123f:       add    %r14d,0x6c(%r12)
   1989  0.0191 :ffffffff80431244:       add    %r14d,0xd0(%r12)
      1 9.6e-06 :ffffffff8043124c:       xor    %eax,%eax
                :ffffffff8043124e:       add    %r14d,0x68(%r12)
  20605  0.1978 :ffffffff80431253:       incl   0x44(%r12)
                :ffffffff80431258:       movl   $0x1,0x3c(%rbp)
                :ffffffff8043125f:       jmp    ffffffff80431266 
<skb_gro_receive+0x3bd>
                :ffffffff80431261:       mov    $0xfffffff9,%eax
  13260  0.1273 :ffffffff80431266:       pop    %r11
   1946  0.0187 :ffffffff80431268:       pop    %rbx
   2010  0.0193 :ffffffff80431269:       pop    %rbp
     64 6.1e-04 :ffffffff8043126a:       pop    %r12
   1948  0.0187 :ffffffff8043126c:       pop    %r13
   2746  0.0264 :ffffffff8043126e:       pop    %r14
     57 5.5e-04 :ffffffff80431270:       pop    %r15
   2067  0.0198 :ffffffff80431272:       retq

ffffffff80460663 <tcp_gro_receive>: /* tcp_gro_receive total: 396796 
3.8083 */
   4433  0.0425 :ffffffff80460663:       push   %r15
   2204  0.0212 :ffffffff80460665:       push   %r14
                :ffffffff80460667:       mov    %rdi,%r14
                :ffffffff8046066a:       push   %r13
   2275  0.0218 :ffffffff8046066c:       push   %r12
                :ffffffff8046066e:       mov    %rsi,%r12
                :ffffffff80460671:       mov    $0x14,%esi
   5933  0.0569 :ffffffff80460676:       mov    %r12,%rdi
                :ffffffff80460679:       push   %rbp
                :ffffffff8046067a:       push   %rbx
   2180  0.0209 :ffffffff8046067b:       sub    $0x8,%rsp
                :ffffffff8046067f:       callq  ffffffff804357a1 
<skb_gro_header>
                :ffffffff80460684:       test   %rax,%rax
   3218  0.0309 :ffffffff80460687:       je     ffffffff804607ed 
<tcp_gro_receive+0x18a>
                :ffffffff8046068d:       mov    0xc(%rax),%al
      1 9.6e-06 :ffffffff80460690:       shr    $0x4,%al
   3528  0.0339 :ffffffff80460693:       movzbl %al,%eax
                :ffffffff80460696:       lea    0x0(,%rax,4),%r13d
      1 9.6e-06 :ffffffff8046069e:       cmp    $0x13,%r13d
   2773  0.0266 :ffffffff804606a2:       jbe    ffffffff804607ed 
<tcp_gro_receive+0x18a>
                :ffffffff804606a8:       mov    %r13d,%esi
                :ffffffff804606ab:       mov    %r12,%rdi
   3327  0.0319 :ffffffff804606ae:       callq  ffffffff804357a1 
<skb_gro_header>
                :ffffffff804606b3:       test   %rax,%rax
   2094  0.0201 :ffffffff804606b6:       mov    %rax,%r8
                :ffffffff804606b9:       je     ffffffff804607ed 
<tcp_gro_receive+0x18a>
                :ffffffff804606bf:       lea    0x38(%r12),%r15
   2245  0.0215 :ffffffff804606c4:       add    %r13d,(%r15)
                :ffffffff804606c7:       mov    0x68(%r12),%ebp
                :ffffffff804606cc:       sub    0x38(%r12),%ebp
   2394  0.0230 :ffffffff804606d1:       mov    0xc(%rax),%ebx
                :ffffffff804606d4:       jmp    ffffffff80460710 
<tcp_gro_receive+0xad>
   2111  0.0203 :ffffffff804606d6:       lea    0x38(%rdi),%r9
      3 2.9e-05 :ffffffff804606da:       cmpl   $0x0,0x4(%r9)
     21 2.0e-04 :ffffffff804606df:       je     ffffffff8046070d 
<tcp_gro_receive+0xaa>
   2592  0.0249 :ffffffff804606e1:       mov    0xa8(%rdi),%eax
                :ffffffff804606e7:       mov    0xc0(%rdi),%r10
                :ffffffff804606ee:       mov    0x2(%r8),%dx
   2440  0.0234 :ffffffff804606f3:       lea    (%r10,%rax,1),%rcx
                :ffffffff804606f7:       mov    (%r8),%eax
      1 9.6e-06 :ffffffff804606fa:       xor    0x2(%rcx),%dx
   6275  0.0602 :ffffffff804606fe:       xor    (%rcx),%eax
      3 2.9e-05 :ffffffff80460700:       or     %ax,%dx
                :ffffffff80460703:       je     ffffffff8046071d 
<tcp_gro_receive+0xba>
                :ffffffff80460705:       movl   $0x0,0x4(%r9)
                :ffffffff8046070d:       mov    %rdi,%r14
   2920  0.0280 :ffffffff80460710:       mov    (%r14),%rdi
     18 1.7e-04 :ffffffff80460713:       test   %rdi,%rdi
      2 1.9e-05 :ffffffff80460716:       jne    ffffffff804606d6 
<tcp_gro_receive+0x73>
     33 3.2e-04 :ffffffff80460718:       jmpq   ffffffff80460807 
<tcp_gro_receive+0x1a4>
   4253  0.0408 :ffffffff8046071d:       mov    0xe(%r8),%ax
   2125  0.0204 :ffffffff80460722:       xor    0xe(%rcx),%ax
      2 1.9e-05 :ffffffff80460726:       mov    %ebx,%edx
                :ffffffff80460728:       and    $0x8000,%edx
   8066  0.0774 :ffffffff8046072e:       or     0x8(%r9),%edx
                :ffffffff80460732:       movzwl %ax,%esi
                :ffffffff80460735:       mov    0x8(%r8),%eax
  64740  0.6214 :ffffffff80460739:       xor    0x8(%rcx),%eax
                :ffffffff8046073c:       or     %eax,%esi
                :ffffffff8046073e:       mov    %ebx,%eax
   2084  0.0200 :ffffffff80460740:       xor    0xc(%rcx),%eax
                :ffffffff80460743:       and    $0x76,%ah
                :ffffffff80460746:       or     %eax,%edx
   2132  0.0205 :ffffffff80460748:       or     %edx,%esi
                :ffffffff8046074a:       mov    $0x14,%edx
                :ffffffff8046074f:       jmp    ffffffff8046075e 
<tcp_gro_receive+0xfb>
                :ffffffff80460751:       movslq %edx,%rax
                :ffffffff80460754:       add    $0x4,%edx
                :ffffffff80460757:       mov    (%r8,%rax,1),%esi
                :ffffffff8046075b:       xor    (%rcx,%rax,1),%esi
   3670  0.0352 :ffffffff8046075e:       test   %esi,%esi
   2162  0.0208 :ffffffff80460760:       jne    ffffffff80460767 
<tcp_gro_receive+0x104>
                :ffffffff80460762:       cmp    %r13d,%edx
      1 9.6e-06 :ffffffff80460765:       jb     ffffffff80460751 
<tcp_gro_receive+0xee>
  50209  0.4819 :ffffffff80460767:       mov    0xb8(%rdi),%eax
   4473  0.0429 :ffffffff8046076d:       mov    0x4(%rcx),%edx
                :ffffffff80460770:       bswap  %edx
   9554  0.0917 :ffffffff80460772:       mov    0x4(%r8),%ecx
                :ffffffff80460776:       bswap  %ecx
                :ffffffff80460778:       movzwl 0x6(%r10,%rax,1),%r13d
   7572  0.0727 :ffffffff8046077e:       mov    0x68(%rdi),%eax
                :ffffffff80460781:       sub    0x38(%rdi),%eax
                :ffffffff80460784:       add    %edx,%eax
   9803  0.0941 :ffffffff80460786:       xor    %eax,%ecx
                :ffffffff80460788:       cmp    %r13d,%ebp
                :ffffffff8046078b:       seta   %al
  50608  0.4857 :ffffffff8046078e:       test   %ebp,%ebp
                :ffffffff80460790:       sete   %dl
                :ffffffff80460793:       or     %edx,%eax
   3161  0.0303 :ffffffff80460795:       movzbl %al,%eax
                :ffffffff80460798:       or     %eax,%esi
                :ffffffff8046079a:       or     %esi,%ecx
   3278  0.0315 :ffffffff8046079c:       jne    ffffffff804607f6 
<tcp_gro_receive+0x193>
                :ffffffff8046079e:       mov    %r12,%rsi
      2 1.9e-05 :ffffffff804607a1:       mov    %r14,%rdi
   2579  0.0248 :ffffffff804607a4:       callq  ffffffff80430ea9 
<skb_gro_receive>
   2059  0.0198 :ffffffff804607a9:       test   %eax,%eax
     49 4.7e-04 :ffffffff804607ab:       jne    ffffffff804607f6 
<tcp_gro_receive+0x193>
                :ffffffff804607ad:       mov    (%r14),%rcx
   1945  0.0187 :ffffffff804607b0:       mov    %ebx,%edx
      3 2.9e-05 :ffffffff804607b2:       and    $0x900,%edx
                :ffffffff804607b8:       mov    0xa8(%rcx),%eax
   2530  0.0243 :ffffffff804607be:       add    0xc0(%rcx),%rax
      3 2.9e-05 :ffffffff804607c5:       or     %edx,0xc(%rax)
     13 1.2e-04 :ffffffff804607c8:       xor    %eax,%eax
   4881  0.0468 :ffffffff804607ca:       cmp    %r13d,%ebp
                :ffffffff804607cd:       setb   %al
                :ffffffff804607d0:       and    $0x2f00,%ebx
   1912  0.0184 :ffffffff804607d6:       or     %ebx,%eax
                :ffffffff804607d8:       test   %rcx,%rcx
                :ffffffff804607db:       je     ffffffff80460816 
<tcp_gro_receive+0x1b3>
   2163  0.0208 :ffffffff804607dd:       cmpl   $0x0,0x4(%r15)
    136  0.0013 :ffffffff804607e2:       je     ffffffff804607e8 
<tcp_gro_receive+0x185>
   2455  0.0236 :ffffffff804607e4:       test   %eax,%eax
     57 5.5e-04 :ffffffff804607e6:       je     ffffffff80460816 
<tcp_gro_receive+0x1b3>
    148  0.0014 :ffffffff804607e8:       mov    %r14,%rdi
    735  0.0071 :ffffffff804607eb:       jmp    ffffffff80460818 
<tcp_gro_receive+0x1b5>
                :ffffffff804607ed:       xor    %edi,%edi
                :ffffffff804607ef:       mov    $0x1,%eax
                :ffffffff804607f4:       jmp    ffffffff80460818 
<tcp_gro_receive+0x1b5>
     68 6.5e-04 :ffffffff804607f6:       xor    %eax,%eax
      1 9.6e-06 :ffffffff804607f8:       test   %ebp,%ebp
     67 6.4e-04 :ffffffff804607fa:       sete   %al
     47 4.5e-04 :ffffffff804607fd:       and    $0x2f00,%ebx
                :ffffffff80460803:       or     %ebx,%eax
     58 5.6e-04 :ffffffff80460805:       jmp    ffffffff804607dd 
<tcp_gro_receive+0x17a>
    122  0.0012 :ffffffff80460807:       xor    %eax,%eax
      9 8.6e-05 :ffffffff80460809:       test   %ebp,%ebp
                :ffffffff8046080b:       sete   %al
     67 6.4e-04 :ffffffff8046080e:       and    $0x2f00,%ebx
      6 5.8e-05 :ffffffff80460814:       or     %ebx,%eax
   1995  0.0191 :ffffffff80460816:       xor    %edi,%edi
     68 6.5e-04 :ffffffff80460818:       or     %eax,0x40(%r12)
    275  0.0026 :ffffffff8046081d:       mov    %rdi,%rax
   2037  0.0196 :ffffffff80460820:       pop    %r11
    191  0.0018 :ffffffff80460822:       pop    %rbx
   4346  0.0417 :ffffffff80460823:       pop    %rbp
   4739  0.0455 :ffffffff80460824:       pop    %r12
    167  0.0016 :ffffffff80460826:       pop    %r13
  23735  0.2278 :ffffffff80460828:       pop    %r14
  56070  0.5381 :ffffffff8046082a:       pop    %r15
    140  0.0013 :ffffffff8046082c:       retq

ffffffff804357a1 <skb_gro_header>: /* skb_gro_header total: 319455 
3.0660 */
  13604  0.1306 :ffffffff804357a1:       push   %rbp
  14938  0.1434 :ffffffff804357a2:       push   %rbx
                :ffffffff804357a3:       mov    %rdi,%rbx
                :ffffffff804357a6:       sub    $0x8,%rsp
  18392  0.1765 :ffffffff804357aa:       mov    0x38(%rdi),%ebp
                :ffffffff804357ad:       mov    0x68(%rdi),%edx
      1 9.6e-06 :ffffffff804357b0:       add    %ebp,%esi
  20559  0.1973 :ffffffff804357b2:       mov    %edx,%edi
                :ffffffff804357b4:       sub    0x6c(%rbx),%edi
                :ffffffff804357b7:       jne    ffffffff804357cc 
<skb_gro_header+0x2b>
  36626  0.3515 :ffffffff804357b9:       mov    0xb8(%rbx),%ecx
      2 1.9e-05 :ffffffff804357bf:       mov    0xc0(%rbx),%rax
      3 2.9e-05 :ffffffff804357c6:       cmp    %esi,0x3c(%rax,%rcx,1)
  18577  0.1783 :ffffffff804357ca:       jae    ffffffff804357ee 
<skb_gro_header+0x4d>
                :ffffffff804357cc:       cmp    %edi,%esi
                :ffffffff804357ce:       jbe    ffffffff804357e3 
<skb_gro_header+0x42>
                :ffffffff804357d0:       cmp    %edx,%esi
                :ffffffff804357d2:       ja     ffffffff80435833 
<skb_gro_header+0x92>
                :ffffffff804357d4:       sub    %edi,%esi
                :ffffffff804357d6:       mov    %rbx,%rdi
                :ffffffff804357d9:       callq  ffffffff8042f6ee 
<__pskb_pull_tail>
                :ffffffff804357de:       test   %rax,%rax
                :ffffffff804357e1:       je     ffffffff80435833 
<skb_gro_header+0x92>
                :ffffffff804357e3:       mov    %ebp,%eax
                :ffffffff804357e5:       add    0xc8(%rbx),%rax
                :ffffffff804357ec:       jmp    ffffffff80435835 
<skb_gro_header+0x94>
      3 2.9e-05 :ffffffff804357ee:       add    0xc0(%rbx),%rcx
  25999  0.2495 :ffffffff804357f5:       mov    $0x1e0000000000,%rax
                :ffffffff804357ff:       mov    $0x6db6db6db6db6db7,%rdx
  44557  0.4276 :ffffffff80435809:       add    0x30(%rcx),%rax
                :ffffffff8043580d:       sar    $0x3,%rax
  12588  0.1208 :ffffffff80435811:       imul   %rdx,%rax
  10104  0.0970 :ffffffff80435815:       mov    $0xffff880000000000,%rdx
                :ffffffff8043581f:       shl    $0xc,%rax
                :ffffffff80435823:       add    %rdx,%rax
  16404  0.1574 :ffffffff80435826:       mov    0x38(%rcx),%edx
                :ffffffff80435829:       add    %rdx,%rax
                :ffffffff8043582c:       mov    %ebp,%edx
  15264  0.1465 :ffffffff8043582e:       add    %rdx,%rax
                :ffffffff80435831:       jmp    ffffffff80435835 
<skb_gro_header+0x94>
                :ffffffff80435833:       xor    %eax,%eax
  45844  0.4400 :ffffffff80435835:       pop    %r10
      2 1.9e-05 :ffffffff80435837:       pop    %rbx
  12844  0.1233 :ffffffff80435838:       pop    %rbp
  13144  0.1262 :ffffffff80435839:       retq

Thanks for your help,

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 30, 2009, 8:10 a.m. UTC | #13
Hi:

Unfortunately the myricom card I was using is now refusing to work:

myri10ge: Version 1.4.4-1.412
myri10ge 0000:04:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33
myri10ge 0000:04:00.0: setting latency timer to 64
myri10ge 0000:04:00.0: PCIE x4 Link
myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat
myri10ge 0000:04:00.0: command 1 failed, result = 14
myri10ge 0000:04:00.0: failed reset
myri10ge 0000:04:00.0: failed reset
myri10ge 0000:04:00.0: myri10ge_probe() failed: MAC=00:60:dd:47:80:7d, SN=312225
myri10ge 0000:04:00.0: PCI INT A disabled

So I won't be able to test this until I locate another myri10ge
card or get this one back up and running again.

Cheers,
Herbert Xu April 30, 2009, 8:14 a.m. UTC | #14
On Thu, Apr 30, 2009 at 04:10:51PM +0800, Herbert Xu wrote:
>
> So I won't be able to test this until I locate another myri10ge
> card or get this one back up and running again.

Another reboot seems to have fixed it.  So all is good.

Cheers,
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..0396447 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1145,7 +1145,7 @@  static inline void skb_gro_reset_offset(struct sk_buff *skb)
 
 static inline void *skb_gro_mac_header(struct sk_buff *skb)
 {
-	return skb_mac_header(skb) < skb->data ? skb_mac_header(skb) :
+	return skb_headlen(skb) ? skb_mac_header(skb) :
 	       page_address(skb_shinfo(skb)->frags[0].page) +
 	       skb_shinfo(skb)->frags[0].page_offset;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..ef38e4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2378,18 +2378,13 @@  void *skb_gro_header(struct sk_buff *skb, unsigned int hlen)
 	unsigned int offset = skb_gro_offset(skb);
 
 	hlen += offset;
-	if (hlen <= skb_headlen(skb))
-		return skb->data + offset;
-
-	if (unlikely(!skb_shinfo(skb)->nr_frags ||
-		     skb_shinfo(skb)->frags[0].size <=
-		     hlen - skb_headlen(skb) ||
+	if (unlikely(skb_headlen(skb) ||
+		     skb_shinfo(skb)->frags[0].size < hlen ||
 		     PageHighMem(skb_shinfo(skb)->frags[0].page)))
 		return pskb_may_pull(skb, hlen) ? skb->data + offset : NULL;
 
 	return page_address(skb_shinfo(skb)->frags[0].page) +
-	       skb_shinfo(skb)->frags[0].page_offset +
-	       offset - skb_headlen(skb);
+	       skb_shinfo(skb)->frags[0].page_offset + offset;
 }
 EXPORT_SYMBOL(skb_gro_header);