diff mbox

myr10ge: again fix lro_gen_skb() alignment

Message ID 20090424054557.GA24575@gondor.apana.org.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu April 24, 2009, 5:45 a.m. UTC
On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>
> I booted the sender into a kernel.org 2.6.18.2 so as to try to have  
> results as close to yours as possible (I was running 2.6.22 on the
> sender before).

OK I've got my hands on a myricom card.  I've tested it using the
same 2.6.18 sender that I used against the eariler cxgb3 test.
I wasn't able to discern any significant deviations between LRO
and GRO.

Unfortunately it seems that this machine is a little too fast
so even with the IRQ bound to a single CPU it's way overspeced
for 10GbE:

    Idle at 10Gb IRQ rate	soaker IRQ rate soaker throuput
GRO 43-45	 14700		13300		7933
LRO 43-45	 14700		13300		7943

But even with the soaker running they seem to be neck and neck.

Here's the patch I used BTW.  I got the checksums to work by
just setting skb->csum.


Cheers,

Comments

Andrew Gallatin April 24, 2009, 12:45 p.m. UTC | #1
Herbert Xu wrote:
> On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>> I booted the sender into a kernel.org 2.6.18.2 so as to try to have  
>> results as close to yours as possible (I was running 2.6.22 on the
>> sender before).
> 
> OK I've got my hands on a myricom card.  I've tested it using the
> same 2.6.18 sender that I used against the eariler cxgb3 test.
> I wasn't able to discern any significant deviations between LRO
> and GRO.
> 
> Unfortunately it seems that this machine is a little too fast
> so even with the IRQ bound to a single CPU it's way overspeced
> for 10GbE:
> 
>     Idle at 10Gb IRQ rate	soaker IRQ rate soaker throuput
> GRO 43-45	 14700		13300		7933
> LRO 43-45	 14700		13300		7943
> 
> But even with the soaker running they seem to be neck and neck.

This is strange.  I wonder if it might be a cache footprint issue?
My intentionally weak receiver is an athlon64 x2 "Toledo", and
has only 512KB L2 cache.  I can re-test with a core-2 based Xeon.

But can you describe your setup in more detail?  What CPU does the
receiver have? You say the sender is running 2.6.18.  Is this
a RHEL5 kernel, or a kernel.org kernel?

> Here's the patch I used BTW.  I got the checksums to work by
> just setting skb->csum.

Yes, sorry about that stupidity.

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 24, 2009, 12:51 p.m. UTC | #2
On Fri, Apr 24, 2009 at 08:45:28AM -0400, Andrew Gallatin wrote:
>
> This is strange.  I wonder if it might be a cache footprint issue?
> My intentionally weak receiver is an athlon64 x2 "Toledo", and
> has only 512KB L2 cache.  I can re-test with a core-2 based Xeon.
>
> But can you describe your setup in more detail?  What CPU does the
> receiver have? You say the sender is running 2.6.18.  Is this
> a RHEL5 kernel, or a kernel.org kernel?

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
stepping        : 6
cpu MHz         : 2833.155
cache size      : 6144 KB
physical id     : 1
siblings        : 1
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 5665.85
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

It's a RHEL-5 kernel, -128 to be exact.  But I rebooted into
2.6.30-rc1 and it made no difference.  The previous observed
problem with 30-rc1 seems to only affect cxgb3.

Cheers,
Andrew Gallatin April 24, 2009, 4:16 p.m. UTC | #3
Herbert Xu wrote:
> On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>> I booted the sender into a kernel.org 2.6.18.2 so as to try to have  
>> results as close to yours as possible (I was running 2.6.22 on the
>> sender before).
> 
> OK I've got my hands on a myricom card.  I've tested it using the
> same 2.6.18 sender that I used against the eariler cxgb3 test.
> I wasn't able to discern any significant deviations between LRO
> and GRO.
> 
> Unfortunately it seems that this machine is a little too fast
> so even with the IRQ bound to a single CPU it's way overspeced
> for 10GbE:
> 
>     Idle at 10Gb IRQ rate	soaker IRQ rate soaker throuput
> GRO 43-45	 14700		13300		7933
> LRO 43-45	 14700		13300		7943
> 
> But even with the soaker running they seem to be neck and neck.

 From what I can tell,  CPU utilization is only broken when a CPU is
otherwise idle, so it should be accurate when you bind the IRQ and the
netserver to the same CPU.   Here are results from an older, slower
core-2 Xeon with a 4MB L2 cache:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
cpu MHz         : 2659.916
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow
bogomips        : 5319.83
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

The Xeon was running net-next, had DCA enabled, ioatdma disabled for
TCP (CONFIG_NET_DMA is not set).  The sender was the weak athlon64,
running 2.6.22.

LRO, no soaker: (13,200 intrs/sec)
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
  87380  65536  65536    60.02      9469.74   17.44    13.31    0.302 
0.461

LRO, soaker: (6,500 intrs/sec)

  87380  65536  65536    60.06      3955.74   7.11     25.02    0.294 
2.072

GRO, no soaker, (13,200 intrs/sec)
  87380  65536  65536    60.02      9467.90   16.76    14.16    0.290 
0.490

GRO, soaker: (6,500 intrs/sec)
  87380  65536  65536    60.02      3774.88   6.20     25.01    0.269 
2.171


These results are indeed quite close, so the performance problem seems
isolated to AMD CPUS, and perhaps due to the smaller caches.
Do you have any AMD you can use as a receiver?

Note that the GRO results were still obtained by (bogusly) setting
CHECKSUM_UNNECESSARY.  I tried to use your patch, and I see
terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
to 5Gb/s with GRO disabled).  I don't see bad checksums in netstat
on the receiver, but it *feels* like something like that.

Here's a diff of netstat -st taken on the sender before and after
a 5 second netperf:
2c2
<     157 active connections openings
---
 >     159 active connections openings
7,9c7,9
<     31465934 segments received
<     72887021 segments send out
<     679 segments retransmited
---
 >     32184827 segments received
 >     73473546 segments send out
 >     698 segments retransmited
16c16
<     4596 packets directly queued to recvmsg prequeue.
---
 >     4603 packets directly queued to recvmsg prequeue.
18,21c18,21
<     15928 packets header predicted
<     18100148 acknowledgments not containing data received
<     13351873 predicted acknowledgments
<     343 times recovered from packet loss due to SACK data
---
 >     15930 packets header predicted
 >     18464095 acknowledgments not containing data received
 >     13706813 predicted acknowledgments
 >     365 times recovered from packet loss due to SACK data
23,25c23,25
<     53 congestion windows fully recovered
<     221 congestion windows partially recovered using Hoe heuristic
<     TCPDSACKUndo: 268
---
 >     60 congestion windows fully recovered
 >     228 congestion windows partially recovered using Hoe heuristic
 >     TCPDSACKUndo: 281
27,28c27,28
<     584 fast retransmits
<     93 forward retransmits
---
 >     597 fast retransmits
 >     99 forward retransmits
30c30
<     674 DSACKs received
---
 >     693 DSACKs received

And on the receiver (whose netstat is confused, and cannot read ext 
stats in a net-next kernel):
diff /tmp/a /tmp/b
3c3
<     12 passive connection openings
---
 >     14 passive connection openings
7,8c7,8
<     3776478 segments received
<     3775846 segments send out
---
 >     4495385 segments received
 >     4494747 segments send out



This was using a net-next pulled 1/2 hour ago.  The only patch was your 
GRO patch applied to myri10ge.  Do you have some other local patch
which might be helping you?

Drew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 24, 2009, 4:30 p.m. UTC | #4
On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>
> Note that the GRO results were still obtained by (bogusly) setting
> CHECKSUM_UNNECESSARY.  I tried to use your patch, and I see
> terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
> to 5Gb/s with GRO disabled).  I don't see bad checksums in netstat
> on the receiver, but it *feels* like something like that.

Well if the hardware checksum ends up being wrong then we'll always
fall back to using software checksums.  So somehow I doubt it's
causing what you're seeing.

> Here's a diff of netstat -st taken on the sender before and after
> a 5 second netperf:
> 2c2
> <     157 active connections openings
> ---
> >     159 active connections openings
> 7,9c7,9
> <     31465934 segments received
> <     72887021 segments send out
> <     679 segments retransmited
> ---
> >     32184827 segments received
> >     73473546 segments send out
> >     698 segments retransmited

So you're losing packets.  This is indeed something that I didn't
see here at all.  I'll see if I can get the card moved to an AMD
machine.

> This was using a net-next pulled 1/2 hour ago.  The only patch was your  
> GRO patch applied to myri10ge.  Do you have some other local patch
> which might be helping you?

I was using Linus's tree + the GRO patches from net-next.  I do
have two further optimisation patches applied but they don't
actually make much difference (I made them while trying to figure
out why cxgb3's GRO became slow again, which turned out to be sender
related).

Cheers,
Herbert Xu April 24, 2009, 4:31 p.m. UTC | #5
On Sat, Apr 25, 2009 at 12:30:09AM +0800, Herbert Xu wrote:
>
> > This was using a net-next pulled 1/2 hour ago.  The only patch was your  
> > GRO patch applied to myri10ge.  Do you have some other local patch
> > which might be helping you?
> 
> I was using Linus's tree + the GRO patches from net-next.  I do
> have two further optimisation patches applied but they don't
> actually make much difference (I made them while trying to figure
> out why cxgb3's GRO became slow again, which turned out to be sender
> related).

I'll also retest using net-next.

Thanks,
Rick Jones April 24, 2009, 5:13 p.m. UTC | #6
> This is strange.  I wonder if it might be a cache footprint issue?
> My intentionally weak receiver is an athlon64 x2 "Toledo", and
> has only 512KB L2 cache.  I can re-test with a core-2 based Xeon.

A point about netperf :)  By default, it will use one more buffer than the 
initial size of the socket buffer divided by the send/recv buffer size - this 
goes back to days of copy-avoidance, a flavor of which can be found in reading:

ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf

particularly section 3.2.  It can be overridden with the global -W option:

     -W send,recv      Set the number of send,recv buffers

Of course, this will interact with other things such as:

The default send/recv size will be the send/recv socket buffer size. That can be 
overridden with the test-specific -m/-M options.

The default socket buffer size will be whatever the system gives it.  That can be 
overridden with the test-specific -s/-S options.

So, the various options can have a non-trivial effect on the cache footprint of 
the data netperf is shoving around.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index f2c4a66..91de78f 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -48,7 +48,6 @@ 
 #include <linux/etherdevice.h>
 #include <linux/if_ether.h>
 #include <linux/if_vlan.h>
-#include <linux/inet_lro.h>
 #include <linux/dca.h>
 #include <linux/ip.h>
 #include <linux/inet.h>
@@ -92,7 +91,6 @@  MODULE_LICENSE("Dual BSD/GPL");
 
 #define MYRI10GE_EEPROM_STRINGS_SIZE 256
 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
-#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
 #define MYRI10GE_LRO_MAX_PKTS 64
 
 #define MYRI10GE_NO_CONFIRM_DATA htonl(0xffffffff)
@@ -161,8 +159,6 @@  struct myri10ge_rx_done {
 	dma_addr_t bus;
 	int cnt;
 	int idx;
-	struct net_lro_mgr lro_mgr;
-	struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
 };
 
 struct myri10ge_slice_netstats {
@@ -1264,18 +1260,31 @@  static inline int
 myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
 		 int bytes, int len, __wsum csum)
 {
+	struct napi_struct *napi = &ss->napi;
 	struct myri10ge_priv *mgp = ss->mgp;
 	struct sk_buff *skb;
-	struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
-	int i, idx, hlen, remainder;
+	struct skb_frag_struct *rx_frags;
+	int i, idx, remainder;
 	struct pci_dev *pdev = mgp->pdev;
-	struct net_device *dev = mgp->dev;
 	u8 *va;
 
 	len += MXGEFW_PAD;
 	idx = rx->cnt & rx->mask;
 	va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
 	prefetch(va);
+
+	skb = napi_get_frags(napi);
+	if ((unlikely(!skb))) {
+		for (i = 0, remainder = len; remainder > 0; i++) {
+			myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
+			put_page(rx->info[idx].page);
+			rx->cnt++;
+			idx = rx->cnt & rx->mask;
+			remainder -= MYRI10GE_ALLOC_SIZE;
+		}
+	}
+	rx_frags = skb_shinfo(skb)->frags;
+
 	/* Fill skb_frag_struct(s) with data from our receive */
 	for (i = 0, remainder = len; remainder > 0; i++) {
 		myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
@@ -1290,52 +1299,18 @@  myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
 		remainder -= MYRI10GE_ALLOC_SIZE;
 	}
 
-	if (mgp->csum_flag && myri10ge_lro) {
-		rx_frags[0].page_offset += MXGEFW_PAD;
-		rx_frags[0].size -= MXGEFW_PAD;
-		len -= MXGEFW_PAD;
-		lro_receive_frags(&ss->rx_done.lro_mgr, rx_frags,
-				  /* opaque, will come back in get_frag_header */
-				  len, len,
-				  (void *)(__force unsigned long)csum, csum);
-
-		return 1;
-	}
-
-	hlen = MYRI10GE_HLEN > len ? len : MYRI10GE_HLEN;
-
-	/* allocate an skb to attach the page(s) to. This is done
-	 * after trying LRO, so as to avoid skb allocation overheads */
-
-	skb = netdev_alloc_skb(dev, MYRI10GE_HLEN + 16);
-	if (unlikely(skb == NULL)) {
-		ss->stats.rx_dropped++;
-		do {
-			i--;
-			put_page(rx_frags[i].page);
-		} while (i != 0);
-		return 0;
-	}
-
-	/* Attach the pages to the skb, and trim off any padding */
-	myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
-	if (skb_shinfo(skb)->frags[0].size <= 0) {
-		put_page(skb_shinfo(skb)->frags[0].page);
-		skb_shinfo(skb)->nr_frags = 0;
-	}
-	skb->protocol = eth_type_trans(skb, dev);
-	skb_record_rx_queue(skb, ss - &mgp->ss[0]);
-
-	if (mgp->csum_flag) {
-		if ((skb->protocol == htons(ETH_P_IP)) ||
-		    (skb->protocol == htons(ETH_P_IPV6))) {
-			skb->csum = csum;
-			skb->ip_summed = CHECKSUM_COMPLETE;
-		} else
-			myri10ge_vlan_ip_csum(skb, csum);
-	}
-	netif_receive_skb(skb);
+	rx_frags[0].page_offset += MXGEFW_PAD;
+	rx_frags[0].size -= MXGEFW_PAD;
+	len -= MXGEFW_PAD;
+	skb_shinfo(skb)->nr_frags = i;
+	skb->len = len;
+	skb->data_len = len;
+	skb->truesize += len;
+	skb->csum = csum;
+	skb->ip_summed = CHECKSUM_COMPLETE;
+	napi_gro_frags(napi);
 	return 1;
+
 }
 
 static inline void
@@ -1445,9 +1420,6 @@  myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
 	ss->stats.rx_packets += rx_packets;
 	ss->stats.rx_bytes += rx_bytes;
 
-	if (myri10ge_lro)
-		lro_flush_all(&rx_done->lro_mgr);
-
 	/* restock receive rings if needed */
 	if (ss->rx_small.fill_cnt - ss->rx_small.cnt < myri10ge_fill_thresh)
 		myri10ge_alloc_rx_pages(mgp, &ss->rx_small,
@@ -1752,9 +1724,7 @@  static const char myri10ge_gstrings_slice_stats[][ETH_GSTRING_LEN] = {
 	"----------- slice ---------",
 	"tx_pkt_start", "tx_pkt_done", "tx_req", "tx_done",
 	"rx_small_cnt", "rx_big_cnt",
-	"wake_queue", "stop_queue", "tx_linearized", "LRO aggregated",
-	    "LRO flushed",
-	"LRO avg aggr", "LRO no_desc"
+	"wake_queue", "stop_queue", "tx_linearized"
 };
 
 #define MYRI10GE_NET_STATS_LEN      21
@@ -1851,14 +1821,6 @@  myri10ge_get_ethtool_stats(struct net_device *netdev,
 		data[i++] = (unsigned int)ss->tx.wake_queue;
 		data[i++] = (unsigned int)ss->tx.stop_queue;
 		data[i++] = (unsigned int)ss->tx.linearized;
-		data[i++] = ss->rx_done.lro_mgr.stats.aggregated;
-		data[i++] = ss->rx_done.lro_mgr.stats.flushed;
-		if (ss->rx_done.lro_mgr.stats.flushed)
-			data[i++] = ss->rx_done.lro_mgr.stats.aggregated /
-			    ss->rx_done.lro_mgr.stats.flushed;
-		else
-			data[i++] = 0;
-		data[i++] = ss->rx_done.lro_mgr.stats.no_desc;
 	}
 }
 
@@ -2186,67 +2148,6 @@  static void myri10ge_free_irq(struct myri10ge_priv *mgp)
 		pci_disable_msix(pdev);
 }
 
-static int
-myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
-			 void **ip_hdr, void **tcpudp_hdr,
-			 u64 * hdr_flags, void *priv)
-{
-	struct ethhdr *eh;
-	struct vlan_ethhdr *veh;
-	struct iphdr *iph;
-	u8 *va = page_address(frag->page) + frag->page_offset;
-	unsigned long ll_hlen;
-	/* passed opaque through lro_receive_frags() */
-	__wsum csum = (__force __wsum) (unsigned long)priv;
-
-	/* find the mac header, aborting if not IPv4 */
-
-	eh = (struct ethhdr *)va;
-	*mac_hdr = eh;
-	ll_hlen = ETH_HLEN;
-	if (eh->h_proto != htons(ETH_P_IP)) {
-		if (eh->h_proto == htons(ETH_P_8021Q)) {
-			veh = (struct vlan_ethhdr *)va;
-			if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
-				return -1;
-
-			ll_hlen += VLAN_HLEN;
-
-			/*
-			 *  HW checksum starts ETH_HLEN bytes into
-			 *  frame, so we must subtract off the VLAN
-			 *  header's checksum before csum can be used
-			 */
-			csum = csum_sub(csum, csum_partial(va + ETH_HLEN,
-							   VLAN_HLEN, 0));
-		} else {
-			return -1;
-		}
-	}
-	*hdr_flags = LRO_IPV4;
-
-	iph = (struct iphdr *)(va + ll_hlen);
-	*ip_hdr = iph;
-	if (iph->protocol != IPPROTO_TCP)
-		return -1;
-	if (iph->frag_off & htons(IP_MF | IP_OFFSET))
-		return -1;
-	*hdr_flags |= LRO_TCP;
-	*tcpudp_hdr = (u8 *) (*ip_hdr) + (iph->ihl << 2);
-
-	/* verify the IP checksum */
-	if (unlikely(ip_fast_csum((u8 *) iph, iph->ihl)))
-		return -1;
-
-	/* verify the  checksum */
-	if (unlikely(csum_tcpudp_magic(iph->saddr, iph->daddr,
-				       ntohs(iph->tot_len) - (iph->ihl << 2),
-				       IPPROTO_TCP, csum)))
-		return -1;
-
-	return 0;
-}
-
 static int myri10ge_get_txrx(struct myri10ge_priv *mgp, int slice)
 {
 	struct myri10ge_cmd cmd;
@@ -2317,7 +2218,6 @@  static int myri10ge_open(struct net_device *dev)
 	struct myri10ge_cmd cmd;
 	int i, status, big_pow2, slice;
 	u8 *itable;
-	struct net_lro_mgr *lro_mgr;
 
 	if (mgp->running != MYRI10GE_ETH_STOPPED)
 		return -EBUSY;
@@ -2438,19 +2338,6 @@  static int myri10ge_open(struct net_device *dev)
 			goto abort_with_rings;
 		}
 
-		lro_mgr = &ss->rx_done.lro_mgr;
-		lro_mgr->dev = dev;
-		lro_mgr->features = LRO_F_NAPI;
-		lro_mgr->ip_summed = CHECKSUM_COMPLETE;
-		lro_mgr->ip_summed_aggr = CHECKSUM_UNNECESSARY;
-		lro_mgr->max_desc = MYRI10GE_MAX_LRO_DESCRIPTORS;
-		lro_mgr->lro_arr = ss->rx_done.lro_desc;
-		lro_mgr->get_frag_header = myri10ge_get_frag_header;
-		lro_mgr->max_aggr = myri10ge_lro_max_pkts;
-		lro_mgr->frag_align_pad = 2;
-		if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
-			lro_mgr->max_aggr = MAX_SKB_FRAGS;
-
 		/* must happen prior to any irq */
 		napi_enable(&(ss)->napi);
 	}
@@ -3884,7 +3771,7 @@  static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	if (dac_enabled)
 		netdev->features |= NETIF_F_HIGHDMA;
-
+	netdev->features |= NETIF_F_GRO;
 	/* make sure we can get an irq, and that MSI can be
 	 * setup (if available).  Also ensure netdev->irq
 	 * is set to correct value if MSI is enabled */