atl1c issues on 3.8.2

Message ID 1363154258.13690.40.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 13, 2013, 5:57 a.m.
On Tue, 2013-03-12 at 18:09 +0100, Michael Büsch wrote:
> On Tue, 12 Mar 2013 16:45:44 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2013-03-12 at 16:17 +0100, Michael Büsch wrote:
> > > Hi,
> > > 
> > > Starting with 3.8.x scp stalls the atl1c based interface on my Asus Eeepc 1011px.
> > > iperf (for example) does not do that. But after scp stalled the interface,
> > > iperf transfers fail, too.
> > 
> > I am pretty sure David stable list contains the needed fix 
> > 
> > http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
> 
> No this didn't fix it.
> 
> However, I tried to revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db again,
> which already caused trouble for me in 3.7
> and this fixed the issue.
> 
> So it seems that this still is the same or a related issue that I reported
> for 3.7. I just wrongly stated that the problem was fixed in 3.8, because my
> simple ping test doesn't catch it on 3.8.
> 



kmalloc(2000) never had the guarantee that the result would not span two
4K pages.

Apparently the NIC doesn't allow a rx descriptor spanning two 4K pages
or has a particular hardware bug that I can not possibly find myself.
(I don't have atl1c nor any documentation)

atl1c driver authors will need to find the bug and fix the driver.

Drivers that deal with this kind of hardware limitation allocates page
themselves and provide skbs with a fragment to upper stack, or use
build_skb() once the frame is received.

drivers/net/ethernet/intel/igb/igb_main.c is a an example.

Could you try (on net-next tree) different values for the
NETDEV_FRAG_PAGE_MAX_ORDER constant, as it might give to Atheros some
hints ?

(8192 & 16384)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 14, 2013, 2:31 p.m. | #1
On Wed, 2013-03-13 at 06:57 +0100, Eric Dumazet wrote:
> On Tue, 2013-03-12 at 18:09 +0100, Michael Büsch wrote:
> > On Tue, 12 Mar 2013 16:45:44 +0100
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > On Tue, 2013-03-12 at 16:17 +0100, Michael Büsch wrote:
> > > > Hi,
> > > > 
> > > > Starting with 3.8.x scp stalls the atl1c based interface on my Asus Eeepc 1011px.
> > > > iperf (for example) does not do that. But after scp stalled the interface,
> > > > iperf transfers fail, too.
> > > 
> > > I am pretty sure David stable list contains the needed fix 
> > > 
> > > http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
> > 
> > No this didn't fix it.
> > 
> > However, I tried to revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db again,
> > which already caused trouble for me in 3.7
> > and this fixed the issue.
> > 
> > So it seems that this still is the same or a related issue that I reported
> > for 3.7. I just wrongly stated that the problem was fixed in 3.8, because my
> > simple ping test doesn't catch it on 3.8.
> > 
> 
> 


And it seems the possible fix is here :

http://patchwork.ozlabs.org/patch/227666/



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Büsch March 14, 2013, 10:17 p.m. | #2
On Thu, 14 Mar 2013 15:31:00 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> And it seems the possible fix is here :
> 
> http://patchwork.ozlabs.org/patch/227666/

I can still reproduce with this fix applied.

However, I noticed that I cannot reproduce, if the wireless interface (ath9k) of
the netbook is down while testing the ethernet. The wireless does not carry any
test traffic. It's just idle.
I do not know if this always had been the case, because wireless was always up (and mostly
idle) in my previous ethernet tests.
Eric Dumazet March 14, 2013, 11:06 p.m. | #3
On Thu, 2013-03-14 at 23:17 +0100, Michael Büsch wrote:

> I can still reproduce with this fix applied.
> 
> However, I noticed that I cannot reproduce, if the wireless interface (ath9k) of
> the netbook is down while testing the ethernet. The wireless does not carry any
> test traffic. It's just idle.
> I do not know if this always had been the case, because wireless was always up (and mostly
> idle) in my previous ethernet tests.
> 

OK, then it must be kind of corruption issue in ath9k, or whatever ?

You could try various DEBUGing stuff, like CONFIG_DEBUG_PAGEALLOC and
CONFIG_SLUB_DEBUG_ON



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Büsch March 15, 2013, 7:44 p.m. | #4
On Fri, 15 Mar 2013 00:06:02 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> You could try various DEBUGing stuff, like CONFIG_DEBUG_PAGEALLOC and
> CONFIG_SLUB_DEBUG_ON

This bug is so weird, so I did some double-checking.
Just to minimize the mistakes on my side.
I compiled a kernel without the revert of the original commit
and without the skb fix you suggested.
It turns out that I am only able to reproduce the issue, if the ath9k interface is
up while testing the atl1c ethernet.
And I also double-checked that reverting the original commit fixes the issue.
No stalls with up or down ath9k then.
So that confirms my previous results.

I tried to enable pagealloc debug and slub debug on a kernel with the suggested skb
fix, but without the revert of the commit. Nothing special appeared
in the logs. I'm currently building a kernel with almost all debugging options
turned on. I will test that tomorrow.

Thanks for your help.
Michael Büsch March 22, 2013, 11:28 a.m. | #5
On Fri, 15 Mar 2013 20:44:57 +0100
Michael Büsch <m@bues.ch> wrote:

> I'm currently building a kernel with almost all debugging options
> turned on. I will test that tomorrow.

It took me a little bit longer than expected, but running the tests on
a kernel with almost all debugging options enabled shows no additional kernel messages. :/
Michael Büsch May 27, 2013, 4:43 p.m. | #6
Any news on this?

Am I still the only one with this issue?
It's still 100% reproducible and I can workaround it by reverting 
69b08f62e17439ee3d436faf0b9a7ca6fffb78db

It can't possibly be that I'm the only one on this planet seeing this...

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 821c7f4..769fdac 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1844,7 +1844,7 @@  static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(32768)
+#define NETDEV_FRAG_PAGE_MAX_ORDER get_order(8192)
 #define NETDEV_FRAG_PAGE_MAX_SIZE  (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)
 #define NETDEV_PAGECNT_MAX_BIAS	   NETDEV_FRAG_PAGE_MAX_SIZE