diff mbox

b44: Reset due to FIFO overflow.

Message ID 1277716394.4235.235.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet June 28, 2010, 9:13 a.m. UTC
Le lundi 28 juin 2010 à 08:41 +0100, James Courtier-Dutton a écrit :
> Hi,
> 
> Reference:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/279102
> 
> I can see this bug and can reproduce it 100% on demand.
> The problem seems to be that when the b44 has a incoming FIFO buffer
> overflow, it resets the entire card, dis-associates with the access
> point and therefore takes some time before it can pass traffic again.
> Can anyone point me to some code that would just recover the FIFO
> instead of reset the entire card?
> 
> I am a kernel developer, but I don't have any data sheets on this card
> so was hoping someone with more knowledge of its workings, could help
> me.
> 
> I can then test it, and see if it is a good fix or not.
> 

Hi

Problem is we dont know if a Receive Fifo overflow is a minor or major
indication from b44 chip.

A minor indication would be : Chip tells us one or more frame were lost.
No special action needed from driver.

A major indication (as of current implemented in b44 driver) is :
I am completely out of order and need a reset. Please do it.

Patch to switch from major to minor indication is easy, but we dont know
if its valid or not.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mitchell Erblich June 28, 2010, 9:33 a.m. UTC | #1
On Jun 28, 2010, at 2:13 AM, Eric Dumazet wrote:

> Le lundi 28 juin 2010 à 08:41 +0100, James Courtier-Dutton a écrit :
>> Hi,
>> 
>> Reference:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/279102
>> 
>> I can see this bug and can reproduce it 100% on demand.
>> The problem seems to be that when the b44 has a incoming FIFO buffer
>> overflow, it resets the entire card, dis-associates with the access
>> point and therefore takes some time before it can pass traffic again.
>> Can anyone point me to some code that would just recover the FIFO
>> instead of reset the entire card?
>> 
>> I am a kernel developer, but I don't have any data sheets on this card
>> so was hoping someone with more knowledge of its workings, could help
>> me.
>> 
>> I can then test it, and see if it is a good fix or not.
>> 
> 
> Hi
> 
> Problem is we dont know if a Receive Fifo overflow is a minor or major
> indication from b44 chip.
> 
> A minor indication would be : Chip tells us one or more frame were lost.
> No special action needed from driver.
> 
> A major indication (as of current implemented in b44 driver) is :
> I am completely out of order and need a reset. Please do it.
> 
> Patch to switch from major to minor indication is easy, but we dont know
> if its valid or not.
> 
> diff --git a/drivers/net/b44.h b/drivers/net/b44.h
> index e1905a4..514dc3a 100644
> --- a/drivers/net/b44.h
> +++ b/drivers/net/b44.h
> @@ -42,7 +42,7 @@
> #define  ISTAT_EMAC		0x04000000 /* EMAC Interrupt */
> #define  ISTAT_MII_WRITE	0x08000000 /* MII Write Interrupt */
> #define  ISTAT_MII_READ		0x10000000 /* MII Read Interrupt */
> -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
> +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
> #define B44_IMASK	0x0024UL /* Interrupt Mask */
> #define  IMASK_DEF		(ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
> #define B44_GPTIMER	0x0028UL /* General Purpose Timer */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why wouldn't the ability to recv frames after a Recv FIFO overflow
 indicate that a reset is NOT required?

Thus,  should't it be an indication of congestion if associated with a single
flow and either speed up (reduce latency to service) the recv side or 
slow down the xmit side?

Mitchell Erblich--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 28, 2010, 10 a.m. UTC | #2
Le lundi 28 juin 2010 à 02:33 -0700, Mitchell Erblich a écrit :

> 
> Why wouldn't the ability to recv frames after a Recv FIFO overflow
>  indicate that a reset is NOT required?
> 

Do you have datasheets of all b44 chips ? I dont.

> Thus,  should't it be an indication of congestion if associated with a single
> flow and either speed up (reduce latency to service) the recv side or 
> slow down the xmit side?

I dont understand what you are saying. xmit is not the problem here.
And driver is flow agnostic. Its well before network stack.

Problem is we receive a spike of RX network frames (possibly UDP or some
other RX only trafic), and chip raises an RX fifo overflow _error_
indication.

Some hardware are buggy enough that such error indication is fatal and
_require_ hardware reset. Thats life. I suspect b44 driver doing a full
reset is not a random guess from driver author, but to avoid a complete
NIC lockup.

Refs:

http://bugzilla.kernel.org/show_bug.cgi?id=7696
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216338

commit 5fc7d61aee1a7f7d3448f8fbccaa93371ebeecb0



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 28, 2010, 10:17 a.m. UTC | #3
On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Problem is we receive a spike of RX network frames (possibly UDP or some
> other RX only trafic), and chip raises an RX fifo overflow _error_
> indication.
>

The cause of the RX overflow is in my case is TCP.
It is reproducible in mythtv.
While watching LiveTV, press "s" for the program guide.
The program guide is implemented into mythtv by a SQL query that
results in a large response.
The kernel is probably not servicing the RX FIFO quickly enough due to
it being busy doing something else. In this case, probably a video
mode switch.

> Some hardware are buggy enough that such error indication is fatal and
> _require_ hardware reset. Thats life. I suspect b44 driver doing a full
> reset is not a random guess from driver author, but to avoid a complete
> NIC lockup.
>

Interesting, which hardware, apart from the b44, is it that "requires"
a hardware reset after a RX FIFO overflow.
Now, it is true that the driver for the b44 is just as bad on windows,
but there the recovery requires a windows reboot!

Kind Regards

James
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 28, 2010, 10:24 a.m. UTC | #4
On 28 June 2010 10:13, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Hi
>
> Problem is we dont know if a Receive Fifo overflow is a minor or major
> indication from b44 chip.
>
> A minor indication would be : Chip tells us one or more frame were lost.
> No special action needed from driver.
>
> A major indication (as of current implemented in b44 driver) is :
> I am completely out of order and need a reset. Please do it.
>
> Patch to switch from major to minor indication is easy, but we dont know
> if its valid or not.
>
> diff --git a/drivers/net/b44.h b/drivers/net/b44.h
> index e1905a4..514dc3a 100644
> --- a/drivers/net/b44.h
> +++ b/drivers/net/b44.h
> @@ -42,7 +42,7 @@
>  #define  ISTAT_EMAC            0x04000000 /* EMAC Interrupt */
>  #define  ISTAT_MII_WRITE       0x08000000 /* MII Write Interrupt */
>  #define  ISTAT_MII_READ                0x10000000 /* MII Read Interrupt */
> -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
> +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
>  #define B44_IMASK      0x0024UL /* Interrupt Mask */
>  #define  IMASK_DEF             (ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
>  #define B44_GPTIMER    0x0028UL /* General Purpose Timer */
>
>
>

Ok, are you saying that all I have to do is apply this patch,
reproduce the problem condition, and if it recovers OK, then we can go
with this fix?
If so, I will try it out after work.

I will probably add a printk in before the ISTAT_ERRORS test, to
inform me when that ISTAT_RFO has actually happened.

But is doing nothing the right thing?
I would have thought that one would have to at least start and stop
the FIFO in order for the write/read pointers to be in the correct
positions or at least change the read pointer to do the equivalent of
flush the buffer.
Is there any of this sort of control over the FIFO possible?

Kind Regards

James
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 28, 2010, 11:09 a.m. UTC | #5
Le lundi 28 juin 2010 à 11:17 +0100, James Courtier-Dutton a écrit :
> On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Problem is we receive a spike of RX network frames (possibly UDP or some
> > other RX only trafic), and chip raises an RX fifo overflow _error_
> > indication.
> >
> 
> The cause of the RX overflow is in my case is TCP.
> It is reproducible in mythtv.
> While watching LiveTV, press "s" for the program guide.
> The program guide is implemented into mythtv by a SQL query that
> results in a large response.
> The kernel is probably not servicing the RX FIFO quickly enough due to
> it being busy doing something else. In this case, probably a video
> mode switch.
> 

Thats strange, b44 has a big RX ring... and tcp sender should wait for
ACK...

> > Some hardware are buggy enough that such error indication is fatal and
> > _require_ hardware reset. Thats life. I suspect b44 driver doing a full
> > reset is not a random guess from driver author, but to avoid a complete
> > NIC lockup.
> >
> 
> Interesting, which hardware, apart from the b44, is it that "requires"
> a hardware reset after a RX FIFO overflow.

Just take a look at some net drivers and you'll see some of them have
this requirement.

rtl8169_rx_interrupt()
...
	if (status & RxFOVF) {
		rtl8169_schedule_work(dev, rtl8169_reset_task);
		dev->stats.rx_fifo_errors++;
	}




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 28, 2010, 11:11 a.m. UTC | #6
Le lundi 28 juin 2010 à 11:24 +0100, James Courtier-Dutton a écrit :
> On 28 June 2010 10:13, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Hi
> >
> > Problem is we dont know if a Receive Fifo overflow is a minor or major
> > indication from b44 chip.
> >
> > A minor indication would be : Chip tells us one or more frame were lost.
> > No special action needed from driver.
> >
> > A major indication (as of current implemented in b44 driver) is :
> > I am completely out of order and need a reset. Please do it.
> >
> > Patch to switch from major to minor indication is easy, but we dont know
> > if its valid or not.
> >
> > diff --git a/drivers/net/b44.h b/drivers/net/b44.h
> > index e1905a4..514dc3a 100644
> > --- a/drivers/net/b44.h
> > +++ b/drivers/net/b44.h
> > @@ -42,7 +42,7 @@
> >  #define  ISTAT_EMAC            0x04000000 /* EMAC Interrupt */
> >  #define  ISTAT_MII_WRITE       0x08000000 /* MII Write Interrupt */
> >  #define  ISTAT_MII_READ                0x10000000 /* MII Read Interrupt */
> > -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
> > +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
> >  #define B44_IMASK      0x0024UL /* Interrupt Mask */
> >  #define  IMASK_DEF             (ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
> >  #define B44_GPTIMER    0x0028UL /* General Purpose Timer */
> >
> >
> >
> 
> Ok, are you saying that all I have to do is apply this patch,
> reproduce the problem condition, and if it recovers OK, then we can go
> with this fix?
> If so, I will try it out after work.
> 

Yes, please try the patch and tell us what happens.

Note : It can be better, it can be worse.

It can work on your b44 chip, and freeze another computer with another
b44 chip. Use at your own risk.


> I will probably add a printk in before the ISTAT_ERRORS test, to
> inform me when that ISTAT_RFO has actually happened.
> 

In this case, you also need to add ISTAT_RFO to IMASK_DEF. My patch
completely ignores ISTAT_RFO and NIC wont generate an interrupt for this
event.

> But is doing nothing the right thing?

If chip only signals an overflow and can revover, its pretty like losing
a frame on a busy network. It happens and you dont have a report for
this.

> I would have thought that one would have to at least start and stop
> the FIFO in order for the write/read pointers to be in the correct
> positions or at least change the read pointer to do the equivalent of
> flush the buffer.

All this depends on hardware bits I dont have access to.

> Is there any of this sort of control over the FIFO possible?
> 


Probably, but I guess if broadcam guys did not implement it, they
probably have a good reason :)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitchell Erblich June 28, 2010, 9:21 p.m. UTC | #7
On Jun 28, 2010, at 4:09 AM, Eric Dumazet wrote:

> Le lundi 28 juin 2010 à 11:17 +0100, James Courtier-Dutton a écrit :
>> On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>> Problem is we receive a spike of RX network frames (possibly UDP or some
>>> other RX only trafic), and chip raises an RX fifo overflow _error_
>>> indication.
>>> 

IMO, spikes are a normal behaviour.

>> 
>> The cause of the RX overflow is in my case is TCP.
>> It is reproducible in mythtv.
>> While watching LiveTV, press "s" for the program guide.
>> The program guide is implemented into mythtv by a SQL query that
>> results in a large response.
>> The kernel is probably not servicing the RX FIFO quickly enough due to
>> it being busy doing something else. In this case, probably a video
>> mode switch.
>> 
> 
> Thats strange, b44 has a big RX ring... and tcp sender should wait for
> ACK...
> 

Slow start, etc SHOULD/CAN  double the number of in-flight segments in each
next round-trip, placing them back to back.

IMO,  a stress test, would be a large number/wirespeed set of pings?

>>> Some hardware are buggy enough that such error indication is fatal and
>>> _require_ hardware reset. Thats life. I suspect b44 driver doing a full
>>> reset is not a random guess from driver author, but to avoid a complete
>>> NIC lockup.
>>> 
>> 
>> Interesting, which hardware, apart from the b44, is it that "requires"
>> a hardware reset after a RX FIFO overflow.
> 
> Just take a look at some net drivers and you'll see some of them have
> this requirement.
> 
> rtl8169_rx_interrupt()
> ...
> 	if (status & RxFOVF) {
> 		rtl8169_schedule_work(dev, rtl8169_reset_task);
> 		dev->stats.rx_fifo_errors++;
> 	}
> 
> 
> 
> 


If they can reset in say X frame loss units, then why not reset if
X is an acceptable number?

And a hammer may fix the dent, while I may be more
interested in preventing the dent in the first place.

Mitchell Erblich





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 28, 2010, 9:37 p.m. UTC | #8
On 28 June 2010 12:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 28 juin 2010 à 11:24 +0100, James Courtier-Dutton a écrit :
>> On 28 June 2010 10:13, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Hi
>> >
>> > Problem is we dont know if a Receive Fifo overflow is a minor or major
>> > indication from b44 chip.
>> >
>> > A minor indication would be : Chip tells us one or more frame were lost.
>> > No special action needed from driver.
>> >
>> > A major indication (as of current implemented in b44 driver) is :
>> > I am completely out of order and need a reset. Please do it.
>> >
>> > Patch to switch from major to minor indication is easy, but we dont know
>> > if its valid or not.
>> >
>> > diff --git a/drivers/net/b44.h b/drivers/net/b44.h
>> > index e1905a4..514dc3a 100644
>> > --- a/drivers/net/b44.h
>> > +++ b/drivers/net/b44.h
>> > @@ -42,7 +42,7 @@
>> >  #define  ISTAT_EMAC            0x04000000 /* EMAC Interrupt */
>> >  #define  ISTAT_MII_WRITE       0x08000000 /* MII Write Interrupt */
>> >  #define  ISTAT_MII_READ                0x10000000 /* MII Read Interrupt */
>> > -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
>> > +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
>> >  #define B44_IMASK      0x0024UL /* Interrupt Mask */
>> >  #define  IMASK_DEF             (ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
>> >  #define B44_GPTIMER    0x0028UL /* General Purpose Timer */
>> >
>> >
>> >
>>
>> Ok, are you saying that all I have to do is apply this patch,
>> reproduce the problem condition, and if it recovers OK, then we can go
>> with this fix?
>> If so, I will try it out after work.
>>
>
> Yes, please try the patch and tell us what happens.
>
> Note : It can be better, it can be worse.
>
> It can work on your b44 chip, and freeze another computer with another
> b44 chip. Use at your own risk.
>

I tried the patch.
I also tried without the patch, but bypassed the hw reset in the RFO case.

In both cases, the hardware did not recover from the overflow.
An "ifconfig eth0 down" then "ifconfig eth0 up" was required to bring
it back to life, I.e. A manual hw reset.

What I did find is that once the RFO state is reached, it is not cleared.
I think we need to find a way to clear the RFO state.
The RFO state is cleared after a HW reset.

Kind Regards

James
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 29, 2010, 5:17 a.m. UTC | #9
Le lundi 28 juin 2010 à 14:21 -0700, Mitchell Erblich a écrit :
> On Jun 28, 2010, at 4:09 AM, Eric Dumazet wrote:
> 
> > Le lundi 28 juin 2010 à 11:17 +0100, James Courtier-Dutton a écrit :
> >> On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>> 
> >>> Problem is we receive a spike of RX network frames (possibly UDP or some
> >>> other RX only trafic), and chip raises an RX fifo overflow _error_
> >>> indication.
> >>> 
> 
> IMO, spikes are a normal behaviour.

Yes, this is why I said NIC is buggy, if it requires a reset (lasting a
_very_ long time) on a normal condition.

> 
> >> 
> >> The cause of the RX overflow is in my case is TCP.
> >> It is reproducible in mythtv.
> >> While watching LiveTV, press "s" for the program guide.
> >> The program guide is implemented into mythtv by a SQL query that
> >> results in a large response.
> >> The kernel is probably not servicing the RX FIFO quickly enough due to
> >> it being busy doing something else. In this case, probably a video
> >> mode switch.
> >> 
> > 
> > Thats strange, b44 has a big RX ring... and tcp sender should wait for
> > ACK...
> > 
> 
> Slow start, etc SHOULD/CAN  double the number of in-flight segments in each
> next round-trip, placing them back to back.
> 

rx ring buffer is about 200 frames on b44. One single tcp flow should
fit.

Limit is 511. James, did you try to increase rx ring ?

ethtool -G eth0 rx 511

> IMO,  a stress test, would be a large number/wirespeed set of pings?
> 

Better is to use frames that are going to slow down receiver.
Say multicast trafic with 100 receivers on same multicast group.
Send 1000 consecutive frames, last ones will trigger RX overflow,
because softirq handler cannot be fast enough.

Ping is answered by kernel, its pretty fast.

> >>> Some hardware are buggy enough that such error indication is fatal and
> >>> _require_ hardware reset. Thats life. I suspect b44 driver doing a full
> >>> reset is not a random guess from driver author, but to avoid a complete
> >>> NIC lockup.
> >>> 
> >> 
> >> Interesting, which hardware, apart from the b44, is it that "requires"
> >> a hardware reset after a RX FIFO overflow.
> > 
> > Just take a look at some net drivers and you'll see some of them have
> > this requirement.
> > 
> > rtl8169_rx_interrupt()
> > ...
> > 	if (status & RxFOVF) {
> > 		rtl8169_schedule_work(dev, rtl8169_reset_task);
> > 		dev->stats.rx_fifo_errors++;
> > 	}
> > 
> > 
> > 
> > 
> 
> 
> If they can reset in say X frame loss units, then why not reset if
> X is an acceptable number?
> 

Because a reset is an exception. While card is reset, we lose many tx
and rx frames and this should be the very last thing to consider.

Why not a complete reboot of the host while we are at it ?

> And a hammer may fix the dent, while I may be more
> interested in preventing the dent in the first place.

So ? Please submit an alternative firmware for this NIC, or provide
another NIC on thousand of machines that are stuck with it.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 29, 2010, 8:42 a.m. UTC | #10
On 28 June 2010 22:37, James Courtier-Dutton <james.dutton@gmail.com> wrote:
>
> I tried the patch.
> I also tried without the patch, but bypassed the hw reset in the RFO case.
>
> In both cases, the hardware did not recover from the overflow.
> An "ifconfig eth0 down" then "ifconfig eth0 up" was required to bring
> it back to life, I.e. A manual hw reset.
>
> What I did find is that once the RFO state is reached, it is not cleared.
> I think we need to find a way to clear the RFO state.
> The RFO state is cleared after a HW reset.
>
> Kind Regards
>
> James
>

Under further analysis, I have found that RFO is not cleared by a
write to bw32(bp, B44_ISTAT, istat);
whereas most other conditions should be cleared by this.

So, I went searching in the hardware reset functions for when the RFO
was cleared.

I found it:
A call to this:
ssb_device_enable(bp->sdev, 0)
in the b44_chip_reset function is what actually clears the RFO.
So, does anyone have any data sheets on the ssb ?
The ssb looks to me like the DMA engine.

On a more positive note, if we can get the ssb to reset without the
phy resetting, we could have our smooth recovery achieved.

Kind Regards

James
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 30, 2010, 7:35 p.m. UTC | #11
On 29 June 2010 06:17, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> rx ring buffer is about 200 frames on b44. One single tcp flow should
> fit.
>
> Limit is 511. James, did you try to increase rx ring ?
>
> ethtool -G eth0 rx 511
>

Any value given to that command results in a non operating NIC.
rmmod b44
modprobe b44 fixes that.

I have narrowed down the problem now.
I can now reset the RX fifo after the RFO in less than 20 ms without
the Ethernet Link going down.
When I get a moment, I will post a patch so other people can test it.

Kind Regards

James
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 30, 2010, 8:22 p.m. UTC | #12
From: James Courtier-Dutton <james.dutton@gmail.com>
Date: Mon, 28 Jun 2010 11:17:59 +0100

> Interesting, which hardware, apart from the b44, is it that "requires"
> a hardware reset after a RX FIFO overflow.

This problem is quite common, actually.

Even though it shouldn't be, this is seemingly one of the least tested
paths of a networking chip.

You'd think the recovery would be easy, flush the fifos and drop the
packet, then rewind the RX descriptor pointer.

But it's not and I've seen everything from RX descriptor corruption
to random DMA splats elsewhere corrupting memory entirely, as a result
of a networking card taking a RX fifo overflow.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Courtier-Dutton June 30, 2010, 10:20 p.m. UTC | #13
On 30 June 2010 21:22, David Miller <davem@davemloft.net> wrote:
> From: James Courtier-Dutton <james.dutton@gmail.com>
> Date: Mon, 28 Jun 2010 11:17:59 +0100
>
>> Interesting, which hardware, apart from the b44, is it that "requires"
>> a hardware reset after a RX FIFO overflow.
>
> This problem is quite common, actually.
>
> Even though it shouldn't be, this is seemingly one of the least tested
> paths of a networking chip.
>
> You'd think the recovery would be easy, flush the fifos and drop the
> packet, then rewind the RX descriptor pointer.
>
> But it's not and I've seen everything from RX descriptor corruption
> to random DMA splats elsewhere corrupting memory entirely, as a result
> of a networking card taking a RX fifo overflow.
>

Well, I have just written a patch (see other thread) to try and reset
the FIFO instead of a complete HW reset.
How do I know if I have RX descriptor corruption, or random DMA splats?
I have not detected any problems so far.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/b44.h b/drivers/net/b44.h
index e1905a4..514dc3a 100644
--- a/drivers/net/b44.h
+++ b/drivers/net/b44.h
@@ -42,7 +42,7 @@ 
 #define  ISTAT_EMAC		0x04000000 /* EMAC Interrupt */
 #define  ISTAT_MII_WRITE	0x08000000 /* MII Write Interrupt */
 #define  ISTAT_MII_READ		0x10000000 /* MII Read Interrupt */
-#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
+#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
 #define B44_IMASK	0x0024UL /* Interrupt Mask */
 #define  IMASK_DEF		(ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
 #define B44_GPTIMER	0x0028UL /* General Purpose Timer */