diff mbox

[Bug,#14925] sky2 panic under load

Message ID 4B4A729E.9060805@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Berck Nash Jan. 11, 2010, 12:36 a.m. UTC
Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
> 
> The following bug entry is on the current list of known regressions
> from 2.6.32.  Please verify if it still should be listed and let me know
> (either way).
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14925
> Subject		: sky2 panic under load
> Submitter	: Berck E. Nash <flyboy@gmail.com>
> Date		: 2009-12-21 23:52 (21 days old)
> References	: http://marc.info/?l=linux-kernel&m=126143955730347&w=4
> 		  http://marc.info/?l=linux-kernel&m=126160893126548&w=4
> Handled-By	: Stephen Hemminger <shemminger@vyatta.com>

The patch attached to the bug report did not fix the problem, but I'm
fairly certain that this one from Jarek P. did:

During TX timeout procedure dev could be awoken too early, e.g. by
sky2_complete_tx() called from sky2_down(). Then sky2_xmit_frame()
can run while buffers are freed causing an oops. This patch fixes it
by adding netif_device_present() test in sky2_tx_complete().

Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=14925

With debugging by: Mike McCormack <mikem@ring3k.org>

Reported-by: Berck E. Nash <flyboy@gmail.com>
Tested-by: Berck E. Nash <flyboy@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 drivers/net/sky2.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jarek Poplawski Jan. 11, 2010, 1:26 p.m. UTC | #1
On Sun, Jan 10, 2010 at 05:36:46PM -0700, Berck E. Nash wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.32.  Please verify if it still should be listed and let me know
> > (either way).

BTW, I don't know why Berck didn't experience such a panic before
2.6.32, but seems not a regression to me. There might be new/more sky2
TX timeouts which trigger this panic and would make a real regression.

Cheers,
Jarek P.

> > 
> > 
> > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14925
> > Subject		: sky2 panic under load
> > Submitter	: Berck E. Nash <flyboy@gmail.com>
> > Date		: 2009-12-21 23:52 (21 days old)
> > References	: http://marc.info/?l=linux-kernel&m=126143955730347&w=4
> > 		  http://marc.info/?l=linux-kernel&m=126160893126548&w=4
> > Handled-By	: Stephen Hemminger <shemminger@vyatta.com>
> 
> The patch attached to the bug report did not fix the problem, but I'm
> fairly certain that this one from Jarek P. did:
> 
> During TX timeout procedure dev could be awoken too early, e.g. by
> sky2_complete_tx() called from sky2_down(). Then sky2_xmit_frame()
> can run while buffers are freed causing an oops. This patch fixes it
> by adding netif_device_present() test in sky2_tx_complete().
> 
> Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=14925
> 
> With debugging by: Mike McCormack <mikem@ring3k.org>
> 
> Reported-by: Berck E. Nash <flyboy@gmail.com>
> Tested-by: Berck E. Nash <flyboy@gmail.com>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 
> ---
> 
>  drivers/net/sky2.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
> index 1c01b96..7650f73 100644
> --- a/drivers/net/sky2.c
> +++ b/drivers/net/sky2.c
> @@ -1844,7 +1844,8 @@ static void sky2_tx_complete(struct sky2_port
> *sky2, u16 done)
>  	sky2->tx_cons = idx;
>  	smp_mb();
> 
> -	if (tx_avail(sky2) > MAX_SKB_TX_LE + 4)
> +	/* Wake unless it's detached, and called e.g. from sky2_down() */
> +	if (tx_avail(sky2) > MAX_SKB_TX_LE + 4 && netif_device_present(dev))
>  		netif_wake_queue(dev);
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 11, 2010, 7:32 p.m. UTC | #2
On Monday 11 January 2010, Jarek Poplawski wrote:
> On Sun, Jan 10, 2010 at 05:36:46PM -0700, Berck E. Nash wrote:
> > Rafael J. Wysocki wrote:
> > > This message has been generated automatically as a part of a report
> > > of recent regressions.
> > > 
> > > The following bug entry is on the current list of known regressions
> > > from 2.6.32.  Please verify if it still should be listed and let me know
> > > (either way).
> 
> BTW, I don't know why Berck didn't experience such a panic before
> 2.6.32, but seems not a regression to me. There might be new/more sky2
> TX timeouts which trigger this panic and would make a real regression.

Even if the code has always been broken, but it's only become visible after
2.6.32, that still counts as a regression IMO, because now the users are
affected who weren't before.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 11, 2010, 8:31 p.m. UTC | #3
On Mon, Jan 11, 2010 at 08:32:24PM +0100, Rafael J. Wysocki wrote:
> On Monday 11 January 2010, Jarek Poplawski wrote:
> > On Sun, Jan 10, 2010 at 05:36:46PM -0700, Berck E. Nash wrote:
> > > Rafael J. Wysocki wrote:
> > > > This message has been generated automatically as a part of a report
> > > > of recent regressions.
> > > > 
> > > > The following bug entry is on the current list of known regressions
> > > > from 2.6.32.  Please verify if it still should be listed and let me know
> > > > (either way).
> > 
> > BTW, I don't know why Berck didn't experience such a panic before
> > 2.6.32, but seems not a regression to me. There might be new/more sky2
> > TX timeouts which trigger this panic and would make a real regression.
> 
> Even if the code has always been broken, but it's only become visible after
> 2.6.32, that still counts as a regression IMO, because now the users are
> affected who weren't before.

Right, but:
1) someone with a similar but older problem might be mislead a fix is
   not for them;
2) someone with exactly this one problem (i.e. Berck ;-) might be
   mislead "no oops" is enough, while their linux might be still worse
   than before. (So I intended Berck to re-consider or even re-check
   this problem wrt. 2.6.31, and maybe even reporting another
   regression.)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Jan. 11, 2010, 8:50 p.m. UTC | #4
On Monday 11 January 2010, Jarek Poplawski wrote:
> On Mon, Jan 11, 2010 at 08:32:24PM +0100, Rafael J. Wysocki wrote:
> > On Monday 11 January 2010, Jarek Poplawski wrote:
> > > On Sun, Jan 10, 2010 at 05:36:46PM -0700, Berck E. Nash wrote:
> > > > Rafael J. Wysocki wrote:
> > > > > This message has been generated automatically as a part of a report
> > > > > of recent regressions.
> > > > > 
> > > > > The following bug entry is on the current list of known regressions
> > > > > from 2.6.32.  Please verify if it still should be listed and let me know
> > > > > (either way).
> > > 
> > > BTW, I don't know why Berck didn't experience such a panic before
> > > 2.6.32, but seems not a regression to me. There might be new/more sky2
> > > TX timeouts which trigger this panic and would make a real regression.
> > 
> > Even if the code has always been broken, but it's only become visible after
> > 2.6.32, that still counts as a regression IMO, because now the users are
> > affected who weren't before.
> 
> Right, but:
> 1) someone with a similar but older problem might be mislead a fix is
>    not for them;

Not if the fix changelog mentions the older kernels explicitly.

> 2) someone with exactly this one problem (i.e. Berck ;-) might be
>    mislead "no oops" is enough, while their linux might be still worse
>    than before. (So I intended Berck to re-consider or even re-check
>    this problem wrt. 2.6.31, and maybe even reporting another
>    regression.)

Yes, rechecking the problem with 2.6.31 would be useful.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Berck Nash Jan. 11, 2010, 9:02 p.m. UTC | #5
Jarek Poplawski wrote:
> On Mon, Jan 11, 2010 at 08:32:24PM +0100, Rafael J. Wysocki wrote:
>> On Monday 11 January 2010, Jarek Poplawski wrote:
>>> On Sun, Jan 10, 2010 at 05:36:46PM -0700, Berck E. Nash wrote:
>>>> Rafael J. Wysocki wrote:
>>>>> This message has been generated automatically as a part of a report
>>>>> of recent regressions.
>>>>>
>>>>> The following bug entry is on the current list of known regressions
>>>>> from 2.6.32.  Please verify if it still should be listed and let me know
>>>>> (either way).
>>> BTW, I don't know why Berck didn't experience such a panic before
>>> 2.6.32, but seems not a regression to me. There might be new/more sky2
>>> TX timeouts which trigger this panic and would make a real regression.
>> Even if the code has always been broken, but it's only become visible after
>> 2.6.32, that still counts as a regression IMO, because now the users are
>> affected who weren't before.
> 
> Right, but:
> 1) someone with a similar but older problem might be mislead a fix is
>    not for them;
> 2) someone with exactly this one problem (i.e. Berck ;-) might be
>    mislead "no oops" is enough, while their linux might be still worse
>    than before. (So I intended Berck to re-consider or even re-check
>    this problem wrt. 2.6.31, and maybe even reporting another
>    regression.)

Well, the problem with this bug is how hard it is for me to reproduce.
I'm willing to admit that just because I never got it before 2.6.32
isn't proof that it wasn't there in 2.6.31.  But it's a regression
somewhere along the line, since I've been using this hardware for over 3
years now.  There were lots of bugs in the sky2 driver years ago, but
for the last 2+ years or so, I haven't had any trouble at all until now.

The bug only shows up for me with bittorrent traffic.  I also use the
same adapter to transfer backups over the network from several
computers, and that doesn't trigger it...

I used 2.6.31 for however long it was the current stable, and I never
got a crash with it.  After I got several crashes in 2.6.32, I reverted
to 2.6.31 until Jarek sent this patch that seems to have fixed it.  I've
never gotten it to crash in 2.6.31, so I'm pretty sure it's a 2.6.32
regression, but I can't prove it.

I would love to do more testing, but since I can't reproduce the bug at
will, I'm not really sure what to offer?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 11, 2010, 9:47 p.m. UTC | #6
On Mon, Jan 11, 2010 at 02:02:40PM -0700, Berck E. Nash wrote:
> Jarek Poplawski wrote:
> > 2) someone with exactly this one problem (i.e. Berck ;-) might be
> >    mislead "no oops" is enough, while their linux might be still worse
> >    than before. (So I intended Berck to re-consider or even re-check
> >    this problem wrt. 2.6.31, and maybe even reporting another
> >    regression.)
> 
> Well, the problem with this bug is how hard it is for me to reproduce.
> I'm willing to admit that just because I never got it before 2.6.32
> isn't proof that it wasn't there in 2.6.31.  But it's a regression
> somewhere along the line, since I've been using this hardware for over 3
> years now.  There were lots of bugs in the sky2 driver years ago, but
> for the last 2+ years or so, I haven't had any trouble at all until now.
> 
> The bug only shows up for me with bittorrent traffic.  I also use the
> same adapter to transfer backups over the network from several
> computers, and that doesn't trigger it...
> 
> I used 2.6.31 for however long it was the current stable, and I never
> got a crash with it.  After I got several crashes in 2.6.32, I reverted
> to 2.6.31 until Jarek sent this patch that seems to have fixed it.  I've
> never gotten it to crash in 2.6.31, so I'm pretty sure it's a 2.6.32
> regression, but I can't prove it.
> 
> I would love to do more testing, but since I can't reproduce the bug at
> will, I'm not really sure what to offer?

I mainly thought about looking at the logs during such bittorrent to
compare if there is any visible change in "sky2 eth0: disabling
interface"/"sky2 eth0: enabling interface" etc. frequency between
2.6.31 and 2.6.32.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index 1c01b96..7650f73 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1844,7 +1844,8 @@  static void sky2_tx_complete(struct sky2_port
*sky2, u16 done)
 	sky2->tx_cons = idx;
 	smp_mb();

-	if (tx_avail(sky2) > MAX_SKB_TX_LE + 4)
+	/* Wake unless it's detached, and called e.g. from sky2_down() */
+	if (tx_avail(sky2) > MAX_SKB_TX_LE + 4 && netif_device_present(dev))
 		netif_wake_queue(dev);
 }