Patchwork mv643xx_eth: fix SMI bus access timeouts

login
register
mail settings
Submitter Lennert Buytenhek
Date Nov. 1, 2008, 5:32 a.m.
Message ID <20081101053220.GA13348@xi.wantstofly.org>
Download mbox | patch
Permalink /patch/6770/
State Accepted
Delegated to: Jeff Garzik
Headers show

Comments

Lennert Buytenhek - Nov. 1, 2008, 5:32 a.m.
The mv643xx_eth mii bus implementation uses wait_event_timeout() to
wait for SMI completion interrupts.

If wait_event_timeout() would return zero, mv643xx_eth would conclude
that the SMI access timed out, but this is not necessarily true --
wait_event_timeout() can also return zero in the case where the SMI
completion interrupt did happen in time but where it took longer than
the requested timeout for the process performing the SMI access to be
scheduled again.  This would lead to occasional SMI access timeouts
when the system would be under heavy load.

The fix is to ignore the return value of wait_event_timeout(), and
to re-check the SMI done bit after wait_event_timeout() returns to
determine whether or not the SMI access timed out.

Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
---
The commit that introduced this was added in the .28 dev cycle, so
this fix is for .28 only.  Thanks!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier - Nov. 1, 2008, 5:47 a.m.
> If wait_event_timeout() would return zero, mv643xx_eth would conclude
 > that the SMI access timed out, but this is not necessarily true --
 > wait_event_timeout() can also return zero in the case where the SMI
 > completion interrupt did happen in time but where it took longer than
 > the requested timeout for the process performing the SMI access to be
 > scheduled again.  This would lead to occasional SMI access timeouts
 > when the system would be under heavy load.

Would it make more sense to fix this in the wait_event_timeout() code
itself a la bb10ed09 ("sched: fix wait_for_completion_timeout() spurious
failure under heavy load")?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lennert Buytenhek - Nov. 1, 2008, 5:53 a.m.
On Fri, Oct 31, 2008 at 10:47:47PM -0700, Roland Dreier wrote:

> > If wait_event_timeout() would return zero, mv643xx_eth would conclude
> > that the SMI access timed out, but this is not necessarily true --
> > wait_event_timeout() can also return zero in the case where the SMI
> > completion interrupt did happen in time but where it took longer than
> > the requested timeout for the process performing the SMI access to be
> > scheduled again.  This would lead to occasional SMI access timeouts
> > when the system would be under heavy load.
> 
> Would it make more sense to fix this in the wait_event_timeout() code
> itself a la bb10ed09 ("sched: fix wait_for_completion_timeout() spurious
> failure under heavy load")?

Well, wait_event_timeout() does (or did, before that commit) exactly
what its docbook comment says it does:

	 * The function returns 0 if the @timeout elapsed, and the remaining
	 * jiffies if the condition evaluated to true before the timeout elapsed.

Making it return 1 jiffy seems a bit hacky.  Why not go all the way
and just make it return 0 or 1 in all cases and audit all the callers
(and update the docbook)?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Garzik - Nov. 3, 2008, 8:25 p.m.
Lennert Buytenhek wrote:
> The mv643xx_eth mii bus implementation uses wait_event_timeout() to
> wait for SMI completion interrupts.
> 
> If wait_event_timeout() would return zero, mv643xx_eth would conclude
> that the SMI access timed out, but this is not necessarily true --
> wait_event_timeout() can also return zero in the case where the SMI
> completion interrupt did happen in time but where it took longer than
> the requested timeout for the process performing the SMI access to be
> scheduled again.  This would lead to occasional SMI access timeouts
> when the system would be under heavy load.
> 
> The fix is to ignore the return value of wait_event_timeout(), and
> to re-check the SMI done bit after wait_event_timeout() returns to
> determine whether or not the SMI access timed out.
> 
> Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
> ---
> The commit that introduced this was added in the .28 dev cycle, so
> this fix is for .28 only.  Thanks!
> 
> diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
> index d25a302..0e94ed3 100644
> --- a/drivers/net/mv643xx_eth.c
> +++ b/drivers/net/mv643xx_eth.c
> @@ -1065,9 +1065,12 @@ static int smi_wait_ready(struct mv643xx_eth_shared_private *msp)
>  		return 0;
>  	}
>  
> -	if (!wait_event_timeout(msp->smi_busy_wait, smi_is_done(msp),
> -				msecs_to_jiffies(100)))
> -		return -ETIMEDOUT;
> +	if (!smi_is_done(msp)) {
> +		wait_event_timeout(msp->smi_busy_wait, smi_is_done(msp),
> +				   msecs_to_jiffies(100));
> +		if (!smi_is_done(msp))
> +			return -ETIMEDOUT;
> +	}

applied


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
index d25a302..0e94ed3 100644
--- a/drivers/net/mv643xx_eth.c
+++ b/drivers/net/mv643xx_eth.c
@@ -1065,9 +1065,12 @@  static int smi_wait_ready(struct mv643xx_eth_shared_private *msp)
 		return 0;
 	}
 
-	if (!wait_event_timeout(msp->smi_busy_wait, smi_is_done(msp),
-				msecs_to_jiffies(100)))
-		return -ETIMEDOUT;
+	if (!smi_is_done(msp)) {
+		wait_event_timeout(msp->smi_busy_wait, smi_is_done(msp),
+				   msecs_to_jiffies(100));
+		if (!smi_is_done(msp))
+			return -ETIMEDOUT;
+	}
 
 	return 0;
 }