diff mbox

2.6.29 forcedeth hang W/O NAPI enabled

Message ID 20090326033658.GA14592@gondor.apana.org.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Herbert Xu March 26, 2009, 3:36 a.m. UTC
Adam Richter <adam_richter2004@yahoo.com> wrote:
> 
> The patch you forwarded to me seems to work.  Thank you for bringing it
> to my attention.

Hi Adam:

Any chance you can test this patch instead of the previous one?

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

This patch fixes this by essentially open-coding __napi_complete.

Note we no longer need the memory barrier because this function
is per-cpu.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,

Comments

Adam Richter March 26, 2009, 5:24 a.m. UTC | #1
--- On Wed, 3/25/09, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Hi Adam:
> 
> Any chance you can test this patch instead of the previous
> one?

OK.  I tried your patch.  It seems fine.  I was able to do "ping -f" for five minutes with no problems, and surf the web during that time.

I preserving the rest of your message below, just to be clear about which patch I tested.

Adam


> 
> net: Fix netpoll lockup in legacy receive path
> 
> When I fixed the GRO crash in the legacy receive path I
> used
> napi_complete to replace __napi_complete.  Unfortunately
> they're
> not the same when NETPOLL is enabled, which may result in
> us
> not calling __napi_complete at all.
> 
> What's more, we really do need to keep the
> __napi_complete call
> within the IRQ-off section since in theory an IRQ can occur
> in
> between and fill up the backlog to the maximum, causing us
> to
> lock up.
> 
> This patch fixes this by essentially open-coding
> __napi_complete.
> 
> Note we no longer need the memory barrier because this
> function
> is per-cpu.
> 
> Signed-off-by: Herbert Xu
> <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e3fe5c7..2a7f6b3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2588,9 +2588,10 @@ static int process_backlog(struct
> napi_struct *napi, int quota)
>  		local_irq_disable();
>  		skb = __skb_dequeue(&queue->input_pkt_queue);
>  		if (!skb) {
> +			list_del(&napi->poll_list);
> +			clear_bit(NAPI_STATE_SCHED, &napi->state);
>  			local_irq_enable();
> -			napi_complete(napi);
> -			goto out;
> +			break;
>  		}
>  		local_irq_enable();
>  
> @@ -2599,7 +2600,6 @@ static int process_backlog(struct
> napi_struct *napi, int quota)
>  
>  	napi_gro_flush(napi);
>  
> -out:
>  	return work;
>  }
> 
> Thanks,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu 许志壬
> <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/�herbert/
> PGP Key: http://gondor.apana.org.au/�herbert/pubkey.txt


      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 26, 2009, 6:58 a.m. UTC | #2
On Wed, Mar 25, 2009 at 10:24:42PM -0700, Adam Richter wrote:
> 
> OK.  I tried your patch.  It seems fine.  I was able to do "ping -f" for five minutes with no problems, and surf the web during that time.

Thanks for testing! Since you also have a forcedeth card, could
you try rebuilding your kernel with FORCEDETH_NAPI enabled and
see if that is stable?

> I preserving the rest of your message below, just to be clear about which patch I tested.

Yep it's the right one.

Cheers,
Adam Richter March 26, 2009, 11:29 p.m. UTC | #3
On Wed, 3/25/09, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Thanks for testing! Since you also have a forcedeth card,
> could
> you try rebuilding your kernel with FORCEDETH_NAPI enabled
> and
> see if that is stable?

It works so far, doing "ping -f" for more than five minutes while I
browse the web, but bear in mind that I was not able to reproduce the
problem earlier with FORCEDETH_NAPI on demand.  I only saw an ethernet lock up with NAPI enabled once (with the original 2.6.29 code).  I
will let you know if the lockup occurs with your code. 

Just to be clear which version I am now testing, I have attached a
copy of process_backlog() from my net/core/dev.c.

Adam



static int process_backlog(struct napi_struct *napi, int quota)
{
	int work = 0;
	struct softnet_data *queue = &__get_cpu_var(softnet_data);
	unsigned long start_time = jiffies;

	napi->weight = weight_p;
	do {
		struct sk_buff *skb;

		local_irq_disable();
		skb = __skb_dequeue(&queue->input_pkt_queue);
		if (!skb) {
			list_del(&napi->poll_list);
			clear_bit(NAPI_STATE_SCHED, &napi->state);
			local_irq_enable();
			break;
		}
		local_irq_enable();

		napi_gro_receive(napi, skb);
	} while (++work < quota && jiffies == start_time);

	napi_gro_flush(napi);

	return work;
}



      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@  static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			list_del(&napi->poll_list);
+			clear_bit(NAPI_STATE_SCHED, &napi->state);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2600,6 @@  static int process_backlog(struct napi_struct *napi, int quota)
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }