diff mbox

net: Fix race condition on receive path.

Message ID 1267817699.30393.8.camel@c-dwalke-linux.qualcomm.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Walker March 5, 2010, 7:34 p.m. UTC
From: Jim Harford <c_jharfo@quicinc.com>

Fixes a race condition on the networking receive path that causes all
received packets to be dropped after 15-60 minutes of heavy network usage.
Function process_backlog() empties the receive queue, re-enables
interrupts, then "completes" the softIRQ.  This provides a time window for
netif_rx() to execute (in IRQ context) and enqueue a received packet
without re-scheduling the softIRQ.  After this, the receive queue is never
processed and the system eventually begins to drop all received packets.

The fix has netif_rx() calling napi_schedule(), whether or not the
receive queue is empty.  There are two questions that must be addressed:
  1) Is the fix correct?  Or, is it safe to always call napi_schedule()?
  2) Does the fix result in a performance degradation?

With respect to correctness, the napi softIRQ routine should NOT be
scheduled if it is already running.  This is controlled by the bit flag
NAPI_STATE_SCHED in the "state" field of struct napi_struct.  When this
flag is set, it means the softIRQ routine is either already pending or
currently running.  When the softIRQ is finished, the bit is cleared via
__napi_complete(), which is called by napi_complete(), which is called by
process_backlog() after re-enabling interrupts and just prior to
returning.

When netif_rx() calls napi_schedule(), that routine checks the
NAPI_STATE_SCHED bit flag by calling napi_schedule_prep(), which calls
test_and_set_bit(); if the flag is already set, napi_schedule() returns
without doing anything.  Note that test_and_set_bit() implements an
atomic "test and set" operation that is commonly used as the foundation
of mutual exclusion mechanisms.

So it is safe for netif_rx() to call napi_schedule(), whether or not the
receive queue is empty.

With respect to performance, the following behavior was observed using an
ARM CPU and cross-compiling for ARM using gcc.  Removing the test of
whether the receive queue is empty saves 2 assembly instructions.  The
addition of calling napi_schedule(), when that routine does nothing
(because the NAPI_STATE_SCHED flag is set) adds 11 assembly instructions.
Thus, there is a net addition of 9 assembly instructions.  Furthermore,
we observe in our testing that even under heavy network load, the receive
queue is empty on at least 95% of netif_rx() invocations.  The 9 assembly
instructions amortized over all netif_rx() invocations yield an average
performance penalty of less than 1 assembly instruction per invocation of
netif_rx().  Finally, the test-and-set operation does not entail testing
common memory shared by multiple CPU cores (which can be costly), because
the data structures in question are specific to a single CPU by design.

So calling napi_schedule() at every invocation of netif_rx() does not
impair performance.

CRs-fixed: 184599
Signed-off-by: Jim Harford <c_jharfo@quicinc.com>
Signed-off-by: Daniel Walker <dwalker@codeaurora.org>
---
 net/core/dev.c |   11 +++--------
 1 files changed, 3 insertions(+), 8 deletions(-)

Comments

stephen hemminger March 5, 2010, 9:21 p.m. UTC | #1
On Fri, 05 Mar 2010 11:34:59 -0800
Daniel Walker <dwalker@codeaurora.org> wrote:

> Fixes a race condition on the networking receive path that causes all
> received packets to be dropped after 15-60 minutes of heavy network usage.
> Function process_backlog() empties the receive queue, re-enables
> interrupts, then "completes" the softIRQ.  This provides a time window for
> netif_rx() to execute (in IRQ context) and enqueue a received packet
> without re-scheduling the softIRQ.  After this, the receive queue is never
> processed and the system eventually begins to drop all received packets.

I wonder why this hasn't shown up before?

Where exactly is the window between empty process_backlog and netif_rx?

Maybe it is ARM specific behavior of softirq?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Harford, Jim March 5, 2010, 11:41 p.m. UTC | #2
It appears that this patch is no longer necessary.  It was made against 2.6.29, but I see that more recent kernel versions don't have the problem code.  For a more detailed explanation, see below.  All code references are in routine process_backlog(), file net/core/dev.c.

In kernel version 2.6.27.45, __napi_complete() is invoked BEFORE interrupts are re-enabled.  Thus, the receive queue status is cleaned up before another interrupt (due to a receive packet) can occur.  This is good design.

In kernel version 2.6.29, git commit ID 303c6a025 inverts this ordering.  Routine napi_complete() is invoked AFTER interrupts are re-enabled.  We observed interrupts taken after interrupts were re-enabled, but before napi_complete cleaned up the receive queue.  This would then shut down the processing of subsequent received packets.

In kernel versions 2.6.30.10 and later, the sequence of operations is identical to 2.6.27.45, so there is no problem.

Jim Harford
Qualcomm Innovation Center


-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@vyatta.com] 
Sent: Friday, March 05, 2010 4:21 PM
To: Daniel Walker
Cc: David S. Miller; Harford, Jim; netdev@vger.kernel.org
Subject: Re: [PATCH] net: Fix race condition on receive path.

On Fri, 05 Mar 2010 11:34:59 -0800
Daniel Walker <dwalker@codeaurora.org> wrote:

> Fixes a race condition on the networking receive path that causes all
> received packets to be dropped after 15-60 minutes of heavy network usage.
> Function process_backlog() empties the receive queue, re-enables
> interrupts, then "completes" the softIRQ.  This provides a time window for
> netif_rx() to execute (in IRQ context) and enqueue a received packet
> without re-scheduling the softIRQ.  After this, the receive queue is never
> processed and the system eventually begins to drop all received packets.

I wonder why this hasn't shown up before?

Where exactly is the window between empty process_backlog and netif_rx?

Maybe it is ARM specific behavior of softirq?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Walker March 5, 2010, 11:44 p.m. UTC | #3
On Fri, 2010-03-05 at 15:41 -0800, Harford, Jim wrote:
> It appears that this patch is no longer necessary.  It was made against 2.6.29, but I see that more recent kernel versions don't have the problem code.  For a more detailed explanation, see below.  All code references are in routine process_backlog(), file net/core/dev.c.
> 
> In kernel version 2.6.27.45, __napi_complete() is invoked BEFORE interrupts are re-enabled.  Thus, the receive queue status is cleaned up before another interrupt (due to a receive packet) can occur.  This is good design.
> 
> In kernel version 2.6.29, git commit ID 303c6a025 inverts this ordering.  Routine napi_complete() is invoked AFTER interrupts are re-enabled.  We observed interrupts taken after interrupts were re-enabled, but before napi_complete cleaned up the receive queue.  This would then shut down the processing of subsequent received packets.
> 
> In kernel versions 2.6.30.10 and later, the sequence of operations is identical to 2.6.27.45, so there is no problem.
> 
> Jim Harford
> Qualcomm Innovation Center

Ok, I guess we can ignore this one then.

Daniel

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger March 5, 2010, 11:48 p.m. UTC | #4
On Fri, 05 Mar 2010 15:44:27 -0800
Daniel Walker <dwalker@codeaurora.org> wrote:

> On Fri, 2010-03-05 at 15:41 -0800, Harford, Jim wrote:
> > It appears that this patch is no longer necessary.  It was made against 2.6.29, but I see that more recent kernel versions don't have the problem code.  For a more detailed explanation, see below.  All code references are in routine process_backlog(), file net/core/dev.c.
> > 
> > In kernel version 2.6.27.45, __napi_complete() is invoked BEFORE interrupts are re-enabled.  Thus, the receive queue status is cleaned up before another interrupt (due to a receive packet) can occur.  This is good design.
> > 
> > In kernel version 2.6.29, git commit ID 303c6a025 inverts this ordering.  Routine napi_complete() is invoked AFTER interrupts are re-enabled.  We observed interrupts taken after interrupts were re-enabled, but before napi_complete cleaned up the receive queue.  This would then shut down the processing of subsequent received packets.
> > 
> > In kernel versions 2.6.30.10 and later, the sequence of operations is identical to 2.6.27.45, so there is no problem.
> > 
> > Jim Harford
> > Qualcomm Innovation Center
> 
> Ok, I guess we can ignore this one then.
> 
> Daniel
> 

It was fixed by:

commit 8f1ead2d1a626ed0c85b3d2c2046a49081d5933f
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Mar 26 00:59:10 2009 -0700

    GRO: Disable GRO on legacy netif_rx path
    
    When I fixed the GRO crash in the legacy receive path I used
    napi_complete to replace __napi_complete.  Unfortunately they're
    not the same when NETPOLL is enabled, which may result in us
    not calling __napi_complete at all.
    
    What's more, we really do need to keep the __napi_complete call
    within the IRQ-off section since in theory an IRQ can occur in
    between and fill up the backlog to the maximum, causing us to
    lock up.
    
    Since we can't seem to find a fix that works properly right now,
    this patch reverts all the GRO support from the netif_rx path.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index be9924f..43161c6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2150,15 +2150,10 @@  int netif_rx(struct sk_buff *skb)
 
 	__get_cpu_var(netdev_rx_stat).total++;
 	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
+		__skb_queue_tail(&queue->input_pkt_queue, skb);
 		napi_schedule(&queue->backlog);
-		goto enqueue;
+		local_irq_restore(flags);
+		return NET_RX_SUCCESS;
 	}
 
 	__get_cpu_var(netdev_rx_stat).dropped++;