From patchwork Fri Mar 5 19:34:59 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Walker X-Patchwork-Id: 47016 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 6BB3BB7CB8 for ; Sat, 6 Mar 2010 06:35:20 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755414Ab0CETfN (ORCPT ); Fri, 5 Mar 2010 14:35:13 -0500 Received: from wolverine02.qualcomm.com ([199.106.114.251]:2925 "EHLO wolverine02.qualcomm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755350Ab0CETfL (ORCPT ); Fri, 5 Mar 2010 14:35:11 -0500 X-IronPort-AV: E=McAfee;i="5400,1158,5911"; a="35673184" Received: from pdmz-ns-mip.qualcomm.com (HELO mostmsg01.qualcomm.com) ([199.106.114.10]) by wolverine02.qualcomm.com with ESMTP/TLS/ADH-AES256-SHA; 05 Mar 2010 11:35:11 -0800 Received: from [10.227.68.15] (pdmz-snip-v218.qualcomm.com [192.168.218.1]) by mostmsg01.qualcomm.com (Postfix) with ESMTPA id 43BC710004D3; Fri, 5 Mar 2010 11:36:43 -0800 (PST) Subject: [PATCH] net: Fix race condition on receive path. From: Daniel Walker To: "David S. Miller" Cc: Jim Harford , netdev@vger.kernel.org Date: Fri, 05 Mar 2010 11:34:59 -0800 Message-ID: <1267817699.30393.8.camel@c-dwalke-linux.qualcomm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Jim Harford Fixes a race condition on the networking receive path that causes all received packets to be dropped after 15-60 minutes of heavy network usage. Function process_backlog() empties the receive queue, re-enables interrupts, then "completes" the softIRQ. This provides a time window for netif_rx() to execute (in IRQ context) and enqueue a received packet without re-scheduling the softIRQ. After this, the receive queue is never processed and the system eventually begins to drop all received packets. The fix has netif_rx() calling napi_schedule(), whether or not the receive queue is empty. There are two questions that must be addressed: 1) Is the fix correct? Or, is it safe to always call napi_schedule()? 2) Does the fix result in a performance degradation? With respect to correctness, the napi softIRQ routine should NOT be scheduled if it is already running. This is controlled by the bit flag NAPI_STATE_SCHED in the "state" field of struct napi_struct. When this flag is set, it means the softIRQ routine is either already pending or currently running. When the softIRQ is finished, the bit is cleared via __napi_complete(), which is called by napi_complete(), which is called by process_backlog() after re-enabling interrupts and just prior to returning. When netif_rx() calls napi_schedule(), that routine checks the NAPI_STATE_SCHED bit flag by calling napi_schedule_prep(), which calls test_and_set_bit(); if the flag is already set, napi_schedule() returns without doing anything. Note that test_and_set_bit() implements an atomic "test and set" operation that is commonly used as the foundation of mutual exclusion mechanisms. So it is safe for netif_rx() to call napi_schedule(), whether or not the receive queue is empty. With respect to performance, the following behavior was observed using an ARM CPU and cross-compiling for ARM using gcc. Removing the test of whether the receive queue is empty saves 2 assembly instructions. The addition of calling napi_schedule(), when that routine does nothing (because the NAPI_STATE_SCHED flag is set) adds 11 assembly instructions. Thus, there is a net addition of 9 assembly instructions. Furthermore, we observe in our testing that even under heavy network load, the receive queue is empty on at least 95% of netif_rx() invocations. The 9 assembly instructions amortized over all netif_rx() invocations yield an average performance penalty of less than 1 assembly instruction per invocation of netif_rx(). Finally, the test-and-set operation does not entail testing common memory shared by multiple CPU cores (which can be costly), because the data structures in question are specific to a single CPU by design. So calling napi_schedule() at every invocation of netif_rx() does not impair performance. CRs-fixed: 184599 Signed-off-by: Jim Harford Signed-off-by: Daniel Walker --- net/core/dev.c | 11 +++-------- 1 files changed, 3 insertions(+), 8 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index be9924f..43161c6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2150,15 +2150,10 @@ int netif_rx(struct sk_buff *skb) __get_cpu_var(netdev_rx_stat).total++; if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { - if (queue->input_pkt_queue.qlen) { -enqueue: - __skb_queue_tail(&queue->input_pkt_queue, skb); - local_irq_restore(flags); - return NET_RX_SUCCESS; - } - + __skb_queue_tail(&queue->input_pkt_queue, skb); napi_schedule(&queue->backlog); - goto enqueue; + local_irq_restore(flags); + return NET_RX_SUCCESS; } __get_cpu_var(netdev_rx_stat).dropped++;