From patchwork Wed May 6 03:36:08 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 26893 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id 1D408B7063 for ; Wed, 6 May 2009 13:36:25 +1000 (EST) Received: by ozlabs.org (Postfix) id 0B76CDDE06; Wed, 6 May 2009 13:36:25 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id A4C45DDE00 for ; Wed, 6 May 2009 13:36:24 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752345AbZEFDgR (ORCPT ); Tue, 5 May 2009 23:36:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752035AbZEFDgR (ORCPT ); Tue, 5 May 2009 23:36:17 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:44772 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752023AbZEFDgQ convert rfc822-to-8bit (ORCPT ); Tue, 5 May 2009 23:36:16 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n463a9AC001822; Wed, 6 May 2009 05:36:09 +0200 Message-ID: <4A0105A8.3060707@cosmosbay.com> Date: Wed, 06 May 2009 05:36:08 +0200 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Vladimir Ivashchenko CC: netdev@vger.kernel.org Subject: Re: bond + tc regression ? References: <1241538358.27647.9.camel@hazard2.francoudi.com> <4A0069F3.5030607@cosmosbay.com> <20090505174135.GA29716@francoudi.com> <4A008A72.6030607@cosmosbay.com> <20090505235008.GA17690@francoudi.com> In-Reply-To: <20090505235008.GA17690@francoudi.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Wed, 06 May 2009 05:36:10 +0200 (CEST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Vladimir Ivashchenko a écrit : > On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote: > >>> I have tried with IRQs bound to one CPU per NIC. Same result. >> Did you check "grep eth /proc/interrupts" that your affinities setup >> were indeed taken into account ? >> >> You should use same CPU for eth0 and eth2 (bond0), >> >> and another CPU for eth1 and eth3 (bond1) > > Ok, the best result is when assign all IRQs to the same CPU. Zero drops. > > When I bind slaves of bond interfaces to the same CPU, I start to get > some drops, but much less than before. I didn't play with combinations. > > My problem is, after applying your accounting patch below, one of my > HTB servers reports only 30-40% CPU idle on one of the cores. That won't > take me for very long, load balancing across cores is needed. > > Is there any way at least to balance individual NICs on per core basis? > Problem of this setup is you have four NICS, but two logical devices (bond0 & bond1) and a central HTB thing. This essentialy makes flows go through the same locks (some rwlocks guarding bonding driver, and others guarding HTB structures). Also when a cpu receives a frame on ethX, it has to forward it on ethY, and another lock guards access to TX queue of ethY device. If another cpus receives a frame on ethZ and want to forward it to ethY device, this other cpu will need same locks and everything slowdown. I am pretty sure you could get good results choosing two cpus sharing same L2 cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size of RX rings on ethX devices. You could try to *reduce* them so that number of inflight skb is small enough that everything fits in this 6MB cache. Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one attached to one central memory bank wont increase ram bandwidth, but reduce it. And making several cores compete for locks on this ram only slows down processing. Only choice we have is to change bonding so that this driver uses RCU instead of rwlocks, but it is probably a complex task. Multiple cpus accessing bonding structures could share memory structures without dirtying them and ping-pong cache lines. Ah, I forgot about one patch that could help your setup too (if using more than one cpu on NIC irqs of course), queued for 2.6.31 (commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0) You could post oprofile results to help us finding other hot spots. [PATCH] net: netif_tx_queue_stopped too expensive netif_tx_queue_stopped(txq) is most of the time false. Yet its cost is very expensive on SMP. static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue) { return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state); } I saw this on oprofile hunting and bnx2 driver bnx2_tx_int(). We probably should split "struct netdev_queue" in two parts, one being read mostly. __netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these deserve a separate cache line. Signed-off-by: Eric Dumazet --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2e7783f..1caaebb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -447,12 +447,18 @@ enum netdev_queue_state_t }; struct netdev_queue { +/* + * read mostly part + */ struct net_device *dev; struct Qdisc *qdisc; unsigned long state; - spinlock_t _xmit_lock; - int xmit_lock_owner; struct Qdisc *qdisc_sleeping; +/* + * write mostly part + */ + spinlock_t _xmit_lock ____cacheline_aligned_in_smp; + int xmit_lock_owner; } ____cacheline_aligned_in_smp;