From patchwork Wed May  6 03:36:08 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eric Dumazet <dada1@cosmosbay.com>
X-Patchwork-Id: 26893
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@bilbo.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from ozlabs.org (ozlabs.org [203.10.76.45])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mx.ozlabs.org",
	Issuer "CA Cert Signing Authority" (verified OK))
	by bilbo.ozlabs.org (Postfix) with ESMTPS id 1D408B7063
	for <patchwork-incoming@bilbo.ozlabs.org>;
	Wed,  6 May 2009 13:36:25 +1000 (EST)
Received: by ozlabs.org (Postfix)
	id 0B76CDDE06; Wed,  6 May 2009 13:36:25 +1000 (EST)
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.176.167])
	by ozlabs.org (Postfix) with ESMTP id A4C45DDE00
	for <patchwork-incoming@ozlabs.org>;
	Wed,  6 May 2009 13:36:24 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752345AbZEFDgR (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Tue, 5 May 2009 23:36:17 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752035AbZEFDgR
	(ORCPT <rfc822;netdev-outgoing>); Tue, 5 May 2009 23:36:17 -0400
Received: from gw1.cosmosbay.com ([212.99.114.194]:44772 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752023AbZEFDgQ convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Tue, 5 May 2009 23:36:16 -0400
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n463a9AC001822;
	Wed, 6 May 2009 05:36:09 +0200
Message-ID: <4A0105A8.3060707@cosmosbay.com>
Date: Wed, 06 May 2009 05:36:08 +0200
From: Eric Dumazet <dada1@cosmosbay.com>
User-Agent: Thunderbird 2.0.0.21 (Windows/20090302)
MIME-Version: 1.0
To: Vladimir Ivashchenko <hazard@francoudi.com>
CC: netdev@vger.kernel.org
Subject: Re: bond + tc regression ?
References: <1241538358.27647.9.camel@hazard2.francoudi.com>
	<4A0069F3.5030607@cosmosbay.com>
	<20090505174135.GA29716@francoudi.com>
	<4A008A72.6030607@cosmosbay.com>
	<20090505235008.GA17690@francoudi.com>
In-Reply-To: <20090505235008.GA17690@francoudi.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6
	(gw1.cosmosbay.com [0.0.0.0]);
	Wed, 06 May 2009 05:36:10 +0200 (CEST)
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Vladimir Ivashchenko a écrit :
> On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:
> 
>>> I have tried with IRQs bound to one CPU per NIC. Same result.
>> Did you check "grep eth /proc/interrupts" that your affinities setup 
>> were indeed taken into account ?
>>
>> You should use same CPU for eth0 and eth2 (bond0),
>>
>> and another CPU for eth1 and eth3 (bond1)
> 
> Ok, the best result is when assign all IRQs to the same CPU. Zero drops.
> 
> When I bind slaves of bond interfaces to the same CPU, I start to get 
> some drops, but much less than before. I didn't play with combinations.
> 
> My problem is, after applying your accounting patch below, one of my 
> HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
> take me for very long, load balancing across cores is needed.
> 
> Is there any way at least to balance individual NICs on per core basis?
> 

Problem of this setup is you have four NICS, but two logical devices (bond0
& bond1) and a central HTB thing. This essentialy makes flows go through the same
locks (some rwlocks guarding bonding driver, and others guarding HTB structures).

Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
another lock guards access to TX queue of ethY device. If another cpus receives
a frame on ethZ and want to forward it to ethY device, this other cpu will
need same locks and everything slowdown.

I am pretty sure you could get good results choosing two cpus sharing same L2
cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
of RX rings on ethX devices. You could try to *reduce* them so that number
of inflight skb is small enough that everything fits in this 6MB cache.

Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
attached to one central memory bank wont increase ram bandwidth, but reduce it.

And making several cores compete for locks on this ram only slows down processing.

Only choice we have is to change bonding so that this driver uses RCU instead
of rwlocks, but it is probably a complex task. Multiple cpus accessing
bonding structures could share memory structures without dirtying them
and ping-pong cache lines.

Ah, I forgot about one patch that could help your setup too (if using more than one
cpu on NIC irqs of course), queued for 2.6.31

(commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)

You could post oprofile results to help us finding other hot spots.


[PATCH] net: netif_tx_queue_stopped too expensive

netif_tx_queue_stopped(txq) is most of the time false.

Yet its cost is very expensive on SMP.

static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
{
	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
}

I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().

We probably should split "struct netdev_queue" in two parts, one
being read mostly.

__netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
deserve a separate cache line.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..1caaebb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -447,12 +447,18 @@ enum netdev_queue_state_t
 };
 
 struct netdev_queue {
+/*
+ * read mostly part
+ */
 	struct net_device	*dev;
 	struct Qdisc		*qdisc;
 	unsigned long		state;
-	spinlock_t		_xmit_lock;
-	int			xmit_lock_owner;
 	struct Qdisc		*qdisc_sleeping;
+/*
+ * write mostly part
+ */
+	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
+	int			xmit_lock_owner;
 } ____cacheline_aligned_in_smp;