[tbench,regression,fixes] : digging out smelly deadmen.

Message ID	490AE1CD.9040207@cosmosbay.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <490AE1CD.9040207@cosmosbay.com> Date: Fri, 31 Oct 2008 11:45:33 +0100 From: Eric Dumazet <dada1@cosmosbay.com> User-Agent: Thunderbird 2.0.0.17 (Windows/20080914) MIME-Version: 1.0 To: David Miller <davem@davemloft.net> CC: ilpo.jarvinen@helsinki.fi, shemminger@vyatta.com, zbr@ioremap.net, rjw@sisk.pl, mingo@elte.hu, s0mbre@tservice.net.ru, a.p.zijlstra@chello.nl, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, efault@gmx.de, akpm@linux-foundation.org Subject: Re: [tbench regression fixes]: digging out smelly deadmen. References: <Pine.LNX.4.64.0810302047030.22084@wrl-59.cs.helsinki.fi> <20081031.005219.141937694.davem@davemloft.net> <Pine.LNX.4.64.0810311119510.7072@wrl-59.cs.helsinki.fi> <20081031.025159.51432990.davem@davemloft.net> In-Reply-To: <20081031.025159.51432990.davem@davemloft.net> Content-Type: multipart/mixed; boundary="------------030404020908030700070809" Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet Oct. 31, 2008, 10:45 a.m. UTC

David Miller a écrit :
> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 31 Oct 2008 11:40:16 +0200 (EET)
> 
>> Let me remind that it is just a single process, so no ping-pong & other 
>> lock related cache effects should play any significant role here, no? (I'm 
>> no expert though :-)).
> 
> Not locks or ping-pongs perhaps, I guess.  So it just sends and
> receives over a socket, implementing both ends of the communication
> in the same process?
> 
> If hash chain conflicts do happen for those 2 sockets, just traversing
> the chain 2 entries deep could show up.

tbench is very sensible to cache line ping-pongs (on SMP machines of course)

Just to prove my point, I coded the following patch and tried it
on a HP BL460c G1. This machine has 2 quad cores cpu 
(Intel(R) Xeon(R) CPU E5450  @3.00GHz)

tbench 8 went from 2240 MB/s to 2310 MB/s after this patch applied

[PATCH] net: Introduce netif_set_last_rx() helper

On SMP machine, loopback device (and possibly others net device)
should try to avoid dirty the memory cache line containing "last_rx"
field. Got 3% increase on tbench on a 8 cpus machine.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 drivers/net/loopback.c    |    2 +-
 include/linux/netdevice.h |   16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

Ilpo Järvinen Oct. 31, 2008, 11:01 a.m. UTC | #1

On Fri, 31 Oct 2008, Eric Dumazet wrote:

> David Miller a écrit :
> > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Fri, 31 Oct 2008 11:40:16 +0200 (EET)
> > 
> > > Let me remind that it is just a single process, so no ping-pong & other
> > > lock related cache effects should play any significant role here, no? (I'm
> > > no expert though :-)).
> > 
> > Not locks or ping-pongs perhaps, I guess.  So it just sends and
> > receives over a socket, implementing both ends of the communication
> > in the same process?
> > 
> > If hash chain conflicts do happen for those 2 sockets, just traversing
> > the chain 2 entries deep could show up.
> 
> tbench is very sensible to cache line ping-pongs (on SMP machines of course)

...Sorry to disappoint you but we were discussion there on my AIM9 
tcp_test results :-).

Eric Dumazet Oct. 31, 2008, 11:10 a.m. UTC | #2

Ilpo Järvinen a écrit :
> On Fri, 31 Oct 2008, Eric Dumazet wrote:
> 
>> David Miller a écrit :
>> > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
>> > Date: Fri, 31 Oct 2008 11:40:16 +0200 (EET)
>> > > > Let me remind that it is just a single process, so no ping-pong 
>> & other
>> > > lock related cache effects should play any significant role here, 
>> no? (I'm
>> > > no expert though :-)).
>> > > Not locks or ping-pongs perhaps, I guess.  So it just sends and
>> > receives over a socket, implementing both ends of the communication
>> > in the same process?
>> > > If hash chain conflicts do happen for those 2 sockets, just 
>> traversing
>> > the chain 2 entries deep could show up.
>>
>> tbench is very sensible to cache line ping-pongs (on SMP machines of 
>> course)
> 
> ...Sorry to disappoint you but we were discussion there on my AIM9 
> tcp_test results :-).
> 

Well, before you added AIM9 on this topic, we were focusing on tbench :)

Sorry to disappoint you :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilpo Järvinen Oct. 31, 2008, 11:15 a.m. UTC | #3

On Fri, 31 Oct 2008, Eric Dumazet wrote:

> Ilpo Järvinen a écrit :
> > On Fri, 31 Oct 2008, Eric Dumazet wrote:
> > 
> > > David Miller a écrit :
> > > > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> > > > Date: Fri, 31 Oct 2008 11:40:16 +0200 (EET)
> > > > > > Let me remind that it is just a single process, so no ping-pong 
> >> & other
> > > > > lock related cache effects should play any significant role here, 
> > > no? (I'm
> > > > > no expert though :-)).
> > > > > Not locks or ping-pongs perhaps, I guess.  So it just sends and
> > > > receives over a socket, implementing both ends of the communication
> > > > in the same process?
> > > > > If hash chain conflicts do happen for those 2 sockets, just 
> > > traversing
> > > > the chain 2 entries deep could show up.
> > >
> > > tbench is very sensible to cache line ping-pongs (on SMP machines of
> > > course)
> > 
> > ...Sorry to disappoint you but we were discussion there on my AIM9 tcp_test
> > results :-).
> > 
> 
> Well, before you added AIM9 on this topic, we were focusing on tbench :)
>
> Sorry to disappoint you :)

It's all Stephen's fault, he added port randomization first... ;-)

stephen hemminger Oct. 31, 2008, 7:57 p.m. UTC | #4

On Fri, 31 Oct 2008 11:45:33 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> David Miller a écrit :
> > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Fri, 31 Oct 2008 11:40:16 +0200 (EET)
> > 
> >> Let me remind that it is just a single process, so no ping-pong & other 
> >> lock related cache effects should play any significant role here, no? (I'm 
> >> no expert though :-)).
> > 
> > Not locks or ping-pongs perhaps, I guess.  So it just sends and
> > receives over a socket, implementing both ends of the communication
> > in the same process?
> > 
> > If hash chain conflicts do happen for those 2 sockets, just traversing
> > the chain 2 entries deep could show up.
> 
> tbench is very sensible to cache line ping-pongs (on SMP machines of course)
> 
> Just to prove my point, I coded the following patch and tried it
> on a HP BL460c G1. This machine has 2 quad cores cpu 
> (Intel(R) Xeon(R) CPU E5450  @3.00GHz)
> 
> tbench 8 went from 2240 MB/s to 2310 MB/s after this patch applied
> 
> [PATCH] net: Introduce netif_set_last_rx() helper
> 
> On SMP machine, loopback device (and possibly others net device)
> should try to avoid dirty the memory cache line containing "last_rx"
> field. Got 3% increase on tbench on a 8 cpus machine.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  drivers/net/loopback.c    |    2 +-
>  include/linux/netdevice.h |   16 ++++++++++++++++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> 

Why bother with last_rx at all on loopback.  I have been thinking
we should figure out a way to get rid of last_rx all together. It only
seems to be used by bonding, and the bonding driver could do the calculation
in its receive handling.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Oct. 31, 2008, 8:10 p.m. UTC | #5

On Fri, Oct 31, 2008 at 12:57:13PM -0700, Stephen Hemminger (shemminger@vyatta.com) wrote:
> Why bother with last_rx at all on loopback.  I have been thinking
> we should figure out a way to get rid of last_rx all together. It only
> seems to be used by bonding, and the bonding driver could do the calculation
> in its receive handling.

Not related to the regression: bug will be just papered out by this
changes. Having bonding on loopback is somewhat strange idea, but still
this kind of changes is an attempt to make a good play in the bad game:
this loopback-only optimization does not fix the problem.

Eric Dumazet Oct. 31, 2008, 9:03 p.m. UTC | #6

Evgeniy Polyakov a écrit :
> On Fri, Oct 31, 2008 at 12:57:13PM -0700, Stephen Hemminger (shemminger@vyatta.com) wrote:
>> Why bother with last_rx at all on loopback.  I have been thinking
>> we should figure out a way to get rid of last_rx all together. It only
>> seems to be used by bonding, and the bonding driver could do the calculation
>> in its receive handling.
> 
> Not related to the regression: bug will be just papered out by this
> changes. Having bonding on loopback is somewhat strange idea, but still
> this kind of changes is an attempt to make a good play in the bad game:
> this loopback-only optimization does not fix the problem.
> 

Just to be clear, this change was not meant to be committed.
It already was rejected by David some years ago (2005, and 2006)

http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html

If you read my mail, I was *only* saying that tbench results can be sensible to
cache line ping pongs. tbench is a crazy benchmark, and only is a crazy benchmark.

Optimizing linux for tbench sake would be .... crazy ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Evgeniy Polyakov Oct. 31, 2008, 9:18 p.m. UTC | #7

Hi Eric.

On Fri, Oct 31, 2008 at 10:03:00PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Just to be clear, this change was not meant to be committed.
> It already was rejected by David some years ago (2005, and 2006)
> 
> http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html
> 
> If you read my mail, I was *only* saying that tbench results can be 
> sensible to
> cache line ping pongs. tbench is a crazy benchmark, and only is a crazy 
> benchmark.

No problem Eric, I just pointed that this particular case is rather
fluffy, which really does not fix anything. It improves the case, but
the way it does it, is not the right one imho.
We would definitely want to eliminate assignment of global constantly
updated variables in the pathes where it is not required, but in a
way which does improve the design and implementation, but not to
hide some other problem.

Tbench is, well, as is it is quite usual network server :)
Dbench side is rather non-optimized, but still it is quite common
pattern of small-sized IO. Anyway, optimizing for some kind of the
workload tends to force other side to become slower, so I agree of
course that any narrow-viewed optimizations are bad, and instead we
should focus on searching error patter more widerspread.

David Miller Oct. 31, 2008, 11:51 p.m. UTC | #8

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 31 Oct 2008 22:03:00 +0100

> Evgeniy Polyakov a écrit :
> > On Fri, Oct 31, 2008 at 12:57:13PM -0700, Stephen Hemminger (shemminger@vyatta.com) wrote:
> >> Why bother with last_rx at all on loopback.  I have been thinking
> >> we should figure out a way to get rid of last_rx all together. It only
> >> seems to be used by bonding, and the bonding driver could do the calculation
> >> in its receive handling.
> > Not related to the regression: bug will be just papered out by this
> > changes. Having bonding on loopback is somewhat strange idea, but still
> > this kind of changes is an attempt to make a good play in the bad game:
> > this loopback-only optimization does not fix the problem.
> 
> Just to be clear, this change was not meant to be committed.
> It already was rejected by David some years ago (2005, and 2006)
> 
> http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html

However, I do like Stephen's suggestion that maybe we can get rid of
this ->last_rx thing by encapsulating the logic completely in the
bonding driver.

> If you read my mail, I was *only* saying that tbench results can be sensible to
> cache line ping pongs. tbench is a crazy benchmark, and only is a crazy benchmark.
> 
> Optimizing linux for tbench sake would be .... crazy ?

Unlike dbench I think tbench is worth cranking up as much as possible.

It doesn't have a huge memory working set, it just writes mostly small
messages over a TCP socket back and forth, and does a lot of blocking

And I think we'd like all of those operating to run as fast as possible.

When Tridge first wrote tbench I would see the expected things at the
top of the profiles.  Things like tcp_ack(), copy to/from user, and
perhaps SLAB.

Things have changed considerably.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

stephen hemminger Oct. 31, 2008, 11:56 p.m. UTC | #9

On Fri, 31 Oct 2008 16:51:44 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 31 Oct 2008 22:03:00 +0100
> 
> > Evgeniy Polyakov a écrit :
> > > On Fri, Oct 31, 2008 at 12:57:13PM -0700, Stephen Hemminger (shemminger@vyatta.com) wrote:
> > >> Why bother with last_rx at all on loopback.  I have been thinking
> > >> we should figure out a way to get rid of last_rx all together. It only
> > >> seems to be used by bonding, and the bonding driver could do the calculation
> > >> in its receive handling.
> > > Not related to the regression: bug will be just papered out by this
> > > changes. Having bonding on loopback is somewhat strange idea, but still
> > > this kind of changes is an attempt to make a good play in the bad game:
> > > this loopback-only optimization does not fix the problem.
> > 
> > Just to be clear, this change was not meant to be committed.
> > It already was rejected by David some years ago (2005, and 2006)
> > 
> > http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html
> 
> However, I do like Stephen's suggestion that maybe we can get rid of
> this ->last_rx thing by encapsulating the logic completely in the
> bonding driver.

Since bonding driver doesn't actually see the rx packets, that isn't
really possible.  But it would be possible to change last_rx from a variable
to an function pointer, so that device's could apply other logic to derive
the last value.  One example would be to keep it per cpu and then take the
maximum.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[tbench,regression,fixes] : digging out smelly deadmen.

Commit Message

Comments

Patch