Patchwork netfilter: finer grained nf_conn locking

login
register
mail settings
Submitter Eric Dumazet
Date March 28, 2009, 4:55 p.m.
Message ID <49CE568A.9090104@cosmosbay.com>
Download mbox | patch
Permalink /patch/25260/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - March 28, 2009, 4:55 p.m.
Eric Dumazet a écrit :
> Patrick McHardy a écrit :
>> Stephen Hemminger wrote:
>>
>>> @@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
>>>  
>>>  struct ip_ct_tcp
>>>  {
>>> +    spinlock_t    lock;
>>>      struct ip_ct_tcp_state seen[2];    /* connection parameters per
>>> direction */
>>>      u_int8_t    state;        /* state of the connection (enum
>>> tcp_conntrack) */
>>>      /* For detecting stale connections */
>> Eric already posted a patch to use an array of locks, which is
>> a better approach IMO since it keeps the size of the conntrack
>> entries down.
> 
> Yes, we probably can use an array for short lived lock sections.
> 
> The extra cost to compute the lock address is quite small, if
> the array size is known at compile time.
> 
> Stephen we might also use this same array to protect the nf_conn_acct_find(ct)
> thing as well (I am referring to your RFT 3/4 patch :
> Use mod_timer_noact to remove nf_conntrack_lock)
> 
> Here is a repost of patch Patrick is referring to :
> 
> 
> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
> 
> TCP connection tracking suffers of huge contention on a global rwlock,
> used to protect tcp conntracking state.
> 
> As each tcp conntrack state have no relations between each others, we
> can switch to fine grained lock. Using an array of spinlocks avoids
> enlarging size of connection tracking structures, yet giving reasonable
> fanout.
> 
> tcp_print_conntrack() doesnt need to lock anything to read
> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
> 
> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly
> 

Hi Patrick

Apparently we could not finish the removal of tcp_lock for 2.6.30 :(

Stephen suggested using a 4 bytes hole in struct nf_conntrack,
but this is ok only if sizeof(spinlock_t) <= 4 and 64 bit arches.

We could do an hybrid thing : use nf_conn.ct_general.lock if 64 bit arches
and sizeof(spinlock_t) <= 4.

Other cases would use a carefuly sized array of spinlocks...

Thank you

[PATCH] netfilter: finer grained nf_conn locking

Introduction of fine grained lock infrastructure for nf_conn.
If possible, we use a 32bit hole on 64bit arches.
Else we use a global array of hashed spinlocks, so we dont
change size of "struct nf_conn"

Get rid of central tcp_lock rwlock used in TCP conntracking
using this infrastructure for better performance on SMP.

"tbench 8" results on my 8 core machine (32bit kernel, with
conntracking on) : 2319 MB/s instead of 2284 MB/s


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/skbuff.h                       |    9 ++-
 include/net/netfilter/nf_conntrack_l4proto.h |    2
 net/netfilter/nf_conntrack_core.c            |   47 +++++++++++++++++
 net/netfilter/nf_conntrack_netlink.c         |    4 -
 net/netfilter/nf_conntrack_proto_tcp.c       |   35 +++++-------
 5 files changed, 74 insertions(+), 23 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - March 29, 2009, 12:48 a.m.
On Sat, 28 Mar 2009 17:55:38 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Eric Dumazet a écrit :
> > Patrick McHardy a écrit :
> >> Stephen Hemminger wrote:
> >>
> >>> @@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
> >>>  
> >>>  struct ip_ct_tcp
> >>>  {
> >>> +    spinlock_t    lock;
> >>>      struct ip_ct_tcp_state seen[2];    /* connection parameters per
> >>> direction */
> >>>      u_int8_t    state;        /* state of the connection (enum
> >>> tcp_conntrack) */
> >>>      /* For detecting stale connections */
> >> Eric already posted a patch to use an array of locks, which is
> >> a better approach IMO since it keeps the size of the conntrack
> >> entries down.
> > 
> > Yes, we probably can use an array for short lived lock sections.

I am not a fan of the array of locks. Sizing it is awkward and
it is vulnerable to hash collisions. Let's see if there is another
better way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - March 30, 2009, 6:57 p.m.
Eric Dumazet wrote:
> Hi Patrick
> 
> Apparently we could not finish the removal of tcp_lock for 2.6.30 :(
> 
> Stephen suggested using a 4 bytes hole in struct nf_conntrack,
> but this is ok only if sizeof(spinlock_t) <= 4 and 64 bit arches.
> 
> We could do an hybrid thing : use nf_conn.ct_general.lock if 64 bit arches
> and sizeof(spinlock_t) <= 4.
> 
> Other cases would use a carefuly sized array of spinlocks...
> 
> Thank you
> 
> [PATCH] netfilter: finer grained nf_conn locking
> 
> Introduction of fine grained lock infrastructure for nf_conn.
> If possible, we use a 32bit hole on 64bit arches.
> Else we use a global array of hashed spinlocks, so we dont
> change size of "struct nf_conn"
> 
> Get rid of central tcp_lock rwlock used in TCP conntracking
> using this infrastructure for better performance on SMP.
> 
> "tbench 8" results on my 8 core machine (32bit kernel, with
> conntracking on) : 2319 MB/s instead of 2284 MB/s

Is this an implicit request for me to try to resurrect the 32-core setup?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 30, 2009, 7:20 p.m.
Rick Jones a écrit :
> Eric Dumazet wrote:
>> Hi Patrick
>>
>> Apparently we could not finish the removal of tcp_lock for 2.6.30 :(
>>
>> Stephen suggested using a 4 bytes hole in struct nf_conntrack,
>> but this is ok only if sizeof(spinlock_t) <= 4 and 64 bit arches.
>>
>> We could do an hybrid thing : use nf_conn.ct_general.lock if 64 bit
>> arches
>> and sizeof(spinlock_t) <= 4.
>>
>> Other cases would use a carefuly sized array of spinlocks...
>>
>> Thank you
>>
>> [PATCH] netfilter: finer grained nf_conn locking
>>
>> Introduction of fine grained lock infrastructure for nf_conn.
>> If possible, we use a 32bit hole on 64bit arches.
>> Else we use a global array of hashed spinlocks, so we dont
>> change size of "struct nf_conn"
>>
>> Get rid of central tcp_lock rwlock used in TCP conntracking
>> using this infrastructure for better performance on SMP.
>>
>> "tbench 8" results on my 8 core machine (32bit kernel, with
>> conntracking on) : 2319 MB/s instead of 2284 MB/s
> 
> Is this an implicit request for me to try to resurrect the 32-core setup?
> 

Not at all, just to keep you informed of work in progress :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - March 30, 2009, 7:38 p.m.
> Eric Dumazet wrote:
>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>  conntracking on) : 2319 MB/s instead of 2284 MB/s

How do you achieve this impressing numbers?
Is it against localhost? (10Gbit/s is max 1250 MB/s)

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 30, 2009, 7:54 p.m.
Jesper Dangaard Brouer a écrit :
> 
>> Eric Dumazet wrote:
>>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>>  conntracking on) : 2319 MB/s instead of 2284 MB/s
> 
> How do you achieve this impressing numbers?
> Is it against localhost? (10Gbit/s is max 1250 MB/s)
> 

tbench is a tcp test on localhost yes :)

Good to test tcp stack without going to NIC hardware


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 30, 2009, 7:57 p.m.
Stephen Hemminger a écrit :
> On Sat, 28 Mar 2009 17:55:38 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Eric Dumazet a écrit :
>>> Patrick McHardy a écrit :
>>>> Stephen Hemminger wrote:
>>>>
>>>>> @@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
>>>>>  
>>>>>  struct ip_ct_tcp
>>>>>  {
>>>>> +    spinlock_t    lock;
>>>>>      struct ip_ct_tcp_state seen[2];    /* connection parameters per
>>>>> direction */
>>>>>      u_int8_t    state;        /* state of the connection (enum
>>>>> tcp_conntrack) */
>>>>>      /* For detecting stale connections */
>>>> Eric already posted a patch to use an array of locks, which is
>>>> a better approach IMO since it keeps the size of the conntrack
>>>> entries down.
>>> Yes, we probably can use an array for short lived lock sections.
> 
> I am not a fan of the array of locks. Sizing it is awkward and
> it is vulnerable to hash collisions. Let's see if there is another
> better way.

On normal machines, (no debugging spinlocks), patch uses an embedded
spinlock. We probably can use this even on 32bit kernels, considering
previous patch removed the rcu_head (8 bytes on 32bit arches) from
nf_conn :)

if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
Adding a spinlock on each nf_conn would be too expensive. In this
case, an array of spinlock is a good compromise, as done in
IP route cache, tcp ehash, ...

I agree sizing of this hash table is not pretty, and should be
a generic kernel service (I wanted such service for futexes for example)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - March 30, 2009, 8:05 p.m.
On Mon, 30 Mar 2009 21:57:15 +0200
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Stephen Hemminger a écrit :
> > On Sat, 28 Mar 2009 17:55:38 +0100
> > Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> >> Eric Dumazet a écrit :
> >>> Patrick McHardy a écrit :
> >>>> Stephen Hemminger wrote:
> >>>>
> >>>>> @@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
> >>>>>  
> >>>>>  struct ip_ct_tcp
> >>>>>  {
> >>>>> +    spinlock_t    lock;
> >>>>>      struct ip_ct_tcp_state seen[2];    /* connection parameters per
> >>>>> direction */
> >>>>>      u_int8_t    state;        /* state of the connection (enum
> >>>>> tcp_conntrack) */
> >>>>>      /* For detecting stale connections */
> >>>> Eric already posted a patch to use an array of locks, which is
> >>>> a better approach IMO since it keeps the size of the conntrack
> >>>> entries down.
> >>> Yes, we probably can use an array for short lived lock sections.
> > 
> > I am not a fan of the array of locks. Sizing it is awkward and
> > it is vulnerable to hash collisions. Let's see if there is another
> > better way.
> 
> On normal machines, (no debugging spinlocks), patch uses an embedded
> spinlock. We probably can use this even on 32bit kernels, considering
> previous patch removed the rcu_head (8 bytes on 32bit arches) from
> nf_conn :)
> 
> if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
> Adding a spinlock on each nf_conn would be too expensive. In this
> case, an array of spinlock is a good compromise, as done in
> IP route cache, tcp ehash, ...
> 
> I agree sizing of this hash table is not pretty, and should be
> a generic kernel service (I wanted such service for futexes for example)
> 

IMO having different locking based on lockdep and architecture is an invitation
to future obscure problems. Perhaps some other locking method or shrinking
ct entry would be better.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - March 30, 2009, 8:34 p.m.
On Mon, 30 Mar 2009, Eric Dumazet wrote:

> Jesper Dangaard Brouer a écrit :
>>
>>> Eric Dumazet wrote:
>>>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>>>  conntracking on) : 2319 MB/s instead of 2284 MB/s
>>
>> How do you achieve this impressing numbers?
>> Is it against localhost? (10Gbit/s is max 1250 MB/s)
>>
>
> tbench is a tcp test on localhost yes :)

I see!

Using a Sun 10GbE NIC I was only getting a throughput of 556.86 MB/sec 
with 64 procs (between an AMD Phenom X4 and a Core i7).  (Not tuned multi 
queues yet ...)

Against localhost I'm getting (not with applied patch):

  1336.42 MB/sec on my AMD phenom X4 9950 Quad-Core Processor

  1552.81 MB/sec on my Core i7 920 (4 physical cores, plus 4 threads)

  2274.53 MB/sec on my dual CPU Xeon E5420 (8 cores)


> Good to test tcp stack without going to NIC hardware

Yes true, but this also stresses the process scheduler, I'm seeing around 
800.000 context switches per sec on the Dual CPU Xeon system.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
Eric Dumazet - March 30, 2009, 8:41 p.m.
Jesper Dangaard Brouer a écrit :
> On Mon, 30 Mar 2009, Eric Dumazet wrote:
> 
>> Jesper Dangaard Brouer a écrit :
>>>
>>>> Eric Dumazet wrote:
>>>>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>>>>  conntracking on) : 2319 MB/s instead of 2284 MB/s
>>>
>>> How do you achieve this impressing numbers?
>>> Is it against localhost? (10Gbit/s is max 1250 MB/s)
>>>
>>
>> tbench is a tcp test on localhost yes :)
> 
> I see!
> 
> Using a Sun 10GbE NIC I was only getting a throughput of 556.86 MB/sec
> with 64 procs (between an AMD Phenom X4 and a Core i7).  (Not tuned
> multi queues yet ...)
> 
> Against localhost I'm getting (not with applied patch):
> 
>  1336.42 MB/sec on my AMD phenom X4 9950 Quad-Core Processor
> 
>  1552.81 MB/sec on my Core i7 920 (4 physical cores, plus 4 threads)

Strange results, compared to my E5420 (I thought i7 was faster ??)

> 
>  2274.53 MB/sec on my dual CPU Xeon E5420 (8 cores)

Yes, my dev machine is a dual E5420 (8 cores) at 3.00 GHz
gcc version here is 4.3.3

> 
> 
>> Good to test tcp stack without going to NIC hardware
> 
> Yes true, but this also stresses the process scheduler, I'm seeing
> around 800.000 context switches per sec on the Dual CPU Xeon system.
> 

Indeed, tbench is a mix of tcp and process scheduler test/bench



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - March 30, 2009, 9:25 p.m.
On Mon, 30 Mar 2009, Eric Dumazet wrote:

> Jesper Dangaard Brouer a écrit :
>>
>> Against localhost I'm getting (not with applied patch):
>>
>>  1336.42 MB/sec on my AMD phenom X4 9950 Quad-Core Processor
>>
>>  1552.81 MB/sec on my Core i7 920 (4 physical cores, plus 4 threads)
>
> Strange results, compared to my E5420 (I thought i7 was faster ??)

I also though that i7 would be faster, but I think it can be explained by 
the i7 only has 4 real cores even though it shows 8 CPUs (due to 
hyperthreads).

>>  2274.53 MB/sec on my dual CPU Xeon E5420 (8 cores)

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
Rick Jones - March 30, 2009, 10:44 p.m.
> Indeed, tbench is a mix of tcp and process scheduler test/bench

If I were inclined to run networking tests (eg netperf) over loopback and wanted 
to maximize the trips up and down the protocol stack while minimizing scheduler 
overheads, I might be inclinded to configure --enable-burst with netperf and then 
run N/2 concurrent instances of something like:

netperf -T M,N -t TCP_RR -l 30 -- -b 128 -D &

where M and N were chosen to have each netperf and netserver pair bound to a pair 
of suitable cores, and the value in the -b option wash picked to maximize the CPU 
utilization on those cores.  Then, in theory there would be little to no process 
to process context switching and presumably little in the way of scheduler effect.

What I don't know is if such a setup would have both netperf and netserver each 
consuming 100% of a CPU or if one of them might "peg" before the other.  If one 
did peg before the other, I might be inclined to switch to running N concurrent 
instances, with -T M to bind each  netperf/netserver pair to the same core. 
There would then be the process to process context switching though it would be 
limited to "related" processes.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - March 31, 2009, 7:52 p.m.
On Mon, 30 Mar 2009, Eric Dumazet wrote:
> Jesper Dangaard Brouer a écrit :
>> On Mon, 30 Mar 2009, Eric Dumazet wrote:
>>
>>> Jesper Dangaard Brouer a écrit :
>>>>
>>>>> Eric Dumazet wrote:
>>>>>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>>>>>  conntracking on) : 2319 MB/s instead of 2284 MB/s
>>>>
>>>> How do you achieve this impressing numbers?
>>>> Is it against localhost? (10Gbit/s is max 1250 MB/s)
>>>
>>> tbench is a tcp test on localhost yes :)
>>
>> I see!
>>
>> Using a Sun 10GbE NIC I was only getting a throughput of 556.86 MB/sec
>> with 64 procs (between an AMD Phenom X4 and a Core i7).  (Not tuned
>> multi queues yet ...)
>>
>> Against localhost I'm getting (not with applied patch):
>>
>>  1336.42 MB/sec on my AMD phenom X4 9950 Quad-Core Processor
>>
>>  1552.81 MB/sec on my Core i7 920 (4 physical cores, plus 4 threads)
>
> Strange results, compared to my E5420 (I thought i7 was faster ??)
>
>>  2274.53 MB/sec on my dual CPU Xeon E5420 (8 cores)

I tried using "netperf" instead of "tbench".

A netperf towards localhost (netperf -T 0,1 -l 120 -H localhost)
reveals that the Core i7 is still the fastest.

24218 Mbit/s  on Core i7 920

11684 Mbit/s  on Phenom X4

  8181 Mbit/s  on Xeon E5420

A note to Rick, the process "netperf" would use 100% CPU time and
"netserver" would only use 70%.

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
Eric Dumazet - March 31, 2009, 8:23 p.m.
Jesper Dangaard Brouer a écrit :
> 
> On Mon, 30 Mar 2009, Eric Dumazet wrote:
>> Jesper Dangaard Brouer a écrit :
>>> On Mon, 30 Mar 2009, Eric Dumazet wrote:
>>>
>>>> Jesper Dangaard Brouer a écrit :
>>>>>
>>>>>> Eric Dumazet wrote:
>>>>>>>  "tbench 8" results on my 8 core machine (32bit kernel, with
>>>>>>>  conntracking on) : 2319 MB/s instead of 2284 MB/s
>>>>>
>>>>> How do you achieve this impressing numbers?
>>>>> Is it against localhost? (10Gbit/s is max 1250 MB/s)
>>>>
>>>> tbench is a tcp test on localhost yes :)
>>>
>>> I see!
>>>
>>> Using a Sun 10GbE NIC I was only getting a throughput of 556.86 MB/sec
>>> with 64 procs (between an AMD Phenom X4 and a Core i7).  (Not tuned
>>> multi queues yet ...)
>>>
>>> Against localhost I'm getting (not with applied patch):
>>>
>>>  1336.42 MB/sec on my AMD phenom X4 9950 Quad-Core Processor
>>>
>>>  1552.81 MB/sec on my Core i7 920 (4 physical cores, plus 4 threads)
>>
>> Strange results, compared to my E5420 (I thought i7 was faster ??)
>>
>>>  2274.53 MB/sec on my dual CPU Xeon E5420 (8 cores)
> 
> I tried using "netperf" instead of "tbench".
> 
> A netperf towards localhost (netperf -T 0,1 -l 120 -H localhost)
> reveals that the Core i7 is still the fastest.
> 
> 24218 Mbit/s  on Core i7 920
> 
> 11684 Mbit/s  on Phenom X4
> 
>  8181 Mbit/s  on Xeon E5420

warning, because on my machine, 

cpu0 is on physical id 0, core 0
cpu1 is on physical id 1, core 0
cpu2 is on physical id 0, core 1
cpu3 is on physical id 1, core 1
cpu3 is on physical id 0, core 2
cpu4 is on physical id 1, core 2
cpu5 is on physical id 0, core 3
cpu6 is on physical id 1, core 3

So -T 0,1 might not do what you think...

$ netperf -T 0,1 -l 120 -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    120.00   7423.16
$ netperf -T 0,2 -l 120 -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    120.00   9360.17



> 
> A note to Rick, the process "netperf" would use 100% CPU time and
> "netserver" would only use 70%.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - March 31, 2009, 8:35 p.m.
> 
>>A note to Rick, the process "netperf" would use 100% CPU time and
>>"netserver" would only use 70%.

Thanks.  That makes a decent quantity of sense for a unidirectional test I 
suppose.  If you get bored you might see what happens with a burst 
request/response test.

./configure --enbable-burst
make netperf
netperf -T 0,2 -t TCP_RR -c -C -- -b 64 -D

altering the parms to either -T or -b as you find necessary/desireable.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer - March 31, 2009, 8:52 p.m.
On Tue, 31 Mar 2009, Eric Dumazet wrote:

>> I tried using "netperf" instead of "tbench".
>>
>> A netperf towards localhost (netperf -T 0,1 -l 120 -H localhost)
>> reveals that the Core i7 is still the fastest.
>>
>> 24218 Mbit/s  on Core i7 920
>>
>> 11684 Mbit/s  on Phenom X4
>>
>>  8181 Mbit/s  on Xeon E5420
>
> warning, because on my machine,
>
> cpu0 is on physical id 0, core 0
> cpu1 is on physical id 1, core 0
> cpu2 is on physical id 0, core 1
> cpu3 is on physical id 1, core 1
> cpu3 is on physical id 0, core 2
> cpu4 is on physical id 1, core 2
> cpu5 is on physical id 0, core 3
> cpu6 is on physical id 1, core 3
>
> So -T 0,1 might not do what you think...

Thanks a lot, I didn't know that!
(hint: cat /proc/cpuinfo | egrep -e 'processor|physical id|core id')

I sort of got a "too large" speedup on the Xeon system:

I want to use core 0 and 1 on physical CPU 0, for some strange reason CPU0 
and CPU4 is these respective cores...

netperf -T 0,4 -l 60  -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : demo : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    60.00    19797.42


Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - April 6, 2009, 12:07 p.m.
Stephen Hemminger wrote:
> On Mon, 30 Mar 2009 21:57:15 +0200
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> On normal machines, (no debugging spinlocks), patch uses an embedded
>> spinlock. We probably can use this even on 32bit kernels, considering
>> previous patch removed the rcu_head (8 bytes on 32bit arches) from
>> nf_conn :)
>>
>> if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
>> Adding a spinlock on each nf_conn would be too expensive. In this
>> case, an array of spinlock is a good compromise, as done in
>> IP route cache, tcp ehash, ...
>>
>> I agree sizing of this hash table is not pretty, and should be
>> a generic kernel service (I wanted such service for futexes for example)
>>
> 
> IMO having different locking based on lockdep and architecture is an invitation
> to future obscure problems. Perhaps some other locking method or shrinking
> ct entry would be better.

I agree. Do people enable lockdep on production machines? Otherwise
I'd say the size increase doesn't really matter.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Engelhardt - April 6, 2009, 12:32 p.m.
On Monday 2009-04-06 14:07, Patrick McHardy wrote:
>>>
>>> if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
>>> Adding a spinlock on each nf_conn would be too expensive. In this
>>> case, an array of spinlock is a good compromise, as done in
>>> IP route cache, tcp ehash, ...
>>
>> IMO having different locking based on lockdep and architecture is an
>> invitation
>> to future obscure problems. Perhaps some other locking method or shrinking
>> ct entry would be better.
>
> I agree. Do people enable lockdep on production machines?

They do not.[1]


[1] http://git.opensuse.org/?p=people/jblunck/kernel-source.git;a=blob;f=config/x86_64/default;hb=SL111_BRANCH
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - April 6, 2009, 5:25 p.m.
On Mon, 6 Apr 2009 14:32:54 +0200 (CEST)
Jan Engelhardt <jengelh@medozas.de> wrote:

> 
> On Monday 2009-04-06 14:07, Patrick McHardy wrote:
> >>>
> >>> if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
> >>> Adding a spinlock on each nf_conn would be too expensive. In this
> >>> case, an array of spinlock is a good compromise, as done in
> >>> IP route cache, tcp ehash, ...
> >>
> >> IMO having different locking based on lockdep and architecture is an
> >> invitation
> >> to future obscure problems. Perhaps some other locking method or shrinking
> >> ct entry would be better.
> >
> > I agree. Do people enable lockdep on production machines?
> 
> They do not.[1]
> 
> 
> [1] http://git.opensuse.org/?p=people/jblunck/kernel-source.git;a=blob;f=config/x86_64/default;hb=SL111_BRANCH

IMHO If they enable lockdep, they can expect that the cost is non-zero.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb1981f..c737f47 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -96,7 +96,14 @@  struct pipe_inode_info;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
-	atomic_t use;
+	atomic_t	use;
+#if BITS_PER_LONG == 64
+	/*
+	 * On 64bit arches, we might use this 32bit hole for spinlock
+	 * (if a spinlock_t fits)
+	 */
+	int		pad;
+#endif
 };
 #endif
 
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index ba32ed7..d66bea9 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -63,7 +63,7 @@  struct nf_conntrack_l4proto
 
 	/* convert protoinfo to nfnetink attributes */
 	int (*to_nlattr)(struct sk_buff *skb, struct nlattr *nla,
-			 const struct nf_conn *ct);
+			 struct nf_conn *ct);
 	/* Calculate protoinfo nlattr size */
 	int (*nlattr_size)(void);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 8020db6..408d287 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -32,6 +32,7 @@ 
 #include <linux/rculist_nulls.h>
 
 #include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_lock.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_expect.h>
@@ -523,6 +524,7 @@  struct nf_conn *nf_conntrack_alloc(struct net *net,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	nf_conn_lock_init(ct);
 	atomic_set(&ct->ct_general.use, 1);
 	ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple = *orig;
 	ct->tuplehash[IP_CT_DIR_REPLY].tuple = *repl;
@@ -1033,11 +1035,50 @@  void nf_conntrack_flush(struct net *net, u32 pid, int report)
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_flush);
 
+spinlock_t *nf_conn_hlocks __read_mostly;
+unsigned int nf_conn_hlocks_mask __read_mostly;
+
+static int nf_conn_hlocks_init(void)
+{
+	if (nf_conn_lock_type() == NF_CONN_LOCK_EXTERNAL) {
+		size_t sz;
+		int i;
+#if defined(CONFIG_PROVE_LOCKING)
+		unsigned int nr_slots = 256;
+#else
+		/* 4 nodes per cpu on VSMP, or 256 slots per cpu */
+		unsigned int nr_slots = max(256UL, ((4UL << INTERNODE_CACHE_SHIFT) /
+					  sizeof(spinlock_t)));
+		nr_slots = roundup_pow_of_two(num_possible_cpus() * nr_slots);
+#endif
+		sz = nr_slots * sizeof(spinlock_t);
+		if (sz > PAGE_SIZE)
+			nf_conn_hlocks = vmalloc(sz);
+		else
+			nf_conn_hlocks = kmalloc(sz, GFP_KERNEL);
+		if (!nf_conn_hlocks)
+			return -ENOMEM;
+		nf_conn_hlocks_mask = nr_slots - 1;
+		for (i = 0; i < nr_slots; i++)
+			spin_lock_init(nf_conn_hlocks + i);
+	}
+	return 0;
+}
+
+static void nf_conn_hlocks_fini(void)
+{
+	if (is_vmalloc_addr(nf_conn_hlocks))
+		vfree(nf_conn_hlocks);
+	else
+		kfree(nf_conn_hlocks);
+}
+
 static void nf_conntrack_cleanup_init_net(void)
 {
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
 	kmem_cache_destroy(nf_conntrack_cachep);
+	nf_conn_hlocks_fini();
 }
 
 static void nf_conntrack_cleanup_net(struct net *net)
@@ -1170,6 +1211,10 @@  static int nf_conntrack_init_init_net(void)
 	int max_factor = 8;
 	int ret;
 
+	ret = nf_conn_hlocks_init();
+	if (ret)
+		goto err_hlocks;
+
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
 	if (!nf_conntrack_htable_size) {
@@ -1217,6 +1262,8 @@  err_helper:
 err_proto:
 	kmem_cache_destroy(nf_conntrack_cachep);
 err_cache:
+	nf_conn_hlocks_fini();
+err_hlocks:
 	return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c6439c7..89ea035 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -144,7 +144,7 @@  nla_put_failure:
 }
 
 static inline int
-ctnetlink_dump_protoinfo(struct sk_buff *skb, const struct nf_conn *ct)
+ctnetlink_dump_protoinfo(struct sk_buff *skb, struct nf_conn *ct)
 {
 	struct nf_conntrack_l4proto *l4proto;
 	struct nlattr *nest_proto;
@@ -351,7 +351,7 @@  nla_put_failure:
 static int
 ctnetlink_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 		    int event, int nowait,
-		    const struct nf_conn *ct)
+		    struct nf_conn *ct)
 {
 	struct nlmsghdr *nlh;
 	struct nfgenmsg *nfmsg;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index b5ccf2b..bb5fc24 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -23,14 +23,13 @@ 
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_ipv6.h>
 #include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_lock.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
 
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
@@ -300,9 +299,7 @@  static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
 {
 	enum tcp_conntrack state;
 
-	read_lock_bh(&tcp_lock);
 	state = ct->proto.tcp.state;
-	read_unlock_bh(&tcp_lock);
 
 	return seq_printf(s, "%s ", tcp_conntrack_names[state]);
 }
@@ -708,14 +705,14 @@  void nf_conntrack_tcp_update(const struct sk_buff *skb,
 
 	end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(nf_conn_lock_addr(ct));
 	/*
 	 * We have to worry for the ack in the reply packet only...
 	 */
 	if (after(end, ct->proto.tcp.seen[dir].td_end))
 		ct->proto.tcp.seen[dir].td_end = end;
 	ct->proto.tcp.last_end = end;
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(nf_conn_lock_addr(ct));
 	pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
 		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
 		 sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -824,7 +821,7 @@  static int tcp_packet(struct nf_conn *ct,
 	th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
 	BUG_ON(th == NULL);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(nf_conn_lock_addr(ct));
 	old_state = ct->proto.tcp.state;
 	dir = CTINFO2DIR(ctinfo);
 	index = get_conntrack_index(th);
@@ -854,7 +851,7 @@  static int tcp_packet(struct nf_conn *ct,
 		        && ct->proto.tcp.last_index == TCP_RST_SET)) {
 			/* Attempt to reopen a closed/aborted connection.
 			 * Delete this connection and look up again. */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(nf_conn_lock_addr(ct));
 
 			/* Only repeat if we can actually remove the timer.
 			 * Destruction may already be in progress in process
@@ -890,7 +887,7 @@  static int tcp_packet(struct nf_conn *ct,
 			 * that the client cannot but retransmit its SYN and
 			 * thus initiate a clean new session.
 			 */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(nf_conn_lock_addr(ct));
 			if (LOG_INVALID(net, IPPROTO_TCP))
 				nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 					  "nf_ct_tcp: killing out of sync session ");
@@ -903,7 +900,7 @@  static int tcp_packet(struct nf_conn *ct,
 		ct->proto.tcp.last_end =
 		    segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
 
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(nf_conn_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid packet ignored ");
@@ -912,7 +909,7 @@  static int tcp_packet(struct nf_conn *ct,
 		/* Invalid packet */
 		pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
 			 dir, get_conntrack_index(th), old_state);
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(nf_conn_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid state ");
@@ -943,7 +940,7 @@  static int tcp_packet(struct nf_conn *ct,
 
 	if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
 			   skb, dataoff, th, pf)) {
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(nf_conn_lock_addr(ct));
 		return -NF_ACCEPT;
 	}
      in_window:
@@ -972,7 +969,7 @@  static int tcp_packet(struct nf_conn *ct,
 		timeout = nf_ct_tcp_timeout_unacknowledged;
 	else
 		timeout = tcp_timeouts[new_state];
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(nf_conn_lock_addr(ct));
 
 	nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
 	if (new_state != old_state)
@@ -1090,12 +1087,12 @@  static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 #include <linux/netfilter/nfnetlink_conntrack.h>
 
 static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
-			 const struct nf_conn *ct)
+			 struct nf_conn *ct)
 {
 	struct nlattr *nest_parms;
 	struct nf_ct_tcp_flags tmp = {};
 
-	read_lock_bh(&tcp_lock);
+	spin_lock_bh(nf_conn_lock_addr(ct));
 	nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -1115,14 +1112,14 @@  static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	tmp.flags = ct->proto.tcp.seen[1].flags;
 	NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
 		sizeof(struct nf_ct_tcp_flags), &tmp);
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(nf_conn_lock_addr(ct));
 
 	nla_nest_end(skb, nest_parms);
 
 	return 0;
 
 nla_put_failure:
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(nf_conn_lock_addr(ct));
 	return -1;
 }
 
@@ -1153,7 +1150,7 @@  static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 	    nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
 		return -EINVAL;
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(nf_conn_lock_addr(ct));
 	if (tb[CTA_PROTOINFO_TCP_STATE])
 		ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
 
@@ -1180,7 +1177,7 @@  static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 		ct->proto.tcp.seen[1].td_scale =
 			nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
 	}
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(nf_conn_lock_addr(ct));
 
 	return 0;
 }