Patchwork iptables: lock free counters

login
register
mail settings
Submitter stephen hemminger
Date Feb. 19, 2009, 7:47 p.m.
Message ID <20090219114719.560999b5@extreme>
Download mbox | patch
Permalink /patch/23446/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

stephen hemminger - Feb. 19, 2009, 7:47 p.m.
The reader/writer lock in ip_tables is acquired in the critical path of
processing packets and is one of the reasons just loading iptables can cause
a 20% performance loss. The rwlock serves two functions:

1) it prevents changes to table state (xt_replace) while table is in use.
   This is now handled by doing rcu on the xt_table. When table is
   replaced, the new table(s) are put in and the old one table(s) are freed
   after RCU period.

2) it provides synchronization when accesing the counter values.
   This is now handled by swapping in new table_info entries for each cpu
   then summing the old values, and putting the result back onto one
   cpu.  On a busy system it may cause sampling to occur at different
   times on each cpu, but no packet/byte counts are lost in the process.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
Added missing preempt_enable.  Patch against nf-next-2.6 git tree.

 include/linux/netfilter/x_tables.h |    6 +
 net/ipv4/netfilter/arp_tables.c    |  115 +++++++++++++++++++++++++++--------
 net/ipv4/netfilter/ip_tables.c     |  120 ++++++++++++++++++++++++++-----------
 net/ipv6/netfilter/ip6_tables.c    |  119 +++++++++++++++++++++++++-----------
 net/netfilter/x_tables.c           |   26 ++++++--
 5 files changed, 284 insertions(+), 102 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Feb. 19, 2009, 11:46 p.m.
Stephen Hemminger a écrit :
> The reader/writer lock in ip_tables is acquired in the critical path of
> processing packets and is one of the reasons just loading iptables can cause
> a 20% performance loss. The rwlock serves two functions:
> 
> 1) it prevents changes to table state (xt_replace) while table is in use.
>    This is now handled by doing rcu on the xt_table. When table is
>    replaced, the new table(s) are put in and the old one table(s) are freed
>    after RCU period.
> 
> 2) it provides synchronization when accesing the counter values.
>    This is now handled by swapping in new table_info entries for each cpu
>    then summing the old values, and putting the result back onto one
>    cpu.  On a busy system it may cause sampling to occur at different
>    times on each cpu, but no packet/byte counts are lost in the process.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


Acked-by: Eric Dumazet <dada1@cosmosbay.com>

Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)

BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)

Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)


> 
> ---
> Added missing preempt_enable.  Patch against nf-next-2.6 git tree.
> 
>  include/linux/netfilter/x_tables.h |    6 +
>  net/ipv4/netfilter/arp_tables.c    |  115 +++++++++++++++++++++++++++--------
>  net/ipv4/netfilter/ip_tables.c     |  120 ++++++++++++++++++++++++++-----------
>  net/ipv6/netfilter/ip6_tables.c    |  119 +++++++++++++++++++++++++-----------
>  net/netfilter/x_tables.c           |   26 ++++++--
>  5 files changed, 284 insertions(+), 102 deletions(-)
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Feb. 19, 2009, 11:56 p.m.
Eric Dumazet wrote:
> Stephen Hemminger a écrit :
> 
>>The reader/writer lock in ip_tables is acquired in the critical path of
>>processing packets and is one of the reasons just loading iptables can cause
>>a 20% performance loss. The rwlock serves two functions:
>>
>>1) it prevents changes to table state (xt_replace) while table is in use.
>>   This is now handled by doing rcu on the xt_table. When table is
>>   replaced, the new table(s) are put in and the old one table(s) are freed
>>   after RCU period.
>>
>>2) it provides synchronization when accesing the counter values.
>>   This is now handled by swapping in new table_info entries for each cpu
>>   then summing the old values, and putting the result back onto one
>>   cpu.  On a busy system it may cause sampling to occur at different
>>   times on each cpu, but no packet/byte counts are lost in the process.
>>
>>Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> 
> 
> Acked-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Sucessfully tested on my dual quad core machine too, but iptables only (no
> ipv6 here)
> 
> BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
> 
> Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)

Do you folks need/want further testing against the 32-core setup?

rick jones

> 
> 
>>---
>>Added missing preempt_enable.  Patch against nf-next-2.6 git tree.
>>
>> include/linux/netfilter/x_tables.h |    6 +
>> net/ipv4/netfilter/arp_tables.c    |  115 +++++++++++++++++++++++++++--------
>> net/ipv4/netfilter/ip_tables.c     |  120 ++++++++++++++++++++++++++-----------
>> net/ipv6/netfilter/ip6_tables.c    |  119 +++++++++++++++++++++++++-----------
>> net/netfilter/x_tables.c           |   26 ++++++--
>> 5 files changed, 284 insertions(+), 102 deletions(-)
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - Feb. 20, 2009, 1:03 a.m.
On Thu, 19 Feb 2009 15:56:18 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Eric Dumazet wrote:
> > Stephen Hemminger a écrit :
> > 
> >>The reader/writer lock in ip_tables is acquired in the critical path of
> >>processing packets and is one of the reasons just loading iptables can cause
> >>a 20% performance loss. The rwlock serves two functions:
> >>
> >>1) it prevents changes to table state (xt_replace) while table is in use.
> >>   This is now handled by doing rcu on the xt_table. When table is
> >>   replaced, the new table(s) are put in and the old one table(s) are freed
> >>   after RCU period.
> >>
> >>2) it provides synchronization when accesing the counter values.
> >>   This is now handled by swapping in new table_info entries for each cpu
> >>   then summing the old values, and putting the result back onto one
> >>   cpu.  On a busy system it may cause sampling to occur at different
> >>   times on each cpu, but no packet/byte counts are lost in the process.
> >>
> >>Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > 
> > 
> > Acked-by: Eric Dumazet <dada1@cosmosbay.com>
> > 
> > Sucessfully tested on my dual quad core machine too, but iptables only (no
> > ipv6 here)
> > 
> > BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
> > 
> > Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)
> 
> Do you folks need/want further testing against the 32-core setup?

It would be good to combine all 3 (iptables-rcu, timer change, and conntrack lock)
to see what the overhead change is.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Feb. 20, 2009, 1:18 a.m.
>>>Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)
>>
>>Do you folks need/want further testing against the 32-core setup?
> 
> 
> It would be good to combine all 3 (iptables-rcu, timer change, and conntrack lock)
> to see what the overhead change is.

Fair enough. Is there a tree somewhere I can pull with all those in it, or do I 
need to go back through the emails and apply patches?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Feb. 20, 2009, 9:37 a.m.
Eric Dumazet wrote:
> Stephen Hemminger a écrit :
>> The reader/writer lock in ip_tables is acquired in the critical path of
>> processing packets and is one of the reasons just loading iptables can cause
>> a 20% performance loss. The rwlock serves two functions:
>>
>> 1) it prevents changes to table state (xt_replace) while table is in use.
>>    This is now handled by doing rcu on the xt_table. When table is
>>    replaced, the new table(s) are put in and the old one table(s) are freed
>>    after RCU period.
>>
>> 2) it provides synchronization when accesing the counter values.
>>    This is now handled by swapping in new table_info entries for each cpu
>>    then summing the old values, and putting the result back onto one
>>    cpu.  On a busy system it may cause sampling to occur at different
>>    times on each cpu, but no packet/byte counts are lost in the process.
>>
>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> 
> Acked-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
> 
> BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
> 
> Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)

Applied, thanks everyone. I've also addes Eric's tbench results
to the changelog.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Feb. 20, 2009, 9:42 a.m.
Rick Jones wrote:
>>>> Thanks Stephen, thats very cool stuff, yet another rwlock out of 
>>>> kernel :)
>>>
>>> Do you folks need/want further testing against the 32-core setup?
>>
>>
>> It would be good to combine all 3 (iptables-rcu, timer change, and 
>> conntrack lock)
>> to see what the overhead change is.
> 
> Fair enough. Is there a tree somewhere I can pull with all those in it, 
> or do I need to go back through the emails and apply patches?

You can use my nf-next.git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git

It contains the lock free counters, as well as smaller optimizations
from Eric. The last timer patch I've seen missed the actual conversion
to use mod_timer_pending(), but it would be great to have some numbers
on the conntrack lock changes. Thanks Rick!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Feb. 20, 2009, 10:57 p.m.
>> Fair enough. Is there a tree somewhere I can pull with all those in 
>> it, or do I need to go back through the emails and apply patches?
> 
> 
> You can use my nf-next.git tree from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git
> 
> It contains the lock free counters, as well as smaller optimizations
> from Eric. 

So, by the time this hits inboxes, under:

ftp://ftp.netperf.org/nf-next-2.6-results

should be a directory called "baseline" which are the results from just a clone 
of your tree from earlier today.  There you will find the config file, the log of 
the build and then three subdirectories:

none - results without doing iptables --list
empty - results after doing iptables --list
full - results after doing an iptables-restore of a config from the "iptables" 
file also up there

In each will be the netperf results in csv format, and four different caliper 
(using the perfmon interface) profiles:

"cycles" uses a profile which is able to take samples with interrupts disabled
"fprof" is a plain flat profile that does not see things happening with interrupt 
s disabled - comparing an fprof to cycles is sometimes interesting
"dcache" tries to take cache miss profiles. iirc that uses the data ear in the 
Itanium PMU to do its thing - I cannot recall the effect of interrupt disabling there
"scgprof" is a sampled call graph profile - likely as not with interrupt 
limitations similar to those of an fprof profile.

> The last timer patch I've seen missed the actual conversion
> to use mod_timer_pending(), but it would be great to have some numbers
> on the conntrack lock changes. Thanks Rick!

I will go back through my email now and try to find the conntrack lock changes 
and apply them to the tree and turn the crank.

happy benchmarking

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Feb. 21, 2009, 12:35 a.m.
Rick Jones wrote:
> So, by the time this hits inboxes, under:
> 
> ftp://ftp.netperf.org/nf-next-2.6-results
> ...
> I will go back through my email now and try to find the conntrack lock 
> changes and apply them to the tree and turn the crank.

Under the base URL above there is now a "conntrack" subdir with the usual "none," 
"empty," and "full" subdirs.  This is with the patch from message ID 
<20090219140303.4329f860@extreme> titled "Re: [RFT 4/4] netfilter: Get rid of 
central rwlock in tcp conntracking" which my mail client says has a date of 
02/19/09 14:03.

On the plus side, only one of the 64 concurrent netperfs died during the "full" 
test compared with more than 10 without the patch.  Also, there were no soft 
lockups reported as there were without the patch.

The rwlock time is gone, naturally, replaced with boatloads of spinlock 
contention.  Hopefully the scgprof profile will help show the source.  Perhaps 
there is yet another patch I should have applied :)

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Feb. 27, 2009, 2:02 p.m.
Eric Dumazet a écrit :
> Stephen Hemminger a écrit :
>> The reader/writer lock in ip_tables is acquired in the critical path of
>> processing packets and is one of the reasons just loading iptables can cause
>> a 20% performance loss. The rwlock serves two functions:
>>
>> 1) it prevents changes to table state (xt_replace) while table is in use.
>>    This is now handled by doing rcu on the xt_table. When table is
>>    replaced, the new table(s) are put in and the old one table(s) are freed
>>    after RCU period.
>>
>> 2) it provides synchronization when accesing the counter values.
>>    This is now handled by swapping in new table_info entries for each cpu
>>    then summing the old values, and putting the result back onto one
>>    cpu.  On a busy system it may cause sampling to occur at different
>>    times on each cpu, but no packet/byte counts are lost in the process.
>>
>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> 
> Acked-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
> 
> BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
> 
> Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)
>

While testing multicast flooding stuff, I found that "iptables -nvL" can 
have a *very* slow response time on my dual quad core machine...

   LatencyTOP version 0.5       (C) 2008 Intel Corporation

Cause                                                Maximum     Percentage
synchronize_rcu synchronize_net do_ipt_get_ctl nf_1878.6 msec          3.1 %
Scheduler: waiting for cpu                        160.3 msec         13.6 %
do_get_write_access journal_get_write_access __ext 11.0 msec          0.0 %
do_get_write_access journal_get_write_access __ext  7.7 msec          0.0 %
poll_schedule_timeout do_select core_sys_select sy  4.9 msec          0.0 %
do_wait sys_wait4 sys_waitpid sysenter_do_call      3.4 msec          0.1 %
call_usermodehelper_exec request_module netlink_cr  1.6 msec          0.0 %
__skb_recv_datagram skb_recv_datagram raw_recvmsg   1.5 msec          0.0 %
do_wait sys_wait4 sysenter_do_call                  0.7 msec          0.0 %


# time iptables -nvL
Chain INPUT (policy ACCEPT 416M packets, 64G bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes)
 pkts bytes target     prot opt in     out     source               destination

real    0m1.810s
user    0m0.000s
sys     0m0.001s


CONFIG_NO_HZ=y
CONFIG_HZ_1000=y
CONFIG_HZ=1000

One cpu is 100% handling softirqs, could it be the problem ?

Cpu0  :  1.0%us, 14.7%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu1  :  3.6%us, 23.2%sy,  0.0%ni, 71.6%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,100.0%si,  0.0%st
Cpu3  :  2.7%us, 23.9%sy,  0.0%ni, 71.1%id,  0.7%wa,  0.0%hi,  1.7%si,  0.0%st
Cpu4  :  1.3%us, 14.3%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu5  :  1.0%us, 14.2%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu6  :  0.3%us,  7.0%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  0.7%us,  8.0%sy,  0.0%ni, 90.0%id,  0.7%wa,  0.0%hi,  0.7%si,  0.0%st

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - March 2, 2009, 10:55 a.m.
Eric Dumazet wrote:
> # time iptables -nvL
> Chain INPUT (policy ACCEPT 416M packets, 64G bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> real    0m1.810s
> user    0m0.000s
> sys     0m0.001s

Thats really slow ...

> CONFIG_NO_HZ=y
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> 
> One cpu is 100% handling softirqs, could it be the problem ?

Is this fixed by your RCU quiescent state fix?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 2, 2009, 5:47 p.m.
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> # time iptables -nvL
>> Chain INPUT (policy ACCEPT 416M packets, 64G bytes)
>>  pkts bytes target     prot opt in     out     source              
>> destination
>>
>> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>>  pkts bytes target     prot opt in     out     source              
>> destination
>>
>> Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes)
>>  pkts bytes target     prot opt in     out     source              
>> destination
>>
>> real    0m1.810s
>> user    0m0.000s
>> sys     0m0.001s
> 
> Thats really slow ...
> 
>> CONFIG_NO_HZ=y
>> CONFIG_HZ_1000=y
>> CONFIG_HZ=1000
>>
>> One cpu is 100% handling softirqs, could it be the problem ?
> 
> Is this fixed by your RCU quiescent state fix?

Yes it is :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - March 2, 2009, 9:56 p.m.
Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> Eric Dumazet wrote:
>>> real    0m1.810s
>>> user    0m0.000s
>>> sys     0m0.001s
>> Thats really slow ...
>>
>>> CONFIG_NO_HZ=y
>>> CONFIG_HZ_1000=y
>>> CONFIG_HZ=1000
>>>
>>> One cpu is 100% handling softirqs, could it be the problem ?
>> Is this fixed by your RCU quiescent state fix?
> 
> Yes it is :)

Great, thanks :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - March 2, 2009, 10:02 p.m.
On Mon, 02 Mar 2009 22:56:39 +0100
Patrick McHardy <kaber@trash.net> wrote:

> Eric Dumazet wrote:
> > Patrick McHardy a écrit :
> >> Eric Dumazet wrote:
> >>> real    0m1.810s
> >>> user    0m0.000s
> >>> sys     0m0.001s
> >> Thats really slow ...
> >>
> >>> CONFIG_NO_HZ=y
> >>> CONFIG_HZ_1000=y
> >>> CONFIG_HZ=1000
> >>>
> >>> One cpu is 100% handling softirqs, could it be the problem ?
> >> Is this fixed by your RCU quiescent state fix?
> > 
> > Yes it is :)
> 
> Great, thanks :)

I wonder if the RCU quiescent fix should go in 2.6.29 because it
fixes other issues like route changing RCU latency under Dos attack.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - March 2, 2009, 10:07 p.m.
Stephen Hemminger wrote:
> On Mon, 02 Mar 2009 22:56:39 +0100
> Patrick McHardy <kaber@trash.net> wrote:
> 
>> Eric Dumazet wrote:
>>> Patrick McHardy a écrit :
>>>> Eric Dumazet wrote:
>>>>> real    0m1.810s
>>>>> user    0m0.000s
>>>>> sys     0m0.001s
>>>> Thats really slow ...
>>>>
>>>>> CONFIG_NO_HZ=y
>>>>> CONFIG_HZ_1000=y
>>>>> CONFIG_HZ=1000
>>>>>
>>>>> One cpu is 100% handling softirqs, could it be the problem ?
>>>> Is this fixed by your RCU quiescent state fix?
>>> Yes it is :)
>> Great, thanks :)
> 
> I wonder if the RCU quiescent fix should go in 2.6.29 because it
> fixes other issues like route changing RCU latency under Dos attack.

 From what I can tell, it should.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney - March 2, 2009, 10:17 p.m.
On Mon, Mar 02, 2009 at 11:07:18PM +0100, Patrick McHardy wrote:
> Stephen Hemminger wrote:
>> On Mon, 02 Mar 2009 22:56:39 +0100
>> Patrick McHardy <kaber@trash.net> wrote:
>>> Eric Dumazet wrote:
>>>> Patrick McHardy a écrit :
>>>>> Eric Dumazet wrote:
>>>>>> real    0m1.810s
>>>>>> user    0m0.000s
>>>>>> sys     0m0.001s
>>>>> Thats really slow ...
>>>>>
>>>>>> CONFIG_NO_HZ=y
>>>>>> CONFIG_HZ_1000=y
>>>>>> CONFIG_HZ=1000
>>>>>>
>>>>>> One cpu is 100% handling softirqs, could it be the problem ?
>>>>> Is this fixed by your RCU quiescent state fix?
>>>> Yes it is :)
>>> Great, thanks :)
>> I wonder if the RCU quiescent fix should go in 2.6.29 because it
>> fixes other issues like route changing RCU latency under Dos attack.
>
> From what I can tell, it should.

I agree.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - March 2, 2009, 10:27 p.m.
Stephen Hemminger a écrit :
> On Mon, 02 Mar 2009 22:56:39 +0100
> Patrick McHardy <kaber@trash.net> wrote:
> 
>> Eric Dumazet wrote:
>>> Patrick McHardy a écrit :
>>>> Eric Dumazet wrote:
>>>>> real    0m1.810s
>>>>> user    0m0.000s
>>>>> sys     0m0.001s
>>>> Thats really slow ...
>>>>
>>>>> CONFIG_NO_HZ=y
>>>>> CONFIG_HZ_1000=y
>>>>> CONFIG_HZ=1000
>>>>>
>>>>> One cpu is 100% handling softirqs, could it be the problem ?
>>>> Is this fixed by your RCU quiescent state fix?
>>> Yes it is :)
>> Great, thanks :)
> 
> I wonder if the RCU quiescent fix should go in 2.6.29 because it
> fixes other issues like route changing RCU latency under Dos attack.
> 
> 

Yes probably, and on stable versions too, since this problem is quite old...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

--- a/include/linux/netfilter/x_tables.h	2009-02-19 11:42:43.060110657 -0800
+++ b/include/linux/netfilter/x_tables.h	2009-02-19 11:42:58.863663575 -0800
@@ -353,7 +353,7 @@  struct xt_table
 	unsigned int valid_hooks;
 
 	/* Lock for the curtain */
-	rwlock_t lock;
+	struct mutex lock;
 
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
@@ -385,7 +385,7 @@  struct xt_table_info
 
 	/* ipt_entry tables: one per CPU */
 	/* Note : this field MUST be the last one, see XT_TABLE_INFO_SZ */
-	char *entries[1];
+	void *entries[1];
 };
 
 #define XT_TABLE_INFO_SZ (offsetof(struct xt_table_info, entries) \
@@ -432,6 +432,8 @@  extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
+extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
+				    struct xt_table_info *new);
 
 #ifdef CONFIG_COMPAT
 #include <net/compat.h>
--- a/net/ipv4/netfilter/ip_tables.c	2009-02-19 11:42:12.968410890 -0800
+++ b/net/ipv4/netfilter/ip_tables.c	2009-02-19 11:42:58.863663575 -0800
@@ -347,10 +347,12 @@  ipt_do_table(struct sk_buff *skb,
 	mtpar.family  = tgpar.family = NFPROTO_IPV4;
 	tgpar.hooknum = hook;
 
-	read_lock_bh(&table->lock);
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
-	private = table->private;
-	table_base = (void *)private->entries[smp_processor_id()];
+
+	rcu_read_lock();
+	private = rcu_dereference(table->private);
+	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+
 	e = get_entry(table_base, private->hook_entry[hook]);
 
 	/* For return from builtin chain */
@@ -445,7 +447,7 @@  ipt_do_table(struct sk_buff *skb,
 		}
 	} while (!hotdrop);
 
-	read_unlock_bh(&table->lock);
+	rcu_read_unlock();
 
 #ifdef DEBUG_ALLOW_ALL
 	return NF_ACCEPT;
@@ -924,13 +926,68 @@  get_counters(const struct xt_table_info 
 				  counters,
 				  &i);
 	}
+
+}
+
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct ipt_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
+
+/* Take values from counters and add them back onto the current cpu */
+static void put_counters(struct xt_table_info *t,
+			 const struct xt_counters counters[])
+{
+	unsigned int i, cpu;
+
+	local_bh_disable();
+	cpu = smp_processor_id();
+	i = 0;
+	IPT_ENTRY_ITERATE(t->entries[cpu],
+			  t->size,
+			  add_counter_to_entry,
+			  counters,
+			  &i);
+	local_bh_enable();
+}
+
+
+static inline int
+zero_entry_counter(struct ipt_entry *e, void *arg)
+{
+	e->counters.bcnt = 0;
+	e->counters.pcnt = 0;
+	return 0;
+}
+
+static void
+clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
+{
+	unsigned int cpu;
+	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
+
+	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
+	for_each_possible_cpu(cpu) {
+		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
+		IPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
+				  zero_entry_counter, NULL);
+	}
 }
 
 static struct xt_counters * alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	struct xt_table_info *private = table->private;
+	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -939,14 +996,30 @@  static struct xt_counters * alloc_counte
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		return ERR_PTR(-ENOMEM);
+		goto nomem;
+
+	info = xt_alloc_table_info(private->size);
+	if (!info)
+		goto free_counters;
+
+	clone_counters(info, private);
 
-	/* First, sum counters... */
-	write_lock_bh(&table->lock);
-	get_counters(private, counters);
-	write_unlock_bh(&table->lock);
+	mutex_lock(&table->lock);
+	xt_table_entry_swap_rcu(private, info);
+	synchronize_net();	/* Wait until smoke has cleared */
+
+	get_counters(info, counters);
+	put_counters(private, counters);
+	mutex_unlock(&table->lock);
+
+	xt_free_table_info(info);
 
 	return counters;
+
+ free_counters:
+	vfree(counters);
+ nomem:
+	return ERR_PTR(-ENOMEM);
 }
 
 static int
@@ -1312,27 +1385,6 @@  do_replace(struct net *net, void __user 
 	return ret;
 }
 
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK. */
-static int
-add_counter_to_entry(struct ipt_entry *e,
-		     const struct xt_counters addme[],
-		     unsigned int *i)
-{
-#if 0
-	duprintf("add_counter: Entry %u %lu/%lu + %lu/%lu\n",
-		 *i,
-		 (long unsigned int)e->counters.pcnt,
-		 (long unsigned int)e->counters.bcnt,
-		 (long unsigned int)addme[*i].pcnt,
-		 (long unsigned int)addme[*i].bcnt);
-#endif
-
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
 
 static int
 do_add_counters(struct net *net, void __user *user, unsigned int len, int compat)
@@ -1393,13 +1445,14 @@  do_add_counters(struct net *net, void __
 		goto free;
 	}
 
-	write_lock_bh(&t->lock);
+	mutex_lock(&t->lock);
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
+	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
 	loc_cpu_entry = private->entries[raw_smp_processor_id()];
@@ -1408,8 +1461,9 @@  do_add_counters(struct net *net, void __
 			  add_counter_to_entry,
 			  paddc,
 			  &i);
+	preempt_enable();
  unlock_up_free:
-	write_unlock_bh(&t->lock);
+	mutex_unlock(&t->lock);
 	xt_table_unlock(t);
 	module_put(t->me);
  free:
--- a/net/netfilter/x_tables.c	2009-02-19 11:42:12.988414682 -0800
+++ b/net/netfilter/x_tables.c	2009-02-19 11:42:58.863663575 -0800
@@ -625,6 +625,20 @@  void xt_free_table_info(struct xt_table_
 }
 EXPORT_SYMBOL(xt_free_table_info);
 
+void xt_table_entry_swap_rcu(struct xt_table_info *oldinfo,
+			     struct xt_table_info *newinfo)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		void *p = oldinfo->entries[cpu];
+		rcu_assign_pointer(oldinfo->entries[cpu], newinfo->entries[cpu]);
+		newinfo->entries[cpu] = p;
+	}
+
+}
+EXPORT_SYMBOL_GPL(xt_table_entry_swap_rcu);
+
 /* Find table by name, grabs mutex & ref.  Returns ERR_PTR() on error. */
 struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
 				    const char *name)
@@ -671,21 +685,22 @@  xt_replace_table(struct xt_table *table,
 	struct xt_table_info *oldinfo, *private;
 
 	/* Do the substitution. */
-	write_lock_bh(&table->lock);
+	mutex_lock(&table->lock);
 	private = table->private;
 	/* Check inside lock: is the old number correct? */
 	if (num_counters != private->number) {
 		duprintf("num_counters != table->private->number (%u/%u)\n",
 			 num_counters, private->number);
-		write_unlock_bh(&table->lock);
+		mutex_unlock(&table->lock);
 		*error = -EAGAIN;
 		return NULL;
 	}
 	oldinfo = private;
-	table->private = newinfo;
+	rcu_assign_pointer(table->private, newinfo);
 	newinfo->initial_entries = oldinfo->initial_entries;
-	write_unlock_bh(&table->lock);
+	mutex_unlock(&table->lock);
 
+	synchronize_net();
 	return oldinfo;
 }
 EXPORT_SYMBOL_GPL(xt_replace_table);
@@ -719,7 +734,8 @@  struct xt_table *xt_register_table(struc
 
 	/* Simplifies replace_table code. */
 	table->private = bootstrap;
-	rwlock_init(&table->lock);
+	mutex_init(&table->lock);
+
 	if (!xt_replace_table(table, 0, newinfo, &ret))
 		goto unlock;
 
--- a/net/ipv4/netfilter/arp_tables.c	2009-02-19 11:42:43.064477910 -0800
+++ b/net/ipv4/netfilter/arp_tables.c	2009-02-19 11:42:58.863663575 -0800
@@ -261,9 +261,10 @@  unsigned int arpt_do_table(struct sk_buf
 	indev = in ? in->name : nulldevname;
 	outdev = out ? out->name : nulldevname;
 
-	read_lock_bh(&table->lock);
-	private = table->private;
-	table_base = (void *)private->entries[smp_processor_id()];
+	rcu_read_lock();
+	private = rcu_dereference(table->private);
+	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+
 	e = get_entry(table_base, private->hook_entry[hook]);
 	back = get_entry(table_base, private->underflow[hook]);
 
@@ -335,7 +336,8 @@  unsigned int arpt_do_table(struct sk_buf
 			e = (void *)e + e->next_offset;
 		}
 	} while (!hotdrop);
-	read_unlock_bh(&table->lock);
+
+	rcu_read_unlock();
 
 	if (hotdrop)
 		return NF_DROP;
@@ -738,11 +740,65 @@  static void get_counters(const struct xt
 	}
 }
 
-static inline struct xt_counters *alloc_counters(struct xt_table *table)
+
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct arpt_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
+
+/* Take values from counters and add them back onto the current cpu */
+static void put_counters(struct xt_table_info *t,
+			 const struct xt_counters counters[])
+{
+	unsigned int i, cpu;
+
+	local_bh_disable();
+	cpu = smp_processor_id();
+	i = 0;
+	ARPT_ENTRY_ITERATE(t->entries[cpu],
+			  t->size,
+			  add_counter_to_entry,
+			  counters,
+			  &i);
+	local_bh_enable();
+}
+
+static inline int
+zero_entry_counter(struct arpt_entry *e, void *arg)
+{
+	e->counters.bcnt = 0;
+	e->counters.pcnt = 0;
+	return 0;
+}
+
+static void
+clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
+{
+	unsigned int cpu;
+	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
+
+	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
+	for_each_possible_cpu(cpu) {
+		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
+		ARPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
+				  zero_entry_counter, NULL);
+	}
+}
+
+static struct xt_counters *alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	struct xt_table_info *private = table->private;
+	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	 * (other than comefrom, which userspace doesn't care
@@ -752,14 +808,30 @@  static inline struct xt_counters *alloc_
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		return ERR_PTR(-ENOMEM);
+		goto nomem;
 
-	/* First, sum counters... */
-	write_lock_bh(&table->lock);
-	get_counters(private, counters);
-	write_unlock_bh(&table->lock);
+	info = xt_alloc_table_info(private->size);
+	if (!info)
+		goto free_counters;
+
+	clone_counters(info, private);
+
+	mutex_lock(&table->lock);
+	xt_table_entry_swap_rcu(private, info);
+	synchronize_net();	/* Wait until smoke has cleared */
+
+	get_counters(info, counters);
+	put_counters(private, counters);
+	mutex_unlock(&table->lock);
+
+	xt_free_table_info(info);
 
 	return counters;
+
+ free_counters:
+	vfree(counters);
+ nomem:
+	return ERR_PTR(-ENOMEM);
 }
 
 static int copy_entries_to_user(unsigned int total_size,
@@ -1099,20 +1171,6 @@  static int do_replace(struct net *net, v
 	return ret;
 }
 
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK.
- */
-static inline int add_counter_to_entry(struct arpt_entry *e,
-				       const struct xt_counters addme[],
-				       unsigned int *i)
-{
-
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
-
 static int do_add_counters(struct net *net, void __user *user, unsigned int len,
 			   int compat)
 {
@@ -1172,13 +1230,14 @@  static int do_add_counters(struct net *n
 		goto free;
 	}
 
-	write_lock_bh(&t->lock);
+	mutex_lock(&t->lock);
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
+	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
 	loc_cpu_entry = private->entries[smp_processor_id()];
@@ -1187,8 +1246,10 @@  static int do_add_counters(struct net *n
 			   add_counter_to_entry,
 			   paddc,
 			   &i);
+	preempt_enable();
  unlock_up_free:
-	write_unlock_bh(&t->lock);
+	mutex_unlock(&t->lock);
+
 	xt_table_unlock(t);
 	module_put(t->me);
  free:
--- a/net/ipv6/netfilter/ip6_tables.c	2009-02-19 11:42:54.219410544 -0800
+++ b/net/ipv6/netfilter/ip6_tables.c	2009-02-19 11:42:58.867668311 -0800
@@ -382,10 +382,12 @@  ip6t_do_table(struct sk_buff *skb,
 	mtpar.family  = tgpar.family = NFPROTO_IPV6;
 	tgpar.hooknum = hook;
 
-	read_lock_bh(&table->lock);
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
-	private = table->private;
-	table_base = (void *)private->entries[smp_processor_id()];
+
+	rcu_read_lock();
+	private = rcu_dereference(table->private);
+	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+
 	e = get_entry(table_base, private->hook_entry[hook]);
 
 	/* For return from builtin chain */
@@ -483,7 +485,7 @@  ip6t_do_table(struct sk_buff *skb,
 #ifdef CONFIG_NETFILTER_DEBUG
 	((struct ip6t_entry *)table_base)->comefrom = NETFILTER_LINK_POISON;
 #endif
-	read_unlock_bh(&table->lock);
+	rcu_read_unlock();
 
 #ifdef DEBUG_ALLOW_ALL
 	return NF_ACCEPT;
@@ -964,11 +966,64 @@  get_counters(const struct xt_table_info 
 	}
 }
 
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct ip6t_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
+
+/* Take values from counters and add them back onto the current cpu */
+static void put_counters(struct xt_table_info *t,
+			 const struct xt_counters counters[])
+{
+	unsigned int i, cpu;
+
+	local_bh_disable();
+	cpu = smp_processor_id();
+	i = 0;
+	IP6T_ENTRY_ITERATE(t->entries[cpu],
+			   t->size,
+			   add_counter_to_entry,
+			   counters,
+			   &i);
+	local_bh_enable();
+}
+
+static inline int
+zero_entry_counter(struct ip6t_entry *e, void *arg)
+{
+	e->counters.bcnt = 0;
+	e->counters.pcnt = 0;
+	return 0;
+}
+
+static void
+clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
+{
+	unsigned int cpu;
+	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
+
+	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
+	for_each_possible_cpu(cpu) {
+		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
+		IP6T_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
+				   zero_entry_counter, NULL);
+	}
+}
+
 static struct xt_counters *alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
+	struct xt_table_info *private = table->private;
+	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -977,14 +1032,28 @@  static struct xt_counters *alloc_counter
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		return ERR_PTR(-ENOMEM);
+		goto nomem;
 
-	/* First, sum counters... */
-	write_lock_bh(&table->lock);
-	get_counters(private, counters);
-	write_unlock_bh(&table->lock);
+	info = xt_alloc_table_info(private->size);
+	if (!info)
+		goto free_counters;
+
+	clone_counters(info, private);
 
-	return counters;
+	mutex_lock(&table->lock);
+	xt_table_entry_swap_rcu(private, info);
+	synchronize_net();	/* Wait until smoke has cleared */
+
+	get_counters(info, counters);
+	put_counters(private, counters);
+	mutex_unlock(&table->lock);
+
+	xt_free_table_info(info);
+
+ free_counters:
+	vfree(counters);
+ nomem:
+	return ERR_PTR(-ENOMEM);
 }
 
 static int
@@ -1351,28 +1420,6 @@  do_replace(struct net *net, void __user 
 	return ret;
 }
 
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK. */
-static inline int
-add_counter_to_entry(struct ip6t_entry *e,
-		     const struct xt_counters addme[],
-		     unsigned int *i)
-{
-#if 0
-	duprintf("add_counter: Entry %u %lu/%lu + %lu/%lu\n",
-		 *i,
-		 (long unsigned int)e->counters.pcnt,
-		 (long unsigned int)e->counters.bcnt,
-		 (long unsigned int)addme[*i].pcnt,
-		 (long unsigned int)addme[*i].bcnt);
-#endif
-
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
-
 static int
 do_add_counters(struct net *net, void __user *user, unsigned int len,
 		int compat)
@@ -1433,13 +1480,14 @@  do_add_counters(struct net *net, void __
 		goto free;
 	}
 
-	write_lock_bh(&t->lock);
+	mutex_lock(&t->lock);
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
+	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
 	loc_cpu_entry = private->entries[raw_smp_processor_id()];
@@ -1448,8 +1496,9 @@  do_add_counters(struct net *net, void __
 			  add_counter_to_entry,
 			  paddc,
 			  &i);
+	preempt_enable();
  unlock_up_free:
-	write_unlock_bh(&t->lock);
+	mutex_unlock(&t->lock);
 	xt_table_unlock(t);
 	module_put(t->me);
  free: