Patchwork 32 core net-next stack/netfilter "scaling"

login
register
mail settings
Submitter Eric Dumazet
Date Jan. 27, 2009, 4:23 p.m.
Message ID <497F350A.9020509@cosmosbay.com>
Download mbox | patch
Permalink /patch/20461/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Jan. 27, 2009, 4:23 p.m.
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>
>> TCP connection tracking suffers of huge contention on a global rwlock,
>> used to protect tcp conntracking state.
>> As each tcp conntrack state have no relations between each others, we
>> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
>>
>> tcp_print_conntrack() dont need to lock anything to read
>> ct->proto.tcp.state,
>> so speedup /proc/net/ip_conntrack as well.
> 
> Thats an interesting test-case, but one lock per conntrack just for
> TCP tracking seems like overkill. We're trying to keep the conntrack
> stuctures as small as possible, so I'd prefer an array of spinlocks
> or something like that.

Please find a new version of patch, using an array of spinlocks.

I chose to define an array at nf_conntrack level, since tcp
conntracking is one user of this array but we might find
other users of this array.

Thanks

[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.

As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock. Using an array of spinlocks avoids
enlarging size of connection tracking structures, yet giving reasonable
fanout.

tcp_print_conntrack() doesnt need to lock anything to read
ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.

nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
 include/net/netfilter/nf_conntrack.h   |   32 +++++++++++++++++++++
 net/netfilter/nf_conntrack_core.c      |   10 ++++--
 net/netfilter/nf_conntrack_proto_tcp.c |   34 ++++++++++-------------
 3 files changed, 53 insertions(+), 23 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Jan. 27, 2009, 5:33 p.m.
Eric Dumazet wrote:
> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>
> TCP connection tracking suffers of huge contention on a global rwlock,
> used to protect tcp conntracking state.
>
> As each tcp conntrack state have no relations between each others, we
> can switch to fine grained lock. Using an array of spinlocks avoids
> enlarging size of connection tracking structures, yet giving reasonable
> fanout.
>
> tcp_print_conntrack() doesnt need to lock anything to read
> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>
> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly
>   

This looks good to me. Rick, would you like to give it a try?

I'll convert the remaining conntrack protocols when applying it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 27, 2009, 6:02 p.m.
Patrick McHardy wrote:
> Eric Dumazet wrote:
> 
>>[PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>
>>TCP connection tracking suffers of huge contention on a global rwlock,
>>used to protect tcp conntracking state.
>>
>>As each tcp conntrack state have no relations between each others, we
>>can switch to fine grained lock. Using an array of spinlocks avoids
>>enlarging size of connection tracking structures, yet giving reasonable
>>fanout.
>>
>>tcp_print_conntrack() doesnt need to lock anything to read
>>ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>>
>>nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly
>>  
> 
> 
> This looks good to me. Rick, would you like to give it a try?
> 
> I'll convert the remaining conntrack protocols when applying it.

I will give it a try and let folks know the results - unless told otherwise, I 
will ass-u-me I only need rerun the "full_iptables" test case.

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 27, 2009, 7:09 p.m.
Rick Jones wrote:
> Patrick McHardy wrote:
> 
>> Eric Dumazet wrote:
>>
>>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>>
>>> TCP connection tracking suffers of huge contention on a global rwlock,
>>> used to protect tcp conntracking state.
>>>
>>> As each tcp conntrack state have no relations between each others, we
>>> can switch to fine grained lock. Using an array of spinlocks avoids
>>> enlarging size of connection tracking structures, yet giving reasonable
>>> fanout.
>>>
>>> tcp_print_conntrack() doesnt need to lock anything to read
>>> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>>>
>>> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared 
>>> read_mostly
>>>  
>>
>>
>>
>> This looks good to me. Rick, would you like to give it a try?
>>
>> I'll convert the remaining conntrack protocols when applying it.
> 
> 
> I will give it a try and let folks know the results - unless told 
> otherwise, I will ass-u-me I only need rerun the "full_iptables" test case.

The runemomniagg2.sh script is still running, but the initial cycles profile 
suggests that the main change is converting the write_lock time into spinlock 
contention time with 78.39% of the cycles spent in ia64_spinlock_contention. 
When the script completes I'll upload the profiles and the netperf results to the 
same base URL as in the basenote under "contrack01/"

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 27, 2009, 7:24 p.m.
>> I will give it a try and let folks know the results - unless told 
>> otherwise, I will ass-u-me I only need rerun the "full_iptables" test 
>> case.
> 
> 
> The runemomniagg2.sh script is still running, but the initial cycles 
> profile suggests that the main change is converting the write_lock time 
> into spinlock contention time with 78.39% of the cycles spent in 
> ia64_spinlock_contention. When the script completes I'll upload the 
> profiles and the netperf results to the same base URL as in the basenote 
> under "contrack01/"

The script completed - although at some point I hit an fd limit - I think I have 
an fd leak in netperf somewhere :( .

Anyhow, there are still some netperfs that end-up kicking the bucket during the 
run - I suspect starvation because where in the other configs (no iptables, and 
empty iptables) each netperf seems to consume about 50% of a CPU - stands to 
reason - 64 netperfs, 32 cores - in the "full" case I see many netperfs consuming 
100% of a CPU.  My gut is thinking that one or more netperf contexts gets stuck 
doing something on behalf of others.  There is also ksoftirqd time for a few of 
those processes.

Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which does 
represent an improvement.

rick jones

PS - just to be certain that running-out of fd's didn't skew the results I'm 
rerunning the script with ulimit -n 10240 and will see if that changes the 
results any.  And I suppose I need to go fd leak hunting in netperf omni code :(
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 27, 2009, 10:17 p.m.
Rick Jones a écrit :
>>> I will give it a try and let folks know the results - unless told 
>>> otherwise, I will ass-u-me I only need rerun the "full_iptables" 
>>> test case.
>>
>>
>> The runemomniagg2.sh script is still running, but the initial cycles 
>> profile suggests that the main change is converting the write_lock 
>> time into spinlock contention time with 78.39% of the cycles spent in 
>> ia64_spinlock_contention. When the script completes I'll upload the 
>> profiles and the netperf results to the same base URL as in the 
>> basenote under "contrack01/"
>
> The script completed - although at some point I hit an fd limit - I 
> think I have an fd leak in netperf somewhere :( .
>
> Anyhow, there are still some netperfs that end-up kicking the bucket 
> during the run - I suspect starvation because where in the other 
> configs (no iptables, and empty iptables) each netperf seems to 
> consume about 50% of a CPU - stands to reason - 64 netperfs, 32 cores 
> - in the "full" case I see many netperfs consuming 100% of a CPU.  My 
> gut is thinking that one or more netperf contexts gets stuck doing 
> something on behalf of others.  There is also ksoftirqd time for a few 
> of those processes.
>
> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which 
> does represent an improvement.
>
> rick jones
>
> PS - just to be certain that running-out of fd's didn't skew the 
> results I'm rerunning the script with ulimit -n 10240 and will see if 
> that changes the results any.  And I suppose I need to go fd leak 
> hunting in netperf omni code :(
> -- 
>

Thanks for the report

If you have so much contention on spinlocks, maybe hash function is not 
good at all...

hash = (unsigned long)ct;
hash ^= hash >> 16;
hash ^= hash >> 8;

I ass-u-me you compiled your kernel with NR_CPUS=32 ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 27, 2009, 10:29 p.m.
> Thanks for the report
> 
> If you have so much contention on spinlocks, maybe hash function is not 
> good at all...
> 
> hash = (unsigned long)ct;
> hash ^= hash >> 16;
> hash ^= hash >> 8;
> 
> I ass-u-me you compiled your kernel with NR_CPUS=32 ?

I believe so - CONFIG_NR_CPUS in .config is 64 anyway.  If there is a more 
definitive place to check I'd be happy to look.

rick

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 27, 2009, 10:34 p.m.
Rick Jones a écrit :
>
>> Thanks for the report
>>
>> If you have so much contention on spinlocks, maybe hash function is 
>> not good at all...
>>
>> hash = (unsigned long)ct;
>> hash ^= hash >> 16;
>> hash ^= hash >> 8;
>>
>> I ass-u-me you compiled your kernel with NR_CPUS=32 ?
>
> I believe so - CONFIG_NR_CPUS in .config is 64 anyway.  If there is a 
> more definitive place to check I'd be happy to look.
>
> rick
>
>

By any chance, are you using SLAB instead of SLUB ?

While running your netperf session, issuing :

grep conntrack /proc/slabinfo

Would help to know hash dispertion on your machine.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Jan. 27, 2009, 10:43 p.m.
> 
> By any chance, are you using SLAB instead of SLUB ?

Seems that way:

sut42:~/net-next-2.6# grep SLAB .config
CONFIG_SLAB=y
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set
sut42:~/net-next-2.6# grep SLUB .config
# CONFIG_SLUB is not set

The config file should be up under the base URL:

ftp://ftp.netperf.org/iptable_scaling

The system started as a Debian Lenny with a 2.6.26 kernel, and I did a make old 
config and mostly took defaults (save multiq)


> While running your netperf session, issuing :
> 
> grep conntrack /proc/slabinfo
> 
> Would help to know hash dispertion on your machine.


sut42:~/net-next-2.6# grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         276    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

about a minute later it is:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         315    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

after about another 40 to 60 seconds it was:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         314    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      7

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 28, 2009, 1:55 p.m.
Rick Jones a écrit :
>>> I will give it a try and let folks know the results - unless told
>>> otherwise, I will ass-u-me I only need rerun the "full_iptables" test
>>> case.
>>
>>
>> The runemomniagg2.sh script is still running, but the initial cycles
>> profile suggests that the main change is converting the write_lock
>> time into spinlock contention time with 78.39% of the cycles spent in
>> ia64_spinlock_contention. When the script completes I'll upload the
>> profiles and the netperf results to the same base URL as in the
>> basenote under "contrack01/"
> 
> The script completed - although at some point I hit an fd limit - I
> think I have an fd leak in netperf somewhere :( .
> 
> Anyhow, there are still some netperfs that end-up kicking the bucket
> during the run - I suspect starvation because where in the other configs
> (no iptables, and empty iptables) each netperf seems to consume about
> 50% of a CPU - stands to reason - 64 netperfs, 32 cores - in the "full"
> case I see many netperfs consuming 100% of a CPU.  My gut is thinking
> that one or more netperf contexts gets stuck doing something on behalf
> of others.  There is also ksoftirqd time for a few of those processes.
> 
> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
> does represent an improvement.
>

Yes indeed you have a speedup, tcp conntracking is OK.

You now hit the nf_conntrack_lock spinlock we have in generic conntrack code 
(net/netfilter/nf_conntrack_core.c)

nf_ct_refresh_acct() for instance has to lock it.

We really want some finer locking here.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Jan. 28, 2009, 4:25 p.m.
Eric Dumazet wrote:
> Rick Jones a écrit :
>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>> does represent an improvement.
>>
> 
> Yes indeed you have a speedup, tcp conntracking is OK.
> 
> You now hit the nf_conntrack_lock spinlock we have in generic conntrack code 
> (net/netfilter/nf_conntrack_core.c)
> 
> nf_ct_refresh_acct() for instance has to lock it.
> 
> We really want some finer locking here.

That looks more complicated since it requires to take multiple locks
occasionally (f.i. hash insertion, potentially helper-related and
expectation-related stuff), and there is the unconfirmed_list, where
fine-grained locking can't really be used without changing it to
a hash.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 28, 2009, 5:07 p.m.
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Rick Jones a écrit :
>>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>>> does represent an improvement.
>>>
>>
>> Yes indeed you have a speedup, tcp conntracking is OK.
>>
>> You now hit the nf_conntrack_lock spinlock we have in generic
>> conntrack code (net/netfilter/nf_conntrack_core.c)
>>
>> nf_ct_refresh_acct() for instance has to lock it.
>>
>> We really want some finer locking here.
> 
> That looks more complicated since it requires to take multiple locks
> occasionally (f.i. hash insertion, potentially helper-related and
> expectation-related stuff), and there is the unconfirmed_list, where
> fine-grained locking can't really be used without changing it to
> a hash.
>

Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
 sockets, using RCU to have lockless lookups.
Yes, we still take a lock when doing an insert or delete at socket
bind/unbind time.

We could keep a central nf_conntrack_lock to guard insertions/deletes
from hash and unconfirmed_list.

But *normal* packets that only need to change state of one particular
connection could use RCU (without spinlock) to locate the conntrack,
then lock the found conntrack to perform all state changes.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 28, 2009, 5:34 p.m.
Eric Dumazet a écrit :
> Patrick McHardy a écrit :
>> Eric Dumazet wrote:
>>> Rick Jones a écrit :
>>>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>>>> does represent an improvement.
>>>>
>>> Yes indeed you have a speedup, tcp conntracking is OK.
>>>
>>> You now hit the nf_conntrack_lock spinlock we have in generic
>>> conntrack code (net/netfilter/nf_conntrack_core.c)
>>>
>>> nf_ct_refresh_acct() for instance has to lock it.
>>>
>>> We really want some finer locking here.
>> That looks more complicated since it requires to take multiple locks
>> occasionally (f.i. hash insertion, potentially helper-related and
>> expectation-related stuff), and there is the unconfirmed_list, where
>> fine-grained locking can't really be used without changing it to
>> a hash.
>>
> 
> Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
>  sockets, using RCU to have lockless lookups.
> Yes, we still take a lock when doing an insert or delete at socket
> bind/unbind time.
> 
> We could keep a central nf_conntrack_lock to guard insertions/deletes
> from hash and unconfirmed_list.
> 
> But *normal* packets that only need to change state of one particular
> connection could use RCU (without spinlock) to locate the conntrack,
> then lock the found conntrack to perform all state changes.

Well... RCU is already used by conntrack :)

Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy - Feb. 9, 2009, 2:57 p.m.
Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Patrick McHardy a écrit :
>>> That looks more complicated since it requires to take multiple locks
>>> occasionally (f.i. hash insertion, potentially helper-related and
>>> expectation-related stuff), and there is the unconfirmed_list, where
>>> fine-grained locking can't really be used without changing it to
>>> a hash.
>>>
>> Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
>>  sockets, using RCU to have lockless lookups.
>> Yes, we still take a lock when doing an insert or delete at socket
>> bind/unbind time.
>>
>> We could keep a central nf_conntrack_lock to guard insertions/deletes
>> from hash and unconfirmed_list.
>>
>> But *normal* packets that only need to change state of one particular
>> connection could use RCU (without spinlock) to locate the conntrack,
>> then lock the found conntrack to perform all state changes.
> 
> Well... RCU is already used by conntrack :)
> 
> Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock

The lock is currently used to avoid timer update races.
Martin has two old patches two remove it, but they require
changes to the timer code to support a mod_timer variant
that doesn't enable an inactive timer:

http://people.netfilter.org/gandalf/performance/patches/mod_timer_noact
http://people.netfilter.org/gandalf/performance/patches/__nf_ct_refresh_acct-locking
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2e0c536..288aff5 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -129,6 +129,38 @@  struct nf_conn
 	struct rcu_head rcu;
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_LOCKING)
+
+/*
+ * We reserve 16 locks per cpu (typical cache line size is 64 bytes)
+ * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory.
+ * (on lockdep we have a quite big spinlock_t, so keep the size down there)
+ */
+#ifdef CONFIG_LOCKDEP
+#define CT_HASH_LOCK_SZ 64
+#elif NR_CPUS >= 32
+#define CT_HASH_LOCK_SZ 512
+#else
+#define CT_HASH_LOCK_SZ (roundup_pow_of_two(NR_CPUS) * 16)
+#endif
+
+extern spinlock_t ct_hlock[CT_HASH_LOCK_SZ];
+
+#else
+#define CT_HASH_LOCK_SZ 0
+#endif
+static inline spinlock_t *ct_lock_addr(const struct nf_conn *ct)
+{
+	if (CT_HASH_LOCK_SZ) {
+		unsigned long hash = (unsigned long)ct;
+		hash ^= hash >> 16;
+		hash ^= hash >> 8;
+		return &ct_hlock[hash % CT_HASH_LOCK_SZ];
+	}
+	return NULL;
+}
+
 static inline struct nf_conn *
 nf_ct_tuplehash_to_ctrack(const struct nf_conntrack_tuple_hash *hash)
 {
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 90ce9dd..68822d8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -61,9 +61,9 @@  struct nf_conn nf_conntrack_untracked __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
 
 static struct kmem_cache *nf_conntrack_cachep __read_mostly;
-
-static int nf_conntrack_hash_rnd_initted;
-static unsigned int nf_conntrack_hash_rnd;
+spinlock_t ct_hlock[CT_HASH_LOCK_SZ];
+static int nf_conntrack_hash_rnd_initted __read_mostly;
+static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
 				  unsigned int size, unsigned int rnd)
@@ -1141,7 +1141,7 @@  module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
-	int ret;
+	int i, ret;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1174,6 +1174,8 @@  static int nf_conntrack_init_init_net(void)
 		ret = -ENOMEM;
 		goto err_cache;
 	}
+	for (i = 0; i < CT_HASH_LOCK_SZ; i++)
+		spin_lock_init(&ct_hlock[i]);
 
 	ret = nf_conntrack_proto_init();
 	if (ret < 0)
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..247e82f 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@ 
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
     If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@  static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
 {
 	enum tcp_conntrack state;
 
-	read_lock_bh(&tcp_lock);
 	state = ct->proto.tcp.state;
-	read_unlock_bh(&tcp_lock);
 
 	return seq_printf(s, "%s ", tcp_conntrack_names[state]);
 }
@@ -705,14 +700,14 @@  void nf_conntrack_tcp_update(const struct sk_buff *skb,
 
 	end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	/*
 	 * We have to worry for the ack in the reply packet only...
 	 */
 	if (after(end, ct->proto.tcp.seen[dir].td_end))
 		ct->proto.tcp.seen[dir].td_end = end;
 	ct->proto.tcp.last_end = end;
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 	pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
 		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
 		 sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -821,7 +816,7 @@  static int tcp_packet(struct nf_conn *ct,
 	th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
 	BUG_ON(th == NULL);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	old_state = ct->proto.tcp.state;
 	dir = CTINFO2DIR(ctinfo);
 	index = get_conntrack_index(th);
@@ -851,7 +846,7 @@  static int tcp_packet(struct nf_conn *ct,
 		        && ct->proto.tcp.last_index == TCP_RST_SET)) {
 			/* Attempt to reopen a closed/aborted connection.
 			 * Delete this connection and look up again. */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(ct_lock_addr(ct));
 
 			/* Only repeat if we can actually remove the timer.
 			 * Destruction may already be in progress in process
@@ -887,7 +882,7 @@  static int tcp_packet(struct nf_conn *ct,
 			 * that the client cannot but retransmit its SYN and
 			 * thus initiate a clean new session.
 			 */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(ct_lock_addr(ct));
 			if (LOG_INVALID(net, IPPROTO_TCP))
 				nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 					  "nf_ct_tcp: killing out of sync session ");
@@ -900,7 +895,7 @@  static int tcp_packet(struct nf_conn *ct,
 		ct->proto.tcp.last_end =
 		    segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
 
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid packet ignored ");
@@ -909,7 +904,7 @@  static int tcp_packet(struct nf_conn *ct,
 		/* Invalid packet */
 		pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
 			 dir, get_conntrack_index(th), old_state);
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid state ");
@@ -940,7 +935,7 @@  static int tcp_packet(struct nf_conn *ct,
 
 	if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
 			   skb, dataoff, th, pf)) {
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		return -NF_ACCEPT;
 	}
      in_window:
@@ -969,7 +964,7 @@  static int tcp_packet(struct nf_conn *ct,
 		timeout = nf_ct_tcp_timeout_unacknowledged;
 	else
 		timeout = tcp_timeouts[new_state];
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
 	if (new_state != old_state)
@@ -1022,6 +1017,7 @@  static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 		pr_debug("nf_ct_tcp: invalid new deleting.\n");
 		return false;
 	}
+	spin_lock_init(ct_lock_addr(ct));
 
 	if (new_state == TCP_CONNTRACK_SYN_SENT) {
 		/* SYN packet */
@@ -1092,7 +1088,7 @@  static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	struct nlattr *nest_parms;
 	struct nf_ct_tcp_flags tmp = {};
 
-	read_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -1112,14 +1108,14 @@  static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	tmp.flags = ct->proto.tcp.seen[1].flags;
 	NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
 		sizeof(struct nf_ct_tcp_flags), &tmp);
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	nla_nest_end(skb, nest_parms);
 
 	return 0;
 
 nla_put_failure:
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 	return -1;
 }
 
@@ -1150,7 +1146,7 @@  static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 	    nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
 		return -EINVAL;
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	if (tb[CTA_PROTOINFO_TCP_STATE])
 		ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
 
@@ -1177,7 +1173,7 @@  static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 		ct->proto.tcp.seen[1].td_scale =
 			nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
 	}
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	return 0;
 }