diff mbox

[net-next-2.6] net: inet_add_protocol() can use cmpxchg()

Message ID 1283986576.2428.21.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 8, 2010, 10:56 p.m. UTC
Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
friends.

inet_protos[] & inet6_protos[] are moved to read_mostly section

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/protocol.c |   31 +++++--------------------------
 net/ipv6/protocol.c |   32 ++++----------------------------
 2 files changed, 9 insertions(+), 54 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jesse Gross Sept. 8, 2010, 11:49 p.m. UTC | #1
On Wed, Sep 8, 2010 at 3:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
> friends.
>
> inet_protos[] & inet6_protos[] are moved to read_mostly section
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

What's the benefit to this?  It's hard to imagine that add/deleting
protocols is highly contended.  On the other hand, a simple spinlock
is very easy to look at and verify correct, while this takes a little
more thought.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 9, 2010, 4:30 a.m. UTC | #2
From: Jesse Gross <jesse@nicira.com>
Date: Wed, 8 Sep 2010 16:49:27 -0700

> On Wed, Sep 8, 2010 at 3:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
>> friends.
>>
>> inet_protos[] & inet6_protos[] are moved to read_mostly section
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> What's the benefit to this?  It's hard to imagine that add/deleting
> protocols is highly contended.  On the other hand, a simple spinlock
> is very easy to look at and verify correct, while this takes a little
> more thought.

Smaller data section is the benefit, thus less cache pressure there.

It's not about contention at all.

We've already applied other similar changes from Eric doing exactly
this kind of transformation and it's not all that non-trivial to me,
frankly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 9, 2010, 4:31 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 09 Sep 2010 00:56:16 +0200

> Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
> friends.
> 
> inet_protos[] & inet6_protos[] are moved to read_mostly section
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 9, 2010, 5:23 a.m. UTC | #4
Le mercredi 08 septembre 2010 à 16:49 -0700, Jesse Gross a écrit :
> On Wed, Sep 8, 2010 at 3:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
> > friends.
> >
> > inet_protos[] & inet6_protos[] are moved to read_mostly section
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> What's the benefit to this?  It's hard to imagine that add/deleting
> protocols is highly contended.  On the other hand, a simple spinlock
> is very easy to look at and verify correct, while this takes a little
> more thought.

Good question.

Did you know that _bh() was not necessary for example at this point ?

Its a leftover of an RCU conversion, done many years ago. It _was_
working of course, but it really hurts my mind. We could use
stop_machine or other brute force thing too ;)


Before patch :

# size net/ipv4/protocol.o net/ipv6/protocol.o
   text	   data	    bss	    dec	    hex	filename
    340	      0	   2052	   2392	    958	net/ipv4/protocol.o
    342	      0	   2052	   2394	    95a	net/ipv6/protocol.o

After patch :

# size net/ipv4/protocol.o net/ipv6/protocol.o
   text	   data	    bss	    dec	    hex	filename
    220	   2048	      0	   2268	    8dc	net/ipv4/protocol.o
    222	   2048	      0	   2270	    8de	net/ipv6/protocol.o

45 lines removed, 240 bytes of text removed, I found this interesting.

Removing the ____cacheline_aligned_in_smp is also good, because
the .data alignement, is reduced and linker is not force to propagate
this alignement to upper .o files, with many holes.

If you take a look at most patches flying around, this one at least took
me to think a bit ;)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 6:42 a.m. UTC | #5
On 2010-09-09 06:30, David Miller wrote:
> From: Jesse Gross <jesse@nicira.com>
> Date: Wed, 8 Sep 2010 16:49:27 -0700
> 
>> On Wed, Sep 8, 2010 at 3:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
>>> friends.
>>>
>>> inet_protos[] & inet6_protos[] are moved to read_mostly section
>>>
>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> What's the benefit to this?  It's hard to imagine that add/deleting
>> protocols is highly contended.  On the other hand, a simple spinlock
>> is very easy to look at and verify correct, while this takes a little
>> more thought.
> 
> Smaller data section is the benefit, thus less cache pressure there.
> 
Of course, each case is different, but generally I understand Jesse's
concern. Smaller data section argument shouldn't be enough. There is
some reason we do it in C not asm, and there is some cost of
unreadable code too.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 9, 2010, 6:55 a.m. UTC | #6
Le jeudi 09 septembre 2010 à 06:42 +0000, Jarek Poplawski a écrit :
> Of course, each case is different, but generally I understand Jesse's
> concern. Smaller data section argument shouldn't be enough. There is
> some reason we do it in C not asm, and there is some cost of
> unreadable code too.

Sure, but is cmpxchg() really not readable for you ? ;)

Now it is available for all arches, you can bet it is going to be
adopted.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 7:09 a.m. UTC | #7
On Thu, Sep 09, 2010 at 08:55:45AM +0200, Eric Dumazet wrote:
> Le jeudi 09 septembre 2010 ?? 06:42 +0000, Jarek Poplawski a écrit :
> > Of course, each case is different, but generally I understand Jesse's
> > concern. Smaller data section argument shouldn't be enough. There is
> > some reason we do it in C not asm, and there is some cost of
> > unreadable code too.
> 
> Sure, but is cmpxchg() really not readable for you ? ;)

Just like for Jesse: "this takes a little more thought". ;-)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 7:24 a.m. UTC | #8
On Thu, Sep 09, 2010 at 07:09:26AM +0000, Jarek Poplawski wrote:
> On Thu, Sep 09, 2010 at 08:55:45AM +0200, Eric Dumazet wrote:
> > Le jeudi 09 septembre 2010 ?? 06:42 +0000, Jarek Poplawski a écrit :
> > > Of course, each case is different, but generally I understand Jesse's
> > > concern. Smaller data section argument shouldn't be enough. There is
> > > some reason we do it in C not asm, and there is some cost of
> > > unreadable code too.
> > 
> > Sure, but is cmpxchg() really not readable for you ? ;)
> 
> Just like for Jesse: "this takes a little more thought". ;-)

But more exactly, it's not about this one line, but the time you need
when reading this after some time to tell: yes, this place is properly
locked for sure.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 9:08 a.m. UTC | #9
On Thu, Sep 09, 2010 at 07:24:53AM +0000, Jarek Poplawski wrote:
> On Thu, Sep 09, 2010 at 07:09:26AM +0000, Jarek Poplawski wrote:
> > On Thu, Sep 09, 2010 at 08:55:45AM +0200, Eric Dumazet wrote:
> > > Le jeudi 09 septembre 2010 ?? 06:42 +0000, Jarek Poplawski a écrit :
> > > > Of course, each case is different, but generally I understand Jesse's
> > > > concern. Smaller data section argument shouldn't be enough. There is
> > > > some reason we do it in C not asm, and there is some cost of
> > > > unreadable code too.
> > > 
> > > Sure, but is cmpxchg() really not readable for you ? ;)
> > 
> > Just like for Jesse: "this takes a little more thought". ;-)
> 
> But more exactly, it's not about this one line, but the time you need
> when reading this after some time to tell: yes, this place is properly
> locked for sure.

And most exactly, without "after some time" I couldn't tell this
either. Is seems some barriers might have been changed, but I might
be wrong. Anyway, it would be nice to have some comment on that in
the changelog.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 9, 2010, 9:23 a.m. UTC | #10
Le jeudi 09 septembre 2010 à 09:08 +0000, Jarek Poplawski a écrit :
> xactly, without "after some time" I couldn't tell this
> either. Is seems some barriers might have been changed, but I might
> be wrong. Anyway, it would be nice to have some comment on that in
> the changelog.

Jarek, cmpxchg() is a very strong primitive, as documented
in Documentation/memory-barriers.txt :

<begin of quote>

Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
(smp_mb()) on each side of the actual operation (with the exception of
explicit lock operations, described later).  These include:

        xchg();
        cmpxchg();
...
These are used for such things as implementing LOCK-class and UNLOCK-class
operations and adjusting reference counters towards object destruction, and as
such the implicit memory barrier effects are necessary.

<end of quote>

Its not needed to document each cmpxchg() use, because of these
strong requirements.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 9:42 a.m. UTC | #11
On Thu, Sep 09, 2010 at 11:23:57AM +0200, Eric Dumazet wrote:
> Le jeudi 09 septembre 2010 ?? 09:08 +0000, Jarek Poplawski a écrit :
> > xactly, without "after some time" I couldn't tell this
> > either. Is seems some barriers might have been changed, but I might
> > be wrong. Anyway, it would be nice to have some comment on that in
> > the changelog.
> 
> Jarek, cmpxchg() is a very strong primitive, as documented
> in Documentation/memory-barriers.txt :
> 
> <begin of quote>
> 
> Any atomic operation that modifies some state in memory and returns information
> about the state (old or new) implies an SMP-conditional general memory barrier
> (smp_mb()) on each side of the actual operation (with the exception of
> explicit lock operations, described later).  These include:
> 
>         xchg();
>         cmpxchg();
> ...
> These are used for such things as implementing LOCK-class and UNLOCK-class
> operations and adjusting reference counters towards object destruction, and as
> such the implicit memory barrier effects are necessary.
> 
> <end of quote>
> 
> Its not needed to document each cmpxchg() use, because of these
> strong requirements.

OK! I should remember this until "after some time". ;-)

Thanks for the explanation,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 9:59 a.m. UTC | #12
On Thu, Sep 09, 2010 at 11:23:57AM +0200, Eric Dumazet wrote:
> Le jeudi 09 septembre 2010 ?? 09:08 +0000, Jarek Poplawski a écrit :
> > xactly, without "after some time" I couldn't tell this
> > either. Is seems some barriers might have been changed, but I might
> > be wrong. Anyway, it would be nice to have some comment on that in
> > the changelog.
> 
> Jarek, cmpxchg() is a very strong primitive, as documented
> in Documentation/memory-barriers.txt :
...
> Its not needed to document each cmpxchg() use, because of these
> strong requirements.
> 

Btw, I wonder if for readability and debuging it shouldn't be made
generic as rcu_cmpxchg_pointer() etc.?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 9, 2010, 12:57 p.m. UTC | #13
Le jeudi 09 septembre 2010 à 09:59 +0000, Jarek Poplawski a écrit :

> Btw, I wonder if for readability and debuging it shouldn't be made
> generic as rcu_cmpxchg_pointer() etc.?

There are a lot of things to do in this area (with proper __rcu sparse
annotations, this can be not very readable ...)


/**
 * rcu_cmpxchg - atomic compare and exchange, SMP & rcu safe
 * @p: pointer to value
 * @old: old value
 * @new: new value
 *
 * Equivalent to :
 *    ATOMIC_BEGIN
 *     ret = *p;
 *     if (ret == old)
 *             rcu_assign_pointer(*p, new);
 *    ATOMIC_END
 *    return ret;
 *
 * cmpxpchg() contains full memory barriers, so can be used
 * in rcu write side without additional smp_wmb() barrier
 */
#define rcu_cmpxchg(p, old, new) cmpxchg(p, old, new)


const struct net_protocol __rcu *
        inet_protos[MAX_INET_PROTOS] __read_mostly;

...

return !rcu_cmpxchg(&inet_protos[hash], 
                    NULL,
                    (const struct net_protocol __force __rcu *)prot) ?
		0 : -1;

...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Sept. 9, 2010, 3:38 p.m. UTC | #14
On Thu, Sep 09, 2010 at 02:57:38PM +0200, Eric Dumazet wrote:
> Le jeudi 09 septembre 2010 ?? 09:59 +0000, Jarek Poplawski a écrit :
> 
> > Btw, I wonder if for readability and debuging it shouldn't be made
> > generic as rcu_cmpxchg_pointer() etc.?
> 
> There are a lot of things to do in this area (with proper __rcu sparse
> annotations, this can be not very readable ...)

Well, this is more than I could dream ;-) The main point to me was
autodocumentation in use, similarly to rcu_assign_pointer(). But, of
course, with the description below it's much better.

Thanks,
Jarek P.

> 
> /**
>  * rcu_cmpxchg - atomic compare and exchange, SMP & rcu safe
>  * @p: pointer to value
>  * @old: old value
>  * @new: new value
>  *
>  * Equivalent to :
>  *    ATOMIC_BEGIN
>  *     ret = *p;
>  *     if (ret == old)
>  *             rcu_assign_pointer(*p, new);
>  *    ATOMIC_END
>  *    return ret;
>  *
>  * cmpxpchg() contains full memory barriers, so can be used
>  * in rcu write side without additional smp_wmb() barrier
>  */
> #define rcu_cmpxchg(p, old, new) cmpxchg(p, old, new)
> 
> 
> const struct net_protocol __rcu *
>         inet_protos[MAX_INET_PROTOS] __read_mostly;
> 
> ...
> 
> return !rcu_cmpxchg(&inet_protos[hash], 
>                     NULL,
>                     (const struct net_protocol __force __rcu *)prot) ?
> 		0 : -1;
> 
> ...
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index f2d2973..65699c2 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -28,8 +28,7 @@ 
 #include <linux/spinlock.h>
 #include <net/protocol.h>
 
-const struct net_protocol *inet_protos[MAX_INET_PROTOS] ____cacheline_aligned_in_smp;
-static DEFINE_SPINLOCK(inet_proto_lock);
+const struct net_protocol *inet_protos[MAX_INET_PROTOS] __read_mostly;
 
 /*
  *	Add a protocol handler to the hash tables
@@ -37,20 +36,9 @@  static DEFINE_SPINLOCK(inet_proto_lock);
 
 int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol)
 {
-	int hash, ret;
+	int hash = protocol & (MAX_INET_PROTOS - 1);
 
-	hash = protocol & (MAX_INET_PROTOS - 1);
-
-	spin_lock_bh(&inet_proto_lock);
-	if (inet_protos[hash]) {
-		ret = -1;
-	} else {
-		inet_protos[hash] = prot;
-		ret = 0;
-	}
-	spin_unlock_bh(&inet_proto_lock);
-
-	return ret;
+	return !cmpxchg(&inet_protos[hash], NULL, prot) ? 0 : -1;
 }
 EXPORT_SYMBOL(inet_add_protocol);
 
@@ -60,18 +48,9 @@  EXPORT_SYMBOL(inet_add_protocol);
 
 int inet_del_protocol(const struct net_protocol *prot, unsigned char protocol)
 {
-	int hash, ret;
-
-	hash = protocol & (MAX_INET_PROTOS - 1);
+	int ret, hash = protocol & (MAX_INET_PROTOS - 1);
 
-	spin_lock_bh(&inet_proto_lock);
-	if (inet_protos[hash] == prot) {
-		inet_protos[hash] = NULL;
-		ret = 0;
-	} else {
-		ret = -1;
-	}
-	spin_unlock_bh(&inet_proto_lock);
+	ret = (cmpxchg(&inet_protos[hash], prot, NULL) == prot) ? 0 : -1;
 
 	synchronize_net();
 
diff --git a/net/ipv6/protocol.c b/net/ipv6/protocol.c
index 1fa3468..9bb936a 100644
--- a/net/ipv6/protocol.c
+++ b/net/ipv6/protocol.c
@@ -25,28 +25,14 @@ 
 #include <linux/spinlock.h>
 #include <net/protocol.h>
 
-const struct inet6_protocol *inet6_protos[MAX_INET_PROTOS];
-static DEFINE_SPINLOCK(inet6_proto_lock);
-
+const struct inet6_protocol *inet6_protos[MAX_INET_PROTOS] __read_mostly;
 
 int inet6_add_protocol(const struct inet6_protocol *prot, unsigned char protocol)
 {
-	int ret, hash = protocol & (MAX_INET_PROTOS - 1);
-
-	spin_lock_bh(&inet6_proto_lock);
-
-	if (inet6_protos[hash]) {
-		ret = -1;
-	} else {
-		inet6_protos[hash] = prot;
-		ret = 0;
-	}
-
-	spin_unlock_bh(&inet6_proto_lock);
+	int hash = protocol & (MAX_INET_PROTOS - 1);
 
-	return ret;
+	return !cmpxchg(&inet6_protos[hash], NULL, prot) ? 0 : -1;
 }
-
 EXPORT_SYMBOL(inet6_add_protocol);
 
 /*
@@ -57,20 +43,10 @@  int inet6_del_protocol(const struct inet6_protocol *prot, unsigned char protocol
 {
 	int ret, hash = protocol & (MAX_INET_PROTOS - 1);
 
-	spin_lock_bh(&inet6_proto_lock);
-
-	if (inet6_protos[hash] != prot) {
-		ret = -1;
-	} else {
-		inet6_protos[hash] = NULL;
-		ret = 0;
-	}
-
-	spin_unlock_bh(&inet6_proto_lock);
+	ret = (cmpxchg(&inet6_protos[hash], prot, NULL) == prot) ? 0 : -1;
 
 	synchronize_net();
 
 	return ret;
 }
-
 EXPORT_SYMBOL(inet6_del_protocol);