diff mbox

[RFC,3/5] powerpc: atomic: implement atomic{, 64}_{add, sub}_return_* variants

Message ID 1440730099-29133-4-git-send-email-boqun.feng@gmail.com (mailing list archive)
State Superseded
Headers show

Commit Message

Boqun Feng Aug. 28, 2015, 2:48 a.m. UTC
On powerpc, we don't need a general memory barrier to achieve acquire and
release semantics, so arch_atomic_op_{acquire,release} can be implemented
using "lwsync" and "isync".

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. So a new kind of barrier
smp_acquire_barrier__after_atomic() is introduced, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

For full ordered semantics, like the original ones, smp_lwsync() is put
before relaxed variants and smp_mb__after_atomic() is put after.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 88 ++++++++++++++++++++++++++++-----------
 1 file changed, 64 insertions(+), 24 deletions(-)

Comments

Peter Zijlstra Aug. 28, 2015, 10:48 a.m. UTC | #1
On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote:
> +/*
> + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
> + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
> + * on the platform without lwsync.
> + */
> +#ifdef CONFIG_SMP
> +#define smp_acquire_barrier__after_atomic() \
> +	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
> +#else
> +#define smp_acquire_barrier__after_atomic() barrier()
> +#endif
> +#define arch_atomic_op_acquire(op, args...)				\
> +({									\
> +	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
> +	smp_acquire_barrier__after_atomic();				\
> +	__ret;								\
> +})
> +
> +#define arch_atomic_op_release(op, args...)				\
> +({									\
> +	smp_lwsync();							\
> +	op##_relaxed(args);						\
> +})

Urgh, so this is RCpc. We were trying to get rid of that if possible.
Lets wait until that's settled before introducing more of it.

lkml.kernel.org/r/20150820155604.GB24100@arm.com
Boqun Feng Aug. 28, 2015, 12:06 p.m. UTC | #2
Hi Peter,

On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote:
> > +/*
> > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
> > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
> > + * on the platform without lwsync.
> > + */
> > +#ifdef CONFIG_SMP
> > +#define smp_acquire_barrier__after_atomic() \
> > +	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
> > +#else
> > +#define smp_acquire_barrier__after_atomic() barrier()
> > +#endif
> > +#define arch_atomic_op_acquire(op, args...)				\
> > +({									\
> > +	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
> > +	smp_acquire_barrier__after_atomic();				\
> > +	__ret;								\
> > +})
> > +
> > +#define arch_atomic_op_release(op, args...)				\
> > +({									\
> > +	smp_lwsync();							\
> > +	op##_relaxed(args);						\
> > +})
> 
> Urgh, so this is RCpc. We were trying to get rid of that if possible.
> Lets wait until that's settled before introducing more of it.
> 
> lkml.kernel.org/r/20150820155604.GB24100@arm.com

OK, get it. Thanks.

So I'm not going to introduce these arch specific macros, I think what I
need to implement are just _relaxed variants and cmpxchg_acquire.

Regards,
Boqun
Boqun Feng Aug. 28, 2015, 2:16 p.m. UTC | #3
On Fri, Aug 28, 2015 at 08:06:14PM +0800, Boqun Feng wrote:
> Hi Peter,
> 
> On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote:
> > > +/*
> > > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
> > > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
> > > + * on the platform without lwsync.
> > > + */
> > > +#ifdef CONFIG_SMP
> > > +#define smp_acquire_barrier__after_atomic() \
> > > +	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
> > > +#else
> > > +#define smp_acquire_barrier__after_atomic() barrier()
> > > +#endif
> > > +#define arch_atomic_op_acquire(op, args...)				\
> > > +({									\
> > > +	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
> > > +	smp_acquire_barrier__after_atomic();				\
> > > +	__ret;								\
> > > +})
> > > +
> > > +#define arch_atomic_op_release(op, args...)				\
> > > +({									\
> > > +	smp_lwsync();							\
> > > +	op##_relaxed(args);						\
> > > +})
> > 
> > Urgh, so this is RCpc. We were trying to get rid of that if possible.
> > Lets wait until that's settled before introducing more of it.
> > 
> > lkml.kernel.org/r/20150820155604.GB24100@arm.com
> 
> OK, get it. Thanks.
> 
> So I'm not going to introduce these arch specific macros, I think what I
> need to implement are just _relaxed variants and cmpxchg_acquire.

Ah.. just read through the thread you mentioned, I might misunderstand
you, probably because I didn't understand RCpc well..

You are saying that in a RELEASE we -might- switch from smp_lwsync() to
smp_mb() semantically, right? I guess this means we -might- switch from
RCpc to RCsc, right?

If so, I think I'd better to wait until we have a conclusion for this.

Thank you for your comments!

Regards,
Boqun
Peter Zijlstra Aug. 28, 2015, 3:39 p.m. UTC | #4
On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote:
> On Fri, Aug 28, 2015 at 08:06:14PM +0800, Boqun Feng wrote:
> > Hi Peter,
> > 
> > On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote:
> > > > +/*
> > > > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
> > > > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
> > > > + * on the platform without lwsync.
> > > > + */
> > > > +#ifdef CONFIG_SMP
> > > > +#define smp_acquire_barrier__after_atomic() \
> > > > +	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
> > > > +#else
> > > > +#define smp_acquire_barrier__after_atomic() barrier()
> > > > +#endif
> > > > +#define arch_atomic_op_acquire(op, args...)				\
> > > > +({									\
> > > > +	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
> > > > +	smp_acquire_barrier__after_atomic();				\
> > > > +	__ret;								\
> > > > +})
> > > > +
> > > > +#define arch_atomic_op_release(op, args...)				\
> > > > +({									\
> > > > +	smp_lwsync();							\
> > > > +	op##_relaxed(args);						\
> > > > +})
> > > 
> > > Urgh, so this is RCpc. We were trying to get rid of that if possible.
> > > Lets wait until that's settled before introducing more of it.
> > > 
> > > lkml.kernel.org/r/20150820155604.GB24100@arm.com
> > 
> > OK, get it. Thanks.
> > 
> > So I'm not going to introduce these arch specific macros, I think what I
> > need to implement are just _relaxed variants and cmpxchg_acquire.
> 
> Ah.. just read through the thread you mentioned, I might misunderstand
> you, probably because I didn't understand RCpc well..
> 
> You are saying that in a RELEASE we -might- switch from smp_lwsync() to
> smp_mb() semantically, right? I guess this means we -might- switch from
> RCpc to RCsc, right?
> 
> If so, I think I'd better to wait until we have a conclusion for this.

Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
not.

Currently PowerPC is the only arch that (can, and) does RCpc and gives a
weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed
to see the stores of the CPU which did the RELEASE in order.

As it stands, RCU is the only _known_ codebase where this matters, but
we did in fact write code for a fair number of years 'assuming' RELEASE
+ ACQUIRE was a full barrier, so who knows what else is out there.


RCsc - release consistency sequential consistency
RCpc - release consistency processor consistency

https://en.wikipedia.org/wiki/Processor_consistency (where they have
s/sequential/causal/)
Boqun Feng Aug. 28, 2015, 4:59 p.m. UTC | #5
On Fri, Aug 28, 2015 at 05:39:21PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote:
<snip>
> > 
> > Ah.. just read through the thread you mentioned, I might misunderstand
> > you, probably because I didn't understand RCpc well..
> > 
> > You are saying that in a RELEASE we -might- switch from smp_lwsync() to
> > smp_mb() semantically, right? I guess this means we -might- switch from
> > RCpc to RCsc, right?
> > 
> > If so, I think I'd better to wait until we have a conclusion for this.
> 
> Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> not.
> 
> Currently PowerPC is the only arch that (can, and) does RCpc and gives a
> weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed
> to see the stores of the CPU which did the RELEASE in order.
> 
> As it stands, RCU is the only _known_ codebase where this matters, but
> we did in fact write code for a fair number of years 'assuming' RELEASE
> + ACQUIRE was a full barrier, so who knows what else is out there.
> 
> 
> RCsc - release consistency sequential consistency
> RCpc - release consistency processor consistency
> 
> https://en.wikipedia.org/wiki/Processor_consistency (where they have
> s/sequential/causal/)

Thank you for your detailed explanation! Much clear now ;-)

Regards,
Boqun
Will Deacon Sept. 1, 2015, 7 p.m. UTC | #6
On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote:
> > Ah.. just read through the thread you mentioned, I might misunderstand
> > you, probably because I didn't understand RCpc well..
> > 
> > You are saying that in a RELEASE we -might- switch from smp_lwsync() to
> > smp_mb() semantically, right? I guess this means we -might- switch from
> > RCpc to RCsc, right?
> > 
> > If so, I think I'd better to wait until we have a conclusion for this.
> 
> Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> not.

We've discussed this before, but for the sake of completeness, I don't
think we're fully RCsc either because we don't order the actual RELEASE
operation again a subsequent ACQUIRE operation:


P0
smp_store_release(&x, 1);
foo = smp_load_acquire(&y);

P1
smp_store_release(&y, 1);
bar = smp_load_acquire(&x);

We allow foo == bar == 0, which is prohibited by SC.


However, we *do* enforce ordering on any prior or subsequent accesses
for the code snippet above (the release and acquire combine to give a
full barrier), which makes these primitives well suited to things like
message passing.

Will
Paul E. McKenney Sept. 1, 2015, 9:45 p.m. UTC | #7
On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote:
> On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> > On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote:
> > > Ah.. just read through the thread you mentioned, I might misunderstand
> > > you, probably because I didn't understand RCpc well..
> > > 
> > > You are saying that in a RELEASE we -might- switch from smp_lwsync() to
> > > smp_mb() semantically, right? I guess this means we -might- switch from
> > > RCpc to RCsc, right?
> > > 
> > > If so, I think I'd better to wait until we have a conclusion for this.
> > 
> > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> > not.
> 
> We've discussed this before, but for the sake of completeness, I don't
> think we're fully RCsc either because we don't order the actual RELEASE
> operation again a subsequent ACQUIRE operation:
> 
> P0
> smp_store_release(&x, 1);
> foo = smp_load_acquire(&y);
> 
> P1
> smp_store_release(&y, 1);
> bar = smp_load_acquire(&x);
> 
> We allow foo == bar == 0, which is prohibited by SC.

I certainly hope that no one expects foo == bar == 0 to be prohibited!!!

On the other hand, in this case, foo == bar == 1 will be prohibited:

P0
foo = smp_load_acquire(&y);
smp_store_release(&x, 1);

P1
bar = smp_load_acquire(&x);
smp_store_release(&y, 1);

> However, we *do* enforce ordering on any prior or subsequent accesses
> for the code snippet above (the release and acquire combine to give a
> full barrier), which makes these primitives well suited to things like
> message passing.

If I understand your example correctly, neither x86 nor Power implement
a full barrier in this case.  For example:

	P0
	WRITE_ONCE(a, 1);
	smp_store_release(b, 1);
	r1 = smp_load_acquire(c);
	r2 = READ_ONCE(d);

	P1
	WRITE_ONCE(d, 1);
	smp_mb();
	r3 = READ_ONCE(a);

Both x86 and Power can reorder P0 as follows:

	P0
	r1 = smp_load_acquire(c);
	r2 = READ_ONCE(d);
	WRITE_ONCE(a, 1);
	smp_store_release(b, 1);

Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed.

Or am I missing your point here?

							Thanx, Paul
Will Deacon Sept. 2, 2015, 9:59 a.m. UTC | #8
Hi Paul,

On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote:
> On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote:
> > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> > > not.
> > 
> > We've discussed this before, but for the sake of completeness, I don't
> > think we're fully RCsc either because we don't order the actual RELEASE
> > operation again a subsequent ACQUIRE operation:
> > 
> > P0
> > smp_store_release(&x, 1);
> > foo = smp_load_acquire(&y);
> > 
> > P1
> > smp_store_release(&y, 1);
> > bar = smp_load_acquire(&x);
> > 
> > We allow foo == bar == 0, which is prohibited by SC.
> 
> I certainly hope that no one expects foo == bar == 0 to be prohibited!!!

I just thought it was worth making this point, because it is prohibited
in SC and I don't want people to think that our RELEASE/ACQUIRE operations
are SC (even though they happen to be on arm64).

> On the other hand, in this case, foo == bar == 1 will be prohibited:
> 
> P0
> foo = smp_load_acquire(&y);
> smp_store_release(&x, 1);
> 
> P1
> bar = smp_load_acquire(&x);
> smp_store_release(&y, 1);

Agreed.

> > However, we *do* enforce ordering on any prior or subsequent accesses
> > for the code snippet above (the release and acquire combine to give a
> > full barrier), which makes these primitives well suited to things like
> > message passing.
> 
> If I understand your example correctly, neither x86 nor Power implement
> a full barrier in this case.  For example:
> 
> 	P0
> 	WRITE_ONCE(a, 1);
> 	smp_store_release(b, 1);
> 	r1 = smp_load_acquire(c);
> 	r2 = READ_ONCE(d);
> 
> 	P1
> 	WRITE_ONCE(d, 1);
> 	smp_mb();
> 	r3 = READ_ONCE(a);
> 
> Both x86 and Power can reorder P0 as follows:
> 
> 	P0
> 	r1 = smp_load_acquire(c);
> 	r2 = READ_ONCE(d);
> 	WRITE_ONCE(a, 1);
> 	smp_store_release(b, 1);
> 
> Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed.
> 
> Or am I missing your point here?

I think this example is slightly different. Having the RELEASE/ACQUIRE
operations being reordered with respect to each other is one thing, but
I thought we were heading in a direction where they combined to give a
full barrier with respect to other accesses. In that case, the reordering
above would be forbidden.

Peter -- if the above reordering can happen on x86, then moving away
from RCpc is going to be less popular than I hoped...

Will
Paul E. McKenney Sept. 2, 2015, 10:49 a.m. UTC | #9
On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote:
> > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote:
> > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> > > > not.
> > > 
> > > We've discussed this before, but for the sake of completeness, I don't
> > > think we're fully RCsc either because we don't order the actual RELEASE
> > > operation again a subsequent ACQUIRE operation:
> > > 
> > > P0
> > > smp_store_release(&x, 1);
> > > foo = smp_load_acquire(&y);
> > > 
> > > P1
> > > smp_store_release(&y, 1);
> > > bar = smp_load_acquire(&x);
> > > 
> > > We allow foo == bar == 0, which is prohibited by SC.
> > 
> > I certainly hope that no one expects foo == bar == 0 to be prohibited!!!
> 
> I just thought it was worth making this point, because it is prohibited
> in SC and I don't want people to think that our RELEASE/ACQUIRE operations
> are SC (even though they happen to be on arm64).

OK, good.

> > On the other hand, in this case, foo == bar == 1 will be prohibited:
> > 
> > P0
> > foo = smp_load_acquire(&y);
> > smp_store_release(&x, 1);
> > 
> > P1
> > bar = smp_load_acquire(&x);
> > smp_store_release(&y, 1);
> 
> Agreed.

Good as well.

> > > However, we *do* enforce ordering on any prior or subsequent accesses
> > > for the code snippet above (the release and acquire combine to give a
> > > full barrier), which makes these primitives well suited to things like
> > > message passing.
> > 
> > If I understand your example correctly, neither x86 nor Power implement
> > a full barrier in this case.  For example:
> > 
> > 	P0
> > 	WRITE_ONCE(a, 1);
> > 	smp_store_release(b, 1);
> > 	r1 = smp_load_acquire(c);
> > 	r2 = READ_ONCE(d);
> > 
> > 	P1
> > 	WRITE_ONCE(d, 1);
> > 	smp_mb();
> > 	r3 = READ_ONCE(a);
> > 
> > Both x86 and Power can reorder P0 as follows:
> > 
> > 	P0
> > 	r1 = smp_load_acquire(c);
> > 	r2 = READ_ONCE(d);
> > 	WRITE_ONCE(a, 1);
> > 	smp_store_release(b, 1);
> > 
> > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed.
> > 
> > Or am I missing your point here?
> 
> I think this example is slightly different. Having the RELEASE/ACQUIRE
> operations being reordered with respect to each other is one thing, but
> I thought we were heading in a direction where they combined to give a
> full barrier with respect to other accesses. In that case, the reordering
> above would be forbidden.

It is certainly less added overhead to make unlock-lock a full barrier
than it is to make smp_store_release()-smp_load_acquire() a full barrier.
I am not fully convinced on either, aside from needing some way to make
unlock-lock a full barrier within the RCU implementation, for which the
now-privatized smp_mb__after_unlock_lock() suffices.

> Peter -- if the above reordering can happen on x86, then moving away
> from RCpc is going to be less popular than I hoped...

;-)

							Thanx, Paul
Will Deacon Sept. 11, 2015, 12:45 p.m. UTC | #10
[left the context in the hope that we can make some progress]

On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote:
> On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote:
> > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote:
> > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> > > > not.
> > > 
> > > We've discussed this before, but for the sake of completeness, I don't
> > > think we're fully RCsc either because we don't order the actual RELEASE
> > > operation again a subsequent ACQUIRE operation:
> > > 
> > > P0
> > > smp_store_release(&x, 1);
> > > foo = smp_load_acquire(&y);
> > > 
> > > P1
> > > smp_store_release(&y, 1);
> > > bar = smp_load_acquire(&x);
> > > 
> > > We allow foo == bar == 0, which is prohibited by SC.
> > 
> > I certainly hope that no one expects foo == bar == 0 to be prohibited!!!
> 
> I just thought it was worth making this point, because it is prohibited
> in SC and I don't want people to think that our RELEASE/ACQUIRE operations
> are SC (even though they happen to be on arm64).
> 
> > On the other hand, in this case, foo == bar == 1 will be prohibited:
> > 
> > P0
> > foo = smp_load_acquire(&y);
> > smp_store_release(&x, 1);
> > 
> > P1
> > bar = smp_load_acquire(&x);
> > smp_store_release(&y, 1);
> 
> Agreed.
> 
> > > However, we *do* enforce ordering on any prior or subsequent accesses
> > > for the code snippet above (the release and acquire combine to give a
> > > full barrier), which makes these primitives well suited to things like
> > > message passing.
> > 
> > If I understand your example correctly, neither x86 nor Power implement
> > a full barrier in this case.  For example:
> > 
> > 	P0
> > 	WRITE_ONCE(a, 1);
> > 	smp_store_release(b, 1);
> > 	r1 = smp_load_acquire(c);
> > 	r2 = READ_ONCE(d);
> > 
> > 	P1
> > 	WRITE_ONCE(d, 1);
> > 	smp_mb();
> > 	r3 = READ_ONCE(a);
> > 
> > Both x86 and Power can reorder P0 as follows:
> > 
> > 	P0
> > 	r1 = smp_load_acquire(c);
> > 	r2 = READ_ONCE(d);
> > 	WRITE_ONCE(a, 1);
> > 	smp_store_release(b, 1);
> > 
> > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed.
> > 
> > Or am I missing your point here?
> 
> I think this example is slightly different. Having the RELEASE/ACQUIRE
> operations being reordered with respect to each other is one thing, but
> I thought we were heading in a direction where they combined to give a
> full barrier with respect to other accesses. In that case, the reordering
> above would be forbidden.
> 
> Peter -- if the above reordering can happen on x86, then moving away
> from RCpc is going to be less popular than I hoped...

Peter, any thoughts? I'm not au fait with the x86 memory model, but what
Paul's saying is worrying.

Will
Paul E. McKenney Sept. 11, 2015, 5:09 p.m. UTC | #11
On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote:
> [left the context in the hope that we can make some progress]
> 
> On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote:
> > On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote:
> > > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote:
> > > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote:
> > > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE +
> > > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does
> > > > > not.
> > > > 
> > > > We've discussed this before, but for the sake of completeness, I don't
> > > > think we're fully RCsc either because we don't order the actual RELEASE
> > > > operation again a subsequent ACQUIRE operation:
> > > > 
> > > > P0
> > > > smp_store_release(&x, 1);
> > > > foo = smp_load_acquire(&y);
> > > > 
> > > > P1
> > > > smp_store_release(&y, 1);
> > > > bar = smp_load_acquire(&x);
> > > > 
> > > > We allow foo == bar == 0, which is prohibited by SC.
> > > 
> > > I certainly hope that no one expects foo == bar == 0 to be prohibited!!!
> > 
> > I just thought it was worth making this point, because it is prohibited
> > in SC and I don't want people to think that our RELEASE/ACQUIRE operations
> > are SC (even though they happen to be on arm64).
> > 
> > > On the other hand, in this case, foo == bar == 1 will be prohibited:
> > > 
> > > P0
> > > foo = smp_load_acquire(&y);
> > > smp_store_release(&x, 1);
> > > 
> > > P1
> > > bar = smp_load_acquire(&x);
> > > smp_store_release(&y, 1);
> > 
> > Agreed.
> > 
> > > > However, we *do* enforce ordering on any prior or subsequent accesses
> > > > for the code snippet above (the release and acquire combine to give a
> > > > full barrier), which makes these primitives well suited to things like
> > > > message passing.
> > > 
> > > If I understand your example correctly, neither x86 nor Power implement
> > > a full barrier in this case.  For example:
> > > 
> > > 	P0
> > > 	WRITE_ONCE(a, 1);
> > > 	smp_store_release(b, 1);
> > > 	r1 = smp_load_acquire(c);
> > > 	r2 = READ_ONCE(d);
> > > 
> > > 	P1
> > > 	WRITE_ONCE(d, 1);
> > > 	smp_mb();
> > > 	r3 = READ_ONCE(a);
> > > 
> > > Both x86 and Power can reorder P0 as follows:
> > > 
> > > 	P0
> > > 	r1 = smp_load_acquire(c);
> > > 	r2 = READ_ONCE(d);
> > > 	WRITE_ONCE(a, 1);
> > > 	smp_store_release(b, 1);
> > > 
> > > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed.
> > > 
> > > Or am I missing your point here?
> > 
> > I think this example is slightly different. Having the RELEASE/ACQUIRE
> > operations being reordered with respect to each other is one thing, but
> > I thought we were heading in a direction where they combined to give a
> > full barrier with respect to other accesses. In that case, the reordering
> > above would be forbidden.
> > 
> > Peter -- if the above reordering can happen on x86, then moving away
> > from RCpc is going to be less popular than I hoped...
> 
> Peter, any thoughts? I'm not au fait with the x86 memory model, but what
> Paul's saying is worrying.

The herd tool has an x86 mode, which will allow you to double-check
my scenario.  This tool is described in "Herding Cats: Modelling,
Simulation, Testing, and Data-mining for Weak Memory" by Alglave,
Marenget, and Tautschnig.  The herd tool is available at this git
repository: https://github.com/herd/herdtools.

							Thanx, Paul
Peter Zijlstra Sept. 14, 2015, 11:35 a.m. UTC | #12
Sorry for being tardy, I had a wee spell of feeling horrible and then I
procrastinated longer than I should have.

On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote:

> Peter, any thoughts? I'm not au fait with the x86 memory model, but what
> Paul's saying is worrying.

Right, so Paul is right -- and I completely forgot (I used to know about
that).

So all the TSO archs (SPARC-TSO, x86 (!OOSTORE) and s390) can do
smp_load_acquire()/smp_store_release() with just barrier(), and while:

	smp_store_release(&x);
	smp_load_acquire(&x);

will provide full order by means of the address dependency,

	smp_store_release(&x);
	smp_load_acquire(&y);

will not. Because the one reorder TSO allows is exactly that one.

> Peter -- if the above reordering can happen on x86, then moving away
> from RCpc is going to be less popular than I hoped...

Sadly yes.. We could of course try and split LOCK from ACQUIRE again,
but I'm not sure that's going to help anything except confusion.
Peter Zijlstra Sept. 14, 2015, 12:01 p.m. UTC | #13
On Mon, Sep 14, 2015 at 01:35:20PM +0200, Peter Zijlstra wrote:
> 
> Sorry for being tardy, I had a wee spell of feeling horrible and then I
> procrastinated longer than I should have.
> 
> On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote:
> 
> > Peter, any thoughts? I'm not au fait with the x86 memory model, but what
> > Paul's saying is worrying.
> 
> Right, so Paul is right -- and I completely forgot (I used to know about
> that).
> 
> So all the TSO archs (SPARC-TSO, x86 (!OOSTORE) and s390) can do
> smp_load_acquire()/smp_store_release() with just barrier(), and while:
> 
> 	smp_store_release(&x);
> 	smp_load_acquire(&x);
> 
> will provide full order by means of the address dependency,
> 
> 	smp_store_release(&x);
> 	smp_load_acquire(&y);
> 
> will not. Because the one reorder TSO allows is exactly that one.
> 
> > Peter -- if the above reordering can happen on x86, then moving away
> > from RCpc is going to be less popular than I hoped...
> 
> Sadly yes.. We could of course try and split LOCK from ACQUIRE again,
> but I'm not sure that's going to help anything except confusion.

This of course also means we need something like:

	smp_mb__release_acquire()

which cannot be a no-op for TSO archs. And it might even mean it needs
to be the same as smp_mb__unlock_lock(), but I need to think more on
this.

The scenario is:

	CPU0			CPU1

				unlock(x)
				  smp_store_release(&x->lock, 0);

	unlock(y)
	  smp_store_release(&next->lock, 1); /* next == &y */

				lock(y)
				  while (!(smp_load_acquire(&y->lock))
					cpu_relax();


Where the lock does _NOT_ issue a store to acquire the lock at all. Now
I don't think any of our current primitives manage this, so we should be
good, but it might just be possible.


And at the same time; having both:

	smp_mb__release_acquire()
	smp_mb__unlock_lock()

is quite horrible, for it clearly shows a LOCK isn't quite the same as
ACQUIRE :/
Peter Zijlstra Sept. 14, 2015, 12:11 p.m. UTC | #14
On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote:
> The scenario is:
> 
> 	CPU0			CPU1
> 
> 				unlock(x)
> 				  smp_store_release(&x->lock, 0);
> 
> 	unlock(y)
> 	  smp_store_release(&next->lock, 1); /* next == &y */
> 
> 				lock(y)
> 				  while (!(smp_load_acquire(&y->lock))
> 					cpu_relax();
> 
> 
> Where the lock does _NOT_ issue a store to acquire the lock at all. Now
> I don't think any of our current primitives manage this, so we should be
> good, but it might just be possible.

So with a bit more through this seems fundamentally impossible, you
always needs some stores in a lock() implementation, the above for
instance needs to queue itself, otherwise CPU0 will not be able to find
it etc..
Will Deacon Sept. 14, 2015, 3:38 p.m. UTC | #15
On Mon, Sep 14, 2015 at 01:11:56PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote:
> > The scenario is:
> > 
> > 	CPU0			CPU1
> > 
> > 				unlock(x)
> > 				  smp_store_release(&x->lock, 0);
> > 
> > 	unlock(y)
> > 	  smp_store_release(&next->lock, 1); /* next == &y */
> > 
> > 				lock(y)
> > 				  while (!(smp_load_acquire(&y->lock))
> > 					cpu_relax();
> > 
> > 
> > Where the lock does _NOT_ issue a store to acquire the lock at all. Now
> > I don't think any of our current primitives manage this, so we should be
> > good, but it might just be possible.
> 
> So with a bit more through this seems fundamentally impossible, you
> always needs some stores in a lock() implementation, the above for
> instance needs to queue itself, otherwise CPU0 will not be able to find
> it etc..

Which brings us back round to separating LOCK/UNLOCK from ACQUIRE/RELEASE.

If we say that UNLOCK(foo) -> LOCK(bar) is ordered but RELEASE(baz) ->
ACQUIRE(boz) is only ordered by smp_mb__release_acquire(), then I think
we're in a position where we can at least build arbitrary locks portably
out of ACQUIRE/RELEASE operations, even though I don't see any users of
that macro in the imminent future.

I'll have a crack at some documentation.

Will
Paul E. McKenney Sept. 14, 2015, 4:26 p.m. UTC | #16
On Mon, Sep 14, 2015 at 04:38:48PM +0100, Will Deacon wrote:
> On Mon, Sep 14, 2015 at 01:11:56PM +0100, Peter Zijlstra wrote:
> > On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote:
> > > The scenario is:
> > > 
> > > 	CPU0			CPU1
> > > 
> > > 				unlock(x)
> > > 				  smp_store_release(&x->lock, 0);
> > > 
> > > 	unlock(y)
> > > 	  smp_store_release(&next->lock, 1); /* next == &y */
> > > 
> > > 				lock(y)
> > > 				  while (!(smp_load_acquire(&y->lock))
> > > 					cpu_relax();
> > > 
> > > 
> > > Where the lock does _NOT_ issue a store to acquire the lock at all. Now
> > > I don't think any of our current primitives manage this, so we should be
> > > good, but it might just be possible.
> > 
> > So with a bit more through this seems fundamentally impossible, you
> > always needs some stores in a lock() implementation, the above for
> > instance needs to queue itself, otherwise CPU0 will not be able to find
> > it etc..
> 
> Which brings us back round to separating LOCK/UNLOCK from ACQUIRE/RELEASE.

I believe that we do need to do this, unless we decide to have unlock-lock
continue to imply only acquire and release, rather than full ordering.
I believe that Mike Ellerman is working up additional benchmarking
on this.

							Thanx, Paul

> If we say that UNLOCK(foo) -> LOCK(bar) is ordered but RELEASE(baz) ->
> ACQUIRE(boz) is only ordered by smp_mb__release_acquire(), then I think
> we're in a position where we can at least build arbitrary locks portably
> out of ACQUIRE/RELEASE operations, even though I don't see any users of
> that macro in the imminent future.
> 
> I'll have a crack at some documentation.
> 
> Will
>
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 55f106e..806ce50 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,39 @@ 
 
 #define ATOMIC_INIT(i)		{ (i) }
 
+/*
+ * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#ifdef CONFIG_SMP
+#define smp_acquire_barrier__after_atomic() \
+	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
+#else
+#define smp_acquire_barrier__after_atomic() barrier()
+#endif
+#define arch_atomic_op_acquire(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
+	smp_acquire_barrier__after_atomic();				\
+	__ret;								\
+})
+
+#define arch_atomic_op_release(op, args...)				\
+({									\
+	smp_lwsync();							\
+	op##_relaxed(args);						\
+})
+
+#define arch_atomic_op_fence(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret;				\
+	smp_lwsync();							\
+	__ret = op##_relaxed(args);					\
+	smp_mb__after_atomic();						\
+	__ret;								\
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
 	int t;
@@ -42,27 +75,27 @@  static __inline__ void atomic_##op(int a, atomic_t *v)			\
 	: "cc");							\
 }									\
 
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static __inline__ int atomic_##op##_return(int a, atomic_t *v)		\
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)				\
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v)	\
 {									\
 	int t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	lwarx	%0,0,%2		# atomic_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-	PPC405_ERR77(0,%2)						\
-"	stwcx.	%0,0,%2 \n"						\
+"1:	lwarx	%0,0,%3		# atomic_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+	PPC405_ERR77(0, %3)						\
+"	stwcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, asm_op)
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC_OPS(add, add)
 ATOMIC_OPS(sub, subf)
@@ -71,8 +104,11 @@  ATOMIC_OP(and, and)
 ATOMIC_OP(or, or)
 ATOMIC_OP(xor, xor)
 
+#define atomic_add_return_relaxed atomic_add_return_relaxed
+#define atomic_sub_return_relaxed atomic_sub_return_relaxed
+
 #undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
@@ -285,26 +321,27 @@  static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
 	: "cc");							\
 }
 
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static __inline__ long atomic64_##op##_return(long a, atomic64_t *v)	\
+#define ATOMIC64_OP_RETURN_RELAXED(op, asm_op)				\
+static inline long							\
+atomic64_##op##_return_relaxed(long a, atomic64_t *v)			\
 {									\
 	long t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	ldarx	%0,0,%2		# atomic64_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-"	stdcx.	%0,0,%2 \n"						\
+"1:	ldarx	%0,0,%3		# atomic64_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+"	stdcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC64_OPS(op, asm_op) ATOMIC64_OP(op, asm_op) ATOMIC64_OP_RETURN(op, asm_op)
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC64_OPS(add, add)
 ATOMIC64_OPS(sub, subf)
@@ -312,8 +349,11 @@  ATOMIC64_OP(and, and)
 ATOMIC64_OP(or, or)
 ATOMIC64_OP(xor, xor)
 
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
+#define atomic64_add_return_relaxed atomic64_add_return_relaxed
+#define atomic64_sub_return_relaxed atomic64_sub_return_relaxed
+
+#undef ATOPIC64_OPS
+#undef ATOMIC64_OP_RETURN_RELAXED
 #undef ATOMIC64_OP
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)