diff mbox series

[RFC,cmpxchg,3/8] ARC: Emulate one-byte and two-byte cmpxchg

Message ID 20240401213950.3910531-3-paulmck@kernel.org
State New
Headers show
Series None | expand

Commit Message

Paul E. McKenney April 1, 2024, 9:39 p.m. UTC
Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
and two-byte cmpxchg() on arc.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: <linux-snps-arc@lists.infradead.org>
---
 arch/arc/Kconfig               |  1 +
 arch/arc/include/asm/cmpxchg.h | 38 ++++++++++++++++++++++++++--------
 2 files changed, 30 insertions(+), 9 deletions(-)

Comments

Arnd Bergmann April 2, 2024, 8:14 a.m. UTC | #1
On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote:
> Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
> and two-byte cmpxchg() on arc.
>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

I'm missing the context here, is it now mandatory to have 16-bit
cmpxchg() everywhere? I think we've historically tried hard to
keep this out of common code since it's expensive on architectures
that don't have native 16-bit load/store instructions (alpha, armv3)
and or sub-word atomics (armv5, riscv, mips).

Does the code that uses this rely on working concurrently with
non-atomic stores to part of the 32-bit word? If we want to
allow that, we need to merge my alpha ev4/45/5 removal series
first.

For the cmpxchg() interface, I would prefer to handle the
8-bit and 16-bit versions the same way as cmpxchg64() and
provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions
by architectures that operate on fixed-size integer values
but not compounds or pointers, and a generic cmpxchg() wrapper
in common code that can handle the abtraction for pointers,
long and (if absolutely necessary) compounds by multiplexing
between cmpxchg32() and cmpxchg64() where needed.

I did a prototype a few years ago and found that there is
probably under a dozen users of the sub-word atomics in
the tree, so this mostly requires changes to architecture
code and less to drivers and core code.

      Arnd
Paul E. McKenney April 2, 2024, 5:06 p.m. UTC | #2
On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote:
> On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote:
> > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
> > and two-byte cmpxchg() on arc.
> >
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> I'm missing the context here, is it now mandatory to have 16-bit
> cmpxchg() everywhere? I think we've historically tried hard to
> keep this out of common code since it's expensive on architectures
> that don't have native 16-bit load/store instructions (alpha, armv3)
> and or sub-word atomics (armv5, riscv, mips).

I need 8-bit, and just added 16-bit because it was easy to do so.
I would be OK dropping the 16-bit portions of this series, assuming
that no-one needs it.  And assuming that it is easier to drop it than
to explain why it is not available.  ;-)

> Does the code that uses this rely on working concurrently with
> non-atomic stores to part of the 32-bit word? If we want to
> allow that, we need to merge my alpha ev4/45/5 removal series
> first.

For 8-but cmpxchg(), yes.  There are potentially concurrent
smp_load_acquire() and smp_store_release() operations to this same byte.

Or is your question specific to the 16-bit primitives?  (Full disclosure:
I have no objection to removing Alpha ev4/45/5, having several times
suggested removing Alpha entirely.  And having the scars to prove it.)

> For the cmpxchg() interface, I would prefer to handle the
> 8-bit and 16-bit versions the same way as cmpxchg64() and
> provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions
> by architectures that operate on fixed-size integer values
> but not compounds or pointers, and a generic cmpxchg() wrapper
> in common code that can handle the abtraction for pointers,
> long and (if absolutely necessary) compounds by multiplexing
> between cmpxchg32() and cmpxchg64() where needed.

So as to support _acquire(), _relaxed(), and _release()?

If so, I don't have any use cases for other than full ordering.

> I did a prototype a few years ago and found that there is
> probably under a dozen users of the sub-word atomics in
> the tree, so this mostly requires changes to architecture
> code and less to drivers and core code.

Given this approach, the predominance of changes to architecture code
seems quite likely to me.

But do we really wish to invest that much work into architectures that
might not be all that long for the world?  (Quickly donning my old
asbestos suit, the one with the tungsten pinstripes...)

							Thanx, Paul
Paul E. McKenney April 2, 2024, 8:52 p.m. UTC | #3
On Tue, Apr 02, 2024 at 10:06:14AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote:
> > On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote:
> > > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
> > > and two-byte cmpxchg() on arc.
> > >
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > I'm missing the context here, is it now mandatory to have 16-bit
> > cmpxchg() everywhere? I think we've historically tried hard to
> > keep this out of common code since it's expensive on architectures
> > that don't have native 16-bit load/store instructions (alpha, armv3)
> > and or sub-word atomics (armv5, riscv, mips).
> 
> I need 8-bit, and just added 16-bit because it was easy to do so.
> I would be OK dropping the 16-bit portions of this series, assuming
> that no-one needs it.  And assuming that it is easier to drop it than
> to explain why it is not available.  ;-)
> 
> > Does the code that uses this rely on working concurrently with
> > non-atomic stores to part of the 32-bit word? If we want to
> > allow that, we need to merge my alpha ev4/45/5 removal series
> > first.
> 
> For 8-but cmpxchg(), yes.  There are potentially concurrent
> smp_load_acquire() and smp_store_release() operations to this same byte.
> 
> Or is your question specific to the 16-bit primitives?  (Full disclosure:
> I have no objection to removing Alpha ev4/45/5, having several times
> suggested removing Alpha entirely.  And having the scars to prove it.)
> 
> > For the cmpxchg() interface, I would prefer to handle the
> > 8-bit and 16-bit versions the same way as cmpxchg64() and
> > provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions
> > by architectures that operate on fixed-size integer values
> > but not compounds or pointers, and a generic cmpxchg() wrapper
> > in common code that can handle the abtraction for pointers,
> > long and (if absolutely necessary) compounds by multiplexing
> > between cmpxchg32() and cmpxchg64() where needed.
> 
> So as to support _acquire(), _relaxed(), and _release()?
> 
> If so, I don't have any use cases for other than full ordering.

Nor any use cases other than integers.  (In case another thing you are
after here is good type-checking for non-integers combined with allowing
C-language implicit conversions for integers.)

						Thanx, Paul

> > I did a prototype a few years ago and found that there is
> > probably under a dozen users of the sub-word atomics in
> > the tree, so this mostly requires changes to architecture
> > code and less to drivers and core code.
> 
> Given this approach, the predominance of changes to architecture code
> seems quite likely to me.
> 
> But do we really wish to invest that much work into architectures that
> might not be all that long for the world?  (Quickly donning my old
> asbestos suit, the one with the tungsten pinstripes...)
> 
> 							Thanx, Paul
Arnd Bergmann April 4, 2024, 11:57 a.m. UTC | #4
On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote:
> On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote:
>> On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote:
>> > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
>> > and two-byte cmpxchg() on arc.
>> >
>> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>> 
>> I'm missing the context here, is it now mandatory to have 16-bit
>> cmpxchg() everywhere? I think we've historically tried hard to
>> keep this out of common code since it's expensive on architectures
>> that don't have native 16-bit load/store instructions (alpha, armv3)
>> and or sub-word atomics (armv5, riscv, mips).
>
> I need 8-bit, and just added 16-bit because it was easy to do so.
> I would be OK dropping the 16-bit portions of this series, assuming
> that no-one needs it.  And assuming that it is easier to drop it than
> to explain why it is not available.  ;-)

It certainly makes sense to handle both the same, whichever
way we do this.

>> Does the code that uses this rely on working concurrently with
>> non-atomic stores to part of the 32-bit word? If we want to
>> allow that, we need to merge my alpha ev4/45/5 removal series
>> first.
>
> For 8-but cmpxchg(), yes.  There are potentially concurrent
> smp_load_acquire() and smp_store_release() operations to this same byte.
>
> Or is your question specific to the 16-bit primitives?  (Full disclosure:
> I have no objection to removing Alpha ev4/45/5, having several times
> suggested removing Alpha entirely.  And having the scars to prove it.)

For the native sub-word access, alpha ev5 cannot do either of
them, while armv3 could do byte access but not 16-bit words.

It sounds like the old alphas are already broken then, which
is a good reason to finally drop support. It would appear that
the arm riscpc is not affected by this though.

>> For the cmpxchg() interface, I would prefer to handle the
>> 8-bit and 16-bit versions the same way as cmpxchg64() and
>> provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions
>> by architectures that operate on fixed-size integer values
>> but not compounds or pointers, and a generic cmpxchg() wrapper
>> in common code that can handle the abtraction for pointers,
>> long and (if absolutely necessary) compounds by multiplexing
>> between cmpxchg32() and cmpxchg64() where needed.
>
> So as to support _acquire(), _relaxed(), and _release()?
>
> If so, I don't have any use cases for other than full ordering.

My main goal here would be to simplify the definition of
the very commonly used cmpxchg() macro so it doesn't have
to deal with so many corner cases, and make it work the
same way across all architectures. Without the type
agnostic wrapper, there would also be the benefit of
additional type checking that we get by replacing the
macros with inline functions.

We'd still need all the combinations of cmpxchg() and xchg()
with the four sizes and ordering variants, but at least the
latter should easily collapse on most architectures. At the
moment, most architectures only provide the full-ordering
version.

      Arnd
Paul E. McKenney April 4, 2024, 2:44 p.m. UTC | #5
On Thu, Apr 04, 2024 at 01:57:32PM +0200, Arnd Bergmann wrote:
> On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote:
> > On Tue, Apr 02, 2024 at 10:14:08AM +0200, Arnd Bergmann wrote:
> >> On Mon, Apr 1, 2024, at 23:39, Paul E. McKenney wrote:
> >> > Use the new cmpxchg_emu_u8() and cmpxchg_emu_u16() to emulate one-byte
> >> > and two-byte cmpxchg() on arc.
> >> >
> >> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >> 
> >> I'm missing the context here, is it now mandatory to have 16-bit
> >> cmpxchg() everywhere? I think we've historically tried hard to
> >> keep this out of common code since it's expensive on architectures
> >> that don't have native 16-bit load/store instructions (alpha, armv3)
> >> and or sub-word atomics (armv5, riscv, mips).
> >
> > I need 8-bit, and just added 16-bit because it was easy to do so.
> > I would be OK dropping the 16-bit portions of this series, assuming
> > that no-one needs it.  And assuming that it is easier to drop it than
> > to explain why it is not available.  ;-)
> 
> It certainly makes sense to handle both the same, whichever
> way we do this.

Agreed, at least as long as the properties of the relevant hardware are
consistent with doing so.

> >> Does the code that uses this rely on working concurrently with
> >> non-atomic stores to part of the 32-bit word? If we want to
> >> allow that, we need to merge my alpha ev4/45/5 removal series
> >> first.
> >
> > For 8-but cmpxchg(), yes.  There are potentially concurrent
> > smp_load_acquire() and smp_store_release() operations to this same byte.
> >
> > Or is your question specific to the 16-bit primitives?  (Full disclosure:
> > I have no objection to removing Alpha ev4/45/5, having several times
> > suggested removing Alpha entirely.  And having the scars to prove it.)
> 
> For the native sub-word access, alpha ev5 cannot do either of
> them, while armv3 could do byte access but not 16-bit words.
> 
> It sounds like the old alphas are already broken then, which
> is a good reason to finally drop support. It would appear that
> the arm riscpc is not affected by this though.

Good point, given that single-byte load/store and emulated single-byte
cmpxchg() has been in common code for some time.

So armv3 is OK with one-byte emulated cmpxchg(), but not with two-byte,
which is consistent with the current state of this series in the -rcu
tree.  (My plan is to wait until Monday to re-send the series in order
to allow the test robots to find yet more bugs.)

Or is the plan to also drop support for armv3 in the near term?

> >> For the cmpxchg() interface, I would prefer to handle the
> >> 8-bit and 16-bit versions the same way as cmpxchg64() and
> >> provide separate cmpxchg8()/cmpxchg16()/cmpxchg32() functions
> >> by architectures that operate on fixed-size integer values
> >> but not compounds or pointers, and a generic cmpxchg() wrapper
> >> in common code that can handle the abtraction for pointers,
> >> long and (if absolutely necessary) compounds by multiplexing
> >> between cmpxchg32() and cmpxchg64() where needed.
> >
> > So as to support _acquire(), _relaxed(), and _release()?
> >
> > If so, I don't have any use cases for other than full ordering.
> 
> My main goal here would be to simplify the definition of
> the very commonly used cmpxchg() macro so it doesn't have
> to deal with so many corner cases, and make it work the
> same way across all architectures. Without the type
> agnostic wrapper, there would also be the benefit of
> additional type checking that we get by replacing the
> macros with inline functions.
> 
> We'd still need all the combinations of cmpxchg() and xchg()
> with the four sizes and ordering variants, but at least the
> latter should easily collapse on most architectures. At the
> moment, most architectures only provide the full-ordering
> version.

That does make a lot of sense to me.  Though C-language inline functions
have some trouble with type-generic parameters.

However, my patch is down at architecture-specific level, so should not
affect the cmpxchg() macro, right?  Or am I missing some aspect of your
proposed refactoring?

							Thanx, Paul
Arnd Bergmann April 4, 2024, 3:06 p.m. UTC | #6
On Thu, Apr 4, 2024, at 16:44, Paul E. McKenney wrote:
> On Thu, Apr 04, 2024 at 01:57:32PM +0200, Arnd Bergmann wrote:
>> On Tue, Apr 2, 2024, at 19:06, Paul E. McKenney wrote:
>> > Or is your question specific to the 16-bit primitives?  (Full disclosure:
>> > I have no objection to removing Alpha ev4/45/5, having several times
>> > suggested removing Alpha entirely.  And having the scars to prove it.)
>> 
>> For the native sub-word access, alpha ev5 cannot do either of
>> them, while armv3 could do byte access but not 16-bit words.
>> 
>> It sounds like the old alphas are already broken then, which
>> is a good reason to finally drop support. It would appear that
>> the arm riscpc is not affected by this though.
>
> Good point, given that single-byte load/store and emulated single-byte
> cmpxchg() has been in common code for some time.
>
> So armv3 is OK with one-byte emulated cmpxchg(), but not with two-byte,
> which is consistent with the current state of this series in the -rcu
> tree.  (My plan is to wait until Monday to re-send the series in order
> to allow the test robots to find yet more bugs.)
>
> Or is the plan to also drop support for armv3 in the near term?

Russell still has his RiscPC, which is probably the only one
using armv3 kernels (it's actually v4 but relies on -march=armv3
to work around hardware quirks). Since armv3 support was dropped
in gcc-9, it's a matter of time before we drop this as well, but
it's not now.

>> We'd still need all the combinations of cmpxchg() and xchg()
>> with the four sizes and ordering variants, but at least the
>> latter should easily collapse on most architectures. At the
>> moment, most architectures only provide the full-ordering
>> version.
>
> That does make a lot of sense to me.  Though C-language inline functions
> have some trouble with type-generic parameters.
>
> However, my patch is down at architecture-specific level, so should not
> affect the cmpxchg() macro, right?  Or am I missing some aspect of your
> proposed refactoring?

Today, arch_cmpxchg() and its variants are defined in each
architecture to handle some subset of integer sizes.

The way I'd like to do this in the future would be to remove that
macro in favor of arch_{cmp,}xchg{8,16,32,64}{,_relaxed,_release,
_acquire,_local} inline functions that each architecture needs
to provide either directly or through a generic helper, with
all the macros wrapping them moved to common code.

I've been wanting to change this for years now and it never
quite makes it to the top of my todo pile, so I guess it can wait
a little longer.

     Arnd
diff mbox series

Patch

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 99d2845f3feb9..0b40039f38eb2 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -14,6 +14,7 @@  config ARC
 	select ARCH_HAS_SETUP_DMA_OPS
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
+	select ARCH_NEED_CMPXCHG_1_2_EMU
 	select ARCH_SUPPORTS_ATOMIC_RMW if ARC_HAS_LLSC
 	select ARCH_32BIT_OFF_T
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h
index e138fde067dea..1e3e23adaca13 100644
--- a/arch/arc/include/asm/cmpxchg.h
+++ b/arch/arc/include/asm/cmpxchg.h
@@ -46,6 +46,12 @@ 
 	__typeof__(*(ptr)) _prev_;					\
 									\
 	switch(sizeof((_p_))) {						\
+	case 1:								\
+		_prev_ = cmpxchg_emu_u8((volatile u8 *)_p_, _o_, _n_);	\
+		break;							\
+	case 2:								\
+		_prev_ = cmpxchg_emu_u16((volatile u16 *)_p_, _o_, _n_); \
+		break;							\
 	case 4:								\
 		_prev_ = __cmpxchg(_p_, _o_, _n_);			\
 		break;							\
@@ -65,16 +71,30 @@ 
 	__typeof__(*(ptr)) _prev_;					\
 	unsigned long __flags;						\
 									\
-	BUILD_BUG_ON(sizeof(_p_) != 4);					\
+	switch(sizeof((_p_))) {						\
+	case 1:								\
+		__flags = cmpxchg_emu_u8((volatile u8 *)_p_, _o_, _n_);	\
+		_prev_ = (__typeof__(*(ptr)))__flags;			\
+		break;							\
+	case 2:								\
+		__flags = cmpxchg_emu_u16((volatile u16 *)_p_, _o_, _n_); \
+		_prev_ = (__typeof__(*(ptr)))__flags;			\
+		break;							\
+	case 4:								\
+		/*							\
+		 * spin lock/unlock provide the needed smp_mb()		\
+		 * before/after						\
+		 */							\
+		atomic_ops_lock(__flags);				\
+		_prev_ = *_p_;						\
+		if (_prev_ == _o_)					\
+			*_p_ = _n_;					\
+		atomic_ops_unlock(__flags);				\
+		break;							\
+	default:							\
+		BUILD_BUG();						\
+	}								\
 									\
-	/*								\
-	 * spin lock/unlock provide the needed smp_mb() before/after	\
-	 */								\
-	atomic_ops_lock(__flags);					\
-	_prev_ = *_p_;							\
-	if (_prev_ == _o_)						\
-		*_p_ = _n_;						\
-	atomic_ops_unlock(__flags);					\
 	_prev_;								\
 })