diff mbox series

[1/2] powerpc/locking: implement this_cpu_cmpxchg local API

Message ID 0EFBD0242622180B+20231204022303.528-1-luming.yu@shingroup.cn (mailing list archive)
State New
Headers show
Series [1/2] powerpc/locking: implement this_cpu_cmpxchg local API | expand

Commit Message

Luming Yu Dec. 4, 2023, 2:23 a.m. UTC
ppc appears to have already supported cmpxchg-local atomic semantics
that is defined by the kernel convention of the feature.
Add this_cpu_cmpxchg ppc local for the performance benefit of arch
sepcific implementation than asm-generic c verison of the locking API.

Signed-off-by: Luming Yu <luming.yu@shingroup.cn>
---
 arch/powerpc/include/asm/percpu.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Michael Ellerman Dec. 11, 2023, 11:40 a.m. UTC | #1
Hi Luming Yu,

Luming Yu <luming.yu@shingroup.cn> writes:
> ppc appears to have already supported cmpxchg-local atomic semantics
> that is defined by the kernel convention of the feature.
> Add this_cpu_cmpxchg ppc local for the performance benefit of arch
> sepcific implementation than asm-generic c verison of the locking API.

Implementing this has been suggested before but it wasn't clear that it
was actually going to perform better than the generic version.

On 64-bit we have interrupt soft masking, so disabling interrupts is
relatively cheap. So the generic this_cpu_cmpxhg using irq disable just
becomes a few loads & stores, no atomic ops required.

In contrast implementing it using __cmpxchg_local() will use
ldarx/stdcx etc. which will be more expensive.

Have you done any performance measurements?

It probably is a bit fishy that we don't mask PMU interrupts vs
this_cpu_cmpxchg() etc., but I don't think it's actually a bug given the
few places using this_cpu_cmpxchg(). Though I could be wrong about that.

> diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c..ceab5df6e7ab 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -18,5 +18,22 @@
>  #include <asm-generic/percpu.h>
>  
>  #include <asm/paca.h>
> +#include <asm/cmpxchg.h>
> +#ifdef this_cpu_cmpxchg_1
> +#undef this_cpu_cmpxchg_1
 
If we need to undef then I think something has gone wrong with the
header inclusion order, the model should be that the arch defines what
it has and the generic code provides fallbacks if the arch didn't define
anything.

> +#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)

I think this is unsafe vs preemption. The raw_cpu_ptr() can generate the
per-cpu address, but then the task can be preempted and moved to a
different CPU before the ldarx/stdcx do the cmpxchg.

The arm64 implementation had the same bug until they fixed it.

> +#endif 
> +#ifdef this_cpu_cmpxchg_2
> +#undef this_cpu_cmpxchg_2
> +#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
> +#endif
> +#ifdef this_cpu_cmpxchg_4
> +#undef this_cpu_cmpxchg_4
> +#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
> +#endif
> +#ifdef this_cpu_cmpxchg_8
> +#undef this_cpu_cmpxchg_8
> +#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
> +#endif
>  
>  #endif /* _ASM_POWERPC_PERCPU_H_ */

cheers
Luming Yu Dec. 15, 2023, 8:50 a.m. UTC | #2
On Mon, Dec 11, 2023 at 10:40:38PM +1100, Michael Ellerman wrote:
> Hi Luming Yu,
> 
> Luming Yu <luming.yu@shingroup.cn> writes:
> > ppc appears to have already supported cmpxchg-local atomic semantics
> > that is defined by the kernel convention of the feature.
> > Add this_cpu_cmpxchg ppc local for the performance benefit of arch
> > sepcific implementation than asm-generic c verison of the locking API.
> 
> Implementing this has been suggested before but it wasn't clear that it
> was actually going to perform better than the generic version.
Thanks for the info. To me, it is a news. : -)
I will check if any web search engine could serve me well to find it out. 
> 
> On 64-bit we have interrupt soft masking, so disabling interrupts is
> relatively cheap. So the generic this_cpu_cmpxhg using irq disable just
> becomes a few loads & stores, no atomic ops required.

something like this just popped up in my first try with a p8 test kvm on
a c1000 powernv8 platform?

I'm not sure the soft lockup is relevant to the interrupt soft masking,
but I will find it out for sure. : -)

[  460.217669] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[  460.217742] Modules linked in:
[  460.217828] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W    L   N 6.7.0-rc5+ #5
[  460.217915] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,git-6b6c16 pSeries
[  460.217999] NIP:  c00000000003e0ec LR: c00000000003e414 CTR: 0000000000000000
[  460.218074] REGS: c000000004797788 TRAP: 0900   Tainted: G        W    L   N  (6.7.0-rc5+)
[  460.218151] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24042442  XER: 00000000
[  460.218342] CFAR: 0000000000000000 IRQMASK: 0
[  460.218342] GPR00: c00000000003e414 c000000004797760 c000000001583b00 c000000004797758
[  460.218342] GPR04: 0000000000000000 0000000000000004 c000000004ccf51c c00000000224e510
[  460.218342] GPR08: 4000000000000002 0000000000000049 c00000000457b280 0015000b00170038
[  460.218342] GPR12: 00340030003a0010 c000000002f40000 0000000000000008 c000000004ccf4fc
[  460.218342] GPR16: 0000000000007586 c0000000040f45c0 c000000004ddd080 c0000000040f45c0
[  460.218342] GPR20: 0000000000000008 0000000000000024 0000000000000004 c000000004ccf4fc
[  460.218342] GPR24: 000000000000003f 0000000000000001 0000000000000001 c000000004cc6e7e
[  460.218342] GPR28: fcffffffffffffff 0000000000000002 fcffffffffffffff 0000000000000003
[  460.219187] NIP [c00000000003e0ec] __replay_soft_interrupts+0x3c/0x160
[  460.219286] LR [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219365] Call Trace:
[  460.219400] [c000000004797760] [c00000000003e150] __replay_soft_interrupts+0xa0/0x160 (unreliable)
[  460.219515] [c000000004797910] [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219612] [c000000004797950] [c000000000a155c4] get_random_u32+0xb4/0x140
[  460.219699] [c0000000047979a0] [c0000000008b1ae0] get_rcw_we+0x138/0x274
[  460.219781] [c000000004797a30] [c00000000208d4bc] test_rslib_init+0x8b8/0xb70
[  460.219877] [c000000004797c40] [c00000000000fb80] do_one_initcall+0x60/0x390
[  460.219965] [c000000004797d10] [c000000002004a18] kernel_init_freeable+0x32c/0x3dc
[  460.220059] [c000000004797de0] [c0000000000102a4] kernel_init+0x34/0x1b0
[  460.220141] [c000000004797e50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[  460.220229] --- interrupt: 0 at 0x0
[  460.220291] Code: 60000000 7c0802a6 f8010010 f821fe51 e92d0c78 f92101a8 39200000 38610028 892d0933 61290040 992d0933 48043a3d <60000000> 39200000 e9410130 f9210160
[  460.955369] Testing (23,15)_64 code...

> 
> In contrast implementing it using __cmpxchg_local() will use
> ldarx/stdcx etc. which will be more expensive.
> 
> Have you done any performance measurements?
Yes, I'm looking for resource to track the perf changes (positive or negative)
in this corner for this patch set being proposed again.

> 
> It probably is a bit fishy that we don't mask PMU interrupts vs
> this_cpu_cmpxchg() etc., but I don't think it's actually a bug given the
> few places using this_cpu_cmpxchg(). Though I could be wrong about that.
I will try to understand the concern and will try to come up with a patch update,
iff the performance number from the change could look reasonable and promising.
> 
> > diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
> > index 8e5b7d0b851c..ceab5df6e7ab 100644
> > --- a/arch/powerpc/include/asm/percpu.h
> > +++ b/arch/powerpc/include/asm/percpu.h
> > @@ -18,5 +18,22 @@
> >  #include <asm-generic/percpu.h>
> >  
> >  #include <asm/paca.h>
> > +#include <asm/cmpxchg.h>
> > +#ifdef this_cpu_cmpxchg_1
> > +#undef this_cpu_cmpxchg_1
>  
> If we need to undef then I think something has gone wrong with the
> header inclusion order, the model should be that the arch defines what
> it has and the generic code provides fallbacks if the arch didn't define
> anything.
> 
> > +#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)
> 
> I think this is unsafe vs preemption. The raw_cpu_ptr() can generate the
> per-cpu address, but then the task can be preempted and moved to a
> different CPU before the ldarx/stdcx do the cmpxchg.
> 
> The arm64 implementation had the same bug until they fixed it.
Thanks for the review, I will look deeper into this spot. I suppose, for per cpu api down to this level,
it is safe to assume it's safe in terms of being preemption-disabled. But, things could be mis-understood
and I can be wrong as well as linux kernel has been rapidly changing so much.:-(
> 
> > +#endif 
> > +#ifdef this_cpu_cmpxchg_2
> > +#undef this_cpu_cmpxchg_2
> > +#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_4
> > +#undef this_cpu_cmpxchg_4
> > +#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_8
> > +#undef this_cpu_cmpxchg_8
> > +#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
> > +#endif
> >  
> >  #endif /* _ASM_POWERPC_PERCPU_H_ */
> 
> cheers
> 
best regards
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c..ceab5df6e7ab 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -18,5 +18,22 @@ 
 #include <asm-generic/percpu.h>
 
 #include <asm/paca.h>
+#include <asm/cmpxchg.h>
+#ifdef this_cpu_cmpxchg_1
+#undef this_cpu_cmpxchg_1
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)
+#endif 
+#ifdef this_cpu_cmpxchg_2
+#undef this_cpu_cmpxchg_2
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
+#endif
+#ifdef this_cpu_cmpxchg_4
+#undef this_cpu_cmpxchg_4
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
+#endif
+#ifdef this_cpu_cmpxchg_8
+#undef this_cpu_cmpxchg_8
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
+#endif
 
 #endif /* _ASM_POWERPC_PERCPU_H_ */