Patchwork [1/2] SPARC32: implement SMP IPIs using the generic functions

login
register
mail settings
Submitter Daniel Hellstrom
Date Jan. 26, 2011, 4:38 p.m.
Message ID <1296059911-22861-1-git-send-email-daniel@gaisler.com>
Download mbox | patch
Permalink /patch/80525/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Daniel Hellstrom - Jan. 26, 2011, 4:38 p.m.
The current SPARC32 SMP IPI generation is implemented the
cross call function. The cross call function uses IRQ15 the
NMI, this is has the effect that IPIs will interrupt IRQ
critical areas and hang the system. Typically on/after
spin_lock_irqsave calls can be aborted.

The cross call functionality must still exist to flush
cache/TLBS.

This patch provides CPU models a custom way to implement
generation of IPIs on the generic code's request. The
typical approach is to generate an IRQ for each IPI case.

After this patch each SPARC32 SMP CPU model needs to
implement IPIs in order to function properly.

Signed-off-by: Daniel Hellstrom <daniel@gaisler.com>
---
 arch/sparc/Kconfig              |    1 +
 arch/sparc/include/asm/smp_32.h |   15 ++-------------
 arch/sparc/kernel/smp_32.c      |   34 +++++++++++++++++++++++++++++++++-
 3 files changed, 36 insertions(+), 14 deletions(-)
David Miller - Jan. 28, 2011, 11:33 p.m.
From: Daniel Hellstrom <daniel@gaisler.com>
Date: Wed, 26 Jan 2011 17:38:30 +0100

> The current SPARC32 SMP IPI generation is implemented the
> cross call function. The cross call function uses IRQ15 the
> NMI, this is has the effect that IPIs will interrupt IRQ
> critical areas and hang the system. Typically on/after
> spin_lock_irqsave calls can be aborted.
> 
> The cross call functionality must still exist to flush
> cache/TLBS.
> 
> This patch provides CPU models a custom way to implement
> generation of IPIs on the generic code's request. The
> typical approach is to generate an IRQ for each IPI case.
> 
> After this patch each SPARC32 SMP CPU model needs to
> implement IPIs in order to function properly.
> 
> Signed-off-by: Daniel Hellstrom <daniel@gaisler.com>

Overall this looks fine, but there are some things we need to
fix up before we can consider applying this.

First off, since you do the:

	select USE_GENERIC_SMP_HELPERS if SMP

under "SPARC", you can remove the select line for this that
exists under "SPARC64" since that is no longer needed.

As per the implementation, I think there are implicit races
which come to be shown more explicitly in your LEON implementation.

You can't have a per-cpu blob of status and modify remote cpu
values the way you do, it's racy.

Let's say cpu 1 sends to cpu 2, and cpu 3 tries to send to cpu 2
at the same time.  It's possible for events to be lost since the
setting and clearing of the per-cpu masks are done non-atomically.

The solution, I think, is to use multiple software interrupt vectors
to distinguish the various cases.

I think we need 3, plus IRQ 15 for the cache/tlb flush IPIs.

I tried to figure out if we have enough on LEON, but because the
per-cpu timer is variable, I can't figure that out.  Does that per-cpu
timer use IRQ 14?

I think we can make this work on sun4m/sun4c/sun4d, which have several
software interrupt vectors available.

sun4m has 15 soft interrupts, in the sun4m_irq_percpu->{pending,clear,set}
registers, these live sequentially starting at bit 16, as per the definition
of the SUN4M_SOFT_INT() macro.

The only catch is that we'll need to peek at the ->pending register(s) to
determine if we have a hardware or software interrupt pending at a given
PIL level (or both).

I'm not sure how the software interrupt triggering works on sun4c, but that
doesn't matter since we only need this on SMP.

The sun4d code has a sun4d_send_ipi() interface from which the necessary
code can be constructed.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Hellstrom - Jan. 31, 2011, 5 p.m.
David Miller wrote:

>From: Daniel Hellstrom <daniel@gaisler.com>
>Date: Wed, 26 Jan 2011 17:38:30 +0100
>
>  
>
>>The current SPARC32 SMP IPI generation is implemented the
>>cross call function. The cross call function uses IRQ15 the
>>NMI, this is has the effect that IPIs will interrupt IRQ
>>critical areas and hang the system. Typically on/after
>>spin_lock_irqsave calls can be aborted.
>>
>>The cross call functionality must still exist to flush
>>cache/TLBS.
>>
>>This patch provides CPU models a custom way to implement
>>generation of IPIs on the generic code's request. The
>>typical approach is to generate an IRQ for each IPI case.
>>
>>After this patch each SPARC32 SMP CPU model needs to
>>implement IPIs in order to function properly.
>>
>>Signed-off-by: Daniel Hellstrom <daniel@gaisler.com>
>>    
>>
>
>Overall this looks fine, but there are some things we need to
>fix up before we can consider applying this.
>
>First off, since you do the:
>
>	select USE_GENERIC_SMP_HELPERS if SMP
>
>under "SPARC", you can remove the select line for this that
>exists under "SPARC64" since that is no longer needed.
>  
>
Of course, will change before submitting next time.

>As per the implementation, I think there are implicit races
>which come to be shown more explicitly in your LEON implementation.
>
>You can't have a per-cpu blob of status and modify remote cpu
>values the way you do, it's racy.
>
>Let's say cpu 1 sends to cpu 2, and cpu 3 tries to send to cpu 2
>at the same time.  It's possible for events to be lost since the
>setting and clearing of the per-cpu masks are done non-atomically.
>
>The solution, I think, is to use multiple software interrupt vectors
>to distinguish the various cases.
>  
>
For now I have changed the code to use a spinlock around the 
"work->msk=1 and IRQ generation" and the clearing of the "work->msk" in 
the interrupt handler.

But I disagree for the moment.. I used three IRQs to implement it the 
very first time because all other architectures does so and I also 
suspected races at first, however I think using memory barriers this way 
is enough. Since the clearing of the work->msk flag is always done after 
it has been checked and will will result in a call to the generic 
handler, so if a write to it was lost due to race does not really matter 
since it will reach the generic handler anyway (that was why it cleared it).

Note that events will be lost when using 3 separate IRQs as well, when 
two CPUs are writing the soft-IRQ generate register when the target CPU 
is already is in an interrupt context only one IRQ will be generated. 
But I don't think it is the number of events that is the important thing 
here, rather one must make sure to enter the generic IPI-handler as long 
as there is something in the generic IPI-queue.

Anyway, I think using 3 IRQs is not a good option for the LEON 
architecture since it is a shortage of IRQs. And different LEON chips 
has different IRQ sources assign making it impossible to use the same 
kernel on multiple chips. If 3 IRQs will be required... so be it.

>I think we need 3, plus IRQ 15 for the cache/tlb flush IPIs.
>
>I tried to figure out if we have enough on LEON, but because the
>per-cpu timer is variable, I can't figure that out.  Does that per-cpu
>timer use IRQ 14?
>  
>
This varies slightly from design/chip to design unfortunatly. I will 
probably need to make this configurable from the GUI.

There are no per-cpu timers on the LEON, however there are multiple 
"global general purpose timers" which can generate IRQ that the IRQ 
controller can broadcast to all CPUs and/or individual CPUs by a 
mask-setting. That is why I can use only one timer for system clock and 
cpu profiling.

>I think we can make this work on sun4m/sun4c/sun4d, which have several
>software interrupt vectors available.
>  
>
Ok, good.

>sun4m has 15 soft interrupts, in the sun4m_irq_percpu->{pending,clear,set}
>registers, these live sequentially starting at bit 16, as per the definition
>of the SUN4M_SOFT_INT() macro.
>
>The only catch is that we'll need to peek at the ->pending register(s) to
>determine if we have a hardware or software interrupt pending at a given
>PIL level (or both).
>  
>
sounds nice.

>I'm not sure how the software interrupt triggering works on sun4c, but that
>doesn't matter since we only need this on SMP.
>
>The sun4d code has a sun4d_send_ipi() interface from which the necessary
>code can be constructed.
>  
>




--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Jan. 31, 2011, 8:49 p.m.
From: Daniel Hellstrom <daniel@gaisler.com>
Date: Mon, 31 Jan 2011 18:00:21 +0100

> For now I have changed the code to use a spinlock around the
> "work->msk=1 and IRQ generation" and the clearing of the "work->msk"
> in the interrupt handler.
> 
> But I disagree for the moment.. I used three IRQs to implement it the
> very first time because all other architectures does so and I also
> suspected races at first, however I think using memory barriers this
> way is enough. Since the clearing of the work->msk flag is always done
> after it has been checked and will will result in a call to the
> generic handler, so if a write to it was lost due to race does not
> really matter since it will reach the generic handler anyway (that was
> why it cleared it).

Ok, I tried to construct a case where it will fail, but I could not.
Part of the key is also that we clear the interrupt status in the
receiver before testing the mask value.

So feel free to stay with your version without spinlocks.

> Note that events will be lost when using 3 separate IRQs as well, when
> two CPUs are writing the soft-IRQ generate register when the target
> CPU is already is in an interrupt context only one IRQ will be
> generated. But I don't think it is the number of events that is the
> important thing here, rather one must make sure to enter the generic
> IPI-handler as long as there is something in the generic IPI-queue.

Yes, and this race is unique to how sparc32 remote IRQ sending works.
On sparc64, for example, we send IRQs to remote cpus by writing only
to local CPU registers.  So events cannot be lost and there are no
chip register programming races.

The key here is visibility.

>>I think we need 3, plus IRQ 15 for the cache/tlb flush IPIs.
>>
>>I tried to figure out if we have enough on LEON, but because the
>>per-cpu timer is variable, I can't figure that out.  Does that per-cpu
>>timer use IRQ 14?
>>  
> This varies slightly from design/chip to design unfortunatly. I will
> probably need to make this configurable from the GUI.
> 
> There are no per-cpu timers on the LEON, however there are multiple
> "global general purpose timers" which can generate IRQ that the IRQ
> controller can broadcast to all CPUs and/or individual CPUs by a
> mask-setting. That is why I can use only one timer for system clock
> and cpu profiling.

Please no more hacked settings in the kernel Kconfig for LEON, I already
think the U-Boot stuff is unreasonable but I let it in anyways.

Put device and misc config values where they belong, in the device
tree.

And make it so that U-Boot can be used without having to tweak the
kernel config in special ways.  The kernel really should not care where
it is loaded, and should be able to remap itself from any location
during the head.S startup sequence.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Hellstrom - Feb. 1, 2011, 7:41 a.m.
David Miller wrote:

>From: Daniel Hellstrom <daniel@gaisler.com>
>Date: Mon, 31 Jan 2011 18:00:21 +0100
>
>  
>
>>For now I have changed the code to use a spinlock around the
>>"work->msk=1 and IRQ generation" and the clearing of the "work->msk"
>>in the interrupt handler.
>>
>>But I disagree for the moment.. I used three IRQs to implement it the
>>very first time because all other architectures does so and I also
>>suspected races at first, however I think using memory barriers this
>>way is enough. Since the clearing of the work->msk flag is always done
>>after it has been checked and will will result in a call to the
>>generic handler, so if a write to it was lost due to race does not
>>really matter since it will reach the generic handler anyway (that was
>>why it cleared it).
>>    
>>
>
>Ok, I tried to construct a case where it will fail, but I could not.
>Part of the key is also that we clear the interrupt status in the
>receiver before testing the mask value.
>
>So feel free to stay with your version without spinlocks.
>  
>
Thanks.

>  
>
>>Note that events will be lost when using 3 separate IRQs as well, when
>>two CPUs are writing the soft-IRQ generate register when the target
>>CPU is already is in an interrupt context only one IRQ will be
>>generated. But I don't think it is the number of events that is the
>>important thing here, rather one must make sure to enter the generic
>>IPI-handler as long as there is something in the generic IPI-queue.
>>    
>>
>
>Yes, and this race is unique to how sparc32 remote IRQ sending works.
>On sparc64, for example, we send IRQs to remote cpus by writing only
>to local CPU registers.  So events cannot be lost and there are no
>chip register programming races.
>  
>
But if the same CPU sends a IPI to the same receiver CPU twice when the 
receiver has interrupts of, shouldn't that also lead to a missed IPI? 
Perhaps the sparc64 has a FIFO of IPIs.

>The key here is visibility.
>
>  
>
>>>I think we need 3, plus IRQ 15 for the cache/tlb flush IPIs.
>>>
>>>I tried to figure out if we have enough on LEON, but because the
>>>per-cpu timer is variable, I can't figure that out.  Does that per-cpu
>>>timer use IRQ 14?
>>> 
>>>      
>>>
>>This varies slightly from design/chip to design unfortunatly. I will
>>probably need to make this configurable from the GUI.
>>
>>There are no per-cpu timers on the LEON, however there are multiple
>>"global general purpose timers" which can generate IRQ that the IRQ
>>controller can broadcast to all CPUs and/or individual CPUs by a
>>mask-setting. That is why I can use only one timer for system clock
>>and cpu profiling.
>>    
>>
>
>Please no more hacked settings in the kernel Kconfig for LEON, I already
>think the U-Boot stuff is unreasonable but I let it in anyways.
>
>Put device and misc config values where they belong, in the device
>tree.
>  
>
The u-boot stuff can not be located in the device tree, they do not 
affect the kernel in any way. I agree with putting as much as possible 
in the device tree.

>And make it so that U-Boot can be used without having to tweak the
>kernel config in special ways.  The kernel really should not care where
>it is loaded, and should be able to remap itself from any location
>during the head.S startup sequence.
>  
>
This is already true, it can be located on any 256 Mbyte boundary. The 
Kconfig options are only used to create the u-boot image, they do not 
change the arch/sparc/boot/image in any way, just a matter of 
convenience when creating the uImage similar to other architectures 
which support u-boot.

U-boot for SPARC will construct a device tree on the fly using plug and 
play, it will use environment variables to remove devices or add 
properties to a device node, it will also set PROM stuff such as MAC 
ethernet address, number of CPUs etc. Actually it is quite nice, setting 
the MAC address in U-Boot will affect the MAC address of u-boot and of 
the Linux kernel.

Daniel

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Feb. 1, 2011, 8:34 p.m.
From: Daniel Hellstrom <daniel@gaisler.com>
Date: Tue, 01 Feb 2011 08:41:51 +0100

> David Miller wrote:
> 
>>Yes, and this race is unique to how sparc32 remote IRQ sending works.
>>On sparc64, for example, we send IRQs to remote cpus by writing only
>>to local CPU registers.  So events cannot be lost and there are no
>>chip register programming races.
>>  
> But if the same CPU sends a IPI to the same receiver CPU twice when
> the receiver has interrupts of, shouldn't that also lead to a missed
> IPI? Perhaps the sparc64 has a FIFO of IPIs.

Inter-cpu interrupts have an at least 2 entry deep queue in each
cpu's incoming interrupt vector unit.

And if you overflow (which rarely happens except under extreme
load), the cpu sends a NACK packet back, which the sender sees
which lets the sender know it needs to resend.

These vectored interrupts are processed very fast, all they do for
the IPI case is either a cache/TLB flush or reschedule to a normal
PIL based IRQ with a local cpu register write.

Vectored interrupts on sparc64 are completely seperate from normal PIL
level based interrupts.  When we do local_irq_disable() on sparc64,
vectored interrupts still can arrive and be processed.

They are therefore very low latency, and therefore must either perform
some very quick task, or reschedule to a local PIL interrupt for
more involved processing.

> This is already true, it can be located on any 256 Mbyte boundary. The
> Kconfig options are only used to create the u-boot image, they do not
> change the arch/sparc/boot/image in any way, just a matter of
> convenience when creating the uImage similar to other architectures
> which support u-boot.

Ok, this is the part I didn't understand.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Calaby - Feb. 1, 2011, 9:52 p.m.
On Wed, Feb 2, 2011 at 07:34, David Miller <davem@davemloft.net> wrote:
> From: Daniel Hellstrom <daniel@gaisler.com>
> Date: Tue, 01 Feb 2011 08:41:51 +0100
>> This is already true, it can be located on any 256 Mbyte boundary. The
>> Kconfig options are only used to create the u-boot image, they do not
>> change the arch/sparc/boot/image in any way, just a matter of
>> convenience when creating the uImage similar to other architectures
>> which support u-boot.
>
> Ok, this is the part I didn't understand.

U-boot (I know because I'm grappling with it on an embedded ARM system
at the moment) is a boot loader that's essentially OF + GRUB in one
package. Modern versions of it peek into the disk and load Linux into
memory from some pre-determined file(s) then boot it in the normal
way. To do this it needs some more info than just the file names, e.g.
how big it is, where to load it and where to jump to, so it only loads
files that have a particular format, which is essentially the kernel /
initrd with an extra header.

The uImage target (if it's anything like the ARM one) sticks that
header on, nothing else, so the numbers that are being set in KConfig
are only changing variables in that header, not how the kernel is
compiled.

Thanks,
Daniel Hellstrom - Feb. 2, 2011, 9:11 a.m.
David Miller wrote:

>From: Daniel Hellstrom <daniel@gaisler.com>
>Date: Tue, 01 Feb 2011 08:41:51 +0100
>
>  
>
>>David Miller wrote:
>>
>>    
>>
>>>Yes, and this race is unique to how sparc32 remote IRQ sending works.
>>>On sparc64, for example, we send IRQs to remote cpus by writing only
>>>to local CPU registers.  So events cannot be lost and there are no
>>>chip register programming races.
>>> 
>>>      
>>>
>>But if the same CPU sends a IPI to the same receiver CPU twice when
>>the receiver has interrupts of, shouldn't that also lead to a missed
>>IPI? Perhaps the sparc64 has a FIFO of IPIs.
>>    
>>
>
>Inter-cpu interrupts have an at least 2 entry deep queue in each
>cpu's incoming interrupt vector unit.
>
>And if you overflow (which rarely happens except under extreme
>load), the cpu sends a NACK packet back, which the sender sees
>which lets the sender know it needs to resend.
>
>These vectored interrupts are processed very fast, all they do for
>the IPI case is either a cache/TLB flush or reschedule to a normal
>PIL based IRQ with a local cpu register write.
>
>Vectored interrupts on sparc64 are completely seperate from normal PIL
>level based interrupts.  When we do local_irq_disable() on sparc64,
>vectored interrupts still can arrive and be processed.
>
>They are therefore very low latency, and therefore must either perform
>some very quick task, or reschedule to a local PIL interrupt for
>more involved processing.
>  
>
Ok now I understand the arch a little better, thanks.

>  
>
>>This is already true, it can be located on any 256 Mbyte boundary. The
>>Kconfig options are only used to create the u-boot image, they do not
>>change the arch/sparc/boot/image in any way, just a matter of
>>convenience when creating the uImage similar to other architectures
>>which support u-boot.
>>    
>>
>
>Ok, this is the part I didn't understand.
>  
>
After image is built and zipped the u-boot mkimage utility is executed 
with some input parameters and the zipped image. The parameters are used 
to build a header that u-boot reads in order to understand what kind of 
image it is (filesystem, kernel etc.), kernel entrypoint and location in 
RAM the image needs to be extracted to. So typing "make ARCH=sparc 
uImage" will create an image that u-boot will understand and able to boot.

U-boot communicates with Linux using the Linux-kernel-header (head_32.S 
format), PROM-emulation, device tree and kernel command line as any 
other sparc32 boot loader.

Daniel

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e48f471..0e7b361 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -25,6 +25,7 @@  config SPARC
 	select HAVE_DMA_ATTRS
 	select HAVE_DMA_API_DEBUG
 	select HAVE_ARCH_JUMP_LABEL
+	select USE_GENERIC_SMP_HELPERS if SMP
 
 config SPARC32
 	def_bool !64BIT
diff --git a/arch/sparc/include/asm/smp_32.h b/arch/sparc/include/asm/smp_32.h
index 841905c..a56961c 100644
--- a/arch/sparc/include/asm/smp_32.h
+++ b/arch/sparc/include/asm/smp_32.h
@@ -67,19 +67,8 @@  static inline void xc4(smpfunc_t func, unsigned long arg1, unsigned long arg2,
 			   unsigned long arg3, unsigned long arg4)
 { smp_cross_call(func, cpu_online_map, arg1, arg2, arg3, arg4); }
 
-static inline int smp_call_function(void (*func)(void *info), void *info, int wait)
-{
-	xc1((smpfunc_t)func, (unsigned long)info);
-	return 0;
-}
-
-static inline int smp_call_function_single(int cpuid, void (*func) (void *info),
-					   void *info, int wait)
-{
-	smp_cross_call((smpfunc_t)func, cpumask_of_cpu(cpuid),
-		       (unsigned long) info, 0, 0, 0);
-	return 0;
-}
+extern void arch_send_call_function_single_ipi(int cpu);
+extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
 
 static inline int cpu_logical_map(int cpu)
 {
diff --git a/arch/sparc/kernel/smp_32.c b/arch/sparc/kernel/smp_32.c
index 91c10fb..a1bb3a8 100644
--- a/arch/sparc/kernel/smp_32.c
+++ b/arch/sparc/kernel/smp_32.c
@@ -125,13 +125,45 @@  struct linux_prom_registers smp_penguin_ctable __cpuinitdata = { 0 };
 
 void smp_send_reschedule(int cpu)
 {
-	/* See sparc64 */
+	/*
+	 * CPU model dependent way of implementing IPI generation targeting
+	 * a single CPU. The trap handler needs only to do trap entry/return
+	 * to call schedule.
+	 */
+	switch (sparc_cpu_model) {
+	default:
+		BUG();
+	}
 }
 
 void smp_send_stop(void)
 {
 }
 
+void arch_send_call_function_single_ipi(int cpu)
+{
+	/*
+	 * CPU model dependent way of implementing IPI generation targeting
+	 * a single CPU
+	 */
+	switch (sparc_cpu_model) {
+	default:
+		BUG();
+	}
+}
+
+void arch_send_call_function_ipi_mask(const struct cpumask *mask)
+{
+	/*
+	 * CPU model dependent way of implementing IPI generation targeting
+	 * a set of CPUs
+	 */
+	switch (sparc_cpu_model) {
+	default:
+		BUG();
+	}
+}
+
 void smp_flush_cache_all(void)
 {
 	xc0((smpfunc_t) BTFIXUP_CALL(local_flush_cache_all));