Patchwork imx6q restart is broken

login
register
mail settings
Submitter Shawn Guo
Date Aug. 8, 2012, 10:18 a.m.
Message ID <20120808101817.GA14718@S2101-09.ap.freescale.net>
Download mbox | patch
Permalink /patch/175890/
State New
Headers show

Comments

Shawn Guo - Aug. 8, 2012, 10:18 a.m.
Thanks Dirk for reporting that imx6q restart (reboot command) is broken.

I tracked down the issue a little bit and found imx6q_restart hangs
on the of_iomap/ioremap call.  The following change, moving the call
somewhere else than imx6q_restart, will just fix the problem.

Does that mean ioremap call is not allowed in platform restart hook?
I'm not sure about that, because I found it works just fine if I build
imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7
only kernel - imx5 and imx6), which is the case how I tested imx6q
restart feature when I was adding it.

To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7
kernel, while it works fine on a V7 only image.  I need some help to
understand that.

Regards,
Shawn

--8<---
Jason Wang - Aug. 9, 2012, 3:18 a.m.
Dirk Behme wrote:
> On 08.08.2012 12:18, Shawn Guo wrote:
>> Thanks Dirk for reporting that imx6q restart (reboot command) is broken.
>>
>> I tracked down the issue a little bit and found imx6q_restart hangs
>> on the of_iomap/ioremap call. The following change, moving the call
>> somewhere else than imx6q_restart, will just fix the problem.
>>
>> Does that mean ioremap call is not allowed in platform restart hook?
>> I'm not sure about that, because I found it works just fine if I build
>> imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7
>> only kernel - imx5 and imx6), which is the case how I tested imx6q
>> restart feature when I was adding it.
>>
>> To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7
>> kernel, while it works fine on a V7 only image. I need some help to
>> understand that.
>
> Some additional information from my debugging:
>
> a) Having a JTAG debugger attached to the i.MX6 SabreLite board I use 
> (kernel built with imx_v6_v7_defconfig) the reboot does work. No hang. 
> This does mean I can't debug the reboot hang with a JTAG debugger. 
> Therefore I added some printk debugging:
>
> b) Adding some printk statements [1] in the of_iomap/ioremap call, it 
> looks to me that the system hangs in
>
> of_iomap() -> ... -> set_pte_at() -> set_pte_ext() / 
> cpu_v7_set_pte_ext() <= hang
>
I noticed this problem several weeks ago, and did some debug, what i 
found is following:

- at the last stage of reset, all non-boot cpus will call 
ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for 
V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5")

- the boot cpu will call cpu_v7_set_pte_ext() in the proc-v7-2level.S, 
at the end of this function, boot cpu will call flush_pte (mcr p15, 0, 
r0, c7, c10, 1), after executing this function, the system will hang. 
That is to say, all non-boot cpus repeatedly call ("mcr p15, 0, %0, c7, 
c10, 5"), if boot cpu call flush_pte (mcr p15, 0, r0, c7, c10, 1), the 
system will hang, this occurs on the cortexa9 r2p10 cpus (freescale 
i.MX6Q), i don't know the root cause for this problem, but if we split 
imx_v6_v7_defconfig into v7_only and use v7_only to build kernel image 
for imx6q, this problem will disappear, since the cpu_relax() will be 
defined to barrier().

- i think we should split imx_v6_v7_defconfig into v6_only and v7_only, 
since when we use imx_v6_v7_defconfig, we will introduce 
"-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6 
platforms (imx3), but it is not good for V7 platforms (imx5, imx6). lets 
grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all (#if 
__LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7) occurrence 
will affect V7 platforms.

Regards,
Hui.
Shawn Guo - Aug. 9, 2012, 4:41 a.m.
On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
> Dirk Behme wrote:
> >On 08.08.2012 12:18, Shawn Guo wrote:
> >>Thanks Dirk for reporting that imx6q restart (reboot command) is broken.
> >>
> >>I tracked down the issue a little bit and found imx6q_restart hangs
> >>on the of_iomap/ioremap call. The following change, moving the call
> >>somewhere else than imx6q_restart, will just fix the problem.
> >>
> >>Does that mean ioremap call is not allowed in platform restart hook?
> >>I'm not sure about that, because I found it works just fine if I build
> >>imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7
> >>only kernel - imx5 and imx6), which is the case how I tested imx6q
> >>restart feature when I was adding it.
> >>
> >>To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7
> >>kernel, while it works fine on a V7 only image. I need some help to
> >>understand that.
> >
> >Some additional information from my debugging:
> >
> >a) Having a JTAG debugger attached to the i.MX6 SabreLite board I
> >use (kernel built with imx_v6_v7_defconfig) the reboot does work.
> >No hang. This does mean I can't debug the reboot hang with a JTAG
> >debugger. Therefore I added some printk debugging:
> >
> >b) Adding some printk statements [1] in the of_iomap/ioremap call,
> >it looks to me that the system hangs in
> >
> >of_iomap() -> ... -> set_pte_at() -> set_pte_ext() /
> >cpu_v7_set_pte_ext() <= hang
> >
> I noticed this problem several weeks ago, and did some debug, what i
> found is following:
> 
Thanks for these great findings.  I also just tracked it down to that
cpu_v7_set_pte_ext hangs on

	mcr   p15, 0, r0, c7, c10, 1          @ flush_pte

> - at the last stage of reset, all non-boot cpus will call
> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb()
> for V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10,
> 5")
> 
> - the boot cpu will call cpu_v7_set_pte_ext() in the
> proc-v7-2level.S, at the end of this function, boot cpu will call
> flush_pte (mcr p15, 0, r0, c7, c10, 1), after executing this
> function, the system will hang. That is to say, all non-boot cpus
> repeatedly call ("mcr p15, 0, %0, c7, c10, 5"), if boot cpu call
> flush_pte (mcr p15, 0, r0, c7, c10, 1), the system will hang, this
> occurs on the cortexa9 r2p10 cpus (freescale i.MX6Q), i don't know
> the root cause for this problem, but if we split imx_v6_v7_defconfig
> into v7_only and use v7_only to build kernel image for imx6q, this
> problem will disappear, since the cpu_relax() will be defined to
> barrier().
> 
> - i think we should split imx_v6_v7_defconfig into v6_only and
> v7_only, since when we use imx_v6_v7_defconfig, we will introduce
> "-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6
> platforms (imx3), but it is not good for V7 platforms (imx5, imx6).
> lets grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all
> (#if __LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7)
> occurrence will affect V7 platforms.

I just copied a few more people.  Let's hear their comments.

Regards,
Shawn
Sascha Hauer - Aug. 9, 2012, 6:32 a.m.
On Thu, Aug 09, 2012 at 12:41:38PM +0800, Shawn Guo wrote:
> On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
> > 
> > - i think we should split imx_v6_v7_defconfig into v6_only and
> > v7_only, since when we use imx_v6_v7_defconfig, we will introduce
> > "-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6
> > platforms (imx3), but it is not good for V7 platforms (imx5, imx6).
> > lets grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all
> > (#if __LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7)
> > occurrence will affect V7 platforms.
> 
> I just copied a few more people.  Let's hear their comments.

As a last resort we could split the defconfigs, but before we should
try to find a solution for this problem.

Sascha
Catalin Marinas - Aug. 9, 2012, 8:06 a.m.
On Thu, Aug 09, 2012 at 05:41:38AM +0100, Shawn Guo wrote:
> On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
> > Dirk Behme wrote:
> > >On 08.08.2012 12:18, Shawn Guo wrote:
> > >>Thanks Dirk for reporting that imx6q restart (reboot command) is broken.
> > >>
> > >>I tracked down the issue a little bit and found imx6q_restart hangs
> > >>on the of_iomap/ioremap call. The following change, moving the call
> > >>somewhere else than imx6q_restart, will just fix the problem.
> > >>
> > >>Does that mean ioremap call is not allowed in platform restart hook?
> > >>I'm not sure about that, because I found it works just fine if I build
> > >>imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7
> > >>only kernel - imx5 and imx6), which is the case how I tested imx6q
> > >>restart feature when I was adding it.
> > >>
> > >>To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7
> > >>kernel, while it works fine on a V7 only image. I need some help to
> > >>understand that.
> > >
> > >Some additional information from my debugging:
> > >
> > >a) Having a JTAG debugger attached to the i.MX6 SabreLite board I
> > >use (kernel built with imx_v6_v7_defconfig) the reboot does work.
> > >No hang. This does mean I can't debug the reboot hang with a JTAG
> > >debugger. Therefore I added some printk debugging:
> > >
> > >b) Adding some printk statements [1] in the of_iomap/ioremap call,
> > >it looks to me that the system hangs in
> > >
> > >of_iomap() -> ... -> set_pte_at() -> set_pte_ext() /
> > >cpu_v7_set_pte_ext() <= hang
> > >
> > I noticed this problem several weeks ago, and did some debug, what i
> > found is following:
> > 
> Thanks for these great findings.  I also just tracked it down to that
> cpu_v7_set_pte_ext hangs on
> 
> 	mcr   p15, 0, r0, c7, c10, 1          @ flush_pte
> 
> > - at the last stage of reset, all non-boot cpus will call
> > ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb()
> > for V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10,
> > 5")
> > 
> > - the boot cpu will call cpu_v7_set_pte_ext() in the
> > proc-v7-2level.S, at the end of this function, boot cpu will call
> > flush_pte (mcr p15, 0, r0, c7, c10, 1), after executing this
> > function, the system will hang. That is to say, all non-boot cpus
> > repeatedly call ("mcr p15, 0, %0, c7, c10, 5"), if boot cpu call
> > flush_pte (mcr p15, 0, r0, c7, c10, 1), the system will hang, this
> > occurs on the cortexa9 r2p10 cpus (freescale i.MX6Q), i don't know
> > the root cause for this problem, but if we split imx_v6_v7_defconfig
> > into v7_only and use v7_only to build kernel image for imx6q, this
> > problem will disappear, since the cpu_relax() will be defined to
> > barrier().

Does this work if you always define cpu_relax() to smp_mb() (even on
ARMv7) and compile the kernel to v7-only? In this case, the smp_mb()
would use the DMB instruction rather than the CP15 equivalent. It's not
a solution, just trying to understand the problem better.

Thanks.
Shawn Guo - Aug. 9, 2012, 8:18 a.m.
On Thu, Aug 09, 2012 at 09:06:48AM +0100, Catalin Marinas wrote:
> Does this work if you always define cpu_relax() to smp_mb() (even on
> ARMv7) and compile the kernel to v7-only? In this case, the smp_mb()
> would use the DMB instruction rather than the CP15 equivalent. It's not
> a solution, just trying to understand the problem better.
> 
This is a test I already did a couple of hours ago.  No, it does not
work.  The kernel hangs at the same place.
Russell King - ARM Linux - Aug. 9, 2012, 9:20 a.m.
On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
> - at the last stage of reset, all non-boot cpus will call  
> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for  
> V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5")

I suspect having this dmb inside cpu_relax() is flooding the
interconnects with traffic, which then prevents other CPUs getting
a look-in (maybe there's no fairness when it comes to dmb's.

If I'm right, you'll find is that even converting this to the ARMv7
DMB instruction won't fix the problem.  It does, however, point
towards a more serious problem - it means that any tight loop using
dmb is detremental.  I have heard some people mention that even on
various ARM SMP platforms, they have see quite an amount of
interaction between the individual CPU cores, and I'm beginning
to wonder whether this is why.

I think a useful test would be to only execute the DMB maybe once
in 50 or 100 loops - the DMB is there to work around a different
problem with the temporal locality of stores on the local CPU.  So,
the only requirement is that we issue a DMB at some point while
spinning waiting for another CPU to respond to our previous writes.
Shawn Guo - Aug. 9, 2012, 12:22 p.m.
On Thu, Aug 09, 2012 at 08:01:47PM +0800, Shawn Guo wrote:
> diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> index 99afa74..1282b61 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -80,11 +80,18 @@ extern void release_thread(struct task_struct *);
>  unsigned long get_wchan(struct task_struct *p);
> 
>  #if __LINUX_ARM_ARCH__ == 6 || defined(CONFIG_ARM_ERRATA_754327)
> -#define cpu_relax()                    smp_mb()
> +#define __cpu_relax()                  smp_mb()
>  #else
> -#define cpu_relax()                    barrier()
> +#define __cpu_relax()                  barrier()
>  #endif
> 
> +#define cpu_relax()                                                    \
> +({                                                                     \
> +       static int i;                                                   \
> +       if (i++ % 100)                                                  \

This should be 

	if (i++ % 100 == 0)

But either way it fixes the problem.

> +               __cpu_relax();                                          \
> +})
> +
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Russell King - ARM Linux - Aug. 9, 2012, 1:57 p.m.
On Thu, Aug 09, 2012 at 08:22:22PM +0800, Shawn Guo wrote:
> On Thu, Aug 09, 2012 at 08:01:47PM +0800, Shawn Guo wrote:
> > diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> > index 99afa74..1282b61 100644
> > --- a/arch/arm/include/asm/processor.h
> > +++ b/arch/arm/include/asm/processor.h
> > @@ -80,11 +80,18 @@ extern void release_thread(struct task_struct *);
> >  unsigned long get_wchan(struct task_struct *p);
> > 
> >  #if __LINUX_ARM_ARCH__ == 6 || defined(CONFIG_ARM_ERRATA_754327)
> > -#define cpu_relax()                    smp_mb()
> > +#define __cpu_relax()                  smp_mb()
> >  #else
> > -#define cpu_relax()                    barrier()
> > +#define __cpu_relax()                  barrier()
> >  #endif
> > 
> > +#define cpu_relax()                                                    \
> > +({                                                                     \
> > +       static int i;                                                   \
> > +       if (i++ % 100)                                                  \
> 
> This should be 
> 
> 	if (i++ % 100 == 0)
> 
> But either way it fixes the problem.

Right, so it does sound like issuing dmb instructions in a tight loop
can bring a system to its knees.

While we can address it in the above way in the kernel, it points to
a much bigger problem - userspace can issue dmb instructions itself,
which means any user program can effectively take a system down.  Can
you confirm this by building and running the following program:

int main()
{
	while (1)
		asm("dmb");
}

Then try running some other programs.  If I'm correct, this should
result in the system becoming rather unresponsive.

Thanks.
Russell King - ARM Linux - Aug. 9, 2012, 2:01 p.m.
On Thu, Aug 09, 2012 at 02:57:55PM +0100, Russell King - ARM Linux wrote:
> Then try running some other programs.  If I'm correct, this should
> result in the system becoming rather unresponsive.

I should better describe what I mean by "unresponsive" - it won't lock
the system up (because the dmb loop will be interrupted from time to
time) but it should make it rather sluggish to deal with page faults on
other CPUs.
Shawn Guo - Aug. 9, 2012, 2:24 p.m.
On Thu, Aug 09, 2012 at 03:01:37PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 09, 2012 at 02:57:55PM +0100, Russell King - ARM Linux wrote:
> > Then try running some other programs.  If I'm correct, this should
> > result in the system becoming rather unresponsive.
> 
> I should better describe what I mean by "unresponsive" - it won't lock
> the system up (because the dmb loop will be interrupted from time to
> time) but it should make it rather sluggish to deal with page faults on
> other CPUs.

You are right.  I'm seeing exactly the "unresponsive" you describe here
after launching that little program.
Matt Sealey - Aug. 9, 2012, 7:03 p.m.
On Thu, Aug 9, 2012 at 4:20 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
>> - at the last stage of reset, all non-boot cpus will call
>> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for
>> V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5")
>
> I suspect having this dmb inside cpu_relax() is flooding the
> interconnects with traffic, which then prevents other CPUs getting
> a look-in (maybe there's no fairness when it comes to dmb's.
>
> If I'm right, you'll find is that even converting this to the ARMv7
> DMB instruction won't fix the problem.  It does, however, point
> towards a more serious problem - it means that any tight loop using
> dmb is detremental.  I have heard some people mention that even on
> various ARM SMP platforms, they have see quite an amount of
> interaction between the individual CPU cores, and I'm beginning
> to wonder whether this is why.

I have an irrelevant but possibly related question here; in
mm/proc-v7.S there's this snip of code;

#ifdef CONFIG_SMP
        ALT_SMP(mrc     p15, 0, r0, c1, c0, 1)
        ALT_UP(mov      r0, #(1 << 6))          @ fake it for UP
        tst     r0, #(1 << 6)                   @ SMP/nAMP mode enabled?
        orreq   r0, r0, #(1 << 6)               @ Enable SMP/nAMP mode
        orreq   r0, r0, r10                     @ Enable CPU-specific SMP bits
        mcreq   p15, 0, r0, c1, c0, 1
#endif

Which I am reading as, read the SMP bit from cp15 and see if it's
enabled, or on UP set the SMP bit and then write it back
regardless.

On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP
method will get used and the SMP bit will get
set regardless. No other cores will be enabled (at least the i.MX6Q
you need to turn them on using the System Reset
Controller..) but it's still going to end up involving the processors
in SMP activity on the bus. Or.. actually it just sets the SMP
bit *regardless* on any platform. I don't have my assembly hat on
today. Is this correct? Is this actually architecturally correct?

Does SMP bit get set regardless on any system that has multiple CPU
cores but they're all brought down except the
primary (or only-left-executing) core?

Once you've got all the CPUs down, in my mind, SMP should be unset and
in theory any cache management broadcasts
between the CPUs should stop architecturally (with the caveat that on
CPU hotplug, you need to flush the bejesus out
of the primary CPU cache and invalidate the cache for one you just
brought up). So I assume this is some kind of
performance optimization to enable CPUs to be unplugged and replugged
without expensive cache management? Aren't
caches invalidated/flushed by the kernel anyway at this point, or is
the expectation that the SCU/PL310 or so would
manage this underneath as long as the primary CPU is involved in SMP?

Above that TLB ops broadcasting is turned on for A9 and A15 too. Even
in UP? Are these operations simply essential
to the SCU since it might not have any idea how many CPUs are running,
and just broadcasts changes on the AXI bus
anyway for the one CPU that's left to pick up?

The docs don't make it too clear what's going on here, and the code
doesn't enlighten me. I would think that on a non-SMP
system you'd want to turn all of this off, not "fake it for UP"..?

We think we've had some serious performance problems here which point
to a significant loss of performance on the AXI
bus going to DDR memory. The only thing we can attribute it to is some
misconfiguration of the SCU/PL310 or on MX51
maybe the "M4IF" unit with regards to AXI caching or priority for
different bus masters, and that SMP bus traffic is causing
a bottleneck of cache management operations. Otherwise we'd expect
several gigabytes per second to memory in a
streaming situation and we can architect it to only manage a couple of
hundred, and that doesn't seem at all right (since
beyond using misaligned accesses or plain kooky instruction sequences,
everything gets aligned by the time it gets to
DDR, and the DDR memory controller should be doing huge bursts which
get cached in the bus arbiter..). I can only look
at this from a naive point of view because there's no way to look at
it with the bare minimum code running, we need the
OS around to do what we're doing. Also in other tests (cortex-strings)
we get ridiculously good performance, but that's
mostly because it's single threaded.. no other CPU does anything and
the cache is basically useless with very large
transfers.

Now we know dmb makes things explode, you have to wonder are things
being needlessly involved here or turned on
when they needn't be, just to make the code easier to write?

We also saw a performance decrease by some hard to understand amount
in the lpj calculation on i.MX51 between
a CPU_V7-only kernel based on 2.6.31 and Freescale's BSP, versus a
V6/V7 defconfig kernel direct from mainline.
I wonder if the less efficient V6 stuff is causing that.. it would
really, really bite if that was it... we're still running tests
though. We like to enable errata fixes regardless of whether it's on
our processor since the vast majority of them are
only selected at runtime depending on whether the core or unit
revision affected is present, however...

#if __LINUX_ARM_ARCH__ == 6 || defined(CONFIG_ARM_ERRATA_754327)
#define cpu_relax()                    smp_mb()
#else
#define cpu_relax()                    barrier()
#endif

This dirt-old (pre r2p0) Cortex-A9 erratum being selected will force a
V7-only kernel to use smp_mb rather than
barrier regardless of what the processor type is.. that means every
one-image-for-ARMv7 kernel which included
Tegra2, the appropriate above erratum and a few others would hit this
problem regardless. I assume if looping
and comparing a value of "i" and then calling smp_mb can't break the
procedure being expected then reading out
the CPU core revision from a cached value in memory somewhere and
comparing it and OPTIONALLY running
smp_mb (which is then defined as dmb) would not be bad either?

Maybe I am getting this backwards, but I really want someone to tell me that :)
Russell King - ARM Linux - Aug. 9, 2012, 9:07 p.m.
On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote:
> I have an irrelevant but possibly related question here; in
> mm/proc-v7.S there's this snip of code;
> 
> #ifdef CONFIG_SMP
>         ALT_SMP(mrc     p15, 0, r0, c1, c0, 1)
>         ALT_UP(mov      r0, #(1 << 6))          @ fake it for UP
>         tst     r0, #(1 << 6)                   @ SMP/nAMP mode enabled?
>         orreq   r0, r0, #(1 << 6)               @ Enable SMP/nAMP mode
>         orreq   r0, r0, r10                     @ Enable CPU-specific SMP bits
>         mcreq   p15, 0, r0, c1, c0, 1
> #endif
> 
> Which I am reading as, read the SMP bit from cp15 and see if it's
> enabled, or on UP set the SMP bit and then write it back
> regardless.
> 
> On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP
> method will get used and the SMP bit will get
> set regardless.

I got here and stopped reading any further.  No, it doesn't work like
that; you don't understand the assembly.

Yes, you're right that the ALT_UP version will be used.  This sets r0
to 1 << 6.  The very next instruction tests whether bit 6 is set in r0.
The following three instructions will only be executed _if_ that bit
was zero.

I'm assuming that the rest of your email is therefore irrelevant; I
have no desire to continue reading such a long message when the premise
it's (presumably) based upon is false.  If you have any further questions
please restate them succinctly after taking on board the above.  Thanks.
Matt Sealey - Aug. 10, 2012, 1:33 p.m.
On Thu, Aug 9, 2012 at 4:07 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote:
>> I have an irrelevant but possibly related question here; in
>> mm/proc-v7.S there's this snip of code;
>>
>> #ifdef CONFIG_SMP
>>         ALT_SMP(mrc     p15, 0, r0, c1, c0, 1)
>>         ALT_UP(mov      r0, #(1 << 6))          @ fake it for UP
>>         tst     r0, #(1 << 6)                   @ SMP/nAMP mode enabled?
>>         orreq   r0, r0, #(1 << 6)               @ Enable SMP/nAMP mode
>>         orreq   r0, r0, r10                     @ Enable CPU-specific SMP bits
>>         mcreq   p15, 0, r0, c1, c0, 1
>> #endif
>>
>> Which I am reading as, read the SMP bit from cp15 and see if it's
>> enabled, or on UP set the SMP bit and then write it back
>> regardless.
>>
>> On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP
>> method will get used and the SMP bit will get
>> set regardless.
>
> I got here and stopped reading any further.  No, it doesn't work like
> that; you don't understand the assembly.
>
> Yes, you're right that the ALT_UP version will be used.  This sets r0
> to 1 << 6.  The very next instruction tests whether bit 6 is set in r0.
> The following three instructions will only be executed _if_ that bit
> was zero.

Right but, as I read it...

if (CONFIG_SMP)
   read cp15
else
   set bit to one to hack it
if (!smp)
  set smp
  write it back

So who unsets the SMP bit? When the core goes down does it clear it,
or are we just
being safe for CPU startup? Does it still have an effect in the chip
logic even if the core
is 'not plugged'?

What happens if you're in CONFIG_SMP and you say maxcpus=1?

> I'm assuming that the rest of your email is therefore irrelevant; I
> have no desire to continue reading such a long message when the premise
> it's (presumably) based upon is false.  If you have any further questions
> please restate them succinctly after taking on board the above.  Thanks.

Please read the rest... I just didn't write first my question properly.
Russell King - ARM Linux - Aug. 10, 2012, 1:53 p.m.
On Fri, Aug 10, 2012 at 08:33:06AM -0500, Matt Sealey wrote:
> On Thu, Aug 9, 2012 at 4:07 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote:
> >> I have an irrelevant but possibly related question here; in
> >> mm/proc-v7.S there's this snip of code;
> >>
> >> #ifdef CONFIG_SMP
> >>         ALT_SMP(mrc     p15, 0, r0, c1, c0, 1)
> >>         ALT_UP(mov      r0, #(1 << 6))          @ fake it for UP
> >>         tst     r0, #(1 << 6)                   @ SMP/nAMP mode enabled?
> >>         orreq   r0, r0, #(1 << 6)               @ Enable SMP/nAMP mode
> >>         orreq   r0, r0, r10                     @ Enable CPU-specific SMP bits
> >>         mcreq   p15, 0, r0, c1, c0, 1
> >> #endif
> >>
> >> Which I am reading as, read the SMP bit from cp15 and see if it's
> >> enabled, or on UP set the SMP bit and then write it back
> >> regardless.
> >>
> >> On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP
> >> method will get used and the SMP bit will get
> >> set regardless.
> >
> > I got here and stopped reading any further.  No, it doesn't work like
> > that; you don't understand the assembly.
> >
> > Yes, you're right that the ALT_UP version will be used.  This sets r0
> > to 1 << 6.  The very next instruction tests whether bit 6 is set in r0.
> > The following three instructions will only be executed _if_ that bit
> > was zero.
> 
> Right but, as I read it...
> 
> if (CONFIG_SMP)
>    read cp15
> else
>    set bit to one to hack it
> if (!smp)
>   set smp
>   write it back

Thank you for not reading what I wrote (or maybe you can't write back
what I said correctly, but either way, it's just making me think that
replying is a waste of time.)

> So who unsets the SMP bit? When the core goes down does it clear it,
> or are we just
> being safe for CPU startup? Does it still have an effect in the chip
> logic even if the core
> is 'not plugged'?

Generally we don't.  Because on secure mode devices, we can't write the
register, and if we tried to we'd get an abort.

So, if we detect a SMP capable CPU, we will enable the SMP bit if it
wasn't already enabled, and never ever attempt to disable it.

> What happens if you're in CONFIG_SMP and you say maxcpus=1?
> 
> > I'm assuming that the rest of your email is therefore irrelevant; I
> > have no desire to continue reading such a long message when the premise
> > it's (presumably) based upon is false.  If you have any further questions
> > please restate them succinctly after taking on board the above.  Thanks.
> 
> Please read the rest... I just didn't write first my question properly.

No, unless you can say what you want more succinctly, I'm not going
to bother reading your (apparantly regular) massive emails.
Rob Herring - Aug. 15, 2012, 3:07 p.m.
On 08/09/2012 04:20 AM, Russell King - ARM Linux wrote:
> On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote:
>> - at the last stage of reset, all non-boot cpus will call  
>> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for  
>> V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5")
> 
> I suspect having this dmb inside cpu_relax() is flooding the
> interconnects with traffic, which then prevents other CPUs getting
> a look-in (maybe there's no fairness when it comes to dmb's.
> 
> If I'm right, you'll find is that even converting this to the ARMv7
> DMB instruction won't fix the problem.  It does, however, point
> towards a more serious problem - it means that any tight loop using
> dmb is detremental.  I have heard some people mention that even on
> various ARM SMP platforms, they have see quite an amount of
> interaction between the individual CPU cores, and I'm beginning
> to wonder whether this is why.
> 
> I think a useful test would be to only execute the DMB maybe once
> in 50 or 100 loops - the DMB is there to work around a different
> problem with the temporal locality of stores on the local CPU.  So,
> the only requirement is that we issue a DMB at some point while
> spinning waiting for another CPU to respond to our previous writes.

I think I am seeing a similar problem on highbank with a v7 only build.

>From what I've debugged, restart hangs for me on the L2x0 spinlock
during a writel. Changing the writel to writel_relaxed in the restart
hook fixes the problem. This skips barriers in the writel and for the
spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
should not be doing a dmb in my case on a v7 only build.

Rob

> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Russell King - ARM Linux - Aug. 15, 2012, 9:44 p.m.
On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote:
> I think I am seeing a similar problem on highbank with a v7 only build.
> 
> >From what I've debugged, restart hangs for me on the L2x0 spinlock
> during a writel. Changing the writel to writel_relaxed in the restart
> hook fixes the problem. This skips barriers in the writel and for the
> spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
> should not be doing a dmb in my case on a v7 only build.

Well, I had the idea of only doing a dmb() once every N loops, but I don't
think we can sensibly introduce such a change into the mainline kernel.
(How would cpu_relax() know it's being used in a loop?)

Remember that the dmb() is in cpu_relax() as a work-around for the lack
of temporal flushing of pending stores, and is needed to make various bits
of the kernel work.

So at the moment, there is no solution for this - and as I've pointed out
it can be trivially exploited in userspace on the affected CPUs.  So
really a kernel work-around isn't going to sort it.
Shawn Guo - Aug. 16, 2012, 2:31 a.m.
On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote:
> On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote:
> > I think I am seeing a similar problem on highbank with a v7 only build.
> > 
> > >From what I've debugged, restart hangs for me on the L2x0 spinlock
> > during a writel. Changing the writel to writel_relaxed in the restart
> > hook fixes the problem. This skips barriers in the writel and for the
> > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
> > should not be doing a dmb in my case on a v7 only build.
> 
> Well, I had the idea of only doing a dmb() once every N loops, but I don't
> think we can sensibly introduce such a change into the mainline kernel.
> (How would cpu_relax() know it's being used in a loop?)
> 
> Remember that the dmb() is in cpu_relax() as a work-around for the lack
> of temporal flushing of pending stores, and is needed to make various bits
> of the kernel work.
> 
The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 ||
defined(CONFIG_ARM_ERRATA_754327).  Otherwise, it's just a barrier()
call.  I guess Rob's puzzle is since the cpu_relax on the stopped
cores does not do dmb, why a wmb/mb call on the running cpu would hang
it.

One thing I note is that mb() will do a outer_sync() call.  Since the
issue is around L2x0 operation, not sure if they are related ...

> So at the moment, there is no solution for this - and as I've pointed out
> it can be trivially exploited in userspace on the affected CPUs.  So
> really a kernel work-around isn't going to sort it.

So, Sascha, it seems we get another good reason to split
imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig?
I have seen Matt Sealey and Hui Wang suggested doing this already.
Catalin Marinas - Aug. 16, 2012, 8:41 a.m.
On 15 August 2012 22:44, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote:
>> I think I am seeing a similar problem on highbank with a v7 only build.
>>
>> >From what I've debugged, restart hangs for me on the L2x0 spinlock
>> during a writel. Changing the writel to writel_relaxed in the restart
>> hook fixes the problem. This skips barriers in the writel and for the
>> spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
>> should not be doing a dmb in my case on a v7 only build.
>
> Well, I had the idea of only doing a dmb() once every N loops, but I don't
> think we can sensibly introduce such a change into the mainline kernel.
> (How would cpu_relax() know it's being used in a loop?)

It doesn't really need to know. We can say that every X calls to
cpu_relax() should result in a dmb and we can make the counter
per-cpu. This would ensure the write-buffer draining which was the aim
of this dmb.
Matt Sealey - Aug. 16, 2012, 5:21 p.m.
On Wed, Aug 15, 2012 at 9:31 PM, Shawn Guo <shawn.guo@linaro.org> wrote:
> On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote:
>> On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote:
>> > I think I am seeing a similar problem on highbank with a v7 only build.
>> >
>> > >From what I've debugged, restart hangs for me on the L2x0 spinlock
>> > during a writel. Changing the writel to writel_relaxed in the restart
>> > hook fixes the problem. This skips barriers in the writel and for the
>> > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
>> > should not be doing a dmb in my case on a v7 only build.
>>
>> Well, I had the idea of only doing a dmb() once every N loops, but I don't
>> think we can sensibly introduce such a change into the mainline kernel.
>> (How would cpu_relax() know it's being used in a loop?)
>>
>> Remember that the dmb() is in cpu_relax() as a work-around for the lack
>> of temporal flushing of pending stores, and is needed to make various bits
>> of the kernel work.
>>
> The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 ||
> defined(CONFIG_ARM_ERRATA_754327).  Otherwise, it's just a barrier()
> call.  I guess Rob's puzzle is since the cpu_relax on the stopped
> cores does not do dmb, why a wmb/mb call on the running cpu would hang
> it.
>
> One thing I note is that mb() will do a outer_sync() call.  Since the
> issue is around L2x0 operation, not sure if they are related ...
>
>> So at the moment, there is no solution for this - and as I've pointed out
>> it can be trivially exploited in userspace on the affected CPUs.  So
>> really a kernel work-around isn't going to sort it.
>
> So, Sascha, it seems we get another good reason to split
> imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig?
> I have seen Matt Sealey and Hui Wang suggested doing this already.

Only because Hui suggested it. I figured it would be better from the
standpoint of allowing
proper compiler-generated or more
efficient/architecture-revision-specific code to be
generated. It was mostly because I also have a side nit about
individual erratum being enabled,
in that by default in imx_v6_v7_defconfig they aren't; that's not nice
and it's true of other ARM
platforms too (Tegra notwithstanding). It's unlikely you'd want to run
your chip without these fixes
enabled, but in any kernel built from imx_v6_v7_defconfig it's likely
MX51, MX53 aren't going to
run particularly reliably.

There are a couple options to make it more likely to run with a
working configuration, namely
turning every single one on regardless, or alternatively every mach
defined (SOC_IMX51,
SOC_OMAP or so) or so depends on the errata necessary for correct
operation (just as
ARCH_TEGRA_2x_SOC does right now) so you can still whittle down the number of
compiled-in errata to the CPUs and cache controllers needed to boot
that particular kernel
on the particular boards you chose.

The vast majority are only done on cores that need it, but this
particular one, and a couple
others I noticed, will be enabled in the future on "v7-pure"
multi-platform configs if you just enable
Tegra support with the above solution (since it's not v6, but it is
affected by erratum 754327).

Is there any way to fix cpu_relax(), smp_mb() or whatever to
specifically fix that particular core
or particular A9 revision at runtime rather than having it be
compile-time if we're doing it every few loops,
per-cpu like Catalin suggested? I assume it's not dangerous to do so
(and would mean we don't
run into it on the systems it's not booted on, rather than running
into it by virtue of compiling it in).

I'm not suggesting it fixes the dmb issue Russell described
(especially the userspace part),
but it would mean the kernel won't be issuing them itself under any
circumstance when it
doesn't need to.
Russell King - ARM Linux - Aug. 16, 2012, 10:34 p.m.
On Thu, Aug 16, 2012 at 10:31:11AM +0800, Shawn Guo wrote:
> On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote:
> > On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote:
> > > I think I am seeing a similar problem on highbank with a v7 only build.
> > > 
> > > >From what I've debugged, restart hangs for me on the L2x0 spinlock
> > > during a writel. Changing the writel to writel_relaxed in the restart
> > > hook fixes the problem. This skips barriers in the writel and for the
> > > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores
> > > should not be doing a dmb in my case on a v7 only build.
> > 
> > Well, I had the idea of only doing a dmb() once every N loops, but I don't
> > think we can sensibly introduce such a change into the mainline kernel.
> > (How would cpu_relax() know it's being used in a loop?)
> > 
> > Remember that the dmb() is in cpu_relax() as a work-around for the lack
> > of temporal flushing of pending stores, and is needed to make various bits
> > of the kernel work.
> > 
> The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 ||
> defined(CONFIG_ARM_ERRATA_754327).  Otherwise, it's just a barrier()
> call.  I guess Rob's puzzle is since the cpu_relax on the stopped
> cores does not do dmb, why a wmb/mb call on the running cpu would hang
> it.
> 
> One thing I note is that mb() will do a outer_sync() call.  Since the
> issue is around L2x0 operation, not sure if they are related ...

This doesn't get around the problem that userspace can still effectively
issue a DoS against the system by just running a dmb in a tight loop.
Or maybe this would have a much more dramatic effect:

	while (1) {
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
		asm("dmb");
	}

and make that 3 seconds to get a ps listing turn into something much
longer?

> > So at the moment, there is no solution for this - and as I've pointed out
> > it can be trivially exploited in userspace on the affected CPUs.  So
> > really a kernel work-around isn't going to sort it.
> 
> So, Sascha, it seems we get another good reason to split
> imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig?
> I have seen Matt Sealey and Hui Wang suggested doing this already.

The last thing we need are more defconfigs.

I think what needs to happen here (while we wait) is someone _with_ the
problem needs to experiment, and find out how many nops are needed for
the DMB not to have much effect in cpu_relax().  If it turns out we just
need to put one nop in, then that's not _too_ bad.
Shawn Guo - Aug. 19, 2012, 3:26 p.m.
On Fri, Aug 17, 2012 at 11:48:56AM +0800, Shawn Guo wrote:
> > I think what needs to happen here (while we wait) is someone _with_ the
> > problem needs to experiment, and find out how many nops are needed for
> > the DMB not to have much effect in cpu_relax().  If it turns out we just
> > need to put one nop in, then that's not _too_ bad.
> 
> At least, I need to have 5 nops to get rid of the issue, something like
> below.
> 
Russell,

What's your take on this then?

Patch

diff --git a/arch/arm/mach-imx/mach-imx6q.c b/arch/arm/mach-imx/mach-imx6q.c
index 5ec0608..01e7489 100644
--- a/arch/arm/mach-imx/mach-imx6q.c
+++ b/arch/arm/mach-imx/mach-imx6q.c
@@ -37,14 +37,10 @@ 
 #include <mach/cpuidle.h>
 #include <mach/hardware.h>

+static void __iomem *wdog_base;

 void imx6q_restart(char mode, const char *cmd)
 {
-       struct device_node *np;
-       void __iomem *wdog_base;
-
-       np = of_find_compatible_node(NULL, NULL, "fsl,imx6q-wdt");
-       wdog_base = of_iomap(np, 0);
        if (!wdog_base)
                goto soft;

@@ -159,6 +155,11 @@  static void __init imx6q_usb_init(void)

 static void __init imx6q_init_machine(void)
 {
+       struct device_node *np;
+
+       np = of_find_compatible_node(NULL, NULL, "fsl,imx6q-wdt");
+       wdog_base = of_iomap(np, 0);
+
        /*
         * This should be removed when all imx6q boards have pinctrl
         * states for devices defined in device tree.