diff mbox

[PATCHv2,for,soc,3/4] arm: Add v7_invalidate_l1 to cache-v7.S

Message ID 1359651943-21752-4-git-send-email-dinguyen@altera.com
State New
Headers show

Commit Message

dinguyen@altera.com Jan. 31, 2013, 5:05 p.m. UTC
From: Dinh Nguyen <dinguyen@altera.com>

mach-socfpga is another platform that needs to use
v7_invalidate_l1 to bringup additional cores. There was a comment that
the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S

Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Tested-by: Pavel Machek <pavel@denx.de>
Reviewed-by: Pavel Machek <pavel@denx.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Olof Johansson <olof@lixom.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Stephen Warren <swarren@wwwdotorg.org>
---
 arch/arm/mach-imx/headsmp.S      |   47 -------------------------------------
 arch/arm/mach-shmobile/headsmp.S |   48 --------------------------------------
 arch/arm/mach-tegra/headsmp.S    |   43 ----------------------------------
 arch/arm/mm/cache-v7.S           |   46 ++++++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 138 deletions(-)

Comments

Stephen Warren Jan. 31, 2013, 6:11 p.m. UTC | #1
On 01/31/2013 10:05 AM, dinguyen@altera.com wrote:
> From: Dinh Nguyen <dinguyen@altera.com>
> 
> mach-socfpga is another platform that needs to use
> v7_invalidate_l1 to bringup additional cores. There was a comment that
> the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S

Tested-by: Stephen Warren <swarren@nvidia.com>
Acked-by: Stephen Warren <swarren@nvidia.com>

(on Tegra30 Cardhu, with this patch applied on top of Tegra's for-next
branch, played audio and did repeated CPU hotplugs)

There will be a minor merge conflict when this patch is merged with the
Tegra tree changes, but it's trivial to resolve; some other code was
removed from this file right next to this function.
Olof Johansson Feb. 1, 2013, 3:47 a.m. UTC | #2
Hi,

On Thu, Jan 31, 2013 at 11:05:42AM -0600, dinguyen@altera.com wrote:
> From: Dinh Nguyen <dinguyen@altera.com>
> 
> mach-socfpga is another platform that needs to use
> v7_invalidate_l1 to bringup additional cores. There was a comment that
> the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S
> 
> Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
> Acked-by: Simon Horman <horms+renesas@verge.net.au>
> Tested-by: Pavel Machek <pavel@denx.de>
> Reviewed-by: Pavel Machek <pavel@denx.de>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Olof Johansson <olof@lixom.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rob Herring <rob.herring@calxeda.com>
> Cc: Sascha Hauer <kernel@pengutronix.de>
> Cc: Magnus Damm <magnus.damm@gmail.com>
> Cc: Stephen Warren <swarren@wwwdotorg.org>

This should be merged through Russell's tree. Please feed it to his patch
tracker.


-Olof
Santosh Shilimkar Feb. 1, 2013, 11:29 a.m. UTC | #3
+ Lorenzo,

On Thursday 31 January 2013 10:35 PM, dinguyen@altera.com wrote:
> From: Dinh Nguyen <dinguyen@altera.com>
>
> mach-socfpga is another platform that needs to use
> v7_invalidate_l1 to bringup additional cores. There was a comment that
> the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S
>
> Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
> Acked-by: Simon Horman <horms+renesas@verge.net.au>
> Tested-by: Pavel Machek <pavel@denx.de>
> Reviewed-by: Pavel Machek <pavel@denx.de>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Olof Johansson <olof@lixom.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rob Herring <rob.herring@calxeda.com>
> Cc: Sascha Hauer <kernel@pengutronix.de>
> Cc: Magnus Damm <magnus.damm@gmail.com>
> Cc: Stephen Warren <swarren@wwwdotorg.org>
> ---
>   arch/arm/mach-imx/headsmp.S      |   47 -------------------------------------
>   arch/arm/mach-shmobile/headsmp.S |   48 --------------------------------------
>   arch/arm/mach-tegra/headsmp.S    |   43 ----------------------------------
>   arch/arm/mm/cache-v7.S           |   46 ++++++++++++++++++++++++++++++++++++
>   4 files changed, 46 insertions(+), 138 deletions(-)
>
[..]

> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> index 7539ec2..15451ee 100644
> --- a/arch/arm/mm/cache-v7.S
> +++ b/arch/arm/mm/cache-v7.S
> @@ -19,6 +19,52 @@
>   #include "proc-macros.S"
>
>   /*
> + * The secondary kernel init calls v7_flush_dcache_all before it enables
> + * the L1; however, the L1 comes out of reset in an undefined state, so
> + * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
> + * of cache lines with uninitialized data and uninitialized tags to get
> + * written out to memory, which does really unpleasant things to the main
> + * processor.  We fix this by performing an invalidate, rather than a
> + * clean + invalidate, before jumping into the kernel.
> + *
> + * This function is cloned from arch/arm/mach-tegra/headsmp.S, and needs
> + * to be called for both secondary cores startup and primary core resume
> + * procedures.
> + */
> +ENTRY(v7_invalidate_l1)
Now since we are moving the code under common place, probably we should
update this a function a bit so that it invalidates the CPU cache till
line of unification. Just to be consistent with other flush API.

Lorenzo,
Does that make sense ?

Regards,
Santosh
Russell King - ARM Linux Feb. 1, 2013, 11:32 a.m. UTC | #4
On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
> Now since we are moving the code under common place, probably we should
> update this a function a bit so that it invalidates the CPU cache till
> line of unification. Just to be consistent with other flush API.

Hmm.  Do you really want a CPU being brought up to do that to the PoU,
every time that it is brought up?  I thought you wanted to get rid of
that kind of stuff from the hotplug paths so that a CPU being brought
up/taken down doesn't affect the caches for the other CPUs within the
inner sharable domain.
Santosh Shilimkar Feb. 1, 2013, 11:44 a.m. UTC | #5
On Friday 01 February 2013 05:02 PM, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
>> Now since we are moving the code under common place, probably we should
>> update this a function a bit so that it invalidates the CPU cache till
>> line of unification. Just to be consistent with other flush API.
>
> Hmm.  Do you really want a CPU being brought up to do that to the PoU,
> every time that it is brought up?  I thought you wanted to get rid of
> that kind of stuff from the hotplug paths so that a CPU being brought
> up/taken down doesn't affect the caches for the other CPUs within the
> inner sharable domain.
>
You are right. We already git rid of the flush of all cache levels
in hotplug and wakeup paths and now it is restricted till the PoU.

Assuming for the current v7 machines, PoU is L2, invalidating the cache
*till* PoU means only CPU local cache. So the API will in a way
invalidate only local cache.

May be I am missing your point here.

Regards,
Santosh
Lorenzo Pieralisi Feb. 1, 2013, 12:11 p.m. UTC | #6
On Fri, Feb 01, 2013 at 11:29:44AM +0000, Santosh Shilimkar wrote:
> + Lorenzo,
> 
> On Thursday 31 January 2013 10:35 PM, dinguyen@altera.com wrote:
> > From: Dinh Nguyen <dinguyen@altera.com>
> >
> > mach-socfpga is another platform that needs to use
> > v7_invalidate_l1 to bringup additional cores. There was a comment that
> > the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S
> >
> > Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
> > Acked-by: Simon Horman <horms+renesas@verge.net.au>
> > Tested-by: Pavel Machek <pavel@denx.de>
> > Reviewed-by: Pavel Machek <pavel@denx.de>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Russell King <linux@arm.linux.org.uk>
> > Cc: Olof Johansson <olof@lixom.net>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Rob Herring <rob.herring@calxeda.com>
> > Cc: Sascha Hauer <kernel@pengutronix.de>
> > Cc: Magnus Damm <magnus.damm@gmail.com>
> > Cc: Stephen Warren <swarren@wwwdotorg.org>
> > ---
> >   arch/arm/mach-imx/headsmp.S      |   47 -------------------------------------
> >   arch/arm/mach-shmobile/headsmp.S |   48 --------------------------------------
> >   arch/arm/mach-tegra/headsmp.S    |   43 ----------------------------------
> >   arch/arm/mm/cache-v7.S           |   46 ++++++++++++++++++++++++++++++++++++
> >   4 files changed, 46 insertions(+), 138 deletions(-)
> >
> [..]
> 
> > diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> > index 7539ec2..15451ee 100644
> > --- a/arch/arm/mm/cache-v7.S
> > +++ b/arch/arm/mm/cache-v7.S
> > @@ -19,6 +19,52 @@
> >   #include "proc-macros.S"
> >
> >   /*
> > + * The secondary kernel init calls v7_flush_dcache_all before it enables
> > + * the L1; however, the L1 comes out of reset in an undefined state, so
> > + * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
> > + * of cache lines with uninitialized data and uninitialized tags to get
> > + * written out to memory, which does really unpleasant things to the main
> > + * processor.  We fix this by performing an invalidate, rather than a
> > + * clean + invalidate, before jumping into the kernel.
> > + *
> > + * This function is cloned from arch/arm/mach-tegra/headsmp.S, and needs
> > + * to be called for both secondary cores startup and primary core resume
> > + * procedures.
> > + */
> > +ENTRY(v7_invalidate_l1)
> Now since we are moving the code under common place, probably we should
> update this a function a bit so that it invalidates the CPU cache till
> line of unification. Just to be consistent with other flush API.
> 
> Lorenzo,
> Does that make sense ?

Well, on latest processors (A15, A7) caches are invalidated on reset unless
the chip is programmed to skip that on reset (ie L2 retained).

But it makes sense, for sure it should not be called v7_invalidate_l1,
but invalidate_louis, and instead of forcing level 0 we should be
reading LoUIS and invalidate up to that level as we do in the clean and
invalidate function.

Is it worth adding a v7 cache function where the only difference wrt the clean
and invalidate operation is a coprocessor opcode ? Probably not IMHO, why add
the set/way loop again ?

It is never called from C code, so I do not think there is a point in
adding a C API either.

Lorenzo
Santosh Shilimkar Feb. 1, 2013, 12:24 p.m. UTC | #7
On Friday 01 February 2013 05:41 PM, Lorenzo Pieralisi wrote:
> On Fri, Feb 01, 2013 at 11:29:44AM +0000, Santosh Shilimkar wrote:
>> + Lorenzo,
>>
>> On Thursday 31 January 2013 10:35 PM, dinguyen@altera.com wrote:
>>> From: Dinh Nguyen <dinguyen@altera.com>
>>>
>>> mach-socfpga is another platform that needs to use
>>> v7_invalidate_l1 to bringup additional cores. There was a comment that
>>> the ideal place for v7_invalidate_l1 should be in arm/mm/cache-v7.S
>>>
>>> Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
>>> Acked-by: Simon Horman <horms+renesas@verge.net.au>
>>> Tested-by: Pavel Machek <pavel@denx.de>
>>> Reviewed-by: Pavel Machek <pavel@denx.de>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Russell King <linux@arm.linux.org.uk>
>>> Cc: Olof Johansson <olof@lixom.net>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Rob Herring <rob.herring@calxeda.com>
>>> Cc: Sascha Hauer <kernel@pengutronix.de>
>>> Cc: Magnus Damm <magnus.damm@gmail.com>
>>> Cc: Stephen Warren <swarren@wwwdotorg.org>
>>> ---
>>>    arch/arm/mach-imx/headsmp.S      |   47 -------------------------------------
>>>    arch/arm/mach-shmobile/headsmp.S |   48 --------------------------------------
>>>    arch/arm/mach-tegra/headsmp.S    |   43 ----------------------------------
>>>    arch/arm/mm/cache-v7.S           |   46 ++++++++++++++++++++++++++++++++++++
>>>    4 files changed, 46 insertions(+), 138 deletions(-)
>>>
>> [..]
>>
>>> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
>>> index 7539ec2..15451ee 100644
>>> --- a/arch/arm/mm/cache-v7.S
>>> +++ b/arch/arm/mm/cache-v7.S
>>> @@ -19,6 +19,52 @@
>>>    #include "proc-macros.S"
>>>
>>>    /*
>>> + * The secondary kernel init calls v7_flush_dcache_all before it enables
>>> + * the L1; however, the L1 comes out of reset in an undefined state, so
>>> + * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
>>> + * of cache lines with uninitialized data and uninitialized tags to get
>>> + * written out to memory, which does really unpleasant things to the main
>>> + * processor.  We fix this by performing an invalidate, rather than a
>>> + * clean + invalidate, before jumping into the kernel.
>>> + *
>>> + * This function is cloned from arch/arm/mach-tegra/headsmp.S, and needs
>>> + * to be called for both secondary cores startup and primary core resume
>>> + * procedures.
>>> + */
>>> +ENTRY(v7_invalidate_l1)
>> Now since we are moving the code under common place, probably we should
>> update this a function a bit so that it invalidates the CPU cache till
>> line of unification. Just to be consistent with other flush API.
>>
>> Lorenzo,
>> Does that make sense ?
>
> Well, on latest processors (A15, A7) caches are invalidated on reset unless
> the chip is programmed to skip that on reset (ie L2 retained).
>
> But it makes sense, for sure it should not be called v7_invalidate_l1,
> but invalidate_louis, and instead of forcing level 0 we should be
> reading LoUIS and invalidate up to that level as we do in the clean and
> invalidate function.
>
That is exactly what I was thinking.

> Is it worth adding a v7 cache function where the only difference wrt the clean
> and invalidate operation is a coprocessor opcode ? Probably not IMHO, why add
> the set/way loop again ?
>
Probably same function can be re-used with the parameter passing to 
differentiate the inv and flush.

> It is never called from C code, so I do not think there is a point in
> adding a C API either.
>
I agree. C API isn't needed.

Regards,
Santosh
Russell King - ARM Linux Feb. 1, 2013, 12:48 p.m. UTC | #8
On Fri, Feb 01, 2013 at 05:14:11PM +0530, Santosh Shilimkar wrote:
> On Friday 01 February 2013 05:02 PM, Russell King - ARM Linux wrote:
>> On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
>>> Now since we are moving the code under common place, probably we should
>>> update this a function a bit so that it invalidates the CPU cache till
>>> line of unification. Just to be consistent with other flush API.
>>
>> Hmm.  Do you really want a CPU being brought up to do that to the PoU,
>> every time that it is brought up?  I thought you wanted to get rid of
>> that kind of stuff from the hotplug paths so that a CPU being brought
>> up/taken down doesn't affect the caches for the other CPUs within the
>> inner sharable domain.
>>
> You are right. We already git rid of the flush of all cache levels
> in hotplug and wakeup paths and now it is restricted till the PoU.
>
> Assuming for the current v7 machines, PoU is L2, invalidating the cache
> *till* PoU means only CPU local cache. So the API will in a way
> invalidate only local cache.

Err, you want to _invalidate_ the caches down to the point of I/D/TLB
unification?  Are you really sure you want to do that on a system
here other CPUs are running?

Even going down to the LoUIS, that point is the point at which the
_other_ CPUs may be sharing caches.

And invalidating those caches while the other CPUs are running on
secondary CPU startup will be VERY VERY VERY bad.
Russell King - ARM Linux Feb. 1, 2013, 12:54 p.m. UTC | #9
On Fri, Feb 01, 2013 at 12:11:38PM +0000, Lorenzo Pieralisi wrote:
> Well, on latest processors (A15, A7) caches are invalidated on reset unless
> the chip is programmed to skip that on reset (ie L2 retained).
> 
> But it makes sense, for sure it should not be called v7_invalidate_l1,
> but invalidate_louis, and instead of forcing level 0 we should be
> reading LoUIS and invalidate up to that level as we do in the clean and
> invalidate function.

No.  Think about it.  c7, c6, 2 _invalidates_ the cache.  That means any
data contained within the cache is discarded.  Data is not written back.

If you do this down to the LoUIS, that includes all cache levels in the
inner sharable domain.  The inner sharable domain includes the other CPUs
in the system which may already be running (certainly the boot CPU will
be running).

Are you _really_ sure you want to be invalidating _valid_ data held in
caches in the inner sharable domain by other CPUs, rather than just the
cache associated with the CPU which is being brought online?
Santosh Shilimkar Feb. 1, 2013, 1:04 p.m. UTC | #10
On Friday 01 February 2013 06:18 PM, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 05:14:11PM +0530, Santosh Shilimkar wrote:
>> On Friday 01 February 2013 05:02 PM, Russell King - ARM Linux wrote:
>>> On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
>>>> Now since we are moving the code under common place, probably we should
>>>> update this a function a bit so that it invalidates the CPU cache till
>>>> line of unification. Just to be consistent with other flush API.
>>>
>>> Hmm.  Do you really want a CPU being brought up to do that to the PoU,
>>> every time that it is brought up?  I thought you wanted to get rid of
>>> that kind of stuff from the hotplug paths so that a CPU being brought
>>> up/taken down doesn't affect the caches for the other CPUs within the
>>> inner sharable domain.
>>>
>> You are right. We already git rid of the flush of all cache levels
>> in hotplug and wakeup paths and now it is restricted till the PoU.
>>
>> Assuming for the current v7 machines, PoU is L2, invalidating the cache
>> *till* PoU means only CPU local cache. So the API will in a way
>> invalidate only local cache.
>
> Err, you want to _invalidate_ the caches down to the point of I/D/TLB
> unification?  Are you really sure you want to do that on a system
> here other CPUs are running?
>
> Even going down to the LoUIS, that point is the point at which the
> _other_ CPUs may be sharing caches.
>
> And invalidating those caches while the other CPUs are running on
> secondary CPU startup will be VERY VERY VERY bad.
>
Absolutly and my intention was never to invalidate all the cache
levels. When I said lous, I mean till that point and not including
that and next cache levels. May be my terminology isn't accurate.

Regards,
Santosh
Russell King - ARM Linux Feb. 1, 2013, 1:20 p.m. UTC | #11
On Fri, Feb 01, 2013 at 06:34:06PM +0530, Santosh Shilimkar wrote:
> On Friday 01 February 2013 06:18 PM, Russell King - ARM Linux wrote:
>> On Fri, Feb 01, 2013 at 05:14:11PM +0530, Santosh Shilimkar wrote:
>>> On Friday 01 February 2013 05:02 PM, Russell King - ARM Linux wrote:
>>>> On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
>>>>> Now since we are moving the code under common place, probably we should
>>>>> update this a function a bit so that it invalidates the CPU cache till
>>>>> line of unification. Just to be consistent with other flush API.
>>>>
>>>> Hmm.  Do you really want a CPU being brought up to do that to the PoU,
>>>> every time that it is brought up?  I thought you wanted to get rid of
>>>> that kind of stuff from the hotplug paths so that a CPU being brought
>>>> up/taken down doesn't affect the caches for the other CPUs within the
>>>> inner sharable domain.
>>>>
>>> You are right. We already git rid of the flush of all cache levels
>>> in hotplug and wakeup paths and now it is restricted till the PoU.
>>>
>>> Assuming for the current v7 machines, PoU is L2, invalidating the cache
>>> *till* PoU means only CPU local cache. So the API will in a way
>>> invalidate only local cache.
>>
>> Err, you want to _invalidate_ the caches down to the point of I/D/TLB
>> unification?  Are you really sure you want to do that on a system
>> here other CPUs are running?
>>
>> Even going down to the LoUIS, that point is the point at which the
>> _other_ CPUs may be sharing caches.
>>
>> And invalidating those caches while the other CPUs are running on
>> secondary CPU startup will be VERY VERY VERY bad.
>>
> Absolutly and my intention was never to invalidate all the cache
> levels. When I said lous, I mean till that point and not including
> that and next cache levels. May be my terminology isn't accurate.

Confused.  Can't find the term "lous" in your previous mails.  I'll
assume you mean LoUIS.  I'm saying that's totally wrong because going
down to that point _includes_ the other CPUs in the system.

What we should be doing on secondary CPU bringup for CPUs where the
caches are in an unknown state is invalidating those caches which are
local to _that_ _CPU_ _only_.  That is not "all cache levels down to
LoUIS".

Here's an example.  CPU0-3 are in the inner sharable domain.

+----------+ +----------+ +----------+ +----------+
|   CPU0   | |   CPU1   | |   CPU2   | |   CPU3   |
+----------+ +----------+ +----------+ +----------+
   |    |       |    |       |    |       |    |
+--v-+--v-+  +--v-+--v-+  +--v-+--v-+  +--v-+--v-+
|CL0I|CL0D|  |CL0I|CL0D|  |CL0I|CL0D|  |CL0I|CL0D| <-- cache level 0
+----+----+  +----+----+  +----+----+  +----+----+
   |    |       |    |       |    |       |    |
   |    |   +---+    +--+    |    |       |    +------------------+
   |    |   |           |    |    +-------|--------------+        |
   |    |   |        +-------+            |              |        |
   |    |   |        |  +-----------------|-----+        |        |
   |    |   |        |        +-----------+     |        |        |
   |    +---|--------|--------|--------+        |        |        |
+--v--------v--------v--------v--+  +--v--------v--------v--------v--+
|            CL1I                |  |            CL1D                | level 1
+--------------------------------+  +--------------------------------+
                 |                                   |            
+----------------v-----------------------------------v---------------+
|                                 CL2                                | level 2
+--------------------------------------------------------------------+

Therefore, because the point of unification for the inner sharable domain
is defined as the point at which the I, D and TLB streams (TLB not shown)
are combined, this happens at CL2, and does *not* include CL2.  CL2 is
where the two paths see the same data.

So, the LoUIS includes caches in level 0 and level 1.

However, CL1 _is_ shared between CPU0-3 - it is part of the inner sharable
domain.  If you invalidate CL1, then you're destroying data held there by
the other CPUs.  So, invalidating down to the LoUIS on secondary CPU
bringup is _wrong_.
Santosh Shilimkar Feb. 1, 2013, 2:09 p.m. UTC | #12
On Friday 01 February 2013 06:50 PM, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 06:34:06PM +0530, Santosh Shilimkar wrote:
>> On Friday 01 February 2013 06:18 PM, Russell King - ARM Linux wrote:
>>> On Fri, Feb 01, 2013 at 05:14:11PM +0530, Santosh Shilimkar wrote:
>>>> On Friday 01 February 2013 05:02 PM, Russell King - ARM Linux wrote:
>>>>> On Fri, Feb 01, 2013 at 04:59:44PM +0530, Santosh Shilimkar wrote:
>>>>>> Now since we are moving the code under common place, probably we should
>>>>>> update this a function a bit so that it invalidates the CPU cache till
>>>>>> line of unification. Just to be consistent with other flush API.
>>>>>
>>>>> Hmm.  Do you really want a CPU being brought up to do that to the PoU,
>>>>> every time that it is brought up?  I thought you wanted to get rid of
>>>>> that kind of stuff from the hotplug paths so that a CPU being brought
>>>>> up/taken down doesn't affect the caches for the other CPUs within the
>>>>> inner sharable domain.
>>>>>
>>>> You are right. We already git rid of the flush of all cache levels
>>>> in hotplug and wakeup paths and now it is restricted till the PoU.
>>>>
>>>> Assuming for the current v7 machines, PoU is L2, invalidating the cache
>>>> *till* PoU means only CPU local cache. So the API will in a way
>>>> invalidate only local cache.
>>>
>>> Err, you want to _invalidate_ the caches down to the point of I/D/TLB
>>> unification?  Are you really sure you want to do that on a system
>>> here other CPUs are running?
>>>
>>> Even going down to the LoUIS, that point is the point at which the
>>> _other_ CPUs may be sharing caches.
>>>
>>> And invalidating those caches while the other CPUs are running on
>>> secondary CPU startup will be VERY VERY VERY bad.
>>>
>> Absolutly and my intention was never to invalidate all the cache
>> levels. When I said lous, I mean till that point and not including
>> that and next cache levels. May be my terminology isn't accurate.
>
> Confused.  Can't find the term "lous" in your previous mails.  I'll
> assume you mean LoUIS.  I'm saying that's totally wrong because going
> down to that point _includes_ the other CPUs in the system.
>
Sorry for the typo. I mean LoUIS.

> What we should be doing on secondary CPU bringup for CPUs where the
> caches are in an unknown state is invalidating those caches which are
> local to _that_ _CPU_ _only_.  That is not "all cache levels down to
> LoUIS".
>
Restricting it to local CPU cache is also my point. My example was bit
narrow with current A9 and A15 designs where there is only L1 and L2
cache and hence not considered the below case.

> Here's an example.  CPU0-3 are in the inner sharable domain.
>
> +----------+ +----------+ +----------+ +----------+
> |   CPU0   | |   CPU1   | |   CPU2   | |   CPU3   |
> +----------+ +----------+ +----------+ +----------+
>     |    |       |    |       |    |       |    |
> +--v-+--v-+  +--v-+--v-+  +--v-+--v-+  +--v-+--v-+
> |CL0I|CL0D|  |CL0I|CL0D|  |CL0I|CL0D|  |CL0I|CL0D| <-- cache level 0
> +----+----+  +----+----+  +----+----+  +----+----+
>     |    |       |    |       |    |       |    |
>     |    |   +---+    +--+    |    |       |    +------------------+
>     |    |   |           |    |    +-------|--------------+        |
>     |    |   |        +-------+            |              |        |
>     |    |   |        |  +-----------------|-----+        |        |
>     |    |   |        |        +-----------+     |        |        |
>     |    +---|--------|--------|--------+        |        |        |
> +--v--------v--------v--------v--+  +--v--------v--------v--------v--+
> |            CL1I                |  |            CL1D                | level 1
> +--------------------------------+  +--------------------------------+
>                   |                                   |
> +----------------v-----------------------------------v---------------+
> |                                 CL2                                | level 2
> +--------------------------------------------------------------------+
>
> Therefore, because the point of unification for the inner sharable domain
> is defined as the point at which the I, D and TLB streams (TLB not shown)
> are combined, this happens at CL2, and does *not* include CL2.  CL2 is
> where the two paths see the same data.
>
> So, the LoUIS includes caches in level 0 and level 1.
>
Thats right for above example. I haven't seen such design so far
and hence my view was narrow.

> However, CL1 _is_ shared between CPU0-3 - it is part of the inner sharable
> domain.  If you invalidate CL1, then you're destroying data held there by
> the other CPUs.  So, invalidating down to the LoUIS on secondary CPU
> bringup is _wrong_.
>
As I said above, I didn't considered the case you mentioned here. Thanks
for bring up this example. I am aligned with you on _NO_ on invaliding
cache level from a CPU which is shared across multiple CPUs.

Regards,
Santosh
Lorenzo Pieralisi Feb. 1, 2013, 2:10 p.m. UTC | #13
On Fri, Feb 01, 2013 at 12:54:17PM +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 12:11:38PM +0000, Lorenzo Pieralisi wrote:
> > Well, on latest processors (A15, A7) caches are invalidated on reset unless
> > the chip is programmed to skip that on reset (ie L2 retained).
> > 
> > But it makes sense, for sure it should not be called v7_invalidate_l1,
> > but invalidate_louis, and instead of forcing level 0 we should be
> > reading LoUIS and invalidate up to that level as we do in the clean and
> > invalidate function.
> 
> No.  Think about it.  c7, c6, 2 _invalidates_ the cache.  That means any
> data contained within the cache is discarded.  Data is not written back.
> 
> If you do this down to the LoUIS, that includes all cache levels in the
> inner sharable domain.  The inner sharable domain includes the other CPUs
> in the system which may already be running (certainly the boot CPU will
> be running).

On all v7 ARM systems I know of LoUIS correspond to L1 and using LoUIS is a
plaster that works for current systems (where by LoUIS I mean all cache
levels down to but not inclusive of, LoUIS).

What you are saying is true for cpu_suspend code and hotplug as well, there we
are cleaning and invalidating down to the LoUIS, which might not be needed
since LoUIS does not mean that all cache levels included are _local_ to the
CPU.

There is no way on ARM systems to detect what cache levels are local to
a CPU, so we decided to use LoUIS for that, as a temporary solution to
avoid cluttering the code with DT bindings that link cache topology and
CPU topology.

And on all ARM systems in the mainline "operating on all levels down to
but not inclusive of LoUIS" is equivalent to cache levels local to a CPU,
if it is not I am missing something and I apologize.

> Are you _really_ sure you want to be invalidating _valid_ data held in
> caches in the inner sharable domain by other CPUs, rather than just the
> cache associated with the CPU which is being brought online?

We know that LoUIS corresponds to cache levels local to a CPU for now and we
know that it will fail as soon as there are caches in the inner shareability
domain that belong to multiple CPUs. Again, that's valid for the clean
and invalidate functions as well (though there it is an optimization
problem, here I agree is a validity issue and more serious).

There is no way to detect cache levels local to a CPU, I tried to do
that before posting the new LoUIS API, and all I can say is that we need
device tree bindings to do that cleanly, architecturally it is just not
detectable.

I am happy to discuss a way forward Russell, your concerns are correct,
we are aware of them and happy to improve the code to make it work
consistently.

Thanks,
Lorenzo
Russell King - ARM Linux Feb. 1, 2013, 2:19 p.m. UTC | #14
On Fri, Feb 01, 2013 at 02:10:52PM +0000, Lorenzo Pieralisi wrote:
> On Fri, Feb 01, 2013 at 12:54:17PM +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 01, 2013 at 12:11:38PM +0000, Lorenzo Pieralisi wrote:
> > > Well, on latest processors (A15, A7) caches are invalidated on reset unless
> > > the chip is programmed to skip that on reset (ie L2 retained).
> > > 
> > > But it makes sense, for sure it should not be called v7_invalidate_l1,
> > > but invalidate_louis, and instead of forcing level 0 we should be
> > > reading LoUIS and invalidate up to that level as we do in the clean and
> > > invalidate function.
> > 
> > No.  Think about it.  c7, c6, 2 _invalidates_ the cache.  That means any
> > data contained within the cache is discarded.  Data is not written back.
> > 
> > If you do this down to the LoUIS, that includes all cache levels in the
> > inner sharable domain.  The inner sharable domain includes the other CPUs
> > in the system which may already be running (certainly the boot CPU will
> > be running).
> 
> On all v7 ARM systems I know of LoUIS correspond to L1 and using LoUIS is a
> plaster that works for current systems (where by LoUIS I mean all cache
> levels down to but not inclusive of, LoUIS).

All that I'm saying is that suggesting that v7_invalidate_l1 should go
down to LoUIS is wrong, in much the same way as _invalidating_ only the
first level of cache.

However, invalidating the first level of cache only is safer than
invalidating down to LoUIS.

The only path which needs this is the secondary CPU bringup path; that's
the only path we have where some platforms give us a CPU which has only
just come out of reset, and so the caches for that CPU may be in an
unknown state.  And it only happens to a small subset of platforms.

Currently, that small subset of platforms only need the first level of
cache invalidating.  (Most platforms don't need this; most platforms
this would be a waste of time - and people seem to care about hotplug
times.)

So, that's what we should do.  And v7_invalidate_l1 is a perfectly
reasonable name for a function to do that for the V7 architecture.

As has been pointed out, there's several duplications of that.  That's
fine, let's move them into the v7 cache code.  But... let's not change
how they work and go through a pointless design exercise for
invalidating more levels of cache when we know that no platform needs
it.

If/when we do end up with a platform which requires it, we can deal
with it then.  But let's not lead people down the route of thinking
that LoUIS is suitable here when it isn't.
Russell King - ARM Linux Feb. 1, 2013, 2:31 p.m. UTC | #15
On Fri, Feb 01, 2013 at 02:19:14PM +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 02:10:52PM +0000, Lorenzo Pieralisi wrote:
> > On Fri, Feb 01, 2013 at 12:54:17PM +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 01, 2013 at 12:11:38PM +0000, Lorenzo Pieralisi wrote:
> > > > Well, on latest processors (A15, A7) caches are invalidated on reset unless
> > > > the chip is programmed to skip that on reset (ie L2 retained).
> > > > 
> > > > But it makes sense, for sure it should not be called v7_invalidate_l1,
> > > > but invalidate_louis, and instead of forcing level 0 we should be
> > > > reading LoUIS and invalidate up to that level as we do in the clean and
> > > > invalidate function.
> > > 
> > > No.  Think about it.  c7, c6, 2 _invalidates_ the cache.  That means any
> > > data contained within the cache is discarded.  Data is not written back.
> > > 
> > > If you do this down to the LoUIS, that includes all cache levels in the
> > > inner sharable domain.  The inner sharable domain includes the other CPUs
> > > in the system which may already be running (certainly the boot CPU will
> > > be running).
> > 
> > On all v7 ARM systems I know of LoUIS correspond to L1 and using LoUIS is a
> > plaster that works for current systems (where by LoUIS I mean all cache
> > levels down to but not inclusive of, LoUIS).
> 
> All that I'm saying is that suggesting that v7_invalidate_l1 should go
> down to LoUIS is wrong, in much the same way as _invalidating_ only the
> first level of cache.
> 
> However, invalidating the first level of cache only is safer than
> invalidating down to LoUIS.
> 
> The only path which needs this is the secondary CPU bringup path; that's
> the only path we have where some platforms give us a CPU which has only
> just come out of reset, and so the caches for that CPU may be in an
> unknown state.  And it only happens to a small subset of platforms.
> 
> Currently, that small subset of platforms only need the first level of
> cache invalidating.  (Most platforms don't need this; most platforms
> this would be a waste of time - and people seem to care about hotplug
> times.)
> 
> So, that's what we should do.  And v7_invalidate_l1 is a perfectly
> reasonable name for a function to do that for the V7 architecture.
> 
> As has been pointed out, there's several duplications of that.  That's
> fine, let's move them into the v7 cache code.  But... let's not change
> how they work and go through a pointless design exercise for
> invalidating more levels of cache when we know that no platform needs
> it.
> 
> If/when we do end up with a platform which requires it, we can deal
> with it then.  But let's not lead people down the route of thinking
> that LoUIS is suitable here when it isn't.

Just to further provide some insight into the reasoning:

Invalidating data out of a working cache risks data corruption; maybe
the data being invalidated is filesystem metadata which was about to
be cleaned and written back to storage.  That risks filesystem
corruption.

Invalidating fewer levels than are actually required is different: we
may leave dirty cache lines behind which may be evicted, but there's
also the chance that the CPU will end up _reading_ from its
uninitialized caches and may crash before that happens.

So, the risks are:
1. invalidate more levels than are necessary and risk discarding data
   which other CPUs are using, which may be important data.
2. invalidate less levels than are necessary and risk writing out
   data from the CPU cache, which may or may not happen _before_ the
   CPU crashes due to reading invalid data.

Out of those two, (2) sounds to me to be the safer approach.

Plus, I can't think of a reason why you'd want to put on a SMP system
more than one layer of CPU local caches... to do so would seem to me to
be an exercise in coherency complexity...  So, I suspect that in the
real world, we will _never_ see any system which has more than one
layer of caches local to the CPU.  But we may see a system with a
cache architecture similar to the one I drew in my email to Santosh.
Lorenzo Pieralisi Feb. 1, 2013, 2:34 p.m. UTC | #16
On Fri, Feb 01, 2013 at 02:19:14PM +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 02:10:52PM +0000, Lorenzo Pieralisi wrote:
> > On Fri, Feb 01, 2013 at 12:54:17PM +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 01, 2013 at 12:11:38PM +0000, Lorenzo Pieralisi wrote:
> > > > Well, on latest processors (A15, A7) caches are invalidated on reset unless
> > > > the chip is programmed to skip that on reset (ie L2 retained).
> > > > 
> > > > But it makes sense, for sure it should not be called v7_invalidate_l1,
> > > > but invalidate_louis, and instead of forcing level 0 we should be
> > > > reading LoUIS and invalidate up to that level as we do in the clean and
> > > > invalidate function.
> > > 
> > > No.  Think about it.  c7, c6, 2 _invalidates_ the cache.  That means any
> > > data contained within the cache is discarded.  Data is not written back.
> > > 
> > > If you do this down to the LoUIS, that includes all cache levels in the
> > > inner sharable domain.  The inner sharable domain includes the other CPUs
> > > in the system which may already be running (certainly the boot CPU will
> > > be running).
> > 
> > On all v7 ARM systems I know of LoUIS correspond to L1 and using LoUIS is a
> > plaster that works for current systems (where by LoUIS I mean all cache
> > levels down to but not inclusive of, LoUIS).
> 
> All that I'm saying is that suggesting that v7_invalidate_l1 should go
> down to LoUIS is wrong, in much the same way as _invalidating_ only the
> first level of cache.
> 
> However, invalidating the first level of cache only is safer than
> invalidating down to LoUIS.
> 
> The only path which needs this is the secondary CPU bringup path; that's
> the only path we have where some platforms give us a CPU which has only
> just come out of reset, and so the caches for that CPU may be in an
> unknown state.  And it only happens to a small subset of platforms.
> 
> Currently, that small subset of platforms only need the first level of
> cache invalidating.  (Most platforms don't need this; most platforms
> this would be a waste of time - and people seem to care about hotplug
> times.)
> 
> So, that's what we should do.  And v7_invalidate_l1 is a perfectly
> reasonable name for a function to do that for the V7 architecture.
> 
> As has been pointed out, there's several duplications of that.  That's
> fine, let's move them into the v7 cache code.  But... let's not change
> how they work and go through a pointless design exercise for
> invalidating more levels of cache when we know that no platform needs
> it.
> 
> If/when we do end up with a platform which requires it, we can deal
> with it then.  But let's not lead people down the route of thinking
> that LoUIS is suitable here when it isn't.

Your point is fair Russell, and I agree with that in the context you
provided. Last thing I want to mention is that duplicating code that
loops through sets/ways is not that great, but I also understand that
refactoring the code to share the loop in v7_flush_dcache_all might
do more harm than good and introduce bugs, so let's merge the
invalidate_l1 function as it stands.

Thank you,
Lorenzo
Santosh Shilimkar Feb. 1, 2013, 2:43 p.m. UTC | #17
On Friday 01 February 2013 08:01 PM, Russell King - ARM Linux wrote:
> Just to further provide some insight into the reasoning:
>
> Invalidating data out of a working cache risks data corruption; maybe
> the data being invalidated is filesystem metadata which was about to
> be cleaned and written back to storage.  That risks filesystem
> corruption.
>
> Invalidating fewer levels than are actually required is different: we
> may leave dirty cache lines behind which may be evicted, but there's
> also the chance that the CPU will end up _reading_ from its
> uninitialized caches and may crash before that happens.
>
> So, the risks are:
> 1. invalidate more levels than are necessary and risk discarding data
>     which other CPUs are using, which may be important data.
> 2. invalidate less levels than are necessary and risk writing out
>     data from the CPU cache, which may or may not happen _before_ the
>     CPU crashes due to reading invalid data.
>
> Out of those two, (2) sounds to me to be the safer approach.
>
> Plus, I can't think of a reason why you'd want to put on a SMP system
> more than one layer of CPU local caches... to do so would seem to me to
> be an exercise in coherency complexity...  So, I suspect that in the
> real world, we will _never_ see any system which has more than one
> layer of caches local to the CPU.  But we may see a system with a
> cache architecture similar to the one I drew in my email to Santosh.
>
I still scratching my head on why you would even have a CPU design
with two L2 shared caches for a 4 CPU system.

If you ever design such a system, you need to ensure that

1. Both L2 are used in exclusive mode
2. Both L2 cache has coherency hardware connected to keep them in sync 
for shared data.

For 1, one would just increase the size of L2 and have only 1 memory.

2 Doesn't bring much advantage unless and until your L3 is too far
away for access in terms of CPU access cycles.

Regards
Santosh
Russell King - ARM Linux Feb. 1, 2013, 2:49 p.m. UTC | #18
On Fri, Feb 01, 2013 at 08:13:34PM +0530, Santosh Shilimkar wrote:
> On Friday 01 February 2013 08:01 PM, Russell King - ARM Linux wrote:
>> Just to further provide some insight into the reasoning:
>>
>> Invalidating data out of a working cache risks data corruption; maybe
>> the data being invalidated is filesystem metadata which was about to
>> be cleaned and written back to storage.  That risks filesystem
>> corruption.
>>
>> Invalidating fewer levels than are actually required is different: we
>> may leave dirty cache lines behind which may be evicted, but there's
>> also the chance that the CPU will end up _reading_ from its
>> uninitialized caches and may crash before that happens.
>>
>> So, the risks are:
>> 1. invalidate more levels than are necessary and risk discarding data
>>     which other CPUs are using, which may be important data.
>> 2. invalidate less levels than are necessary and risk writing out
>>     data from the CPU cache, which may or may not happen _before_ the
>>     CPU crashes due to reading invalid data.
>>
>> Out of those two, (2) sounds to me to be the safer approach.
>>
>> Plus, I can't think of a reason why you'd want to put on a SMP system
>> more than one layer of CPU local caches... to do so would seem to me to
>> be an exercise in coherency complexity...  So, I suspect that in the
>> real world, we will _never_ see any system which has more than one
>> layer of caches local to the CPU.  But we may see a system with a
>> cache architecture similar to the one I drew in my email to Santosh.
>>
> I still scratching my head on why you would even have a CPU design
> with two L2 shared caches for a 4 CPU system.
>
> If you ever design such a system, you need to ensure that
>
> 1. Both L2 are used in exclusive mode
> 2. Both L2 cache has coherency hardware connected to keep them in sync  
> for shared data.
>
> For 1, one would just increase the size of L2 and have only 1 memory.
>
> 2 Doesn't bring much advantage unless and until your L3 is too far
> away for access in terms of CPU access cycles.

I don't think you quite understood my diagram.  There aren't two separate
L2 data caches (CL1I and CL1D).  I'm showing the L2 cache as having a
harvard structure (separate instruction and data) with no coherency
between them - and because they're harvard structured, that means the
unification level must be _below_ that point.
Santosh Shilimkar Feb. 1, 2013, 2:53 p.m. UTC | #19
On Friday 01 February 2013 08:19 PM, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 08:13:34PM +0530, Santosh Shilimkar wrote:
>> On Friday 01 February 2013 08:01 PM, Russell King - ARM Linux wrote:
>>> Just to further provide some insight into the reasoning:
>>>
>>> Invalidating data out of a working cache risks data corruption; maybe
>>> the data being invalidated is filesystem metadata which was about to
>>> be cleaned and written back to storage.  That risks filesystem
>>> corruption.
>>>
>>> Invalidating fewer levels than are actually required is different: we
>>> may leave dirty cache lines behind which may be evicted, but there's
>>> also the chance that the CPU will end up _reading_ from its
>>> uninitialized caches and may crash before that happens.
>>>
>>> So, the risks are:
>>> 1. invalidate more levels than are necessary and risk discarding data
>>>      which other CPUs are using, which may be important data.
>>> 2. invalidate less levels than are necessary and risk writing out
>>>      data from the CPU cache, which may or may not happen _before_ the
>>>      CPU crashes due to reading invalid data.
>>>
>>> Out of those two, (2) sounds to me to be the safer approach.
>>>
>>> Plus, I can't think of a reason why you'd want to put on a SMP system
>>> more than one layer of CPU local caches... to do so would seem to me to
>>> be an exercise in coherency complexity...  So, I suspect that in the
>>> real world, we will _never_ see any system which has more than one
>>> layer of caches local to the CPU.  But we may see a system with a
>>> cache architecture similar to the one I drew in my email to Santosh.
>>>
>> I still scratching my head on why you would even have a CPU design
>> with two L2 shared caches for a 4 CPU system.
>>
>> If you ever design such a system, you need to ensure that
>>
>> 1. Both L2 are used in exclusive mode
>> 2. Both L2 cache has coherency hardware connected to keep them in sync
>> for shared data.
>>
>> For 1, one would just increase the size of L2 and have only 1 memory.
>>
>> 2 Doesn't bring much advantage unless and until your L3 is too far
>> away for access in terms of CPU access cycles.
>
> I don't think you quite understood my diagram.  There aren't two separate
> L2 data caches (CL1I and CL1D).  I'm showing the L2 cache as having a
> harvard structure (separate instruction and data) with no coherency
> between them - and because they're harvard structured, that means the
> unification level must be _below_ that point.
>
Now I get it. Yes I missed the I and D separation in the diagram.
Thanks a lot for drawing.

Regards,
Santosh
diff mbox

Patch

diff --git a/arch/arm/mach-imx/headsmp.S b/arch/arm/mach-imx/headsmp.S
index 7e49deb..921fc15 100644
--- a/arch/arm/mach-imx/headsmp.S
+++ b/arch/arm/mach-imx/headsmp.S
@@ -17,53 +17,6 @@ 
 
 	.section ".text.head", "ax"
 
-/*
- * The secondary kernel init calls v7_flush_dcache_all before it enables
- * the L1; however, the L1 comes out of reset in an undefined state, so
- * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
- * of cache lines with uninitialized data and uninitialized tags to get
- * written out to memory, which does really unpleasant things to the main
- * processor.  We fix this by performing an invalidate, rather than a
- * clean + invalidate, before jumping into the kernel.
- *
- * This funciton is cloned from arch/arm/mach-tegra/headsmp.S, and needs
- * to be called for both secondary cores startup and primary core resume
- * procedures.  Ideally, it should be moved into arch/arm/mm/cache-v7.S.
- */
-ENTRY(v7_invalidate_l1)
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c5, 0	@ invalidate I cache
-	mcr	p15, 2, r0, c0, c0, 0
-	mrc	p15, 1, r0, c0, c0, 0
-
-	ldr	r1, =0x7fff
-	and	r2, r1, r0, lsr #13
-
-	ldr	r1, =0x3ff
-
-	and	r3, r1, r0, lsr #3	@ NumWays - 1
-	add	r2, r2, #1		@ NumSets
-
-	and	r0, r0, #0x7
-	add	r0, r0, #4	@ SetShift
-
-	clz	r1, r3		@ WayShift
-	add	r4, r3, #1	@ NumWays
-1:	sub	r2, r2, #1	@ NumSets--
-	mov	r3, r4		@ Temp = NumWays
-2:	subs	r3, r3, #1	@ Temp--
-	mov	r5, r3, lsl r1
-	mov	r6, r2, lsl r0
-	orr	r5, r5, r6	@ Reg = (Temp<<WayShift)|(NumSets<<SetShift)
-	mcr	p15, 0, r5, c7, c6, 2
-	bgt	2b
-	cmp	r2, #0
-	bgt	1b
-	dsb
-	isb
-	mov	pc, lr
-ENDPROC(v7_invalidate_l1)
-
 #ifdef CONFIG_SMP
 ENTRY(v7_secondary_startup)
 	bl	v7_invalidate_l1
diff --git a/arch/arm/mach-shmobile/headsmp.S b/arch/arm/mach-shmobile/headsmp.S
index b202c12..96001fd 100644
--- a/arch/arm/mach-shmobile/headsmp.S
+++ b/arch/arm/mach-shmobile/headsmp.S
@@ -16,54 +16,6 @@ 
 
 	__CPUINIT
 
-/* Cache invalidation nicked from arch/arm/mach-imx/head-v7.S, thanks!
- *
- * The secondary kernel init calls v7_flush_dcache_all before it enables
- * the L1; however, the L1 comes out of reset in an undefined state, so
- * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
- * of cache lines with uninitialized data and uninitialized tags to get
- * written out to memory, which does really unpleasant things to the main
- * processor.  We fix this by performing an invalidate, rather than a
- * clean + invalidate, before jumping into the kernel.
- *
- * This funciton is cloned from arch/arm/mach-tegra/headsmp.S, and needs
- * to be called for both secondary cores startup and primary core resume
- * procedures.  Ideally, it should be moved into arch/arm/mm/cache-v7.S.
- */
-ENTRY(v7_invalidate_l1)
-	mov	r0, #0
-	mcr	p15, 0, r0, c7, c5, 0	@ invalidate I cache
-	mcr	p15, 2, r0, c0, c0, 0
-	mrc	p15, 1, r0, c0, c0, 0
-
-	ldr	r1, =0x7fff
-	and	r2, r1, r0, lsr #13
-
-	ldr	r1, =0x3ff
-
-	and	r3, r1, r0, lsr #3	@ NumWays - 1
-	add	r2, r2, #1		@ NumSets
-
-	and	r0, r0, #0x7
-	add	r0, r0, #4	@ SetShift
-
-	clz	r1, r3		@ WayShift
-	add	r4, r3, #1	@ NumWays
-1:	sub	r2, r2, #1	@ NumSets--
-	mov	r3, r4		@ Temp = NumWays
-2:	subs	r3, r3, #1	@ Temp--
-	mov	r5, r3, lsl r1
-	mov	r6, r2, lsl r0
-	orr	r5, r5, r6	@ Reg = (Temp<<WayShift)|(NumSets<<SetShift)
-	mcr	p15, 0, r5, c7, c6, 2
-	bgt	2b
-	cmp	r2, #0
-	bgt	1b
-	dsb
-	isb
-	mov	pc, lr
-ENDPROC(v7_invalidate_l1)
-
 ENTRY(shmobile_invalidate_start)
 	bl	v7_invalidate_l1
 	b	secondary_startup
diff --git a/arch/arm/mach-tegra/headsmp.S b/arch/arm/mach-tegra/headsmp.S
index 4a317fa..fb082c4 100644
--- a/arch/arm/mach-tegra/headsmp.S
+++ b/arch/arm/mach-tegra/headsmp.S
@@ -18,49 +18,6 @@ 
         .section ".text.head", "ax"
 	__CPUINIT
 
-/*
- * Tegra specific entry point for secondary CPUs.
- *   The secondary kernel init calls v7_flush_dcache_all before it enables
- *   the L1; however, the L1 comes out of reset in an undefined state, so
- *   the clean + invalidate performed by v7_flush_dcache_all causes a bunch
- *   of cache lines with uninitialized data and uninitialized tags to get
- *   written out to memory, which does really unpleasant things to the main
- *   processor.  We fix this by performing an invalidate, rather than a
- *   clean + invalidate, before jumping into the kernel.
- */
-ENTRY(v7_invalidate_l1)
-        mov     r0, #0
-        mcr     p15, 2, r0, c0, c0, 0
-        mrc     p15, 1, r0, c0, c0, 0
-
-        ldr     r1, =0x7fff
-        and     r2, r1, r0, lsr #13
-
-        ldr     r1, =0x3ff
-
-        and     r3, r1, r0, lsr #3  @ NumWays - 1
-        add     r2, r2, #1          @ NumSets
-
-        and     r0, r0, #0x7
-        add     r0, r0, #4          @ SetShift
-
-        clz     r1, r3              @ WayShift
-        add     r4, r3, #1          @ NumWays
-1:      sub     r2, r2, #1          @ NumSets--
-        mov     r3, r4              @ Temp = NumWays
-2:      subs    r3, r3, #1          @ Temp--
-        mov     r5, r3, lsl r1
-        mov     r6, r2, lsl r0
-        orr     r5, r5, r6          @ Reg = (Temp<<WayShift)|(NumSets<<SetShift)
-        mcr     p15, 0, r5, c7, c6, 2
-        bgt     2b
-        cmp     r2, #0
-        bgt     1b
-        dsb
-        isb
-        mov     pc, lr
-ENDPROC(v7_invalidate_l1)
-
 
 ENTRY(tegra_secondary_startup)
         bl      v7_invalidate_l1
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 7539ec2..15451ee 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -19,6 +19,52 @@ 
 #include "proc-macros.S"
 
 /*
+ * The secondary kernel init calls v7_flush_dcache_all before it enables
+ * the L1; however, the L1 comes out of reset in an undefined state, so
+ * the clean + invalidate performed by v7_flush_dcache_all causes a bunch
+ * of cache lines with uninitialized data and uninitialized tags to get
+ * written out to memory, which does really unpleasant things to the main
+ * processor.  We fix this by performing an invalidate, rather than a
+ * clean + invalidate, before jumping into the kernel.
+ *
+ * This function is cloned from arch/arm/mach-tegra/headsmp.S, and needs
+ * to be called for both secondary cores startup and primary core resume
+ * procedures.
+ */
+ENTRY(v7_invalidate_l1)
+       mov     r0, #0
+       mcr     p15, 2, r0, c0, c0, 0
+       mrc     p15, 1, r0, c0, c0, 0
+
+       ldr     r1, =0x7fff
+       and     r2, r1, r0, lsr #13
+
+       ldr     r1, =0x3ff
+
+       and     r3, r1, r0, lsr #3      @ NumWays - 1
+       add     r2, r2, #1              @ NumSets
+
+       and     r0, r0, #0x7
+       add     r0, r0, #4      @ SetShift
+
+       clz     r1, r3          @ WayShift
+       add     r4, r3, #1      @ NumWays
+1:     sub     r2, r2, #1      @ NumSets--
+       mov     r3, r4          @ Temp = NumWays
+2:     subs    r3, r3, #1      @ Temp--
+       mov     r5, r3, lsl r1
+       mov     r6, r2, lsl r0
+       orr     r5, r5, r6      @ Reg = (Temp<<WayShift)|(NumSets<<SetShift)
+       mcr     p15, 0, r5, c7, c6, 2
+       bgt     2b
+       cmp     r2, #0
+       bgt     1b
+       dsb
+       isb
+       mov     pc, lr
+ENDPROC(v7_invalidate_l1)
+
+/*
  *	v7_flush_icache_all()
  *
  *	Flush the whole I-cache.