diff mbox

[U-Boot,Resend,RFC,1/2] armv8: Fix dcache disable function

Message ID 1476476277-10527-2-git-send-email-york.sun@nxp.com
State Not Applicable
Delegated to: York Sun
Headers show

Commit Message

York Sun Oct. 14, 2016, 8:17 p.m. UTC
Current code turns off d-cache first, then flush all levels of cache.
This results data loss. As soon as d-cache is off, the dirty cache
is discarded according to the test on LS2080A. This issue was not
seen as long as external L3 cache was flushed to push the data to
main memory. However, external L3 cache is not guaranteed to have
the data. To fix this, flush the d-cache by way/set first to make
sure cache is clean before turning it off.

Signed-off-by: York Sun <york.sun@nxp.com>
CC: David Feng <fenghua@phytium.com.cn>
---

 arch/arm/cpu/armv8/cache_v8.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stephen Warren Oct. 14, 2016, 8:23 p.m. UTC | #1
On 10/14/2016 02:17 PM, York Sun wrote:
> Current code turns off d-cache first, then flush all levels of cache.
> This results data loss. As soon as d-cache is off, the dirty cache
> is discarded according to the test on LS2080A. This issue was not
> seen as long as external L3 cache was flushed to push the data to
> main memory. However, external L3 cache is not guaranteed to have
> the data. To fix this, flush the d-cache by way/set first to make
> sure cache is clean before turning it off.

> diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
> index cd3f6c1..92d6277 100644
> --- a/arch/arm/cpu/armv8/cache_v8.c
> +++ b/arch/arm/cpu/armv8/cache_v8.c
> @@ -478,9 +478,9 @@ void dcache_disable(void)
>  	if (!(sctlr & CR_C))
>  		return;
>
> +	flush_dcache_all();
>  	set_sctlr(sctlr & ~(CR_C|CR_M));
>
> -	flush_dcache_all();
>  	__asm_invalidate_tlb_all();
>  }

This one makes sense. I'll try and test it soon.
Stephen Warren Oct. 17, 2016, 9:22 p.m. UTC | #2
On 10/14/2016 02:17 PM, York Sun wrote:
> Current code turns off d-cache first, then flush all levels of cache.
> This results data loss. As soon as d-cache is off, the dirty cache
> is discarded according to the test on LS2080A. This issue was not
> seen as long as external L3 cache was flushed to push the data to
> main memory. However, external L3 cache is not guaranteed to have
> the data. To fix this, flush the d-cache by way/set first to make
> sure cache is clean before turning it off.

This patch,
Tested-by: Stephen Warren <swarren@nvidia.com>
Stephen Warren Oct. 19, 2016, 3:25 p.m. UTC | #3
On 10/14/2016 02:17 PM, York Sun wrote:
> Current code turns off d-cache first, then flush all levels of cache.
> This results data loss. As soon as d-cache is off, the dirty cache
> is discarded according to the test on LS2080A. This issue was not
> seen as long as external L3 cache was flushed to push the data to
> main memory. However, external L3 cache is not guaranteed to have
> the data. To fix this, flush the d-cache by way/set first to make
> sure cache is clean before turning it off.

> diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c

> @@ -478,9 +478,9 @@ void dcache_disable(void)

> +	flush_dcache_all();
>  	set_sctlr(sctlr & ~(CR_C|CR_M));
>
> -	flush_dcache_all();
>  	__asm_invalidate_tlb_all();

I talked to Mark Rutland at ARM, and I believe the current code is 
correct. Here's my interpretation of what he said:

The dcache must be disabled first. This prevents allocation of new 
entries in the cache during the flush operation, which prevents the race 
conditions that were mentioned in the other thread.

Then, the flush operation must be invoked. Since the cache is now 
disabled, this can fully flush the cache without worrying about racing 
with things being added to the cache.

This all implies that the implementation of dcache_disable(), 
set_sctlr(), flush_dcache_all(), and any code they call must not access 
data in DRAM at all; since because the dcache is off, any DRAM access 
will[1] read potentially stale data from DRAM, rather than any dirty 
data that might be in the cache.

[1] I'm not sure if that's "will" or "may", i.e. whether this is 
architecturally guaranteed in ARMv8 or is implementation defined. At 
least the Cortex A72 TRM says "will" for that CPU; not sure about others.

Perhaps the most obvious upshot of this is that the stack can't be used. 
This implies to me that we need to recode all those functions purely in 
assembly, or just possibly play some tricks to 100% force gcc not to 
touch memory anywhere inside dcache_disable() or the functions it calls. 
We're just getting lucky here right now since everything happens to be 
inlined, but I don't think we're doing anything to 100% guarantee this.

What worries me here is that at least on Tegra, a "flush the entire 
dcache" operation requires an SMC call to the secure monitor. That will 
certainly access DRAM when the secure monitor runs, but perhaps this 
doesn't matter since that's at a different exception level, and we know 
the secure monitor accesses DRAM regions that are separate from U-Boot's 
DRAM? I suspect life isn't that convenient. I'm wondering if this all 
implies that, like patch 2 in this series, we need to get 100% away from 
flush-by-set/way, even with SoC-specific hooks to make that work 
reliably, and just flush everything by VA, which IIRC is architecturally 
guaranteed to work without SoC-specific logic. That way, we can 
encapsulate everything into an assembly function without worrying about 
calling SMCs or SoC-specific hook functions without using DRAM. Of 
course, how that assembly function knows which VAs to flush without 
decoding the page tables or other data structure is another matter:-(
Stephen Warren Oct. 19, 2016, 5:18 p.m. UTC | #4
On 10/19/2016 09:25 AM, Stephen Warren wrote:
> On 10/14/2016 02:17 PM, York Sun wrote:
>> Current code turns off d-cache first, then flush all levels of cache.
>> This results data loss. As soon as d-cache is off, the dirty cache
>> is discarded according to the test on LS2080A. This issue was not
>> seen as long as external L3 cache was flushed to push the data to
>> main memory. However, external L3 cache is not guaranteed to have
>> the data. To fix this, flush the d-cache by way/set first to make
>> sure cache is clean before turning it off.
>
>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>> b/arch/arm/cpu/armv8/cache_v8.c
>
>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>
>> +    flush_dcache_all();
>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>
>> -    flush_dcache_all();
>>      __asm_invalidate_tlb_all();
>
> I talked to Mark Rutland at ARM, and I believe the current code is
> correct. Here's my interpretation of what he said:
>
> The dcache must be disabled first. This prevents allocation of new
> entries in the cache during the flush operation, which prevents the race
> conditions that were mentioned in the other thread.
>
> Then, the flush operation must be invoked. Since the cache is now
> disabled, this can fully flush the cache without worrying about racing
> with things being added to the cache.
>
> This all implies that the implementation of dcache_disable(),
> set_sctlr(), flush_dcache_all(), and any code they call must not access
> data in DRAM at all; since because the dcache is off, any DRAM access
> will[1] read potentially stale data from DRAM, rather than any dirty
> data that might be in the cache.
>
> [1] I'm not sure if that's "will" or "may", i.e. whether this is
> architecturally guaranteed in ARMv8 or is implementation defined. At
> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>
> Perhaps the most obvious upshot of this is that the stack can't be used.
> This implies to me that we need to recode all those functions purely in
> assembly, or just possibly play some tricks to 100% force gcc not to
> touch memory anywhere inside dcache_disable() or the functions it calls.
> We're just getting lucky here right now since everything happens to be
> inlined, but I don't think we're doing anything to 100% guarantee this.
>
> What worries me here is that at least on Tegra, a "flush the entire
> dcache" operation requires an SMC call to the secure monitor. That will
> certainly access DRAM when the secure monitor runs, but perhaps this
> doesn't matter since that's at a different exception level, and we know
> the secure monitor accesses DRAM regions that are separate from U-Boot's
> DRAM? I suspect life isn't that convenient. I'm wondering if this all
> implies that, like patch 2 in this series, we need to get 100% away from
> flush-by-set/way, even with SoC-specific hooks to make that work
> reliably, and just flush everything by VA, which IIRC is architecturally
> guaranteed to work without SoC-specific logic. That way, we can
> encapsulate everything into an assembly function without worrying about
> calling SMCs or SoC-specific hook functions without using DRAM. Of
> course, how that assembly function knows which VAs to flush without
> decoding the page tables or other data structure is another matter:-(

Re: the last paragraph there:

After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have 
separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to 
flush/... its own RAM out of the dcache, since that's all that's 
relevant at the EL U-Boot is running at, and doesn't care about anything 
EL3 might have mapped cached. So, it's safe to invoke SMCs from the 
cache flush code in U-Boot even if the EL3 code touches its own DRAM. 
There might be a corner case where this isn't true if EL3 has some 
EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.

Re: my other series to add more cache hooks: I'll re-implement the Tegra 
hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.

If it turns out that dcache_disable() ever starts touching DRAM at the 
wrong time, we can deal with that then; it doesn't now at least.
York Sun Oct. 19, 2016, 10:32 p.m. UTC | #5
On 10/19/2016 12:18 PM, Stephen Warren wrote:
> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>> On 10/14/2016 02:17 PM, York Sun wrote:
>>> Current code turns off d-cache first, then flush all levels of cache.
>>> This results data loss. As soon as d-cache is off, the dirty cache
>>> is discarded according to the test on LS2080A. This issue was not
>>> seen as long as external L3 cache was flushed to push the data to
>>> main memory. However, external L3 cache is not guaranteed to have
>>> the data. To fix this, flush the d-cache by way/set first to make
>>> sure cache is clean before turning it off.
>>
>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>> b/arch/arm/cpu/armv8/cache_v8.c
>>
>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>
>>> +    flush_dcache_all();
>>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>>
>>> -    flush_dcache_all();
>>>      __asm_invalidate_tlb_all();
>>
>> I talked to Mark Rutland at ARM, and I believe the current code is
>> correct. Here's my interpretation of what he said:
>>
>> The dcache must be disabled first. This prevents allocation of new
>> entries in the cache during the flush operation, which prevents the race
>> conditions that were mentioned in the other thread.
>>
>> Then, the flush operation must be invoked. Since the cache is now
>> disabled, this can fully flush the cache without worrying about racing
>> with things being added to the cache.
>>
>> This all implies that the implementation of dcache_disable(),
>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>> data in DRAM at all; since because the dcache is off, any DRAM access
>> will[1] read potentially stale data from DRAM, rather than any dirty
>> data that might be in the cache.
>>
>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>> architecturally guaranteed in ARMv8 or is implementation defined. At
>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>
>> Perhaps the most obvious upshot of this is that the stack can't be used.
>> This implies to me that we need to recode all those functions purely in
>> assembly, or just possibly play some tricks to 100% force gcc not to
>> touch memory anywhere inside dcache_disable() or the functions it calls.
>> We're just getting lucky here right now since everything happens to be
>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>
>> What worries me here is that at least on Tegra, a "flush the entire
>> dcache" operation requires an SMC call to the secure monitor. That will
>> certainly access DRAM when the secure monitor runs, but perhaps this
>> doesn't matter since that's at a different exception level, and we know
>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>> implies that, like patch 2 in this series, we need to get 100% away from
>> flush-by-set/way, even with SoC-specific hooks to make that work
>> reliably, and just flush everything by VA, which IIRC is architecturally
>> guaranteed to work without SoC-specific logic. That way, we can
>> encapsulate everything into an assembly function without worrying about
>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>> course, how that assembly function knows which VAs to flush without
>> decoding the page tables or other data structure is another matter:-(
>
> Re: the last paragraph there:
>
> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
> flush/... its own RAM out of the dcache, since that's all that's
> relevant at the EL U-Boot is running at, and doesn't care about anything
> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
> There might be a corner case where this isn't true if EL3 has some
> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>
> Re: my other series to add more cache hooks: I'll re-implement the Tegra
> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>
> If it turns out that dcache_disable() ever starts touching DRAM at the
> wrong time, we can deal with that then; it doesn't now at least.
>

Stephen,

I think one of our difference is whether the data is returned correctly 
with dirty dcache being disabled. With current code flow, the dcache is 
disabled, followed by flushing/invalidating by way/set, then L3 cache is 
flushed. I see correct stack after these steps. During my recent attempt 
to remove flushing L3, I added code to flush the stack by VA (which I 
already confirmed the data will end up in main memory) to replace 
flushing L3. I found as soon as the dcache is disabled, the dirty cache 
data is lost when trying to access from the core (debugged by JTAG 
tool). This makes me to think the order may be wrong. By reverting the 
order, i.e. flushing dcache first then disable it, I can see correct 
data in main memory. I understand the proposed new code requires not 
touching stack or any data after the first flush. I prefer to not to 
make this change if we have an explanation of what I saw.

York
Stephen Warren Oct. 19, 2016, 11:01 p.m. UTC | #6
On 10/19/2016 04:32 PM, york sun wrote:
> On 10/19/2016 12:18 PM, Stephen Warren wrote:
>> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>>> On 10/14/2016 02:17 PM, York Sun wrote:
>>>> Current code turns off d-cache first, then flush all levels of cache.
>>>> This results data loss. As soon as d-cache is off, the dirty cache
>>>> is discarded according to the test on LS2080A. This issue was not
>>>> seen as long as external L3 cache was flushed to push the data to
>>>> main memory. However, external L3 cache is not guaranteed to have
>>>> the data. To fix this, flush the d-cache by way/set first to make
>>>> sure cache is clean before turning it off.
>>>
>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>>> b/arch/arm/cpu/armv8/cache_v8.c
>>>
>>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>>
>>>> +    flush_dcache_all();
>>>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>>>
>>>> -    flush_dcache_all();
>>>>      __asm_invalidate_tlb_all();
>>>
>>> I talked to Mark Rutland at ARM, and I believe the current code is
>>> correct. Here's my interpretation of what he said:
>>>
>>> The dcache must be disabled first. This prevents allocation of new
>>> entries in the cache during the flush operation, which prevents the race
>>> conditions that were mentioned in the other thread.
>>>
>>> Then, the flush operation must be invoked. Since the cache is now
>>> disabled, this can fully flush the cache without worrying about racing
>>> with things being added to the cache.
>>>
>>> This all implies that the implementation of dcache_disable(),
>>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>>> data in DRAM at all; since because the dcache is off, any DRAM access
>>> will[1] read potentially stale data from DRAM, rather than any dirty
>>> data that might be in the cache.
>>>
>>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>>> architecturally guaranteed in ARMv8 or is implementation defined. At
>>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>>
>>> Perhaps the most obvious upshot of this is that the stack can't be used.
>>> This implies to me that we need to recode all those functions purely in
>>> assembly, or just possibly play some tricks to 100% force gcc not to
>>> touch memory anywhere inside dcache_disable() or the functions it calls.
>>> We're just getting lucky here right now since everything happens to be
>>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>>
>>> What worries me here is that at least on Tegra, a "flush the entire
>>> dcache" operation requires an SMC call to the secure monitor. That will
>>> certainly access DRAM when the secure monitor runs, but perhaps this
>>> doesn't matter since that's at a different exception level, and we know
>>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>>> implies that, like patch 2 in this series, we need to get 100% away from
>>> flush-by-set/way, even with SoC-specific hooks to make that work
>>> reliably, and just flush everything by VA, which IIRC is architecturally
>>> guaranteed to work without SoC-specific logic. That way, we can
>>> encapsulate everything into an assembly function without worrying about
>>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>>> course, how that assembly function knows which VAs to flush without
>>> decoding the page tables or other data structure is another matter:-(
>>
>> Re: the last paragraph there:
>>
>> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
>> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
>> flush/... its own RAM out of the dcache, since that's all that's
>> relevant at the EL U-Boot is running at, and doesn't care about anything
>> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
>> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
>> There might be a corner case where this isn't true if EL3 has some
>> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>>
>> Re: my other series to add more cache hooks: I'll re-implement the Tegra
>> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>>
>> If it turns out that dcache_disable() ever starts touching DRAM at the
>> wrong time, we can deal with that then; it doesn't now at least.
>>
>
> Stephen,
>
> I think one of our difference is whether the data is returned correctly
> with dirty dcache being disabled. With current code flow, the dcache is
> disabled, followed by flushing/invalidating by way/set, then L3 cache is
> flushed. I see correct stack after these steps. During my recent attempt
> to remove flushing L3, I added code to flush the stack by VA (which I
> already confirmed the data will end up in main memory) to replace
> flushing L3. I found as soon as the dcache is disabled, the dirty cache
> data is lost when trying to access from the core (debugged by JTAG
> tool). This makes me to think the order may be wrong. By reverting the
> order, i.e. flushing dcache first then disable it, I can see correct
> data in main memory. I understand the proposed new code requires not
> touching stack or any data after the first flush. I prefer to not to
> make this change if we have an explanation of what I saw.

I think the confusion is: what does "lost" mean?

On ARMv8 at least:

Starting off with dcache enabled and containing some dirty data:

CPU reads will read the dirty data from the cache.

Now if dcache is disabled:

CPU reads will be of the uncached/uncachable type since dcache is off. 
The dcache will not respond. So, the CPU will receive (potentially) 
stale data directly from RAM.

In this state, it is not possible to access the dirty data in the 
caches. However, it's not (permanently) lost...

Now if the dcache is fully flushed:

All dirty data that was in the dcache will be written to RAM. After this 
process is complete, any reads from RAM will return the most up-to-date 
possible data; the temporarily "lost" dirty data is now no longer lost.


The cache disable process must proceed as above; we can't flush the 
dcache and then disable it. Attempting to do things in that order would 
introduce race conditions. At all times before the dcache is disabled, 
any CPU writes to memory will allocate new lines in the dcache. Also, 
the dcache is fully responsive to the cache coherency protocol, so some 
other agent on the bus (another CPU, cluster, DMA agent, ...) could 
cause a (clean or dirty) line to be brought into the cache. This might 
happen after that part of the cache would have been flushed. In other 
words, it's impossible to fully flush the cache (by set/way) while it's 
enabled.


I'm not sure that flushing by VA instead of by set/way solves anything 
much here. I believe flushing by VA runs less risk of interference due 
to the coherency protocol bringing lines back into the cache, since 
flush by VA will affect other caches? However this still doesn't allow 
use of the CPU's own stack or other DRAM writes; if we:

a) Flush by VA
(this step is assumed to perform some function calls/returns, or other 
DRAM writes)

b) Disable cache

... then irrespective of how step (a) was performed, since the DRAM 
writes performed in that step are likely still in the cache as dirty 
data, so step (b) will cause them to be "lost" until a full flush 
operation is performed. Thus we must always flush after disabling the 
cache, so there's no point also flushing before.
York Sun Oct. 20, 2016, 5:06 a.m. UTC | #7
On 10/19/2016 06:01 PM, Stephen Warren wrote:
> On 10/19/2016 04:32 PM, york sun wrote:
>> On 10/19/2016 12:18 PM, Stephen Warren wrote:
>>> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>>>> On 10/14/2016 02:17 PM, York Sun wrote:
>>>>> Current code turns off d-cache first, then flush all levels of cache.
>>>>> This results data loss. As soon as d-cache is off, the dirty cache
>>>>> is discarded according to the test on LS2080A. This issue was not
>>>>> seen as long as external L3 cache was flushed to push the data to
>>>>> main memory. However, external L3 cache is not guaranteed to have
>>>>> the data. To fix this, flush the d-cache by way/set first to make
>>>>> sure cache is clean before turning it off.
>>>>
>>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>>>> b/arch/arm/cpu/armv8/cache_v8.c
>>>>
>>>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>>>
>>>>> +    flush_dcache_all();
>>>>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>>>>
>>>>> -    flush_dcache_all();
>>>>>      __asm_invalidate_tlb_all();
>>>>
>>>> I talked to Mark Rutland at ARM, and I believe the current code is
>>>> correct. Here's my interpretation of what he said:
>>>>
>>>> The dcache must be disabled first. This prevents allocation of new
>>>> entries in the cache during the flush operation, which prevents the race
>>>> conditions that were mentioned in the other thread.
>>>>
>>>> Then, the flush operation must be invoked. Since the cache is now
>>>> disabled, this can fully flush the cache without worrying about racing
>>>> with things being added to the cache.
>>>>
>>>> This all implies that the implementation of dcache_disable(),
>>>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>>>> data in DRAM at all; since because the dcache is off, any DRAM access
>>>> will[1] read potentially stale data from DRAM, rather than any dirty
>>>> data that might be in the cache.
>>>>
>>>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>>>> architecturally guaranteed in ARMv8 or is implementation defined. At
>>>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>>>
>>>> Perhaps the most obvious upshot of this is that the stack can't be used.
>>>> This implies to me that we need to recode all those functions purely in
>>>> assembly, or just possibly play some tricks to 100% force gcc not to
>>>> touch memory anywhere inside dcache_disable() or the functions it calls.
>>>> We're just getting lucky here right now since everything happens to be
>>>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>>>
>>>> What worries me here is that at least on Tegra, a "flush the entire
>>>> dcache" operation requires an SMC call to the secure monitor. That will
>>>> certainly access DRAM when the secure monitor runs, but perhaps this
>>>> doesn't matter since that's at a different exception level, and we know
>>>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>>>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>>>> implies that, like patch 2 in this series, we need to get 100% away from
>>>> flush-by-set/way, even with SoC-specific hooks to make that work
>>>> reliably, and just flush everything by VA, which IIRC is architecturally
>>>> guaranteed to work without SoC-specific logic. That way, we can
>>>> encapsulate everything into an assembly function without worrying about
>>>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>>>> course, how that assembly function knows which VAs to flush without
>>>> decoding the page tables or other data structure is another matter:-(
>>>
>>> Re: the last paragraph there:
>>>
>>> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
>>> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
>>> flush/... its own RAM out of the dcache, since that's all that's
>>> relevant at the EL U-Boot is running at, and doesn't care about anything
>>> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
>>> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
>>> There might be a corner case where this isn't true if EL3 has some
>>> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>>>
>>> Re: my other series to add more cache hooks: I'll re-implement the Tegra
>>> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>>>
>>> If it turns out that dcache_disable() ever starts touching DRAM at the
>>> wrong time, we can deal with that then; it doesn't now at least.
>>>
>>
>> Stephen,
>>
>> I think one of our difference is whether the data is returned correctly
>> with dirty dcache being disabled. With current code flow, the dcache is
>> disabled, followed by flushing/invalidating by way/set, then L3 cache is
>> flushed. I see correct stack after these steps. During my recent attempt
>> to remove flushing L3, I added code to flush the stack by VA (which I
>> already confirmed the data will end up in main memory) to replace
>> flushing L3. I found as soon as the dcache is disabled, the dirty cache
>> data is lost when trying to access from the core (debugged by JTAG
>> tool). This makes me to think the order may be wrong. By reverting the
>> order, i.e. flushing dcache first then disable it, I can see correct
>> data in main memory. I understand the proposed new code requires not
>> touching stack or any data after the first flush. I prefer to not to
>> make this change if we have an explanation of what I saw.
>
> I think the confusion is: what does "lost" mean?
>
> On ARMv8 at least:
>
> Starting off with dcache enabled and containing some dirty data:
>
> CPU reads will read the dirty data from the cache.
>
> Now if dcache is disabled:
>
> CPU reads will be of the uncached/uncachable type since dcache is off.
> The dcache will not respond. So, the CPU will receive (potentially)
> stale data directly from RAM.
>
> In this state, it is not possible to access the dirty data in the
> caches. However, it's not (permanently) lost...
>
> Now if the dcache is fully flushed:
>
> All dirty data that was in the dcache will be written to RAM. After this
> process is complete, any reads from RAM will return the most up-to-date
> possible data; the temporarily "lost" dirty data is now no longer lost.
>
>
> The cache disable process must proceed as above; we can't flush the
> dcache and then disable it. Attempting to do things in that order would
> introduce race conditions. At all times before the dcache is disabled,
> any CPU writes to memory will allocate new lines in the dcache. Also,
> the dcache is fully responsive to the cache coherency protocol, so some
> other agent on the bus (another CPU, cluster, DMA agent, ...) could
> cause a (clean or dirty) line to be brought into the cache. This might
> happen after that part of the cache would have been flushed. In other
> words, it's impossible to fully flush the cache (by set/way) while it's
> enabled.
>
>
> I'm not sure that flushing by VA instead of by set/way solves anything
> much here. I believe flushing by VA runs less risk of interference due
> to the coherency protocol bringing lines back into the cache, since
> flush by VA will affect other caches? However this still doesn't allow
> use of the CPU's own stack or other DRAM writes; if we:
>
> a) Flush by VA
> (this step is assumed to perform some function calls/returns, or other
> DRAM writes)
>
> b) Disable cache
>
> ... then irrespective of how step (a) was performed, since the DRAM
> writes performed in that step are likely still in the cache as dirty
> data, so step (b) will cause them to be "lost" until a full flush
> operation is performed. Thus we must always flush after disabling the
> cache, so there's no point also flushing before.
>
>
Stephen,

Sorry for late response. I am still traveling...
I understand the data in dirty cache is not lost when the dcache is 
disabled. It is just not accessible. In my case, after flushing L1/L2 by 
way/set, the data is probably in L3 cache. Without flushing L3, I have 
stalled data (including the stack) in main memory.

My previous test was trying to prove I can skip flushing L3 if I flush 
the necessary VA. Now I when recall it carefully, I think I made a 
mistake by flushing by VA _after_ flushing the cache by way/set. I might 
have a positive result if I flushed the cache by VA first. I will repeat 
this test when I get back to prove this theory.

That being said, do you or anyone see any value by skipping flushing L3? 
My target was to avoid making a SMC call for EL2/EL1, and to avoid 
implementing flushing L3 function in U-Boot (even we already have one 
for CCN-504). I understand this is different from what dcache_disable() 
implies.

York
Stephen Warren Oct. 20, 2016, 6:34 p.m. UTC | #8
On 10/19/2016 11:06 PM, york sun wrote:
> On 10/19/2016 06:01 PM, Stephen Warren wrote:
>> On 10/19/2016 04:32 PM, york sun wrote:
>>> On 10/19/2016 12:18 PM, Stephen Warren wrote:
>>>> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>>>>> On 10/14/2016 02:17 PM, York Sun wrote:
>>>>>> Current code turns off d-cache first, then flush all levels of cache.
>>>>>> This results data loss. As soon as d-cache is off, the dirty cache
>>>>>> is discarded according to the test on LS2080A. This issue was not
>>>>>> seen as long as external L3 cache was flushed to push the data to
>>>>>> main memory. However, external L3 cache is not guaranteed to have
>>>>>> the data. To fix this, flush the d-cache by way/set first to make
>>>>>> sure cache is clean before turning it off.
>>>>>
>>>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>>>>> b/arch/arm/cpu/armv8/cache_v8.c
>>>>>
>>>>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>>>>
>>>>>> +    flush_dcache_all();
>>>>>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>>>>>
>>>>>> -    flush_dcache_all();
>>>>>>      __asm_invalidate_tlb_all();
>>>>>
>>>>> I talked to Mark Rutland at ARM, and I believe the current code is
>>>>> correct. Here's my interpretation of what he said:
>>>>>
>>>>> The dcache must be disabled first. This prevents allocation of new
>>>>> entries in the cache during the flush operation, which prevents the race
>>>>> conditions that were mentioned in the other thread.
>>>>>
>>>>> Then, the flush operation must be invoked. Since the cache is now
>>>>> disabled, this can fully flush the cache without worrying about racing
>>>>> with things being added to the cache.
>>>>>
>>>>> This all implies that the implementation of dcache_disable(),
>>>>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>>>>> data in DRAM at all; since because the dcache is off, any DRAM access
>>>>> will[1] read potentially stale data from DRAM, rather than any dirty
>>>>> data that might be in the cache.
>>>>>
>>>>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>>>>> architecturally guaranteed in ARMv8 or is implementation defined. At
>>>>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>>>>
>>>>> Perhaps the most obvious upshot of this is that the stack can't be used.
>>>>> This implies to me that we need to recode all those functions purely in
>>>>> assembly, or just possibly play some tricks to 100% force gcc not to
>>>>> touch memory anywhere inside dcache_disable() or the functions it calls.
>>>>> We're just getting lucky here right now since everything happens to be
>>>>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>>>>
>>>>> What worries me here is that at least on Tegra, a "flush the entire
>>>>> dcache" operation requires an SMC call to the secure monitor. That will
>>>>> certainly access DRAM when the secure monitor runs, but perhaps this
>>>>> doesn't matter since that's at a different exception level, and we know
>>>>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>>>>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>>>>> implies that, like patch 2 in this series, we need to get 100% away from
>>>>> flush-by-set/way, even with SoC-specific hooks to make that work
>>>>> reliably, and just flush everything by VA, which IIRC is architecturally
>>>>> guaranteed to work without SoC-specific logic. That way, we can
>>>>> encapsulate everything into an assembly function without worrying about
>>>>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>>>>> course, how that assembly function knows which VAs to flush without
>>>>> decoding the page tables or other data structure is another matter:-(
>>>>
>>>> Re: the last paragraph there:
>>>>
>>>> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
>>>> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
>>>> flush/... its own RAM out of the dcache, since that's all that's
>>>> relevant at the EL U-Boot is running at, and doesn't care about anything
>>>> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
>>>> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
>>>> There might be a corner case where this isn't true if EL3 has some
>>>> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>>>>
>>>> Re: my other series to add more cache hooks: I'll re-implement the Tegra
>>>> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>>>>
>>>> If it turns out that dcache_disable() ever starts touching DRAM at the
>>>> wrong time, we can deal with that then; it doesn't now at least.
>>>>
>>>
>>> Stephen,
>>>
>>> I think one of our difference is whether the data is returned correctly
>>> with dirty dcache being disabled. With current code flow, the dcache is
>>> disabled, followed by flushing/invalidating by way/set, then L3 cache is
>>> flushed. I see correct stack after these steps. During my recent attempt
>>> to remove flushing L3, I added code to flush the stack by VA (which I
>>> already confirmed the data will end up in main memory) to replace
>>> flushing L3. I found as soon as the dcache is disabled, the dirty cache
>>> data is lost when trying to access from the core (debugged by JTAG
>>> tool). This makes me to think the order may be wrong. By reverting the
>>> order, i.e. flushing dcache first then disable it, I can see correct
>>> data in main memory. I understand the proposed new code requires not
>>> touching stack or any data after the first flush. I prefer to not to
>>> make this change if we have an explanation of what I saw.
>>
>> I think the confusion is: what does "lost" mean?
>>
>> On ARMv8 at least:
>>
>> Starting off with dcache enabled and containing some dirty data:
>>
>> CPU reads will read the dirty data from the cache.
>>
>> Now if dcache is disabled:
>>
>> CPU reads will be of the uncached/uncachable type since dcache is off.
>> The dcache will not respond. So, the CPU will receive (potentially)
>> stale data directly from RAM.
>>
>> In this state, it is not possible to access the dirty data in the
>> caches. However, it's not (permanently) lost...
>>
>> Now if the dcache is fully flushed:
>>
>> All dirty data that was in the dcache will be written to RAM. After this
>> process is complete, any reads from RAM will return the most up-to-date
>> possible data; the temporarily "lost" dirty data is now no longer lost.
>>
>>
>> The cache disable process must proceed as above; we can't flush the
>> dcache and then disable it. Attempting to do things in that order would
>> introduce race conditions. At all times before the dcache is disabled,
>> any CPU writes to memory will allocate new lines in the dcache. Also,
>> the dcache is fully responsive to the cache coherency protocol, so some
>> other agent on the bus (another CPU, cluster, DMA agent, ...) could
>> cause a (clean or dirty) line to be brought into the cache. This might
>> happen after that part of the cache would have been flushed. In other
>> words, it's impossible to fully flush the cache (by set/way) while it's
>> enabled.
>>
>>
>> I'm not sure that flushing by VA instead of by set/way solves anything
>> much here. I believe flushing by VA runs less risk of interference due
>> to the coherency protocol bringing lines back into the cache, since
>> flush by VA will affect other caches? However this still doesn't allow
>> use of the CPU's own stack or other DRAM writes; if we:
>>
>> a) Flush by VA
>> (this step is assumed to perform some function calls/returns, or other
>> DRAM writes)
>>
>> b) Disable cache
>>
>> ... then irrespective of how step (a) was performed, since the DRAM
>> writes performed in that step are likely still in the cache as dirty
>> data, so step (b) will cause them to be "lost" until a full flush
>> operation is performed. Thus we must always flush after disabling the
>> cache, so there's no point also flushing before.
>>
>>
> Stephen,
>
> Sorry for late response. I am still traveling...

I didn't think your response was slow:-)

> I understand the data in dirty cache is not lost when the dcache is
> disabled. It is just not accessible. In my case, after flushing L1/L2 by
> way/set, the data is probably in L3 cache. Without flushing L3, I have
> stalled data (including the stack) in main memory.

I assume "stale" not "stalled".

Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to 
do everything, but the L3 cache doesn't implement by set/way 
operations), then L3 can indeed still contain dirty data, and hence main 
memory can be stale.

> My previous test was trying to prove I can skip flushing L3 if I flush
> the necessary VA.

It depends whether your L3 cache is before/after the Level of 
Unification and/or the Level of Coherency for your HW, and whether you 
flush by VA to the PoU or PoC.

Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it 
uses __asm_flush_dcache_range() which cleans and invalidates to the 
point of coherency (it invokes the dc civac instruction).

 > Now I when recall it carefully, I think I made a
> mistake by flushing by VA _after_ flushing the cache by way/set. I might
> have a positive result if I flushed the cache by VA first. I will repeat
> this test when I get back to prove this theory.

I'll assume you L3 cache is before the Point of Coherency. If so, then 
performing a clean/invalidate by VA to the PoC will affect all of L1, 
L2, and L3 caches. As such, you need only perform the c/i by VA, and you 
can entirely skip the c/i by set/way; I believe it would be redundant, 
assuming that the by VA operations cover all relevant VAs.

> That being said, do you or anyone see any value by skipping flushing L3?
> My target was to avoid making a SMC call for EL2/EL1, and to avoid
> implementing flushing L3 function in U-Boot (even we already have one
> for CCN-504). I understand this is different from what dcache_disable()
> implies.

I /think/ that if we can modify U-Boot to do all cache-wide operations 
by VA to the PoC, then we can indeed do that instead of any by-set/way 
operations, and in turn that will mean U-Boot doesn't need to invoke any 
SMC calls or SoC-specific logic when enabling/disabling the cache. My 
remaining questions would be:

a) How can we determine the set of VA ranges we need to operate on?

In the case where U-Boot is invalidating the dcache prior to enabling 
it, I think the set of VA ranges to operate on is determined by the MMU 
configuration of the SW that ran before U-Boot, since that defines what 
dirty data might be present in the dcache. U-Boot won't be able to 
determine that.

Perhaps we only care about the set of VAs that U-Boot will actually 
enable in its own MMU configuration, since those will be the only VAs 
the CPU will be allowed to architecturally access. So, this might 
actually be a solvable problem.

Perhaps U-Boot requires that the previous SW has already cleaned all 
relevant parts of the dcache (the fact U-Boot attempts to invalidate the 
entire dcache before enabling it implies this is not the case though).

In the case where U-Boot has already  programmed its own MMU 
configuration, then that MMU configuration can determine the set of VAs 
to operate on. This case is easy (easier) since we definitely have all 
the required configuration knowledge.

b) How can we implement the by VA code in a way that doesn't touch DRAM?

Implementing by-set/way is fairly constrained in that all the 
configuration data is in a few registers, and the algorithm just 
requires a few nested loops.

A generic by VA implementation seems like it would require walking 
U-Boot's struct mm_region *mem_map data structure (or the CPU's 
translation tables) in RAM. Perhaps that's OK since it's read-only...
York Sun Oct. 21, 2016, 7:31 p.m. UTC | #9
On 10/20/2016 01:34 PM, Stephen Warren wrote:
> On 10/19/2016 11:06 PM, york sun wrote:
>> On 10/19/2016 06:01 PM, Stephen Warren wrote:
>>> On 10/19/2016 04:32 PM, york sun wrote:
>>>> On 10/19/2016 12:18 PM, Stephen Warren wrote:
>>>>> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>>>>>> On 10/14/2016 02:17 PM, York Sun wrote:
>>>>>>> Current code turns off d-cache first, then flush all levels of cache.
>>>>>>> This results data loss. As soon as d-cache is off, the dirty cache
>>>>>>> is discarded according to the test on LS2080A. This issue was not
>>>>>>> seen as long as external L3 cache was flushed to push the data to
>>>>>>> main memory. However, external L3 cache is not guaranteed to have
>>>>>>> the data. To fix this, flush the d-cache by way/set first to make
>>>>>>> sure cache is clean before turning it off.
>>>>>>
>>>>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>>>>>> b/arch/arm/cpu/armv8/cache_v8.c
>>>>>>
>>>>>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>>>>>
>>>>>>> +    flush_dcache_all();
>>>>>>>      set_sctlr(sctlr & ~(CR_C|CR_M));
>>>>>>>
>>>>>>> -    flush_dcache_all();
>>>>>>>      __asm_invalidate_tlb_all();
>>>>>>
>>>>>> I talked to Mark Rutland at ARM, and I believe the current code is
>>>>>> correct. Here's my interpretation of what he said:
>>>>>>
>>>>>> The dcache must be disabled first. This prevents allocation of new
>>>>>> entries in the cache during the flush operation, which prevents the race
>>>>>> conditions that were mentioned in the other thread.
>>>>>>
>>>>>> Then, the flush operation must be invoked. Since the cache is now
>>>>>> disabled, this can fully flush the cache without worrying about racing
>>>>>> with things being added to the cache.
>>>>>>
>>>>>> This all implies that the implementation of dcache_disable(),
>>>>>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>>>>>> data in DRAM at all; since because the dcache is off, any DRAM access
>>>>>> will[1] read potentially stale data from DRAM, rather than any dirty
>>>>>> data that might be in the cache.
>>>>>>
>>>>>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>>>>>> architecturally guaranteed in ARMv8 or is implementation defined. At
>>>>>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>>>>>
>>>>>> Perhaps the most obvious upshot of this is that the stack can't be used.
>>>>>> This implies to me that we need to recode all those functions purely in
>>>>>> assembly, or just possibly play some tricks to 100% force gcc not to
>>>>>> touch memory anywhere inside dcache_disable() or the functions it calls.
>>>>>> We're just getting lucky here right now since everything happens to be
>>>>>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>>>>>
>>>>>> What worries me here is that at least on Tegra, a "flush the entire
>>>>>> dcache" operation requires an SMC call to the secure monitor. That will
>>>>>> certainly access DRAM when the secure monitor runs, but perhaps this
>>>>>> doesn't matter since that's at a different exception level, and we know
>>>>>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>>>>>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>>>>>> implies that, like patch 2 in this series, we need to get 100% away from
>>>>>> flush-by-set/way, even with SoC-specific hooks to make that work
>>>>>> reliably, and just flush everything by VA, which IIRC is architecturally
>>>>>> guaranteed to work without SoC-specific logic. That way, we can
>>>>>> encapsulate everything into an assembly function without worrying about
>>>>>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>>>>>> course, how that assembly function knows which VAs to flush without
>>>>>> decoding the page tables or other data structure is another matter:-(
>>>>>
>>>>> Re: the last paragraph there:
>>>>>
>>>>> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
>>>>> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
>>>>> flush/... its own RAM out of the dcache, since that's all that's
>>>>> relevant at the EL U-Boot is running at, and doesn't care about anything
>>>>> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
>>>>> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
>>>>> There might be a corner case where this isn't true if EL3 has some
>>>>> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>>>>>
>>>>> Re: my other series to add more cache hooks: I'll re-implement the Tegra
>>>>> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>>>>>
>>>>> If it turns out that dcache_disable() ever starts touching DRAM at the
>>>>> wrong time, we can deal with that then; it doesn't now at least.
>>>>>
>>>>
>>>> Stephen,
>>>>
>>>> I think one of our difference is whether the data is returned correctly
>>>> with dirty dcache being disabled. With current code flow, the dcache is
>>>> disabled, followed by flushing/invalidating by way/set, then L3 cache is
>>>> flushed. I see correct stack after these steps. During my recent attempt
>>>> to remove flushing L3, I added code to flush the stack by VA (which I
>>>> already confirmed the data will end up in main memory) to replace
>>>> flushing L3. I found as soon as the dcache is disabled, the dirty cache
>>>> data is lost when trying to access from the core (debugged by JTAG
>>>> tool). This makes me to think the order may be wrong. By reverting the
>>>> order, i.e. flushing dcache first then disable it, I can see correct
>>>> data in main memory. I understand the proposed new code requires not
>>>> touching stack or any data after the first flush. I prefer to not to
>>>> make this change if we have an explanation of what I saw.
>>>
>>> I think the confusion is: what does "lost" mean?
>>>
>>> On ARMv8 at least:
>>>
>>> Starting off with dcache enabled and containing some dirty data:
>>>
>>> CPU reads will read the dirty data from the cache.
>>>
>>> Now if dcache is disabled:
>>>
>>> CPU reads will be of the uncached/uncachable type since dcache is off.
>>> The dcache will not respond. So, the CPU will receive (potentially)
>>> stale data directly from RAM.
>>>
>>> In this state, it is not possible to access the dirty data in the
>>> caches. However, it's not (permanently) lost...
>>>
>>> Now if the dcache is fully flushed:
>>>
>>> All dirty data that was in the dcache will be written to RAM. After this
>>> process is complete, any reads from RAM will return the most up-to-date
>>> possible data; the temporarily "lost" dirty data is now no longer lost.
>>>
>>>
>>> The cache disable process must proceed as above; we can't flush the
>>> dcache and then disable it. Attempting to do things in that order would
>>> introduce race conditions. At all times before the dcache is disabled,
>>> any CPU writes to memory will allocate new lines in the dcache. Also,
>>> the dcache is fully responsive to the cache coherency protocol, so some
>>> other agent on the bus (another CPU, cluster, DMA agent, ...) could
>>> cause a (clean or dirty) line to be brought into the cache. This might
>>> happen after that part of the cache would have been flushed. In other
>>> words, it's impossible to fully flush the cache (by set/way) while it's
>>> enabled.
>>>
>>>
>>> I'm not sure that flushing by VA instead of by set/way solves anything
>>> much here. I believe flushing by VA runs less risk of interference due
>>> to the coherency protocol bringing lines back into the cache, since
>>> flush by VA will affect other caches? However this still doesn't allow
>>> use of the CPU's own stack or other DRAM writes; if we:
>>>
>>> a) Flush by VA
>>> (this step is assumed to perform some function calls/returns, or other
>>> DRAM writes)
>>>
>>> b) Disable cache
>>>
>>> ... then irrespective of how step (a) was performed, since the DRAM
>>> writes performed in that step are likely still in the cache as dirty
>>> data, so step (b) will cause them to be "lost" until a full flush
>>> operation is performed. Thus we must always flush after disabling the
>>> cache, so there's no point also flushing before.
>>>
>>>
>> Stephen,
>>
>> Sorry for late response. I am still traveling...
>
> I didn't think your response was slow:-)
>
>> I understand the data in dirty cache is not lost when the dcache is
>> disabled. It is just not accessible. In my case, after flushing L1/L2 by
>> way/set, the data is probably in L3 cache. Without flushing L3, I have
>> stalled data (including the stack) in main memory.
>
> I assume "stale" not "stalled".
>
> Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to
> do everything, but the L3 cache doesn't implement by set/way
> operations), then L3 can indeed still contain dirty data, and hence main
> memory can be stale.
>
>> My previous test was trying to prove I can skip flushing L3 if I flush
>> the necessary VA.
>
> It depends whether your L3 cache is before/after the Level of
> Unification and/or the Level of Coherency for your HW, and whether you
> flush by VA to the PoU or PoC.
>
> Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it
> uses __asm_flush_dcache_range() which cleans and invalidates to the
> point of coherency (it invokes the dc civac instruction).
>
>  > Now I when recall it carefully, I think I made a
>> mistake by flushing by VA _after_ flushing the cache by way/set. I might
>> have a positive result if I flushed the cache by VA first. I will repeat
>> this test when I get back to prove this theory.
>
> I'll assume you L3 cache is before the Point of Coherency. If so, then
> performing a clean/invalidate by VA to the PoC will affect all of L1,
> L2, and L3 caches. As such, you need only perform the c/i by VA, and you
> can entirely skip the c/i by set/way; I believe it would be redundant,
> assuming that the by VA operations cover all relevant VAs.

I believe the PoC and PoU is before L3 for me. I can clean/invalidate by 
VA, it may not cover all the cache lines. So by set/way is still needed.

>
>> That being said, do you or anyone see any value by skipping flushing L3?
>> My target was to avoid making a SMC call for EL2/EL1, and to avoid
>> implementing flushing L3 function in U-Boot (even we already have one
>> for CCN-504). I understand this is different from what dcache_disable()
>> implies.
>
> I /think/ that if we can modify U-Boot to do all cache-wide operations
> by VA to the PoC, then we can indeed do that instead of any by-set/way
> operations, and in turn that will mean U-Boot doesn't need to invoke any
> SMC calls or SoC-specific logic when enabling/disabling the cache. My
> remaining questions would be:
>
> a) How can we determine the set of VA ranges we need to operate on?
>
> In the case where U-Boot is invalidating the dcache prior to enabling
> it, I think the set of VA ranges to operate on is determined by the MMU
> configuration of the SW that ran before U-Boot, since that defines what
> dirty data might be present in the dcache. U-Boot won't be able to
> determine that.
>
> Perhaps we only care about the set of VAs that U-Boot will actually
> enable in its own MMU configuration, since those will be the only VAs
> the CPU will be allowed to architecturally access. So, this might
> actually be a solvable problem.
>
> Perhaps U-Boot requires that the previous SW has already cleaned all
> relevant parts of the dcache (the fact U-Boot attempts to invalidate the
> entire dcache before enabling it implies this is not the case though).
>
> In the case where U-Boot has already  programmed its own MMU
> configuration, then that MMU configuration can determine the set of VAs
> to operate on. This case is easy (easier) since we definitely have all
> the required configuration knowledge.
>
> b) How can we implement the by VA code in a way that doesn't touch DRAM?
>
> Implementing by-set/way is fairly constrained in that all the
> configuration data is in a few registers, and the algorithm just .
> requires a few nested loops.
>
> A generic by VA implementation seems like it would require walking
> U-Boot's struct mm_region *mem_map data structure (or the CPU's
> translation tables) in RAM. Perhaps that's OK since it's read-only...
>

I agree in general about your points, but it may not be always practical 
to flush all by VA. U-Boot may map huge amount of memory. Walking 
through MMU table and flush all will be too much. Without flushing all 
memory, we really cannot say we flushed all dcache. On the other side, 
for U-Boot itself to operate, we don't really have to flush all. I guess 
the key is if we need to flush all. For Linux to boot, we don't. We can 
flush some memory (including U-Boot stack), turn off the MMU. As soon as 
kernel boots up, it enables dcache and everything is back. One can make 
the argument that if users put something in memory and it is needed 
before dcache is re-enabled, then it makes sense to flush the concerned 
memory, unless it is not practical to do so.

At this moment, I can drop these two patches. I will still test it as I 
said earlier.

York
Mark Rutland Oct. 24, 2016, 10:44 a.m. UTC | #10
Hi,

Sorry for joining this a bit late; apologies if the below re-treads
ground already covered.

On Wed, Oct 19, 2016 at 09:25:02AM -0600, Stephen Warren wrote:
> On 10/14/2016 02:17 PM, York Sun wrote:
> >Current code turns off d-cache first, then flush all levels of cache.
> >This results data loss. As soon as d-cache is off, the dirty cache
> >is discarded according to the test on LS2080A. This issue was not
> >seen as long as external L3 cache was flushed to push the data to
> >main memory. However, external L3 cache is not guaranteed to have
> >the data. To fix this, flush the d-cache by way/set first to make
> >sure cache is clean before turning it off.
> 
> >diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
> 
> >@@ -478,9 +478,9 @@ void dcache_disable(void)
> 
> >+	flush_dcache_all();
> > 	set_sctlr(sctlr & ~(CR_C|CR_M));
> >
> >-	flush_dcache_all();
> > 	__asm_invalidate_tlb_all();
> 
> I talked to Mark Rutland at ARM, and I believe the current code is
> correct.

Well, almost, but not quite. It's a long story... ;)

I gave a primer [1,2] on the details at ELC earlier this year, which may
or may not be useful.

The big details are:

* Generaly "Flush" is ambiguous/meaningless. Here you seem to want
  clean+invalidate.

* Set/Way operations are for IMPLEMENTATION DEFINED (i.e. SoC-specific)
  cache maintenance sequences, and are not truly portable (e.g. not
  affecting system caches).
  
  I assume that an earlier boot stage initialised the caches prior to
  U-Boot. Given that, you *only* need to perform maintenance for the
  memory you have (at any point) mapped with cacheable attrbiutes, which
  should be a small subset of the PA space. With ARMv8-A, broadcast
  maintenance to the PoC should affect all relevant caches (assuming you
  use the correct shareability attributes).

* You *cannot* write a dcache disable routine in C, as the compiler can
  perform a number of implicit memory accesses (e.g. stack, globals,
  GOT). For that alone, I do not believe the code above is correct.

  Note that we have seen this being an issue in practice, before we got
  rid of Set/Way ops from arm64 Linux (see commit 5e051531447259e5).

* Your dcache disable code *must* be clean to the PoC, prior to
  execution, or instruction fetches could see stale data. You can first
  *clean* this to the PoC, which is sufficient to avoid the problems
  above.

* The SCTLR_ELx.{C,I} bits do not enable/disable caches; they merely
  activate/deactiveate cacheable attributes on data/instruction fetches.

  Note that cacheable instruction fetches can allocate into unified/data
  caches.
  
  Also, note that the I bit is independent of the C bit, and the
  attributes it provides differ when the M bit is clear. Generally, I
  would advise that at all times M == C == I, as that leads to the least
  surprise.

Thanks,
Mark.

[1] http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf
[2] https://www.youtube.com/watch?v=F0SlIMHRnLk
Mark Rutland Oct. 24, 2016, 10:59 a.m. UTC | #11
On Fri, Oct 21, 2016 at 07:31:52PM +0000, york sun wrote:
> On 10/20/2016 01:34 PM, Stephen Warren wrote:
> > On 10/19/2016 11:06 PM, york sun wrote:
> >> I understand the data in dirty cache is not lost when the dcache is
> >> disabled. It is just not accessible. In my case, after flushing L1/L2 by
> >> way/set, the data is probably in L3 cache. Without flushing L3, I have
> >> stalled data (including the stack) in main memory.
> >
> > I assume "stale" not "stalled".
> >
> > Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to
> > do everything, but the L3 cache doesn't implement by set/way
> > operations), then L3 can indeed still contain dirty data, and hence main
> > memory can be stale.
> >
> >> My previous test was trying to prove I can skip flushing L3 if I flush
> >> the necessary VA.
> >
> > It depends whether your L3 cache is before/after the Level of
> > Unification and/or the Level of Coherency for your HW, and whether you
> > flush by VA to the PoU or PoC.
> >
> > Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it
> > uses __asm_flush_dcache_range() which cleans and invalidates to the
> > point of coherency (it invokes the dc civac instruction).
> >
> >  > Now I when recall it carefully, I think I made a
> >> mistake by flushing by VA _after_ flushing the cache by way/set. I might
> >> have a positive result if I flushed the cache by VA first. I will repeat
> >> this test when I get back to prove this theory.
> >
> > I'll assume you L3 cache is before the Point of Coherency. If so, then
> > performing a clean/invalidate by VA to the PoC will affect all of L1,
> > L2, and L3 caches. As such, you need only perform the c/i by VA, and you
> > can entirely skip the c/i by set/way; I believe it would be redundant,
> > assuming that the by VA operations cover all relevant VAs.
> 
> I believe the PoC and PoU is before L3 for me. 

If you are using CCN, then the PoC is beyond the L3.

Were the PoC before the L3, there would be no requirement to perform
maintenance on the L3. The PoC is the point at which *all* accesses
(cacheable or otherwise) see the same data.

Per the ARM ARM (for ARMv8-A), maintenance to the PoC *must* affect
system caches (including CCN).

> I can clean/invalidate by VA, it may not cover all the cache lines. So
> by set/way is still needed.

The problem is figuring out which VA ranges require maintenance.

Do we not have an idea of the set of memory banks present in the SoC?
Like the memblock array in Linux?

> > b) How can we implement the by VA code in a way that doesn't touch DRAM?
> >
> > Implementing by-set/way is fairly constrained in that all the
> > configuration data is in a few registers, and the algorithm just .
> > requires a few nested loops.
> >
> > A generic by VA implementation seems like it would require walking
> > U-Boot's struct mm_region *mem_map data structure (or the CPU's
> > translation tables) in RAM. Perhaps that's OK since it's read-only...

So long as you clean the structure by VA to the PoC first, you can
safely access it with non-cacheable accesses.

> I agree in general about your points, but it may not be always practical 
> to flush all by VA. U-Boot may map huge amount of memory. Walking 
> through MMU table and flush all will be too much. 

I would recommend that you benchmark that; from my own experiments, so
long as you only perform maintenance on the portions of the PA space you
care about (and amortize barriers), this can take surprisingly little
time.

> Without flushing all memory, we really cannot say we flushed all
> dcache. On the other side, for U-Boot itself to operate, we don't
> really have to flush all. I guess the key is if we need to flush all.
> For Linux to boot, we don't.

This depends; so long as you've *only* used Normal, Inner-Shareable,
Inner Write-Back, Outer Write-Back, you could omit some maintenance, but
you still need to clean the Linux image to the PoC.

Any memory mapped with other attributes *must* be invalidated (and
perhaps clean+invalidated) from the caches.

> We can flush some memory (including U-Boot stack), turn off the MMU.
> As soon as kernel boots up, it enables dcache and everything is back.

If you have used memory attributes inconsistent with Linux, things will
end badly from here, due to potential loss of coherency resulting from
mismatched memory attributes.

Thanks,
Mark.
Stephen Warren Oct. 26, 2016, 7:41 p.m. UTC | #12
On 10/24/2016 04:44 AM, Mark Rutland wrote:
> Hi,
>
> Sorry for joining this a bit late; apologies if the below re-treads
> ground already covered.
>
> On Wed, Oct 19, 2016 at 09:25:02AM -0600, Stephen Warren wrote:
>> On 10/14/2016 02:17 PM, York Sun wrote:
>>> Current code turns off d-cache first, then flush all levels of cache.
>>> This results data loss. As soon as d-cache is off, the dirty cache
>>> is discarded according to the test on LS2080A. This issue was not
>>> seen as long as external L3 cache was flushed to push the data to
>>> main memory. However, external L3 cache is not guaranteed to have
>>> the data. To fix this, flush the d-cache by way/set first to make
>>> sure cache is clean before turning it off.
>>
>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
>>
>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>
>>> +	flush_dcache_all();
>>> 	set_sctlr(sctlr & ~(CR_C|CR_M));
>>>
>>> -	flush_dcache_all();
>>> 	__asm_invalidate_tlb_all();
>>
>> I talked to Mark Rutland at ARM, and I believe the current code is
>> correct.
>
> Well, almost, but not quite. It's a long story... ;)

To be clear, I was referring specifically/only to the relative ordering 
of the call to flush_dcache_all() and the SCTLR register modification, 
rather than wider aspects of the code:-)

> I gave a primer [1,2] on the details at ELC earlier this year, which may
> or may not be useful.
>
> The big details are:
>
> * Generaly "Flush" is ambiguous/meaningless. Here you seem to want
>   clean+invalidate.

Yes. I think the naming of this code is driven by U-Boot's 
cross-architecture compatibility and history, hence why it doesn't use 
the expected ARM terminology.

> * Set/Way operations are for IMPLEMENTATION DEFINED (i.e. SoC-specific)
>   cache maintenance sequences, and are not truly portable (e.g. not
>   affecting system caches).
>
>   I assume that an earlier boot stage initialised the caches prior to
>   U-Boot. Given that, you *only* need to perform maintenance for the
>   memory you have (at any point) mapped with cacheable attrbiutes, which
>   should be a small subset of the PA space. With ARMv8-A, broadcast
>   maintenance to the PoC should affect all relevant caches (assuming you
>   use the correct shareability attributes).

There is another thread (or fork of this thread) where York suggests 
replacing this code with operations by VA instead.

> * You *cannot* write a dcache disable routine in C, as the compiler can
>   perform a number of implicit memory accesses (e.g. stack, globals,
>   GOT). For that alone, I do not believe the code above is correct.
>
>   Note that we have seen this being an issue in practice, before we got
>   rid of Set/Way ops from arm64 Linux (see commit 5e051531447259e5).

I agree. In practice, at least with the compiler I happen to be using, 
we are currently getting lucky and there aren't any RAM accesses emitted 
by the compiler in this part of the code. Admittedly that won't hold 
true for everyone or across all of time though, so this should certainly 
be fixed...

> * Your dcache disable code *must* be clean to the PoC, prior to
>   execution, or instruction fetches could see stale data. You can first
>   *clean* this to the PoC, which is sufficient to avoid the problems
>   above.

I'm not sure if we have any documented rules/assumptions regarding 
cache/... state upon U-Boot entry like the Linux kernel does. It would 
probably be a good idea to do so...

I believe that for this particular piece of code, since it's always 
executed late in U-Boot's operation, the requirement is covered by 
U-Boot's code relocation routine, since that memcpy()s' .text/.data to 
the top of usable RAM, and hence hopefully is already fully cleaning the 
dcache and invalidating the icache afterwards.

> * The SCTLR_ELx.{C,I} bits do not enable/disable caches; they merely
>   activate/deactiveate cacheable attributes on data/instruction fetches.
>
>   Note that cacheable instruction fetches can allocate into unified/data
>   caches.
>
>   Also, note that the I bit is independent of the C bit, and the
>   attributes it provides differ when the M bit is clear. Generally, I
>   would advise that at all times M == C == I, as that leads to the least
>   surprise.

I think M==C==I is generally true in U-Boot, except for a limited time 
right before it jumps to the OS, where icache and dcache "are 
disabled"[1] separately, but one right after the other.

[1] to abuse the terminology that you pointed out was incorrect:-)

> Thanks,
> Mark.
>
> [1] http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf
> [2] https://www.youtube.com/watch?v=F0SlIMHRnLk
Stephen Warren Oct. 26, 2016, 7:47 p.m. UTC | #13
On 10/24/2016 04:59 AM, Mark Rutland wrote:
> On Fri, Oct 21, 2016 at 07:31:52PM +0000, york sun wrote:
>> On 10/20/2016 01:34 PM, Stephen Warren wrote:
>>> On 10/19/2016 11:06 PM, york sun wrote:
>>>> I understand the data in dirty cache is not lost when the dcache is
>>>> disabled. It is just not accessible. In my case, after flushing L1/L2 by
>>>> way/set, the data is probably in L3 cache. Without flushing L3, I have
>>>> stalled data (including the stack) in main memory.
>>>
>>> I assume "stale" not "stalled".
>>>
>>> Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to
>>> do everything, but the L3 cache doesn't implement by set/way
>>> operations), then L3 can indeed still contain dirty data, and hence main
>>> memory can be stale.
>>>
>>>> My previous test was trying to prove I can skip flushing L3 if I flush
>>>> the necessary VA.
>>>
>>> It depends whether your L3 cache is before/after the Level of
>>> Unification and/or the Level of Coherency for your HW, and whether you
>>> flush by VA to the PoU or PoC.
>>>
>>> Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it
>>> uses __asm_flush_dcache_range() which cleans and invalidates to the
>>> point of coherency (it invokes the dc civac instruction).
>>>
>>>  > Now I when recall it carefully, I think I made a
>>>> mistake by flushing by VA _after_ flushing the cache by way/set. I might
>>>> have a positive result if I flushed the cache by VA first. I will repeat
>>>> this test when I get back to prove this theory.
>>>
>>> I'll assume you L3 cache is before the Point of Coherency. If so, then
>>> performing a clean/invalidate by VA to the PoC will affect all of L1,
>>> L2, and L3 caches. As such, you need only perform the c/i by VA, and you
>>> can entirely skip the c/i by set/way; I believe it would be redundant,
>>> assuming that the by VA operations cover all relevant VAs.
>>
>> I believe the PoC and PoU is before L3 for me.
>
> If you are using CCN, then the PoC is beyond the L3.
>
> Were the PoC before the L3, there would be no requirement to perform
> maintenance on the L3. The PoC is the point at which *all* accesses
> (cacheable or otherwise) see the same data.
>
> Per the ARM ARM (for ARMv8-A), maintenance to the PoC *must* affect
> system caches (including CCN).
>
>> I can clean/invalidate by VA, it may not cover all the cache lines. So
>> by set/way is still needed.
>
> The problem is figuring out which VA ranges require maintenance.
>
> Do we not have an idea of the set of memory banks present in the SoC?
> Like the memblock array in Linux?

There are two data structures in ARM U-Boot that describe memory layout:

1) A list of RAM memory regions. U-Boot uses these to know where to 
relocate itself to (it relocates itself to the top of RAM at boot), and 
to fill in the /memory node when booting an OS using DT (and other 
equivalent mechanisms when not using DT.)

2) For AArch64 specifically, there's a memory map table that defines all 
RAM and MMIO regions, and the translation table attributes required for 
those regions.

Judging by your comments later in the original message, it sounds like 
it'd be fine to read from these structures during any dcache clean 
routine provided the table has already been cleaned. That makes using 
the tables much more tractable:-)
York Sun Oct. 26, 2016, 7:54 p.m. UTC | #14
On 10/26/2016 12:47 PM, Stephen Warren wrote:
>
> There are two data structures in ARM U-Boot that describe memory layout:
>
> 1) A list of RAM memory regions. U-Boot uses these to know where to
> relocate itself to (it relocates itself to the top of RAM at boot), and
> to fill in the /memory node when booting an OS using DT (and other
> equivalent mechanisms when not using DT.)
>
> 2) For AArch64 specifically, there's a memory map table that defines all
> RAM and MMIO regions, and the translation table attributes required for
> those regions.
>
> Judging by your comments later in the original message, it sounds like
> it'd be fine to read from these structures during any dcache clean
> routine provided the table has already been cleaned. That makes using
> the tables much more tractable:-)
>

I think we need to benchmark walking through the MMU tables. It can map 
huge amount of memory. For our case, it is more than 16GB. I have been 
reluctant to do so for the size. I am now back testing to revert _this_ 
patch, hoping to confirm what I learned from this discussion. After 
that, I will see how long it takes to flush all cached addresses by VA.

York
Stephen Warren Oct. 26, 2016, 8:12 p.m. UTC | #15
On 10/26/2016 01:54 PM, york sun wrote:
> On 10/26/2016 12:47 PM, Stephen Warren wrote:
>>
>> There are two data structures in ARM U-Boot that describe memory layout:
>>
>> 1) A list of RAM memory regions. U-Boot uses these to know where to
>> relocate itself to (it relocates itself to the top of RAM at boot), and
>> to fill in the /memory node when booting an OS using DT (and other
>> equivalent mechanisms when not using DT.)
>>
>> 2) For AArch64 specifically, there's a memory map table that defines all
>> RAM and MMIO regions, and the translation table attributes required for
>> those regions.
>>
>> Judging by your comments later in the original message, it sounds like
>> it'd be fine to read from these structures during any dcache clean
>> routine provided the table has already been cleaned. That makes using
>> the tables much more tractable:-)
>>
>
> I think we need to benchmark walking through the MMU tables. It can map
> huge amount of memory. For our case, it is more than 16GB. I have been
> reluctant to do so for the size. I am now back testing to revert _this_
> patch, hoping to confirm what I learned from this discussion. After
> that, I will see how long it takes to flush all cached addresses by VA.

Given the existence of the two data structures I mentioned above, I 
don't think we'd ever need to walk the MMU translation tables (assuming 
you mean the data structures U-Boot creates and configures into the 
CPU's MMU), rather than just walking over the tiny number of entries in 
those data structures?
York Sun Oct. 26, 2016, 8:29 p.m. UTC | #16
On 10/26/2016 01:12 PM, Stephen Warren wrote:
> On 10/26/2016 01:54 PM, york sun wrote:
>> On 10/26/2016 12:47 PM, Stephen Warren wrote:
>>>
>>> There are two data structures in ARM U-Boot that describe memory layout:
>>>
>>> 1) A list of RAM memory regions. U-Boot uses these to know where to
>>> relocate itself to (it relocates itself to the top of RAM at boot), and
>>> to fill in the /memory node when booting an OS using DT (and other
>>> equivalent mechanisms when not using DT.)
>>>
>>> 2) For AArch64 specifically, there's a memory map table that defines all
>>> RAM and MMIO regions, and the translation table attributes required for
>>> those regions.
>>>
>>> Judging by your comments later in the original message, it sounds like
>>> it'd be fine to read from these structures during any dcache clean
>>> routine provided the table has already been cleaned. That makes using
>>> the tables much more tractable:-)
>>>
>>
>> I think we need to benchmark walking through the MMU tables. It can map
>> huge amount of memory. For our case, it is more than 16GB. I have been
>> reluctant to do so for the size. I am now back testing to revert _this_
>> patch, hoping to confirm what I learned from this discussion. After
>> that, I will see how long it takes to flush all cached addresses by VA.
>
> Given the existence of the two data structures I mentioned above, I
> don't think we'd ever need to walk the MMU translation tables (assuming
> you mean the data structures U-Boot creates and configures into the
> CPU's MMU), rather than just walking over the tiny number of entries in
> those data structures?
>

The list of RAM used by U-Boot is different from the total memory. 
U-Boot relocates itself to the gd->ram_top. We may have memory in second 
region, or even the third region.

We also have other cached region other than RAM. I know for Layerscape 
SoCs, there is QMan portal has 64MB cached. If we consider to flush 
_all_ address, they should be included.

I think the 2nd data structure you mentioned is the MMU table for 
aarch64, isn't it?

York
Stephen Warren Oct. 26, 2016, 9 p.m. UTC | #17
On 10/26/2016 02:29 PM, york sun wrote:
> On 10/26/2016 01:12 PM, Stephen Warren wrote:
>> On 10/26/2016 01:54 PM, york sun wrote:
>>> On 10/26/2016 12:47 PM, Stephen Warren wrote:
>>>>
>>>> There are two data structures in ARM U-Boot that describe memory layout:
>>>>
>>>> 1) A list of RAM memory regions. U-Boot uses these to know where to
>>>> relocate itself to (it relocates itself to the top of RAM at boot), and
>>>> to fill in the /memory node when booting an OS using DT (and other
>>>> equivalent mechanisms when not using DT.)
>>>>
>>>> 2) For AArch64 specifically, there's a memory map table that defines all
>>>> RAM and MMIO regions, and the translation table attributes required for
>>>> those regions.
>>>>
>>>> Judging by your comments later in the original message, it sounds like
>>>> it'd be fine to read from these structures during any dcache clean
>>>> routine provided the table has already been cleaned. That makes using
>>>> the tables much more tractable:-)
>>>>
>>>
>>> I think we need to benchmark walking through the MMU tables. It can map
>>> huge amount of memory. For our case, it is more than 16GB. I have been
>>> reluctant to do so for the size. I am now back testing to revert _this_
>>> patch, hoping to confirm what I learned from this discussion. After
>>> that, I will see how long it takes to flush all cached addresses by VA.
>>
>> Given the existence of the two data structures I mentioned above, I
>> don't think we'd ever need to walk the MMU translation tables (assuming
>> you mean the data structures U-Boot creates and configures into the
>> CPU's MMU), rather than just walking over the tiny number of entries in
>> those data structures?
>>
>
> The list of RAM used by U-Boot is different from the total memory.
> U-Boot relocates itself to the gd->ram_top. We may have memory in second
> region, or even the third region.
>
> We also have other cached region other than RAM. I know for Layerscape
> SoCs, there is QMan portal has 64MB cached. If we consider to flush
> _all_ address, they should be included.
>
> I think the 2nd data structure you mentioned is the MMU table for
> aarch64, isn't it?

The 2nd data structure I mentioned is what U-Boot uses to create the 
translation tables, but isn't the translation table itself.
York Sun Oct. 26, 2016, 9:02 p.m. UTC | #18
On 10/26/2016 12:54 PM, york.sun@nxp.com wrote:
> On 10/26/2016 12:47 PM, Stephen Warren wrote:
>>
>> There are two data structures in ARM U-Boot that describe memory layout:
>>
>> 1) A list of RAM memory regions. U-Boot uses these to know where to
>> relocate itself to (it relocates itself to the top of RAM at boot), and
>> to fill in the /memory node when booting an OS using DT (and other
>> equivalent mechanisms when not using DT.)
>>
>> 2) For AArch64 specifically, there's a memory map table that defines all
>> RAM and MMIO regions, and the translation table attributes required for
>> those regions.
>>
>> Judging by your comments later in the original message, it sounds like
>> it'd be fine to read from these structures during any dcache clean
>> routine provided the table has already been cleaned. That makes using
>> the tables much more tractable:-)
>>
>
> I think we need to benchmark walking through the MMU tables. It can map
> huge amount of memory. For our case, it is more than 16GB. I have been
> reluctant to do so for the size. I am now back testing to revert _this_
> patch, hoping to confirm what I learned from this discussion.

I came back from my testing and I have more questions than answers.

For _this_ patch, I proposed to flush cache before disabling them, 
noting once the dcache is disabled, the staled data in dirty cache is 
not visible to the core. My argument was if we flush L1/L2, they could 
end up in L3 (I don't know for sure). If I want to skip flushing L3, I 
have to fix it.

During this discussion, I thought I made a mistake by flushing L1/L2 by 
way/set first, then flushing by VA. Actually I didn't. I flushed by VA 
first.

With my today's test, the baseline (working in the sense of booting 
Linux) is

PATCH 1/2 armv8: Fix dcache disable function
PATCH 2/2 armv8: Fix flush_dcache_all function

With these two patches, I flush the stack up to top of U-Boot by VA, 
followed by flush by set/way. L3 is not flushed. Then d-cache is 
disabled. I know this is not a real "flush all" procedure. With this 
modified procedure, I can continue to boot Linux.

If I revert patch 1, i.e. to disable dcache before flushing, I can see 
the data is not visible from the core (debug with JTAG tool). My hope 
was the staled data should be flushed to main memory if flushed by VA. 
That's not the case. The main memory doesn't have the correct data. So 
my new question is, why flushing by VA doesn't flush the data to main 
memory? Do I need to flush the cache while cache is enabled?

York
York Sun Oct. 26, 2016, 9:04 p.m. UTC | #19
On 10/26/2016 02:00 PM, Stephen Warren wrote:
> On 10/26/2016 02:29 PM, york sun wrote:
>> On 10/26/2016 01:12 PM, Stephen Warren wrote:
>>> On 10/26/2016 01:54 PM, york sun wrote:
>>>> On 10/26/2016 12:47 PM, Stephen Warren wrote:
>>>>>
>>>>> There are two data structures in ARM U-Boot that describe memory layout:
>>>>>
>>>>> 1) A list of RAM memory regions. U-Boot uses these to know where to
>>>>> relocate itself to (it relocates itself to the top of RAM at boot), and
>>>>> to fill in the /memory node when booting an OS using DT (and other
>>>>> equivalent mechanisms when not using DT.)
>>>>>
>>>>> 2) For AArch64 specifically, there's a memory map table that defines all
>>>>> RAM and MMIO regions, and the translation table attributes required for
>>>>> those regions.
>>>>>
>>>>> Judging by your comments later in the original message, it sounds like
>>>>> it'd be fine to read from these structures during any dcache clean
>>>>> routine provided the table has already been cleaned. That makes using
>>>>> the tables much more tractable:-)
>>>>>
>>>>
>>>> I think we need to benchmark walking through the MMU tables. It can map
>>>> huge amount of memory. For our case, it is more than 16GB. I have been
>>>> reluctant to do so for the size. I am now back testing to revert _this_
>>>> patch, hoping to confirm what I learned from this discussion. After
>>>> that, I will see how long it takes to flush all cached addresses by VA.
>>>
>>> Given the existence of the two data structures I mentioned above, I
>>> don't think we'd ever need to walk the MMU translation tables (assuming
>>> you mean the data structures U-Boot creates and configures into the
>>> CPU's MMU), rather than just walking over the tiny number of entries in
>>> those data structures?
>>>
>>
>> The list of RAM used by U-Boot is different from the total memory.
>> U-Boot relocates itself to the gd->ram_top. We may have memory in second
>> region, or even the third region.
>>
>> We also have other cached region other than RAM. I know for Layerscape
>> SoCs, there is QMan portal has 64MB cached. If we consider to flush
>> _all_ address, they should be included.
>>
>> I think the 2nd data structure you mentioned is the MMU table for
>> aarch64, isn't it?
>
> The 2nd data structure I mentioned is what U-Boot uses to create the
> translation tables, but isn't the translation table itself.
>

I see. You were referring the struct mm_region. It does simplify walking 
the address. You presumption is we don't update the MMU table after 
creating it. It is mostly true, but current MMU code allows updating.

York
York Sun Oct. 28, 2016, 5:38 p.m. UTC | #20
On 10/26/2016 02:02 PM, york.sun@nxp.com wrote:
>
> I came back from my testing and I have more questions than answers.
>
> For _this_ patch, I proposed to flush cache before disabling them,
> noting once the dcache is disabled, the staled data in dirty cache is
> not visible to the core. My argument was if we flush L1/L2, they could
> end up in L3 (I don't know for sure). If I want to skip flushing L3, I
> have to fix it.
>
> During this discussion, I thought I made a mistake by flushing L1/L2 by
> way/set first, then flushing by VA. Actually I didn't. I flushed by VA
> first.
>
> With my today's test, the baseline (working in the sense of booting
> Linux) is
>
> PATCH 1/2 armv8: Fix dcache disable function
> PATCH 2/2 armv8: Fix flush_dcache_all function
>
> With these two patches, I flush the stack up to top of U-Boot by VA,
> followed by flush by set/way. L3 is not flushed. Then d-cache is
> disabled. I know this is not a real "flush all" procedure. With this
> modified procedure, I can continue to boot Linux.
>
> If I revert patch 1, i.e. to disable dcache before flushing, I can see
> the data is not visible from the core (debug with JTAG tool). My hope
> was the staled data should be flushed to main memory if flushed by VA.
> That's not the case. The main memory doesn't have the correct data. So
> my new question is, why flushing by VA doesn't flush the data to main
> memory? Do I need to flush the cache while cache is enabled?
>

Guys,

I think I found the root cause of my data loss.

Current code disables D-cache and MMU before flushing the cache. I think 
the problem is turning off MMU. MMU should stay on when we flush 
D-cache. We can turn it off after the flushing. Once I make this change, 
I can see the correct data in memory after flushing (by VA).

Do you agree we should leave MMU on during flushing?

York
Stephen Warren Oct. 28, 2016, 5:56 p.m. UTC | #21
On 10/28/2016 11:38 AM, york sun wrote:
> On 10/26/2016 02:02 PM, york.sun@nxp.com wrote:
>>
>> I came back from my testing and I have more questions than answers.
>>
>> For _this_ patch, I proposed to flush cache before disabling them,
>> noting once the dcache is disabled, the staled data in dirty cache is
>> not visible to the core. My argument was if we flush L1/L2, they could
>> end up in L3 (I don't know for sure). If I want to skip flushing L3, I
>> have to fix it.
>>
>> During this discussion, I thought I made a mistake by flushing L1/L2 by
>> way/set first, then flushing by VA. Actually I didn't. I flushed by VA
>> first.
>>
>> With my today's test, the baseline (working in the sense of booting
>> Linux) is
>>
>> PATCH 1/2 armv8: Fix dcache disable function
>> PATCH 2/2 armv8: Fix flush_dcache_all function
>>
>> With these two patches, I flush the stack up to top of U-Boot by VA,
>> followed by flush by set/way. L3 is not flushed. Then d-cache is
>> disabled. I know this is not a real "flush all" procedure. With this
>> modified procedure, I can continue to boot Linux.
>>
>> If I revert patch 1, i.e. to disable dcache before flushing, I can see
>> the data is not visible from the core (debug with JTAG tool). My hope
>> was the staled data should be flushed to main memory if flushed by VA.
>> That's not the case. The main memory doesn't have the correct data. So
>> my new question is, why flushing by VA doesn't flush the data to main
>> memory? Do I need to flush the cache while cache is enabled?
>>
>
> Guys,
>
> I think I found the root cause of my data loss.
>
> Current code disables D-cache and MMU before flushing the cache. I think
> the problem is turning off MMU. MMU should stay on when we flush
> D-cache. We can turn it off after the flushing. Once I make this change,
> I can see the correct data in memory after flushing (by VA).
>
> Do you agree we should leave MMU on during flushing?

If you're "flushing" by VA, then I'm not surprised since the MMU is what 
defines the VA->PA mapping, and perhaps you have some physically tagged 
caches.

However, I believe U-Boot mainline currently "flushes" by set/way, which 
I wouldn't expect MMU status to influence at all.
York Sun Oct. 28, 2016, 6:17 p.m. UTC | #22
On 10/28/2016 10:57 AM, Stephen Warren wrote:
> On 10/28/2016 11:38 AM, york sun wrote:
>> On 10/26/2016 02:02 PM, york.sun@nxp.com wrote:
>>>
>>> I came back from my testing and I have more questions than answers.
>>>
>>> For _this_ patch, I proposed to flush cache before disabling them,
>>> noting once the dcache is disabled, the staled data in dirty cache is
>>> not visible to the core. My argument was if we flush L1/L2, they could
>>> end up in L3 (I don't know for sure). If I want to skip flushing L3, I
>>> have to fix it.
>>>
>>> During this discussion, I thought I made a mistake by flushing L1/L2 by
>>> way/set first, then flushing by VA. Actually I didn't. I flushed by VA
>>> first.
>>>
>>> With my today's test, the baseline (working in the sense of booting
>>> Linux) is
>>>
>>> PATCH 1/2 armv8: Fix dcache disable function
>>> PATCH 2/2 armv8: Fix flush_dcache_all function
>>>
>>> With these two patches, I flush the stack up to top of U-Boot by VA,
>>> followed by flush by set/way. L3 is not flushed. Then d-cache is
>>> disabled. I know this is not a real "flush all" procedure. With this
>>> modified procedure, I can continue to boot Linux.
>>>
>>> If I revert patch 1, i.e. to disable dcache before flushing, I can see
>>> the data is not visible from the core (debug with JTAG tool). My hope
>>> was the staled data should be flushed to main memory if flushed by VA.
>>> That's not the case. The main memory doesn't have the correct data. So
>>> my new question is, why flushing by VA doesn't flush the data to main
>>> memory? Do I need to flush the cache while cache is enabled?
>>>
>>
>> Guys,
>>
>> I think I found the root cause of my data loss.
>>
>> Current code disables D-cache and MMU before flushing the cache. I think
>> the problem is turning off MMU. MMU should stay on when we flush
>> D-cache. We can turn it off after the flushing. Once I make this change,
>> I can see the correct data in memory after flushing (by VA).
>>
>> Do you agree we should leave MMU on during flushing?
>
> If you're "flushing" by VA, then I'm not surprised since the MMU is what
> defines the VA->PA mapping, and perhaps you have some physically tagged
> caches.
>
> However, I believe U-Boot mainline currently "flushes" by set/way, which
> I wouldn't expect MMU status to influence at all.

Flushing by set/way (only) is what I am trying to change. It would be 
better if we don't have to flush L3. Do you agree?

York
Stephen Warren Oct. 28, 2016, 6:32 p.m. UTC | #23
On 10/28/2016 12:17 PM, york sun wrote:
> On 10/28/2016 10:57 AM, Stephen Warren wrote:
>> On 10/28/2016 11:38 AM, york sun wrote:
>>> On 10/26/2016 02:02 PM, york.sun@nxp.com wrote:
>>>>
>>>> I came back from my testing and I have more questions than answers.
>>>>
>>>> For _this_ patch, I proposed to flush cache before disabling them,
>>>> noting once the dcache is disabled, the staled data in dirty cache is
>>>> not visible to the core. My argument was if we flush L1/L2, they could
>>>> end up in L3 (I don't know for sure). If I want to skip flushing L3, I
>>>> have to fix it.
>>>>
>>>> During this discussion, I thought I made a mistake by flushing L1/L2 by
>>>> way/set first, then flushing by VA. Actually I didn't. I flushed by VA
>>>> first.
>>>>
>>>> With my today's test, the baseline (working in the sense of booting
>>>> Linux) is
>>>>
>>>> PATCH 1/2 armv8: Fix dcache disable function
>>>> PATCH 2/2 armv8: Fix flush_dcache_all function
>>>>
>>>> With these two patches, I flush the stack up to top of U-Boot by VA,
>>>> followed by flush by set/way. L3 is not flushed. Then d-cache is
>>>> disabled. I know this is not a real "flush all" procedure. With this
>>>> modified procedure, I can continue to boot Linux.
>>>>
>>>> If I revert patch 1, i.e. to disable dcache before flushing, I can see
>>>> the data is not visible from the core (debug with JTAG tool). My hope
>>>> was the staled data should be flushed to main memory if flushed by VA.
>>>> That's not the case. The main memory doesn't have the correct data. So
>>>> my new question is, why flushing by VA doesn't flush the data to main
>>>> memory? Do I need to flush the cache while cache is enabled?
>>>>
>>>
>>> Guys,
>>>
>>> I think I found the root cause of my data loss.
>>>
>>> Current code disables D-cache and MMU before flushing the cache. I think
>>> the problem is turning off MMU. MMU should stay on when we flush
>>> D-cache. We can turn it off after the flushing. Once I make this change,
>>> I can see the correct data in memory after flushing (by VA).
>>>
>>> Do you agree we should leave MMU on during flushing?
>>
>> If you're "flushing" by VA, then I'm not surprised since the MMU is what
>> defines the VA->PA mapping, and perhaps you have some physically tagged
>> caches.
>>
>> However, I believe U-Boot mainline currently "flushes" by set/way, which
>> I wouldn't expect MMU status to influence at all.
>
> Flushing by set/way (only) is what I am trying to change. It would be
> better if we don't have to flush L3. Do you agree?

It depends on whether the L3 is before or after the Point of Coherency. 
If it's before, then it needs to be cleaned. If it's after, then I 
believe it's irrelevant and can be skipped. I don't believe there's any 
other factor that will allow/prevent you from skipping operations on 
your L3; there's no wiggle-room or leeway.

Related, consider the following from the Linux kernel's 
Documentation/arm64/booting.txt:

> - Caches, MMUs
>   The MMU must be off.
>   Instruction cache may be on or off.
>   The address range corresponding to the loaded kernel image must be
>   cleaned to the PoC.

(That only applies to the kernel image specifically, but doing the same 
for the entire cache content seems reasonable, perhaps even required for 
other reasons?)
York Sun Oct. 28, 2016, 9:35 p.m. UTC | #24
On 10/28/2016 11:32 AM, Stephen Warren wrote:
> On 10/28/2016 12:17 PM, york sun wrote:
>> On 10/28/2016 10:57 AM, Stephen Warren wrote:
>>> On 10/28/2016 11:38 AM, york sun wrote:
>>>> On 10/26/2016 02:02 PM, york.sun@nxp.com wrote:
>>>>>
>>>>> I came back from my testing and I have more questions than answers.
>>>>>
>>>>> For _this_ patch, I proposed to flush cache before disabling them,
>>>>> noting once the dcache is disabled, the staled data in dirty cache is
>>>>> not visible to the core. My argument was if we flush L1/L2, they could
>>>>> end up in L3 (I don't know for sure). If I want to skip flushing L3, I
>>>>> have to fix it.
>>>>>
>>>>> During this discussion, I thought I made a mistake by flushing L1/L2 by
>>>>> way/set first, then flushing by VA. Actually I didn't. I flushed by VA
>>>>> first.
>>>>>
>>>>> With my today's test, the baseline (working in the sense of booting
>>>>> Linux) is
>>>>>
>>>>> PATCH 1/2 armv8: Fix dcache disable function
>>>>> PATCH 2/2 armv8: Fix flush_dcache_all function
>>>>>
>>>>> With these two patches, I flush the stack up to top of U-Boot by VA,
>>>>> followed by flush by set/way. L3 is not flushed. Then d-cache is
>>>>> disabled. I know this is not a real "flush all" procedure. With this
>>>>> modified procedure, I can continue to boot Linux.
>>>>>
>>>>> If I revert patch 1, i.e. to disable dcache before flushing, I can see
>>>>> the data is not visible from the core (debug with JTAG tool). My hope
>>>>> was the staled data should be flushed to main memory if flushed by VA.
>>>>> That's not the case. The main memory doesn't have the correct data. So
>>>>> my new question is, why flushing by VA doesn't flush the data to main
>>>>> memory? Do I need to flush the cache while cache is enabled?
>>>>>
>>>>
>>>> Guys,
>>>>
>>>> I think I found the root cause of my data loss.
>>>>
>>>> Current code disables D-cache and MMU before flushing the cache. I think
>>>> the problem is turning off MMU. MMU should stay on when we flush
>>>> D-cache. We can turn it off after the flushing. Once I make this change,
>>>> I can see the correct data in memory after flushing (by VA).
>>>>
>>>> Do you agree we should leave MMU on during flushing?
>>>
>>> If you're "flushing" by VA, then I'm not surprised since the MMU is what
>>> defines the VA->PA mapping, and perhaps you have some physically tagged
>>> caches.
>>>
>>> However, I believe U-Boot mainline currently "flushes" by set/way, which
>>> I wouldn't expect MMU status to influence at all.
>>
>> Flushing by set/way (only) is what I am trying to change. It would be
>> better if we don't have to flush L3. Do you agree?
>
> It depends on whether the L3 is before or after the Point of Coherency.
> If it's before, then it needs to be cleaned. If it's after, then I
> believe it's irrelevant and can be skipped. I don't believe there's any
> other factor that will allow/prevent you from skipping operations on
> your L3; there's no wiggle-room or leeway.

As Mark pointed, out my L3 is before PoC. Flushing by set/way only 
cleans L1/L2 cache. If not flushing L3, or flushing by VA, my stack is 
corrupted.

>
> Related, consider the following from the Linux kernel's
> Documentation/arm64/booting.txt:
>
>> - Caches, MMUs
>>   The MMU must be off.
>>   Instruction cache may be on or off.
>>   The address range corresponding to the loaded kernel image must be
>>   cleaned to the PoC.
>
> (That only applies to the kernel image specifically, but doing the same
> for the entire cache content seems reasonable, perhaps even required for
> other reasons?)

Booting Linux is not an issue here. The kernel image is flushed by VA.

I am struggling on the dcache_disable() which implies all dcache is 
flushed. I don't have a reasonable way to flush all if I want to skip 
L3. I tried to benchmark flushing by VA to cover my entire 16GB memory. 
It took 30+ seconds. On the other side, flushing by set/way and flushing 
L3 together took 7 ms. If I only flush U-Boot stack in this function, it 
can run really fast, but that defeats the purpose of flush all cache.

I thought of parsing each set/way to find the address of each cache line 
(I don't know how to do that yet), but the tag only contains physical 
address not VA.

The ARM document shows example code to clean entire data or unified 
cache to PoC, very similar to the code we have in U-Boot armv8/cache.S.
Unless there are other cache maintenance instruction I am not aware of, 
I don't see how to flush to PoC by set/way.

At this point, I don't see a reasonable way to implement flush all 
dcache without flushing L3.

York
Mark Rutland Nov. 7, 2016, 2:03 p.m. UTC | #25
On Fri, Oct 28, 2016 at 12:32:36PM -0600, Stephen Warren wrote:
> Related, consider the following from the Linux kernel's
> Documentation/arm64/booting.txt:
> 
> >- Caches, MMUs
> >  The MMU must be off.
> >  Instruction cache may be on or off.
> >  The address range corresponding to the loaded kernel image must be
> >  cleaned to the PoC.
> 
> (That only applies to the kernel image specifically, but doing the
> same for the entire cache content seems reasonable, perhaps even
> required for other reasons?)

It's certainly preferable.

The wording is somewhat poor too, and needs soem fixing up.

If anything has been allocated into the cache which may conflict with
later use with Normal Inner-Shareable Inner-WB Outer-WB mappings, thise
needs to be (Cleaned+)Invalidated from the caches.

Thanks,
Mark.
Mark Rutland Nov. 7, 2016, 2:11 p.m. UTC | #26
On Fri, Oct 28, 2016 at 09:35:37PM +0000, york sun wrote:
> I am struggling on the dcache_disable() which implies all dcache is 
> flushed. I don't have a reasonable way to flush all if I want to skip 
> L3. I tried to benchmark flushing by VA to cover my entire 16GB memory. 
> It took 30+ seconds. On the other side, flushing by set/way and flushing 
> L3 together took 7 ms. If I only flush U-Boot stack in this function, it 
> can run really fast, but that defeats the purpose of flush all cache.
> 
> I thought of parsing each set/way to find the address of each cache line 
> (I don't know how to do that yet), but the tag only contains physical 
> address not VA.

With the MMU off, translation is an idmap (i.e. VA == PA), so if you
have physical addresses, you can use those directly.

That said, the presence and implementation of any mechanism to read
addresses from the cache is IMPLEMENTATION DEFINED, so this will not be
portable.

> The ARM document shows example code to clean entire data or unified 
> cache to PoC, very similar to the code we have in U-Boot armv8/cache.S.

Do you mean the "Example code for cache maintenance instructions"?

In recent versions of the ARM ARM there's a large note explaining why
this only works in very restricted scenarios (and cannot be used to
affect system caches such as your L3).

In the latest ARM ARM ("ARM DDI 0487A.k"), see page D3-1710.

> Unless there are other cache maintenance instruction I am not aware of, 
> I don't see how to flush to PoC by set/way.

Architecturally, Set/Way operations are not guaranteed to affect al
caches prior to the PoC, and may require other IMPLEMENTATION DEFINED
maintenance (e.g. MMIO control of system-level caches).

Thanks,
Mark.
York Sun Nov. 7, 2016, 4:23 p.m. UTC | #27
On 11/07/2016 06:12 AM, Mark Rutland wrote:
> On Fri, Oct 28, 2016 at 09:35:37PM +0000, york sun wrote:
>> I am struggling on the dcache_disable() which implies all dcache is
>> flushed. I don't have a reasonable way to flush all if I want to skip
>> L3. I tried to benchmark flushing by VA to cover my entire 16GB memory.
>> It took 30+ seconds. On the other side, flushing by set/way and flushing
>> L3 together took 7 ms. If I only flush U-Boot stack in this function, it
>> can run really fast, but that defeats the purpose of flush all cache.
>>
>> I thought of parsing each set/way to find the address of each cache line
>> (I don't know how to do that yet), but the tag only contains physical
>> address not VA.
>
> With the MMU off, translation is an idmap (i.e. VA == PA), so if you
> have physical addresses, you can use those directly.
>
> That said, the presence and implementation of any mechanism to read
> addresses from the cache is IMPLEMENTATION DEFINED, so this will not be
> portable.
>
>> The ARM document shows example code to clean entire data or unified
>> cache to PoC, very similar to the code we have in U-Boot armv8/cache.S.
>
> Do you mean the "Example code for cache maintenance instructions"?
>
> In recent versions of the ARM ARM there's a large note explaining why
> this only works in very restricted scenarios (and cannot be used to
> affect system caches such as your L3).
>
> In the latest ARM ARM ("ARM DDI 0487A.k"), see page D3-1710.
>
>> Unless there are other cache maintenance instruction I am not aware of,
>> I don't see how to flush to PoC by set/way.
>
> Architecturally, Set/Way operations are not guaranteed to affect al
> caches prior to the PoC, and may require other IMPLEMENTATION DEFINED
> maintenance (e.g. MMIO control of system-level caches).
>

At this point, seeking alternative ways to clean entire cache without 
flushing L3 seems non-productive. I am going to stop here. Thanks for 
the discussion.

York
diff mbox

Patch

diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
index cd3f6c1..92d6277 100644
--- a/arch/arm/cpu/armv8/cache_v8.c
+++ b/arch/arm/cpu/armv8/cache_v8.c
@@ -478,9 +478,9 @@  void dcache_disable(void)
 	if (!(sctlr & CR_C))
 		return;
 
+	flush_dcache_all();
 	set_sctlr(sctlr & ~(CR_C|CR_M));
 
-	flush_dcache_all();
 	__asm_invalidate_tlb_all();
 }