Patchwork Unnecessary cache-line flush on page table updates ?

login
register
mail settings
Submitter Catalin Marinas
Date July 1, 2011, 10:10 a.m.
Message ID <20110701101019.GA1723@e102109-lin.cambridge.arm.com>
Download mbox | patch
Permalink /patch/102891/
State New
Headers show

Comments

Catalin Marinas - July 1, 2011, 10:10 a.m.
On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
> cache not from memory. Then I think we do not need to flush the cache
> line in the following code because MMU will always see up-to-date view
> of page table in both UP and SMP systems.
> 
> linux/arch/arm/mm/proc-v7.S
> 
> ENTRY(cpu_v7_set_pte_ext)
> 	...
>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
> D-cache // why we need this in A9?
>         …
> 
> If this is a necessary one, could you please explain the reason? Thanks.

No, it's not necessary, only that this file is used by other processors
as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
walk) and avoid flushing if the value is 1. The same could be done for
PMD entries, though that's less critical than the PTEs.

Please note that the patch is not fully tested.

8<--------------------

From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Fri, 1 Jul 2011 10:57:07 +0100
Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported

This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
only cleans the D-cache corresponding to a PTE if coherent page table
walks are not supported.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/mm/proc-v7.S |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)
heechul Yun - July 1, 2011, 9:42 p.m.
Great.

Removing the PTE flush seems to have a noticeable performance
difference in my test. The followings are lmbench 3.0a performance
result measured on a Cortex A9 SMP platform. So far, I did not have
any problem while doing various test.

=========
mm-patch:
=========
Pagefaults on /tmp/XXX: 3.0759 microseconds
Process fork+exit: 464.5414 microseconds
Process fork+execve: 785.4944 microseconds
Process fork+/bin/sh -c: 488.6204 microseconds

=========
original:
=========
Pagefaults on /tmp/XXX: 3.6209 microseconds
Process fork+exit: 485.5236 microseconds
Process fork+execve: 820.0613 microseconds
Process fork+/bin/sh -c: 2966.3828 microseconds

Heechul

On Fri, Jul 1, 2011 at 3:10 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
>> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
>> cache not from memory. Then I think we do not need to flush the cache
>> line in the following code because MMU will always see up-to-date view
>> of page table in both UP and SMP systems.
>>
>> linux/arch/arm/mm/proc-v7.S
>>
>> ENTRY(cpu_v7_set_pte_ext)
>>       ...
>>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
>> D-cache // why we need this in A9?
>>         …
>>
>> If this is a necessary one, could you please explain the reason? Thanks.
>
> No, it's not necessary, only that this file is used by other processors
> as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
> walk) and avoid flushing if the value is 1. The same could be done for
> PMD entries, though that's less critical than the PTEs.
>
> Please note that the patch is not fully tested.
>
> 8<--------------------
>
> From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Fri, 1 Jul 2011 10:57:07 +0100
> Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported
>
> This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
> only cleans the D-cache corresponding to a PTE if coherent page table
> walks are not supported.
>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
>  arch/arm/mm/proc-v7.S |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
> index 8013afc..fc5b36f 100644
> --- a/arch/arm/mm/proc-v7.S
> +++ b/arch/arm/mm/proc-v7.S
> @@ -166,7 +166,9 @@ ENTRY(cpu_v7_set_pte_ext)
>  ARM(  str     r3, [r0, #2048]! )
>  THUMB(        add     r0, r0, #2048 )
>  THUMB(        str     r3, [r0] )
> -       mcr     p15, 0, r0, c7, c10, 1          @ flush_pte
> +       mrc     p15, 0, r3, c0, c1, 7           @ read ID_MMFR3
> +       tst     r3, #0xf << 20                  @ check the coherent walk bits
> +       mcreq   p15, 0, r0, c7, c10, 1          @ flush_pte
>  #endif
>        mov     pc, lr
>  ENDPROC(cpu_v7_set_pte_ext)
>
> --
> Catalin
>
Catalin Marinas - July 4, 2011, 9:45 a.m.
On Fri, Jul 01, 2011 at 10:42:00PM +0100, heechul Yun wrote:
> Great.
> 
> Removing the PTE flush seems to have a noticeable performance
> difference in my test. The followings are lmbench 3.0a performance
> result measured on a Cortex A9 SMP platform. So far, I did not have
> any problem while doing various test.
> 
> =========
> mm-patch:
> =========
> Pagefaults on /tmp/XXX: 3.0759 microseconds
> Process fork+exit: 464.5414 microseconds
> Process fork+execve: 785.4944 microseconds
> Process fork+/bin/sh -c: 488.6204 microseconds
> 
> =========
> original:
> =========
> Pagefaults on /tmp/XXX: 3.6209 microseconds
> Process fork+exit: 485.5236 microseconds
> Process fork+execve: 820.0613 microseconds
> Process fork+/bin/sh -c: 2966.3828 microseconds

Given these results, I think it's worth merging the patch. Can I add
your Tested-by?

I think there can be a few other optimisations in the TLB area but it
needs some digging.

Thanks.
Catalin Marinas - July 4, 2011, 10:43 a.m.
On Mon, Jul 04, 2011 at 11:02:21AM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 10:45:32AM +0100, Catalin Marinas wrote:
> > Given these results, I think it's worth merging the patch. Can I add
> > your Tested-by?
> 
> If we're going to make this change to use the coherent information,
> let's address all the places in one go, which are:
> 
>         clean_pmd_entry
>         flush_pmd_entry
>         clean_pte_table
> 
> These require a little more thought as we aren't guaranteed to have
> ID_MMFR3 in place - maybe they should be callbacks into the per-CPU
> code.

For the first two, can we not clear the TLB_DCLEAN bit in
__cpu_tlb_flags (only a single check at boot time?).

For the last one, we could add a tlb_flags() check.

> It would also be a good idea to change the comment from "flush_pte"
> to "Clean data cache to PoU".

OK.

> > I think there can be a few other optimisations in the TLB area but it
> > needs some digging.
> 
> The single-TLB model works fairly well, but as I thought the lack of
> mcr%? processing by GCC makes the asm/tlbflush.h code fairly disgusting
> even for a v6+v7 kernel.  Luckily, we can play some tricks and sort
> some of that out.  The patch below is not complete (and can result in
> some rules of the architecture being violated - namely the requirement
> for an ISB after the BTB flush without a branch between) but it
> illustrates the idea:

I'm not sure about this rule, I can ask for some clarification (we are
not changing the memory map of the branch we execute).

According to the ARM ARM, the TLB invalidation sequence should be:

STR rx, [Translation table entry]          ; write new entry to the translation table
Clean cache line [Translation table entry] ; This operation is not required with the
                                           ; Multiprocessing Extensions.
DSB            ; ensures visibility of the data cleaned from the D Cache
Invalidate TLB entry by MVA (and ASID if non-global) [page address]
Invalidate BTC
DSB            ; ensure completion of the Invalidate TLB operation
ISB            ; ensure table changes visible to instruction fetch

So we have a DSB and ISB (when we don't return to user) unconditionally.

Starting with Cortex-A8 (well, unless you enable some errata
workarounds), the BTB is a no-op anyway, so maybe we could have the BTB
unconditionally as well (for ARMv6/v7). The only problem is the inner
shareable if we are on SMP - maybe we can do some run-time code patching
for it.

> diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
> index d2005de..252874c 100644
> --- a/arch/arm/include/asm/tlbflush.h
> +++ b/arch/arm/include/asm/tlbflush.h
> @@ -34,15 +34,15 @@
>  #define TLB_V6_D_ASID  (1 << 17)
>  #define TLB_V6_I_ASID  (1 << 18)
> 
> -#define TLB_BTB                (1 << 28)
> -
>  /* Unified Inner Shareable TLB operations (ARMv7 MP extensions) */
>  #define TLB_V7_UIS_PAGE        (1 << 19)
>  #define TLB_V7_UIS_FULL (1 << 20)
>  #define TLB_V7_UIS_ASID (1 << 21)
> 
>  /* Inner Shareable BTB operation (ARMv7 MP extensions) */
> -#define TLB_V7_IS_BTB  (1 << 22)
> +#define TLB_V7_IS_BTB  (1 << 26)
> +#define TLB_BTB                (1 << 27)
> +#define TLB_BTB_BARRIER        (1 << 28)

I won't comment on the BTB changes until we clarify the conditionality
of barriers and BTB.
Russell King - ARM Linux - July 4, 2011, 11:13 a.m.
On Mon, Jul 04, 2011 at 11:43:29AM +0100, Catalin Marinas wrote:
> According to the ARM ARM, the TLB invalidation sequence should be:
> 
> STR rx, [Translation table entry]          ; write new entry to the translation table
> Clean cache line [Translation table entry] ; This operation is not required with the
>                                            ; Multiprocessing Extensions.
> DSB            ; ensures visibility of the data cleaned from the D Cache
> Invalidate TLB entry by MVA (and ASID if non-global) [page address]
> Invalidate BTC
> DSB            ; ensure completion of the Invalidate TLB operation
> ISB            ; ensure table changes visible to instruction fetch
> 
> So we have a DSB and ISB (when we don't return to user) unconditionally.

That is the "cover all cases of following code" example sequence - it
is intended that at the end of the sequence, the CPU will be able to
execute whatever code irrespective of what it is.

The clarity of that comes from reading the actual rules given before
the example code:

| For TLB maintenance, the translation table walk is treated as a separate
| observer:
| • A write to the translation tables, after it has been cleaned from the
|   cache if appropriate, is only guaranteed to be seen by a translation
|   table walk caused by an explicit load or store after the execution of
|   both a DSB and an ISB.
|
|   However, it is guaranteed that any writes to the translation tables are
|   not seen by any explicit memory access that occurs in program order
|   before the write to the translation tables.

In other words, if we write to the translation tables to setup a new
mapping where one did not previously exist, we only need an intervening
DSB and ISB when we want to access data through that that mapping.  So,
ioremap(), vmalloc(), vmap(), module allocation, etc would require this.

If we are tearing down a mapping, then we don't need any barrier for
individual PTE entries, but we do if we remove a higher-level (PMD/PUD)
table.

Note that it is irrelevant whether or not we had to clean the cache.

That "However" clause is an interesting one though - it seems to imply
that no barrier is required when we zero out a new page table, before
we link the new page table into the higher order table.  The memory we
allocated for the page table doesn't become a page table until it is
linked into the page table tree.  It also raises the question about how
the system knows that a particular store is to something that's a page
table and something that isn't...  Given that normal memory accesses are
unordered, I think this paragraph is misleading and wrong.

So, I think we need a DSB+ISB in clean_pte_table() to ensure that the
zeroing of that page will be visible to everyone coincidentally or before
the table is linked into the page table tree.  Maybe the arch people can
clarify that?

| • For the base ARMv7 architecture and versions of the architecture before
|   ARMv7, if the translation tables are held in Write-Back Cacheable memory,
|   the caches must be cleaned to the point of unification after writing to
|   the translation tables and before the DSB instruction. This ensures that
|   the updated translation table are visible to a hardware translation
|   table walk.

IOW, if ID_MMFR3 is implemented on ARMv7+ and it indicates that coherent
walks are supported, we don't need to clean the cache entries for
PUD/PMD/PTE tables.

| • A write to the translation tables, after it has been cleaned from the
|   cache if appropriate, is only guaranteed to be seen by a translation
|   table walk caused by the instruction fetch of an instruction that
|   follows the write to the translation tables after both a DSB and an ISB.

So we also need a DSB and ISB if we are going to execute code from the
new mapping.  This applies to module_alloc() only.

As far as the BTB goes, I wonder if we can postpone that for user TLB
ops by setting a TIF_ flag and checking that before returning to userspace.
That would avoid having to needlessly destroy the cached branch information
for kernel space while looping over the page tables.  The only other place
that needs to worry about that is module_alloc() and vmap/vmalloc with
PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
Thoughts?
Russell King - ARM Linux - July 4, 2011, 1:05 p.m.
On Mon, Jul 04, 2011 at 11:43:29AM +0100, Catalin Marinas wrote:
> On Mon, Jul 04, 2011 at 11:02:21AM +0100, Russell King - ARM Linux wrote:
> > The single-TLB model works fairly well, but as I thought the lack of
> > mcr%? processing by GCC makes the asm/tlbflush.h code fairly disgusting
> > even for a v6+v7 kernel.  Luckily, we can play some tricks and sort
> > some of that out.  The patch below is not complete (and can result in
> > some rules of the architecture being violated - namely the requirement
> > for an ISB after the BTB flush without a branch between) but it
> > illustrates the idea:
> 
> I'm not sure about this rule, I can ask for some clarification (we are
> not changing the memory map of the branch we execute).

There's no need for clarification on the BTB and branch issue, the
ARM ARM is quite clear on this topic:

Branch predictor maintenance operations and the memory order model
The following rule describes the effect of the memory order model on
the branch predictor maintenance operations:
• Any invalidation of the branch predictor is guaranteed to take effect only
  after one of the following:
  ■ execution of a ISB instruction
  ■ taking an exception
  ■ return from an exception.
Therefore, if a branch instruction appears between an invalidate branch
prediction instruction and an ISB operation, exception entry or exception
return, it is UNPREDICTABLE whether the branch instruction is affected by
the invalidate. Software must avoid this ordering of instructions, because
it might lead to UNPREDICTABLE behavior.

The branch predictor maintenance operations must be used to invalidate
entries in the branch predictor after any of the following events:
• enabling or disabling the MMU
• writing new data to instruction locations
• writing new mappings to the translation tables
• changes to the TTBR0, TTBR1, or TTBCR registers, unless accompanied by
  a change to the ContextID or the FCSE ProcessID.

Failure to invalidate entries might give UNPREDICTABLE results, caused by
the execution of old branches.

So:

	mcr	p15, 0, %0, c7, c5, 6 @ BTC invalidate
	tst	%1, #value
	beq	no_dsb_isb
	dsb
	isb
no_dsb_isb:

is strictly not predictable whether the branch will be taken.

The only remaining question is: the operations prior to the BTC invalidate
will not have changed the code path which 'beq' is a part of, so is it
_really_ the case that a BTC invalidate will make such a branch
unpredictable?

I guess the reason for this is that if the BTC is half-way through an
invalidate, its state may not be determinable, and so it is not
determinable whether the branch will be taken irrespective of the
previous BTC state before the invalidate and the new state after the
isb.

Or, to put it another way, the BTC can return random results between
the BTC invalidate and the following ISB.
Catalin Marinas - July 4, 2011, 1:15 p.m.
On Mon, Jul 04, 2011 at 02:05:21PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 11:43:29AM +0100, Catalin Marinas wrote:
> > On Mon, Jul 04, 2011 at 11:02:21AM +0100, Russell King - ARM Linux wrote:
> > > The single-TLB model works fairly well, but as I thought the lack of
> > > mcr%? processing by GCC makes the asm/tlbflush.h code fairly disgusting
> > > even for a v6+v7 kernel.  Luckily, we can play some tricks and sort
> > > some of that out.  The patch below is not complete (and can result in
> > > some rules of the architecture being violated - namely the requirement
> > > for an ISB after the BTB flush without a branch between) but it
> > > illustrates the idea:
> > 
> > I'm not sure about this rule, I can ask for some clarification (we are
> > not changing the memory map of the branch we execute).
> 
> There's no need for clarification on the BTB and branch issue, the
> ARM ARM is quite clear on this topic:
> 
> Branch predictor maintenance operations and the memory order model
> The following rule describes the effect of the memory order model on
> the branch predictor maintenance operations:
> • Any invalidation of the branch predictor is guaranteed to take effect only
>   after one of the following:
>   ■ execution of a ISB instruction
>   ■ taking an exception
>   ■ return from an exception.
> Therefore, if a branch instruction appears between an invalidate branch
> prediction instruction and an ISB operation, exception entry or exception
> return, it is UNPREDICTABLE whether the branch instruction is affected by
> the invalidate. Software must avoid this ordering of instructions, because
> it might lead to UNPREDICTABLE behavior.

So my this rule should we always use an ISB even if we only flushed user
TLB entries? Otherwise we get some branches before the exception return.

My reading is that it only affects the branch in the TLB op function for
which we don't change any mapping, so there isn't any risk of
mis-prediction. That's why I think I should ask for clarification if we
can't avoid the branch.
heechul Yun - July 4, 2011, 2:59 p.m.
On Mon, Jul 4, 2011 at 2:45 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Jul 01, 2011 at 10:42:00PM +0100, heechul Yun wrote:
>> Great.
>>
>> Removing the PTE flush seems to have a noticeable performance
>> difference in my test. The followings are lmbench 3.0a performance
>> result measured on a Cortex A9 SMP platform. So far, I did not have
>> any problem while doing various test.
>>
>> =========
>> mm-patch:
>> =========
>> Pagefaults on /tmp/XXX: 3.0759 microseconds
>> Process fork+exit: 464.5414 microseconds
>> Process fork+execve: 785.4944 microseconds
>> Process fork+/bin/sh -c: 488.6204 microseconds
>>
>> =========
>> original:
>> =========
>> Pagefaults on /tmp/XXX: 3.6209 microseconds
>> Process fork+exit: 485.5236 microseconds
>> Process fork+execve: 820.0613 microseconds
>> Process fork+/bin/sh -c: 2966.3828 microseconds
>
> Given these results, I think it's worth merging the patch. Can I add
> your Tested-by?

Yes you can.
I think this patch will be valuable for many applications (e.g.,
server applications).

Heechul

>
> I think there can be a few other optimisations in the TLB area but it
> needs some digging.
>
> Thanks.
>
> --
> Catalin
>
Catalin Marinas - July 4, 2011, 3:58 p.m.
On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 11:43:29AM +0100, Catalin Marinas wrote:
> > According to the ARM ARM, the TLB invalidation sequence should be:
> > 
> > STR rx, [Translation table entry]          ; write new entry to the translation table
> > Clean cache line [Translation table entry] ; This operation is not required with the
> >                                            ; Multiprocessing Extensions.
> > DSB            ; ensures visibility of the data cleaned from the D Cache
> > Invalidate TLB entry by MVA (and ASID if non-global) [page address]
> > Invalidate BTC
> > DSB            ; ensure completion of the Invalidate TLB operation
> > ISB            ; ensure table changes visible to instruction fetch
> > 
> > So we have a DSB and ISB (when we don't return to user) unconditionally.
> 
> That is the "cover all cases of following code" example sequence - it
> is intended that at the end of the sequence, the CPU will be able to
> execute whatever code irrespective of what it is.
> 
> The clarity of that comes from reading the actual rules given before
> the example code:
> 
> | For TLB maintenance, the translation table walk is treated as a separate
> | observer:
> | • A write to the translation tables, after it has been cleaned from the
> |   cache if appropriate, is only guaranteed to be seen by a translation
> |   table walk caused by an explicit load or store after the execution of
> |   both a DSB and an ISB.
> |
> |   However, it is guaranteed that any writes to the translation tables are
> |   not seen by any explicit memory access that occurs in program order
> |   before the write to the translation tables.
> 
> In other words, if we write to the translation tables to setup a new
> mapping where one did not previously exist, we only need an intervening
> DSB and ISB when we want to access data through that that mapping.  So,
> ioremap(), vmalloc(), vmap(), module allocation, etc would require this.

Yes.

> If we are tearing down a mapping, then we don't need any barrier for
> individual PTE entries, but we do if we remove a higher-level (PMD/PUD)
> table.

It depends on the use, I don't think we can generally assume that any
tearing down is fine since there are cases where we need a guaranteed
fault (zap_vma_ptes may only tear down PTE entries but the driver using
it expects a fault if something else tries to access that location). A
DSB would be enough.

> That "However" clause is an interesting one though - it seems to imply
> that no barrier is required when we zero out a new page table, before
> we link the new page table into the higher order table.  The memory we
> allocated for the page table doesn't become a page table until it is
> linked into the page table tree. 

That's my understanding of that clause as well.

> It also raises the question about how
> the system knows that a particular store is to something that's a page
> table and something that isn't...  Given that normal memory accesses are
> unordered, I think this paragraph is misleading and wrong.

Reordering of accesses can happen because of load speculation, store
reordering in the write buffer or delays on the bus outside the CPU. The
ARM processors do not issue stores speculatively. When a memory access
is issued, it checks the TLB and may perform a PTW (otherwise the
external bus wouldn't know the address. For explicit accesses, if the
PTW fails, it raises a fault (and we also need a precise abort).

So in case of load from memory vs store to page table, just because the
store is always issued after the load, the address translation for the
load is done before the changes to the page table.

In case of store vs store, AFAIK the stores in the write buffer already
have the physical address (hence the TLB look-up) and give that they are
issued in program order, there is no risk of one being visible before
the other (which should have gone through the write buffer).

> So, I think we need a DSB+ISB in clean_pte_table() to ensure that the
> zeroing of that page will be visible to everyone coincidentally or before
> the table is linked into the page table tree.  Maybe the arch people can
> clarify that?

I can ask if you need a better explanation than above.

> | • For the base ARMv7 architecture and versions of the architecture before
> |   ARMv7, if the translation tables are held in Write-Back Cacheable memory,
> |   the caches must be cleaned to the point of unification after writing to
> |   the translation tables and before the DSB instruction. This ensures that
> |   the updated translation table are visible to a hardware translation
> |   table walk.
> 
> IOW, if ID_MMFR3 is implemented on ARMv7+ and it indicates that coherent
> walks are supported, we don't need to clean the cache entries for
> PUD/PMD/PTE tables.

Yes.

> | • A write to the translation tables, after it has been cleaned from the
> |   cache if appropriate, is only guaranteed to be seen by a translation
> |   table walk caused by the instruction fetch of an instruction that
> |   follows the write to the translation tables after both a DSB and an ISB.
> 
> So we also need a DSB and ISB if we are going to execute code from the
> new mapping.  This applies to module_alloc() only.

Yes.

> As far as the BTB goes, I wonder if we can postpone that for user TLB
> ops by setting a TIF_ flag and checking that before returning to userspace.
> That would avoid having to needlessly destroy the cached branch information
> for kernel space while looping over the page tables.  The only other place
> that needs to worry about that is module_alloc() and vmap/vmalloc with
> PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().

With setting and checking the TIF_ flag we penalise newer hardware
(Cortex-A8 onwards) where the BTB invalidation is a no-op. But I'll
check with the people here if there are any implications with deferring
the BTB invalidation.
Russell King - ARM Linux - July 4, 2011, 7:58 p.m.
On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > If we are tearing down a mapping, then we don't need any barrier for
> > individual PTE entries, but we do if we remove a higher-level (PMD/PUD)
> > table.
> 
> It depends on the use, I don't think we can generally assume that any
> tearing down is fine since there are cases where we need a guaranteed
> fault (zap_vma_ptes may only tear down PTE entries but the driver using
> it expects a fault if something else tries to access that location). A
> DSB would be enough.

It depends whether we're unmapping due to munmap or for kernel internal
stuff, or whether we're doing it as a result of disposing with the
page tables.  In the latter case, avoiding the DSBs would be a good
trick to pull.

> > That "However" clause is an interesting one though - it seems to imply
> > that no barrier is required when we zero out a new page table, before
> > we link the new page table into the higher order table.  The memory we
> > allocated for the page table doesn't become a page table until it is
> > linked into the page table tree. 
> 
> That's my understanding of that clause as well.
> 
> > It also raises the question about how
> > the system knows that a particular store is to something that's a page
> > table and something that isn't...  Given that normal memory accesses are
> > unordered, I think this paragraph is misleading and wrong.
> 
> Reordering of accesses can happen because of load speculation, store
> reordering in the write buffer or delays on the bus outside the CPU. The
> ARM processors do not issue stores speculatively. When a memory access
> is issued, it checks the TLB and may perform a PTW (otherwise the
> external bus wouldn't know the address. For explicit accesses, if the
> PTW fails, it raises a fault (and we also need a precise abort).

I think you missed my point.  The order in which stores to normal memory
appear is not determinable.  In the case of writeback caches, it depends
on the cache replacement algorithm, the state of the cache and its
allocation behaviour.

It is entirely possible that the store to the page tables _could_ appear
in memory (and therefore visible to the MMU) before the stores to the
new page table initializing it to zeros.

For instance, lets say that the L1 writeback cache is read allocate only,
and does not read entries from the L1 cache.

The new page table happens to contain some L1 cache lines, but the upper
table entry is not in the L1 cache.  This means that stores to the new
page table hit the L1 cache lines, which become dirty.  The store to the
upper table entry bypasses the L1 cache and is immediately placed into
the write buffer.

This means in all probability that the MMU will see the new page table
before it is visibly initialized.

> > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > ops by setting a TIF_ flag and checking that before returning to userspace.
> > That would avoid having to needlessly destroy the cached branch information
> > for kernel space while looping over the page tables.  The only other place
> > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> 
> With setting and checking the TIF_ flag we penalise newer hardware
> (Cortex-A8 onwards) where the BTB invalidation is a no-op. But I'll
> check with the people here if there are any implications with deferring
> the BTB invalidation.

Thanks.
heechul Yun - July 4, 2011, 9:24 p.m.
On Fri, Jul 1, 2011 at 2:42 PM, heechul Yun <heechul@illinois.edu> wrote:
> Great.
>
> Removing the PTE flush seems to have a noticeable performance
> difference in my test. The followings are lmbench 3.0a performance
> result measured on a Cortex A9 SMP platform. So far, I did not have
> any problem while doing various test.
>
> =========
> mm-patch:
> =========
> Pagefaults on /tmp/XXX: 3.0759 microseconds
> Process fork+exit: 464.5414 microseconds
> Process fork+execve: 785.4944 microseconds
> Process fork+/bin/sh -c: 488.6204 microseconds

I realized that I made a big mistake on the  fork+/bin/sh data (other
numbers are fine). What happened was that there was no /bin/sh
(android platform has /system/bin/sh instead). When I did the first
test with the original kernel I made /bin as a symlink to /system/bin
which I forgot to do after flashing the system with the new, patched,
kernel.

The following is a result of the patched kernel after the correct symlink

Pagefaults on /tmp/XXX: 3.0991 microseconds
Process fork+exit: 465.1620 microseconds
Process fork+execve: 781.5077 microseconds
Process fork+/bin/sh -c: 2804.2023 microseconds

It's still noticeably better (about 5~15%) but not substantial.
Sorry for the confusion.

Heechul

>
> =========
> original:
> =========
> Pagefaults on /tmp/XXX: 3.6209 microseconds
> Process fork+exit: 485.5236 microseconds
> Process fork+execve: 820.0613 microseconds
> Process fork+/bin/sh -c: 2966.3828 microseconds
>
> Heechul
>
> On Fri, Jul 1, 2011 at 3:10 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
>>> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
>>> cache not from memory. Then I think we do not need to flush the cache
>>> line in the following code because MMU will always see up-to-date view
>>> of page table in both UP and SMP systems.
>>>
>>> linux/arch/arm/mm/proc-v7.S
>>>
>>> ENTRY(cpu_v7_set_pte_ext)
>>>       ...
>>>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
>>> D-cache // why we need this in A9?
>>>         …
>>>
>>> If this is a necessary one, could you please explain the reason? Thanks.
>>
>> No, it's not necessary, only that this file is used by other processors
>> as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
>> walk) and avoid flushing if the value is 1. The same could be done for
>> PMD entries, though that's less critical than the PTEs.
>>
>> Please note that the patch is not fully tested.
>>
>> 8<--------------------
>>
>> From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
>> From: Catalin Marinas <catalin.marinas@arm.com>
>> Date: Fri, 1 Jul 2011 10:57:07 +0100
>> Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported
>>
>> This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
>> only cleans the D-cache corresponding to a PTE if coherent page table
>> walks are not supported.
>>
>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>>  arch/arm/mm/proc-v7.S |    4 +++-
>>  1 files changed, 3 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
>> index 8013afc..fc5b36f 100644
>> --- a/arch/arm/mm/proc-v7.S
>> +++ b/arch/arm/mm/proc-v7.S
>> @@ -166,7 +166,9 @@ ENTRY(cpu_v7_set_pte_ext)
>>  ARM(  str     r3, [r0, #2048]! )
>>  THUMB(        add     r0, r0, #2048 )
>>  THUMB(        str     r3, [r0] )
>> -       mcr     p15, 0, r0, c7, c10, 1          @ flush_pte
>> +       mrc     p15, 0, r3, c0, c1, 7           @ read ID_MMFR3
>> +       tst     r3, #0xf << 20                  @ check the coherent walk bits
>> +       mcreq   p15, 0, r0, c7, c10, 1          @ flush_pte
>>  #endif
>>        mov     pc, lr
>>  ENDPROC(cpu_v7_set_pte_ext)
>>
>> --
>> Catalin
>>
>
Russell King - ARM Linux - July 4, 2011, 11:20 p.m.
On Mon, Jul 04, 2011 at 08:58:19PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> > With setting and checking the TIF_ flag we penalise newer hardware
> > (Cortex-A8 onwards) where the BTB invalidation is a no-op. But I'll
> > check with the people here if there are any implications with deferring
> > the BTB invalidation.
> 
> Thanks.

Please can you also check whether BTC invalidation is required when a
page is removed from the page tables for the purpose of swapping it
out (so the PTE entry becomes zero).

Also, is BTC invalidation required if the very same page that was
there before is later restored without any modification to the
instruction content of that page.  (I'm thinking page is aged to old,
and so unmapped, then a prefetch abort which reinstates the same page
that was there previously.)

Finally, is BTC invalidation required if a different physical page
containing the same instruction content as before is placed at that
location?

I expect all but the last case requires BTC invalidation.

Lastly, please check whether population and removal of page table
entries for NX pages require BTC invalidation.  I expect not.
Catalin Marinas - July 5, 2011, 9:26 a.m.
On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > ops by setting a TIF_ flag and checking that before returning to userspace.
> > That would avoid having to needlessly destroy the cached branch information
> > for kernel space while looping over the page tables.  The only other place
> > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> 
> Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> but we do need to do a dsb+isb.

Why would we need an ISB?

> Firstly, the majority of mappings are created with NX set, so the BTC
> can't be involved in anything created there (as we can't execute code
> there.)

OK.

> Secondly, if we're loading a code into kernel space to execute, then we
> need to ensure I/D coherency via flush_icache_range(), which has to
> happen after the mappings have been created, and this already does our
> BTC invalidate+dsb+isb.

OK.

> So, as far as kernel space TLB invalidation goes, my conclusion is that
> we do not have to touch the BTC at all there, and we can leave that to
> flush_icache_range() (and therefore cpu..coherent_kern_range) to deal
> with.
> 
> Practically, testing it out on Versatile Express loading/unloading a
> few modules shows no ill effects from dropping the BTC invalidates from
> the kernel TLB invalidate ops.

AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
maintenance are no-ops. You wouldn't notice any issues if you remove
them (you can check ID_MMFR1[31:28].

> 8<----------
> From: Russell King <rmk+kernel@arm.linux.org.uk>
> ARM: btc: avoid invalidating the branch target cache on kernel TLB maintanence
> 
> Kernel space needs very little in the way of BTC maintanence as most
> mappings which are created and destroyed are non-executable, and so
> could never enter the instruction stream.
> 
> The case which does warrant BTC maintanence is when a module is loaded.
> This creates a new executable mapping, but at that point the pages have
> not been initialized with code and data, so at that point they contain
> unpredictable information.  Invalidating the BTC at this stage serves
> little useful purpose.
> 
> Before we execute module code, we call flush_icache_range(), which deals
> with the BTC maintanence requirements.  This ensures that we have a BTC
> maintanence operation before we execute code via the newly created
> mapping.
> 
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

The patch looks fine (it would be good if we get some testing on an
ARM11 platform).
Russell King - ARM Linux - July 5, 2011, 9:46 a.m.
On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > That would avoid having to needlessly destroy the cached branch information
> > > for kernel space while looping over the page tables.  The only other place
> > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > 
> > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > but we do need to do a dsb+isb.
> 
> Why would we need an ISB?

Because the ARM ARM tells us that we must do a dsb+isb to ensure that
instructions are visible to the instruction stream after a change to
the page tables.

I've since added the dsb+isbs back to the TLB ops because the ARM ARM
is qutie explicit that both are required to ensure that explicit
accesses see the effect of the TLB operation.  To me it is rather
perverse that an ISB is required to ensure that this sequence works:

	write page table entry
	clean
	dsb
	flush tlb entry
	dsb
	isb
	read/write new page

but the ARM ARM is quite explicit that this sequence is insufficient:

	write page table entry
	clean
	dsb
	flush tlb entry
	dsb
	read/write new page

to ensure visibility of the new page.

I suspect that the ARM ARM is wrong here - and the ISB is only required
if you're going to be executing from the new page.  Maybe that's already
been fixed?  Is there an update to DDI0406B?
Catalin Marinas - July 5, 2011, 10:07 a.m.
On Mon, Jul 04, 2011 at 08:58:19PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > That "However" clause is an interesting one though - it seems to imply
> > > that no barrier is required when we zero out a new page table, before
> > > we link the new page table into the higher order table.  The memory we
> > > allocated for the page table doesn't become a page table until it is
> > > linked into the page table tree. 
> > > 
> > > It also raises the question about how
> > > the system knows that a particular store is to something that's a page
> > > table and something that isn't...  Given that normal memory accesses are
> > > unordered, I think this paragraph is misleading and wrong.
> > 
> > Reordering of accesses can happen because of load speculation, store
> > reordering in the write buffer or delays on the bus outside the CPU. The
> > ARM processors do not issue stores speculatively. When a memory access
> > is issued, it checks the TLB and may perform a PTW (otherwise the
> > external bus wouldn't know the address. For explicit accesses, if the
> > PTW fails, it raises a fault (and we also need a precise abort).
> 
> I think you missed my point.  The order in which stores to normal memory
> appear is not determinable.  In the case of writeback caches, it depends
> on the cache replacement algorithm, the state of the cache and its
> allocation behaviour.
> 
> It is entirely possible that the store to the page tables _could_ appear
> in memory (and therefore visible to the MMU) before the stores to the
> new page table initializing it to zeros.

I think I got your point now. Do you mean:

1. allocate a page for pte
2. zero the page
3. write the pmd entry pointing to the pte page

In this case we need a DSB between 2 and 3 (and maybe a cache flush as
well, which we already do) otherwise we could speculatively load garbage
into the TLB.

Going back to the "however" clause - "any writes to the translation
tables are not seen by any explicit memory access that occurs in program
order before the write to the translation tables" - the important issue
here is the understanding of "seen". An explicit memory access "sees" a
write to the translation table if it uses the new translation to
calculate the physical address. But in the case above, point 3 doesn't
need to see the page zeroing at point 3, they are independent, hence the
need for a DSB.

Maybe the clause could be clarified a bit - I'll ping RG.

> For instance, lets say that the L1 writeback cache is read allocate only,
> and does not read entries from the L1 cache.
> 
> The new page table happens to contain some L1 cache lines, but the upper
> table entry is not in the L1 cache.  This means that stores to the new
> page table hit the L1 cache lines, which become dirty.  The store to the
> upper table entry bypasses the L1 cache and is immediately placed into
> the write buffer.
> 
> This means in all probability that the MMU will see the new page table
> before it is visibly initialized.

If the MMU can read the L1 cache, than we need a DSB before the write to
the upper memory (I wonder whether a DMB would do, since the MMU is just
another observer). If the MMU can't read the L1 cache, than we need a
cache clean as well.
Russell King - ARM Linux - July 5, 2011, 10:48 a.m.
On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> maintenance are no-ops. You wouldn't notice any issues if you remove
> them (you can check ID_MMFR1[31:28].

It's not transparent:

CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111

0b0010 Branch predictor requires flushing on:
	• enabling or disabling the MMU
	• writing new data to instruction locations
	• writing new mappings to the translation tables
	• any change to the TTBR0, TTBR1, or TTBCR registers without a
	  corresponding change to the FCSE ProcessID or ContextID.
Catalin Marinas - July 5, 2011, 1:54 p.m.
On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > maintenance are no-ops. You wouldn't notice any issues if you remove
> > them (you can check ID_MMFR1[31:28].
> 
> It's not transparent:
> 
> CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> 
> 0b0010 Branch predictor requires flushing on:
> 	• enabling or disabling the MMU
> 	• writing new data to instruction locations
> 	• writing new mappings to the translation tables
> 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> 	  corresponding change to the FCSE ProcessID or ContextID.

OK, it makes sense. But what's really confusing - the A8 has the same
value for ID_MMFR1 (according to the TRM) but the BTB instructions are
no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
is wrong in this case.

Anyway, as long as we stick to the ARM ARM requirements we should be OK.
Russell King - ARM Linux - July 5, 2011, 2:15 p.m.
On Tue, Jul 05, 2011 at 02:54:24PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > > maintenance are no-ops. You wouldn't notice any issues if you remove
> > > them (you can check ID_MMFR1[31:28].
> > 
> > It's not transparent:
> > 
> > CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> > 
> > 0b0010 Branch predictor requires flushing on:
> > 	• enabling or disabling the MMU
> > 	• writing new data to instruction locations
> > 	• writing new mappings to the translation tables
> > 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> > 	  corresponding change to the FCSE ProcessID or ContextID.
> 
> OK, it makes sense. But what's really confusing - the A8 has the same
> value for ID_MMFR1 (according to the TRM) but the BTB instructions are
> no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
> is wrong in this case.

Grumble.  This kind of thing worries me.

If the ID registers are wrong, then they can't be relied upon by software
making decisions about what can be skipped.  So how can we be confident
that MMFR3 23:20 != 0 really does mean that we don't need to clean the
cache when writing the page tables, and is not a result of some bug?
Catalin Marinas - July 5, 2011, 2:40 p.m.
On Tue, Jul 05, 2011 at 03:15:10PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 02:54:24PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> > > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > > > maintenance are no-ops. You wouldn't notice any issues if you remove
> > > > them (you can check ID_MMFR1[31:28].
> > > 
> > > It's not transparent:
> > > 
> > > CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> > > 
> > > 0b0010 Branch predictor requires flushing on:
> > > 	• enabling or disabling the MMU
> > > 	• writing new data to instruction locations
> > > 	• writing new mappings to the translation tables
> > > 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> > > 	  corresponding change to the FCSE ProcessID or ContextID.
> > 
> > OK, it makes sense. But what's really confusing - the A8 has the same
> > value for ID_MMFR1 (according to the TRM) but the BTB instructions are
> > no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
> > is wrong in this case.
> 
> Grumble.  This kind of thing worries me.
> 
> If the ID registers are wrong, then they can't be relied upon by software
> making decisions about what can be skipped.  

It could be because the BTB operations are configurable via an ACTLR
bit, but the ID registers are fixed values.

> So how can we be confident
> that MMFR3 23:20 != 0 really does mean that we don't need to clean the
> cache when writing the page tables, and is not a result of some bug?

Well, you can get other bugs, you can never guarantee anything. But when
they are found, there are errata workarounds. This would actually be a
very easy to detect issue.
Catalin Marinas - July 6, 2011, 3:52 p.m.
On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > > That would avoid having to needlessly destroy the cached branch information
> > > > for kernel space while looping over the page tables.  The only other place
> > > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > > 
> > > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > > but we do need to do a dsb+isb.
> > 
> > Why would we need an ISB?
> 
> Because the ARM ARM tells us that we must do a dsb+isb to ensure that
> instructions are visible to the instruction stream after a change to
> the page tables.
> 
> I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> is qutie explicit that both are required to ensure that explicit
> accesses see the effect of the TLB operation.  To me it is rather
> perverse that an ISB is required to ensure that this sequence works:
> 
> 	write page table entry
> 	clean
> 	dsb
> 	flush tlb entry
> 	dsb
> 	isb
> 	read/write new page

The same requirement can be found in latest (internal only) ARM ARM as
well. I think the reason is that some memory access already in the
pipeline (but after the TLB flush) may have sampled the state of the
page table using an old TLB entry, even though the actual memory access
will happen after the DSB.
Russell King - ARM Linux - July 6, 2011, 3:55 p.m.
On Wed, Jul 06, 2011 at 04:52:14PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > > > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > > > That would avoid having to needlessly destroy the cached branch information
> > > > > for kernel space while looping over the page tables.  The only other place
> > > > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > > > 
> > > > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > > > but we do need to do a dsb+isb.
> > > 
> > > Why would we need an ISB?
> > 
> > Because the ARM ARM tells us that we must do a dsb+isb to ensure that
> > instructions are visible to the instruction stream after a change to
> > the page tables.
> > 
> > I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> > is qutie explicit that both are required to ensure that explicit
> > accesses see the effect of the TLB operation.  To me it is rather
> > perverse that an ISB is required to ensure that this sequence works:
> > 
> > 	write page table entry
> > 	clean
> > 	dsb
> > 	flush tlb entry
> > 	dsb
> > 	isb
> > 	read/write new page
> 
> The same requirement can be found in latest (internal only) ARM ARM as
> well. I think the reason is that some memory access already in the
> pipeline (but after the TLB flush) may have sampled the state of the
> page table using an old TLB entry, even though the actual memory access
> will happen after the DSB.

It would be useful to have a statement from RG on this.  Could you
pass this to him please?

Thanks.
Catalin Marinas - July 6, 2011, 4:05 p.m.
On Tue, Jul 05, 2011 at 12:20:19AM +0100, Russell King - ARM Linux wrote:
> Please can you also check whether BTC invalidation is required when a
> page is removed from the page tables for the purpose of swapping it
> out (so the PTE entry becomes zero).

Not required.

> Also, is BTC invalidation required if the very same page that was
> there before is later restored without any modification to the
> instruction content of that page.  (I'm thinking page is aged to old,
> and so unmapped, then a prefetch abort which reinstates the same page
> that was there previously.)

Not required.

> Finally, is BTC invalidation required if a different physical page
> containing the same instruction content as before is placed at that
> location?

Required (see the comment below).

> Lastly, please check whether population and removal of page table
> entries for NX pages require BTC invalidation.  I expect not.
 
Not required.

Basically the rules for BTC are similar to those for the I-cache, but it
can be either ASID-tagged VIVT or VIPT, independent of the type of the
I-cache. I think current implementations of the BTC are ASID-tagged VIVT
but the architecture doesn't mandate this.

So the mapping of a data page doesn't require BTC invalidation, even if
the page is not marked as XN.

Having a branch between BTC and ISB is OK as long as we don't change the
mapping of the branch.

Mapping a page in a previously unmapped location doesn't generally
require BTC invalidation if we guarantee that there are no BTC entries
for that location (IOW, we got a new ASID or unmapping the page
previously had a BTC invalidation).
Catalin Marinas - July 6, 2011, 4:15 p.m.
On Wed, Jul 06, 2011 at 04:55:25PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 06, 2011 at 04:52:14PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> > > I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> > > is qutie explicit that both are required to ensure that explicit
> > > accesses see the effect of the TLB operation.  To me it is rather
> > > perverse that an ISB is required to ensure that this sequence works:
> > > 
> > > 	write page table entry
> > > 	clean
> > > 	dsb
> > > 	flush tlb entry
> > > 	dsb
> > > 	isb
> > > 	read/write new page
> > 
> > The same requirement can be found in latest (internal only) ARM ARM as
> > well. I think the reason is that some memory access already in the
> > pipeline (but after the TLB flush) may have sampled the state of the
> > page table using an old TLB entry, even though the actual memory access
> > will happen after the DSB.
> 
> It would be useful to have a statement from RG on this.  Could you
> pass this to him please?

Just checked - it is required as per the ARM ARM and there are cores
that rely on the ISB. As I said, it depends on the pipeline
implementation.
Russell King - ARM Linux - July 6, 2011, 6:08 p.m.
On Wed, Jul 06, 2011 at 05:05:51PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 12:20:19AM +0100, Russell King - ARM Linux wrote:
> > Please can you also check whether BTC invalidation is required when a
> > page is removed from the page tables for the purpose of swapping it
> > out (so the PTE entry becomes zero).
> 
> Not required.
> 
> > Also, is BTC invalidation required if the very same page that was
> > there before is later restored without any modification to the
> > instruction content of that page.  (I'm thinking page is aged to old,
> > and so unmapped, then a prefetch abort which reinstates the same page
> > that was there previously.)
> 
> Not required.
> 
> > Finally, is BTC invalidation required if a different physical page
> > containing the same instruction content as before is placed at that
> > location?
> 
> Required (see the comment below).
> 
> > Lastly, please check whether population and removal of page table
> > entries for NX pages require BTC invalidation.  I expect not.
>  
> Not required.
> 
> Basically the rules for BTC are similar to those for the I-cache, but it
> can be either ASID-tagged VIVT or VIPT, independent of the type of the
> I-cache. I think current implementations of the BTC are ASID-tagged VIVT
> but the architecture doesn't mandate this.
> 
> So the mapping of a data page doesn't require BTC invalidation, even if
> the page is not marked as XN.
> 
> Having a branch between BTC and ISB is OK as long as we don't change the
> mapping of the branch.
> 
> Mapping a page in a previously unmapped location doesn't generally
> require BTC invalidation if we guarantee that there are no BTC entries
> for that location (IOW, we got a new ASID or unmapping the page
> previously had a BTC invalidation).

Okay, so I can say with confidence then that how things stand in my tree,
which limits BTC invalidation to:

1. For kernel mappings, flush_icache_range() which must be called prior
   to code placed in them being executed.

2. For user mappings, __sync_icache_dcache's icache flush, which is
   called before a non-zero user PTE is inserted.

together covers all the necessary points.

The area which needs more to focus some further work is
__sync_icache_dcache(), which is fairly over-zealous about all the
flushing.  In a boot of the Versatile Express platform, it inserts
193k PTEs.  Of that, 192k are for userspace.  For userspace:

87k are for new executable pages, 87 are for executable pages with
their attributes (eg, young-ness/read-only ness) changed.

54k are for new non-executable pages, 51k are for non-executable pages
with attributes changed.

__sync_icache_dcache() handles these userspace PTEs as follows:

0 satisfy the !pte_present_user() test.

105k are ignored due to the VIPT non-aliasing cache and the PTE being
non-executable.

0 have invalid PFNs.

964 result in a Dcache flush.  This number does not increment through
repeated monitoring of the stats via 'cat'.

87k result in an Icache invalidate with BTC invalidation.

Around 130 (of 193k) calls into this code can be saved by checking
whether the PFN has changed and the state of the young bit (no point
doing any flushing if the page is marked old as it won't be accessible
to the user in that state.)

130 may not sound like many, but that's just from a boot.  Over time,
pages will be aged, which will push that number up.  We really ought not
flush the I-cache and BTC just because we marked the page old, and then
again when it is marked young.

That's also something which won't really show up in benchmarks, as it
depends on pushing virtual memory out of the system which is not
something which benchmarks are particularly good at measuring.  It's
effect depends on overall system behaviour depends on the system setup
and memory pressure.
Russell King - ARM Linux - July 11, 2011, 5:01 p.m.
On Mon, Jul 11, 2011 at 05:49:20PM +0100, Catalin Marinas wrote:
> On Wed, Jul 06, 2011 at 07:08:14PM +0100, Russell King - ARM Linux wrote:
> > Okay, so I can say with confidence then that how things stand in my tree,
> > which limits BTC invalidation to:
> > 
> > 1. For kernel mappings, flush_icache_range() which must be called prior
> >    to code placed in them being executed.
> > 
> > 2. For user mappings, __sync_icache_dcache's icache flush, which is
> >    called before a non-zero user PTE is inserted.
> 
> What about:
> 
> flush_cache_user_range()
> flush_ptrace_access()
> 
> They are fine as long as you haven't removed the BTC invalidation from
> __cpuc_coherent_(kern|user)_range.

That's the basis of flush_icache_range(), so that still has the BTC
invalidate.

> > The area which needs more to focus some further work is
> > __sync_icache_dcache(), which is fairly over-zealous about all the
> > flushing.
> 
> Another thing that could be optimised is not to clean and invalidate the
> D-cache but only clean to the PoU. The only problem is that
> (flush|invalidate)_kernel_vmap_area, functions that seem to used only in
> a single place. The semantics in cachetlb.txt claim to be used for I/O,
> which means that they are already broken since we don't handle the L2
> cache.

Those are newly introduced to cope with XFS wanting DMA to vmap'd areas
to work.  They're there to handle the vmalloc-space alias of the pages.
The DMA API sorts out the kernel direct-mapped plus L2 for non-virtually
tagged L2 caches.

So they're just an additional pre-flush and post-invalidate calls around
the DMA API to cope with vmalloc space aliases.  So I don't think they're
broken.
Catalin Marinas - July 12, 2011, 1:09 p.m.
On Mon, Jul 11, 2011 at 06:01:41PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 11, 2011 at 05:49:20PM +0100, Catalin Marinas wrote:
> > On Wed, Jul 06, 2011 at 07:08:14PM +0100, Russell King - ARM Linux wrote:
> > > The area which needs more to focus some further work is
> > > __sync_icache_dcache(), which is fairly over-zealous about all the
> > > flushing.
> > 
> > Another thing that could be optimised is not to clean and invalidate the
> > D-cache but only clean to the PoU. The only problem is that
> > (flush|invalidate)_kernel_vmap_area, functions that seem to used only in
> > a single place. The semantics in cachetlb.txt claim to be used for I/O,
> > which means that they are already broken since we don't handle the L2
> > cache.
> 
> Those are newly introduced to cope with XFS wanting DMA to vmap'd areas
> to work.  They're there to handle the vmalloc-space alias of the pages.
> The DMA API sorts out the kernel direct-mapped plus L2 for non-virtually
> tagged L2 caches.
> 
> So they're just an additional pre-flush and post-invalidate calls around
> the DMA API to cope with vmalloc space aliases.  So I don't think they're
> broken.

OK, so in this case these functions need to go to the point of
coherency. We could also optimise them to do pre-cleaning and
post-invalidation rather than always clean&invalidate.

Can we not use dmac_flush_range() (or dma_clean_range and dma_inv_range
via dmac_(map|unmap)_area) instead of __cpuc_flush_dcache_area?
Russell King - ARM Linux - July 15, 2011, 4:24 p.m.
On Tue, Jul 12, 2011 at 02:09:07PM +0100, Catalin Marinas wrote:
> On Mon, Jul 11, 2011 at 06:01:41PM +0100, Russell King - ARM Linux wrote:
> > On Mon, Jul 11, 2011 at 05:49:20PM +0100, Catalin Marinas wrote:
> > > On Wed, Jul 06, 2011 at 07:08:14PM +0100, Russell King - ARM Linux wrote:
> > > > The area which needs more to focus some further work is
> > > > __sync_icache_dcache(), which is fairly over-zealous about all the
> > > > flushing.
> > > 
> > > Another thing that could be optimised is not to clean and invalidate the
> > > D-cache but only clean to the PoU. The only problem is that
> > > (flush|invalidate)_kernel_vmap_area, functions that seem to used only in
> > > a single place. The semantics in cachetlb.txt claim to be used for I/O,
> > > which means that they are already broken since we don't handle the L2
> > > cache.
> > 
> > Those are newly introduced to cope with XFS wanting DMA to vmap'd areas
> > to work.  They're there to handle the vmalloc-space alias of the pages.
> > The DMA API sorts out the kernel direct-mapped plus L2 for non-virtually
> > tagged L2 caches.
> > 
> > So they're just an additional pre-flush and post-invalidate calls around
> > the DMA API to cope with vmalloc space aliases.  So I don't think they're
> > broken.
> 
> OK, so in this case these functions need to go to the point of
> coherency. We could also optimise them to do pre-cleaning and
> post-invalidation rather than always clean&invalidate.
> 
> Can we not use dmac_flush_range() (or dma_clean_range and dma_inv_range
> via dmac_(map|unmap)_area) instead of __cpuc_flush_dcache_area?

We got rid of the clean and invalidate interfaces because they weren't
suitable for cross-CPU cache handling (they were defined by implementation
rather than purpose.)

We don't have enough information here to be able to call the map/unmap
functions (the DMA direction would be a nonsense) so I think these
should be new callbacks into the CPU cache code, and we do whatever
is necessary in the low level stuff, rather than trying to overload
existing functions.

Patch

diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index 8013afc..fc5b36f 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -166,7 +166,9 @@  ENTRY(cpu_v7_set_pte_ext)
  ARM(	str	r3, [r0, #2048]! )
  THUMB(	add	r0, r0, #2048 )
  THUMB(	str	r3, [r0] )
-	mcr	p15, 0, r0, c7, c10, 1		@ flush_pte
+	mrc	p15, 0, r3, c0, c1, 7		@ read ID_MMFR3
+	tst	r3, #0xf << 20			@ check the coherent walk bits
+	mcreq	p15, 0, r0, c7, c10, 1		@ flush_pte
 #endif
 	mov	pc, lr
 ENDPROC(cpu_v7_set_pte_ext)