[V3] arm64: Don't flush tlb while clearing the accessed bit

Message ID 1540805158-618-1-git-send-email-amhetre@nvidia.com
State New
Headers show
Series
  • [V3] arm64: Don't flush tlb while clearing the accessed bit
Related show

Commit Message

Ashish Mhetre Oct. 29, 2018, 9:25 a.m.
From: Alex Van Brunt <avanbrunt@nvidia.com>

Accessed bit is used to age a page and in generic implementation there is
flush_tlb while clearing the accessed bit.
Flushing a TLB is overhead on ARM64 as access flag faults don't get
translation table entries cached into TLB's. Flushing TLB is not necessary
for this. Clearing the accessed bit without flushing TLB doesn't cause data
corruption on ARM64.
In our case with this patch, speed of reading from fast NVMe/SSD through
PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%.
So for performance optimisation don't flush TLB when clearing the accessed
bit on ARM64.
x86 made the same optimization even though their TLB invalidate is much
faster as it doesn't broadcast to other CPUs.
Please refer to:
'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
the accessed bit instead of flushing the TLB")'

Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com>
Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
---
 arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Jon Hunter Oct. 29, 2018, 9:57 a.m. | #1
On 29/10/2018 09:25, Ashish Mhetre wrote:
> From: Alex Van Brunt <avanbrunt@nvidia.com>
> 
> Accessed bit is used to age a page and in generic implementation there is
> flush_tlb while clearing the accessed bit.
> Flushing a TLB is overhead on ARM64 as access flag faults don't get
> translation table entries cached into TLB's. Flushing TLB is not necessary
> for this. Clearing the accessed bit without flushing TLB doesn't cause data
> corruption on ARM64.
> In our case with this patch, speed of reading from fast NVMe/SSD through
> PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%.
> So for performance optimisation don't flush TLB when clearing the accessed
> bit on ARM64.
> x86 made the same optimization even though their TLB invalidate is much
> faster as it doesn't broadcast to other CPUs.
> Please refer to:
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
> 
> Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com>
> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
> ---

Please make sure you state here below the above line what has been
changed between each version of the patch.

Thanks
Jon
Will Deacon Oct. 29, 2018, 10:55 a.m. | #2
On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote:
> From: Alex Van Brunt <avanbrunt@nvidia.com>
> 
> Accessed bit is used to age a page and in generic implementation there is
> flush_tlb while clearing the accessed bit.
> Flushing a TLB is overhead on ARM64 as access flag faults don't get
> translation table entries cached into TLB's. Flushing TLB is not necessary
> for this. Clearing the accessed bit without flushing TLB doesn't cause data
> corruption on ARM64.
> In our case with this patch, speed of reading from fast NVMe/SSD through
> PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%.
> So for performance optimisation don't flush TLB when clearing the accessed
> bit on ARM64.
> x86 made the same optimization even though their TLB invalidate is much
> faster as it doesn't broadcast to other CPUs.

Ok, but they may end up using IPIs so lets avoid these vague performance
claims in the log unless they're backed up with numbers.

> Please refer to:
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
> 
> Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com>
> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2ab2031..080d842 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -652,6 +652,26 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	return __ptep_test_and_clear_young(ptep);
>  }
>  
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +					 unsigned long address, pte_t *ptep)
> +{
> +	/*
> +	 * On ARM64 CPUs, clearing the accessed bit without a TLB flush
> +	 * doesn't cause data corruption. [ It could cause incorrect
> +	 * page aging and the (mistaken) reclaim of hot pages, but the
> +	 * chance of that should be relatively low. ]
> +	 *
> +	 * So as a performance optimization don't flush the TLB when
> +	 * clearing the accessed bit, it will eventually be flushed by
> +	 * a context switch or a VM operation anyway. [ In the rare
> +	 * event of it not getting flushed for a long time the delay
> +	 * shouldn't really matter because there's no real memory
> +	 * pressure for swapout to react to. ]

This is blindly copied from x86 and isn't true for us: we don't invalidate
the TLB on context switch. That means our window for keeping the stale
entries around is potentially much bigger and might not be a great idea.

If we roll a TLB invalidation routine without the trailing DSB, what sort of
performance does that get you?

Will
Alexander Van Brunt Oct. 29, 2018, 3:13 p.m. | #3
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?

We have been doing our testing on our Carmel CPUs. Carmel will effectively
ignore a TLB invalidate that doesn't have a DSB (until the invalidate buffer
overflows). So, I expect the performance to be the same as with no TLB
invalidate, but not represent the performance of other ARMv8 CPUs


From: Will Deacon <will.deacon@arm.com>
Sent: Monday, October 29, 2018 3:55 AM
To: Ashish Mhetre
Cc: mark.rutland@arm.com; linux-arm-kernel@lists.infradead.org; linux-tegra@vger.kernel.org; Alexander Van Brunt; Sachin Nikam; linux-kernel@vger.kernel.org
Subject: Re: [PATCH V3] arm64: Don't flush tlb while clearing the accessed bit
  

On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote:
> From: Alex Van Brunt <avanbrunt@nvidia.com>
> 
> Accessed bit is used to age a page and in generic implementation there is
> flush_tlb while clearing the accessed bit.
> Flushing a TLB is overhead on ARM64 as access flag faults don't get
> translation table entries cached into TLB's. Flushing TLB is not necessary
> for this. Clearing the accessed bit without flushing TLB doesn't cause data
> corruption on ARM64.
> In our case with this patch, speed of reading from fast NVMe/SSD through
> PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%.
> So for performance optimisation don't flush TLB when clearing the accessed
> bit on ARM64.
> x86 made the same optimization even though their TLB invalidate is much
> faster as it doesn't broadcast to other CPUs.

Ok, but they may end up using IPIs so lets avoid these vague performance
claims in the log unless they're backed up with numbers.

> Please refer to:
> 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
> 
> Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com>
> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2ab2031..080d842 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -652,6 +652,26 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>        return __ptep_test_and_clear_young(ptep);
>  }

> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +                                      unsigned long address, pte_t *ptep)
> +{
> +     /*
> +      * On ARM64 CPUs, clearing the accessed bit without a TLB flush
> +      * doesn't cause data corruption. [ It could cause incorrect
> +      * page aging and the (mistaken) reclaim of hot pages, but the
> +      * chance of that should be relatively low. ]
> +      *
> +      * So as a performance optimization don't flush the TLB when
> +      * clearing the accessed bit, it will eventually be flushed by
> +      * a context switch or a VM operation anyway. [ In the rare
> +      * event of it not getting flushed for a long time the delay
> +      * shouldn't really matter because there's no real memory
> +      * pressure for swapout to react to. ]

This is blindly copied from x86 and isn't true for us: we don't invalidate
the TLB on context switch. That means our window for keeping the stale
entries around is potentially much bigger and might not be a great idea.

If we roll a TLB invalidation routine without the trailing DSB, what sort of
performance does that get you?

Will
Will Deacon Oct. 30, 2018, 11:50 a.m. | #4
[Sorry to be "that person" but please can you use plain text for your mail?
 This is getting really hard to follow.]

On Tue, Oct 30, 2018 at 11:17:34AM +0530, Ashish Mhetre wrote:
> On 29/10/18 4:25 PM, Will Deacon wrote:
>     On Mon, Oct 29, 2018 at 02:55:58PM +0530, Ashish Mhetre wrote:
>         From: Alex Van Brunt <avanbrunt@nvidia.com>
> 
>         Accessed bit is used to age a page and in generic implementation there is
>         flush_tlb while clearing the accessed bit.
>         Flushing a TLB is overhead on ARM64 as access flag faults don't get
>         translation table entries cached into TLB's. Flushing TLB is not necessary
>         for this. Clearing the accessed bit without flushing TLB doesn't cause data
>         corruption on ARM64.
>         In our case with this patch, speed of reading from fast NVMe/SSD through
>         PCIe got improved by 10% ~ 15% and writing got improved by 20% ~ 40%.
>         So for performance optimisation don't flush TLB when clearing the accessed
>         bit on ARM64.
>         x86 made the same optimization even though their TLB invalidate is much
>         faster as it doesn't broadcast to other CPUs.
> 
>     Ok, but they may end up using IPIs so lets avoid these vague performance
>     claims in the log unless they're backed up with numbers.
> 
> By numbers do you mean the actual benchmark values?

What I mean is, if we're going to claim that x86 TLB invalidation "is much
faster" than arm64, I'd prefer that there was some science behind it.
However, I think in this case it's not even relevant, so we can just rewrite
the commit message.

How about the patch below -- does that work for you?

Will

--->8

From 1443d2dcfd66563127aa1b13d05eac7cd9fd8445 Mon Sep 17 00:00:00 2001
From: Alex Van Brunt <avanbrunt@nvidia.com>
Date: Mon, 29 Oct 2018 14:55:58 +0530
Subject: [PATCH] arm64: mm: Don't wait for completion of TLB invalidation when
 page aging

When transitioning a PTE from young to old as part of page aging, we
can avoid waiting for the TLB invalidation to complete and therefore
drop the subsequent DSB instruction. Whilst this opens up a race with
page reclaim, where a PTE in active use via a stale, young TLB entry
does not update the underlying descriptor, the worst thing that happens
is that the page is reclaimed and then immediately faulted back in.

Given that we have a DSB in our context-switch path, the window for a
spurious reclaim is fairly limited and eliding the barrier claims to
boost NVMe/SSD accesses by over 10% on some platforms.

A similar optimisation was made for x86 in commit b13b1d2d8692 ("x86/mm:
In the PTE swapout page reclaim case clear the accessed bit instead of
flushing the TLB").

Signed-off-by: Alex Van Brunt <avanbrunt@nvidia.com>
Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
[will: rewrote patch]
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/pgtable.h  | 22 ++++++++++++++++++++++
 arch/arm64/include/asm/tlbflush.h | 11 +++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 50b1ef8584c0..5bbb59c81920 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -22,6 +22,7 @@
 #include <asm/memory.h>
 #include <asm/pgtable-hwdef.h>
 #include <asm/pgtable-prot.h>
+#include <asm/tlbflush.h>
 
 /*
  * VMALLOC range.
@@ -685,6 +686,27 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return __ptep_test_and_clear_young(ptep);
 }
 
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+					 unsigned long address, pte_t *ptep)
+{
+	int young = ptep_test_and_clear_young(vma, address, ptep);
+
+	if (young) {
+		/*
+		 * We can elide the trailing DSB here since the worst that can
+		 * happen is that a CPU continues to use the young entry in its
+		 * TLB and we mistakenly reclaim the associated page. The
+		 * window for such an event is bounded by the next
+		 * context-switch, which provides a DSB to complete the TLB
+		 * invalidation.
+		 */
+		flush_tlb_page_nosync(vma, address);
+	}
+
+	return young;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index c3c0387aee18..a629a4067aae 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -21,6 +21,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <asm/cputype.h>
 #include <asm/mmu.h>
@@ -164,14 +165,20 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page(struct vm_area_struct *vma,
-				  unsigned long uaddr)
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
 {
 	unsigned long addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
 
 	dsb(ishst);
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
+}
+
+static inline void flush_tlb_page(struct vm_area_struct *vma,
+				  unsigned long uaddr)
+{
+	flush_tlb_page_nosync(vma, uaddr);
 	dsb(ish);
 }
Alexander Van Brunt Dec. 3, 2018, 9:20 p.m. | #5
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?

It is not as good. In some cases, it is really bad. Skipping the invalidate was
the most consistent and fast implementation.

Methodology:

We ran 6 tests on Jetson Xavier with three different implementations of
ptep_clear_flush_young: the existing version that does a TLB invalidate and a
DSB, our proposal to skip the TLB invalidate, and Will's suggestion to just skip
the DSB. The 6 tests are read and write versions of sequential access, random
access, and alternating between a fixed page and a random page. We ran each of
the (6 tests) * (3 configs) 31 times and measured the execution time.

The Jetson Xavier platform has 8 Carmel CPUs, 16 GB of DRAM, and an NVMe hard
drive. Carmel CPUs have a unique feature where they batch up TLB invalidates
until either the very large buffer overflows or it executes a DSB.

Below we report statistically significant (p < .01) differences in the mean
execution time or the variation in execution time. There are 36 comparisons
tested. Because of that, there is a 50% chance that at least one of the 36
comparisons would have a p <= 1 / 36. p = .01 should make false positives
unlikely.

Sequential Reads:

Executing a TLB invalidate but skipping the DSB had 3.5x more variation than an
invalidate and a DSB and it had 12.3x more run-to-run variation than skipping
the TLB invalidate and the DSB. The run-to-run variation when skipping the DSB
was 38% of the execution time. This is likely because Carmel's feature of
batching up TLB invalidates until it executes a DSB and the need to wait for
the other 7 cores to compete the invalidate.

Skipping the TLB invalidate was 8% faster than executing an invalidate and a
DSB. It also had 3.5x less run-to-run variation. Because the run-to-run
variation with the implementation that executed a TLB invalidate but not a DSB
was so much higher, its execution time could not be estimated with enough
precision to say that it is statistically different from the other two
implementations.

Random Reads:

Executing a TLB invalidate but not a DSB was the faster and had less run-to-run
variation than either of the other implementations. It is 8% faster and has ~3x
lower run-to-run variation than either alternative. The run-to-run variation
when skipping the DSB was 1.5% of the overall execution time.

Skipping the TLB invalidate was not statistically different from the existing
implementation that does a TLB invalidate and a DSB.

Alternating Random and Hot Page Reads:

In this test, executing a TLB invalidate but not a DSB was the fastest. It was
12% faster than an invalidate and a DSB. It was 9% faster than executing no
invalidate. Similarly, skipping the invalidate was 4% faster than executing an
invalidate and a DSB (9% + 4% != 12% because of rounding).

The run-to-run variation was the lowest when executing an invalidate and a DSB.
Its variation was 1% of the time.  That is 64% of variation when skipping the
DSB and 22% of executing no TLB invalidate (5% of the execution time).

This test was meant to test the effects of a TLB not being updated with the
newly young PTE in memory. By never executing a TLB invalidate, then the kernel
almost never gets a chance to take a page fault because the access flag is
clear. By executing an invalidate but not executing a DSB probably results in
the TLB usually updated with the PTE value before the page falls off the LRU
list. So, it makes sense that skipping the DSB is the fastest. The cases where
the hot page erroneously evicted are likely the reason why the variation
increases with looser TLB invalidate implementations.

Sequential Writes:

There were no statistically significant results in this test. That is likely
because IO was limiting the write speed. Also, the write tests had much more run
-to-run variation (about 10% of the execution time) than the read tests.

For openness, the existing implementation that executes an invalidate and DSB
was faster by 8% but didn't quite meet the requirements to be statistically
significant. Its p-value was .014. Since it less than 1 / 36 = .028, it is
unlikely to be coincidental. But, every other result reported has a p < .004.

Random Writes:

Skipping the invalidate was the fastest. It was 51% faster than executing an
invalidate and a DSB. It was 38% faster than executing an invalidate but not a
DSB. The run-to-run variations were not statistically different.

Alternating Random and Hot Page Writes:

Similar to random writes, skipping the invalidate was the fastest. It was 46%
faster than executing an invalidate and a DSB and 45% faster than executing an
invalidate without a DSB. The run-to-run variations were not statistically
different.

Conclusion:

There were no statistically significant results where executing a TLB invalidate
and a DSB was fastest. Except for the sequential write case where there were no
significant results, it was slower by 8-50% than the alternatives.

Executing a TLB invalidate but not a DSB was faster than not executing a TLB
invalidate in the two random read cases by about 8%. However, skipping the
invalidate was faster in the random write tests by about 40%.

The existing implementation that executes an invalidate and a DSB had 3-4x less
run-to-run variation than the alternatives in the one hot page read test. That
is the strongest reason to continue fully invalidating TLBs. However, at worst
it had a 5% execution time. I think that going from 1% to 5% on this test is
more than made up for by reducing the variation in the sequential read test from
12% to 4% by skipping the invalidates altogether.

Because these are microbenchmarks that represent small parts of real
applications, I think that we should use the worst case run-to-run variation to
choose the implementation that has the least variation. Using that metric,
skipping the invalidate has a worst case of 5% (random read), skipping just the
DSB has a worst case of 38% (sequential read), and executing an invalidate and a
DSB has a worst case of 12% (sequential read).
Will Deacon Dec. 6, 2018, 7:18 p.m. | #6
Hi Alex,

Thanks for running these tests and providing the in-depth analysis.

On Mon, Dec 03, 2018 at 09:20:25PM +0000, Alexander Van Brunt wrote:
> > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > performance does that get you?
> 
> It is not as good. In some cases, it is really bad. Skipping the invalidate was
> the most consistent and fast implementation.

My problem with that is it's not really much different to just skipping the
page table update entirely. Skipping the DSB is closer to what is done on
x86, where we bound the stale entry time to the next context-switch.

Given that I already queued the version without the DSB, we have the choice
to either continue with that or to revert it and go back to the previous
behaviour. Which would you prefer?

Will
Alexander Van Brunt Dec. 6, 2018, 8:42 p.m. | #7
> > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > performance does that get you?
> > 
> > It is not as good. In some cases, it is really bad. Skipping the invalidate was
> > the most consistent and fast implementation.

> My problem with that is it's not really much different to just skipping the
> page table update entirely. Skipping the DSB is closer to what is done on
> x86, where we bound the stale entry time to the next context-switch.

Which of the three implementations is the "that" and "it" in the first sentence?

> Given that I already queued the version without the DSB, we have the choice
> to either continue with that or to revert it and go back to the previous
> behaviour. Which would you prefer?

To me, skipping the DSB is a win over doing the invalidate and the DSB because
it is faster on average.

DSBs have a big impact on the performance of other CPUs in the inner shareable
domain because of the ordering requirements. For example, we have observed
Cortex A57s stalling all CPUs in the cluster until Device accesses complete.

Would you be open to a patch on top of the DSB skipping patch that skips the
whole invalidate?
Will Deacon Dec. 7, 2018, 5:53 p.m. | #8
On Thu, Dec 06, 2018 at 08:42:03PM +0000, Alexander Van Brunt wrote:
> > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > > performance does that get you?
> > > 
> > > It is not as good. In some cases, it is really bad. Skipping the invalidate was
> > > the most consistent and fast implementation.
> 
> > My problem with that is it's not really much different to just skipping the
> > page table update entirely. Skipping the DSB is closer to what is done on
> > x86, where we bound the stale entry time to the next context-switch.
> 
> Which of the three implementations is the "that" and "it" in the first sentence?

that = it = skipping the whole invalidation + the DSB

> > Given that I already queued the version without the DSB, we have the choice
> > to either continue with that or to revert it and go back to the previous
> > behaviour. Which would you prefer?
> 
> To me, skipping the DSB is a win over doing the invalidate and the DSB because
> it is faster on average.
> 
> DSBs have a big impact on the performance of other CPUs in the inner shareable
> domain because of the ordering requirements. For example, we have observed
> Cortex A57s stalling all CPUs in the cluster until Device accesses complete.
> 
> Would you be open to a patch on top of the DSB skipping patch that skips the
> whole invalidate?

I don't think so; we don't have an upper bound on how long we'll have a
stale TLB if remove the invalidation completely.

Will
Alexander Van Brunt Dec. 8, 2018, 12:05 a.m. | #9
> > > My problem with that is it's not really much different to just skipping the
> > > page table update entirely. Skipping the DSB is closer to what is done on
> > > x86, where we bound the stale entry time to the next context-switch.
> >  
> > Which of the three implementations is the "that" and "it" in the first sentence?
>
> that = it = skipping the whole invalidation + the DSB

The TLB is tiny compared to the size of the inactive list. Somehow a TLB has to
not be evicted during the page's life in the inactive list. That is not an easy
feat except for the hottest of pages.

If there is a context-switch, most of the original thread's TLBs will be
evicted because TLBs have a hard time to hold two thread's working sets. So, in
practice, that is almost the same as the x86 guarantee.

The worst case cannot have a large impact because the maximum number of pages
that will not have the TLB evicted is the number of pages in the TLB. For
example, a 1024 entry TLB can at worst result in 4 MB of pages erroneously
reclaimed. That is not bad on a system with 4+ GB of memory.

We did benchmark the extreme case where half the pages accessed where not
evicted from the TLB. In the read case, skipping the DSB was ~10% faster than
skipping the invalidate or doing the invalidate and the DSB.

Compared to the improvement in the average performance and variability in the
other cases we tested, the 10% loss in a carefully crafted test is not as
important.

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2ab2031..080d842 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -652,6 +652,26 @@  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return __ptep_test_and_clear_young(ptep);
 }
 
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+					 unsigned long address, pte_t *ptep)
+{
+	/*
+	 * On ARM64 CPUs, clearing the accessed bit without a TLB flush
+	 * doesn't cause data corruption. [ It could cause incorrect
+	 * page aging and the (mistaken) reclaim of hot pages, but the
+	 * chance of that should be relatively low. ]
+	 *
+	 * So as a performance optimization don't flush the TLB when
+	 * clearing the accessed bit, it will eventually be flushed by
+	 * a context switch or a VM operation anyway. [ In the rare
+	 * event of it not getting flushed for a long time the delay
+	 * shouldn't really matter because there's no real memory
+	 * pressure for swapout to react to. ]
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,