diff mbox series

[2/3] mm/cow: optimise pte dirty/accessed bits handling in fork

Message ID 20180828112034.30875-3-npiggin@gmail.com (mailing list archive)
State Superseded
Headers show
Series mm: dirty/accessed pte optimisations | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied
snowpatch_ozlabs/checkpatch success Test checkpatch on branch next

Commit Message

Nicholas Piggin Aug. 28, 2018, 11:20 a.m. UTC
fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits. Dirty bits are all
tested and cleared together and any dirty bit is the same as many
dirty bits. Any young bit is treated similarly to many young bits, but
not quite the same. A comment has been added where there is some
difference.

This eliminates a major source of faults powerpc/radix requires to set
dirty/accessed bits in ptes, speeding up a fork/exit microbenchmark by
about 5% on POWER9 (16600 -> 17500 fork/execs per second).

Skylake appears to have a micro-fault overhead too -- a test which
allocates 4GB anonymous memory, reads each page, then forks, and times
the child reading a byte from each page. The first pass over the pages
takes about 1000 cycles per page, the second pass takes about 27
cycles (TLB miss). With no additional minor faults measured due to
either child pass, and the page array well exceeding TLB capacity, the
large cost must be caused by micro faults caused by setting accessed
bit.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/huge_memory.c |  2 --
 mm/memory.c      | 10 +++++-----
 mm/vmscan.c      |  8 ++++++++
 3 files changed, 13 insertions(+), 7 deletions(-)

Comments

Linus Torvalds Aug. 29, 2018, 3:42 p.m. UTC | #1
On Tue, Aug 28, 2018 at 4:20 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> fork clears dirty/accessed bits from new ptes in the child. This logic
> has existed since mapped page reclaim was done by scanning ptes when
> it may have been quite important. Today with physical based pte
> scanning, there is less reason to clear these bits.

Can you humor me, and make the dirty/accessed bit patches separate?

There is actually a difference wrt the dirty bit: if we unmap an area
with dirty pages, we have to do the special synchronous flush.

So a clean page in the virtual mapping is _literally_ cheaper to have.

> This eliminates a major source of faults powerpc/radix requires to set
> dirty/accessed bits in ptes, speeding up a fork/exit microbenchmark by
> about 5% on POWER9 (16600 -> 17500 fork/execs per second).

I don't think the dirty bit matters.

The accessed bit I think may be worth keeping, so by all means remove the mkold.

                  Linus
Nicholas Piggin Aug. 29, 2018, 11:12 p.m. UTC | #2
On Wed, 29 Aug 2018 08:42:09 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Aug 28, 2018 at 4:20 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > fork clears dirty/accessed bits from new ptes in the child. This logic
> > has existed since mapped page reclaim was done by scanning ptes when
> > it may have been quite important. Today with physical based pte
> > scanning, there is less reason to clear these bits.  
> 
> Can you humor me, and make the dirty/accessed bit patches separate?

Yeah sure.

> There is actually a difference wrt the dirty bit: if we unmap an area
> with dirty pages, we have to do the special synchronous flush.
> 
> So a clean page in the virtual mapping is _literally_ cheaper to have.

Oh yeah true, that blasted thing. Good point.

Dirty micro fault seems to be the big one for my Skylake, takes 300
nanoseconds per access. Accessed takes about 100. (I think, have to
go over my benchmark a bit more carefully and re-test).

Dirty will happen less often though, particularly as most places we
do write to (stack, heap, etc) will be write protected for COW anyway,
I think. Worst case might be a big shared shm segment like a database
buffer cache, but those kind of forks should happen very very
infrequently I would hope.

Yes maybe we can do that. I'll split them up and try to get some
numbers for them individually.

Thanks,
Nick
Linus Torvalds Aug. 29, 2018, 11:15 p.m. UTC | #3
On Wed, Aug 29, 2018 at 4:12 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Dirty micro fault seems to be the big one for my Skylake, takes 300
> nanoseconds per access. Accessed takes about 100. (I think, have to
> go over my benchmark a bit more carefully and re-test).

Yeah, but they only happen for shared areas after fork, which sounds
like it shouldn't be a big deal in most cases.

And I'm not entirely objecting to your patch per se, I just would want
to keep the accessed bit changes separate from the dirty bit ones.

*If* somebody has bisectable issues with it (performance or not), it
will then be clearer what the exact issue is.

            Linus
Nicholas Piggin Aug. 29, 2018, 11:57 p.m. UTC | #4
On Wed, 29 Aug 2018 16:15:37 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Aug 29, 2018 at 4:12 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > Dirty micro fault seems to be the big one for my Skylake, takes 300
> > nanoseconds per access. Accessed takes about 100. (I think, have to
> > go over my benchmark a bit more carefully and re-test).  
> 
> Yeah, but they only happen for shared areas after fork, which sounds
> like it shouldn't be a big deal in most cases.

You might be right there.

> 
> And I'm not entirely objecting to your patch per se, I just would want
> to keep the accessed bit changes separate from the dirty bit ones.
> 
> *If* somebody has bisectable issues with it (performance or not), it
> will then be clearer what the exact issue is.

Yeah that makes a lot of sense. I'll do a bit more testing and send
Andrew a respin at least with those split (and a good comment for
the dirty bit vs unmap handling that you pointed out).

Thanks,
Nick
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9bae12978ef..5fb1a43e12e0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -977,7 +977,6 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmdp_set_wrprotect(src_mm, addr, src_pmd);
 		pmd = pmd_wrprotect(pmd);
 	}
-	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1071,7 +1070,6 @@  int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pudp_set_wrprotect(src_mm, addr, src_pud);
 		pud = pud_wrprotect(pud);
 	}
-	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index b616a69ad770..3d8bf8220bd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1038,12 +1038,12 @@  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	/*
-	 * If it's a shared mapping, mark it clean in
-	 * the child
+	 * Child inherits dirty and young bits from parent. There is no
+	 * point clearing them because any cleaning or aging has to walk
+	 * all ptes anyway, and it will notice the bits set in the parent.
+	 * Leaving them set avoids stalls and even page faults on CPUs that
+	 * handle these bits in software.
 	 */
-	if (vm_flags & VM_SHARED)
-		pte = pte_mkclean(pte);
-	pte = pte_mkold(pte);
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7e7d25504651..52fe64af3d80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1021,6 +1021,14 @@  static enum page_references page_check_references(struct page *page,
 		 * to look twice if a mapped file page is used more
 		 * than once.
 		 *
+		 * fork() will set referenced bits in child ptes despite
+		 * not having been accessed, to avoid micro-faults of
+		 * setting accessed bits. This heuristic is not perfectly
+		 * accurate in other ways -- multiple map/unmap in the
+		 * same time window would be treated as multiple references
+		 * despite same number of actual memory accesses made by
+		 * the program.
+		 *
 		 * Mark it and spare it for another trip around the
 		 * inactive list.  Another page table reference will
 		 * lead to its activation.