diff mbox series

[2/4] mm: speed up mremap by 500x on large regions (v2)

Message ID 20181013013200.206928-3-joel@joelfernandes.org
State Not Applicable
Headers show
Series Add support for fast mremap | expand

Commit Message

Joel Fernandes Oct. 13, 2018, 1:31 a.m. UTC
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is skipped. I also flush the
tlb every time we do this optimization since I couldn't find a way to
determine if the low-level PTEs are dirty. It is seen that the cost of
doing so is not much compared the improvement, on both x86-64 and arm64.

Cc: minchan@kernel.org
Cc: pantin@google.com
Cc: hughd@google.com
Cc: lokeshgidra@google.com
Cc: dancol@google.com
Cc: mhocko@kernel.org
Cc: kirill@shutemov.name
Cc: akpm@linux-foundation.org
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/Kconfig |  5 ++++
 mm/mremap.c  | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

Comments

Christoph Hellwig Oct. 15, 2018, 9:42 a.m. UTC | #1
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> Android needs to mremap large regions of memory during memory management
> related operations.

Just curious: why?

> +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> +	    || old_end - old_addr < PMD_SIZE)

The || goes on the first line.

> +		} else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) {

Overly long line.
Joel Fernandes Oct. 15, 2018, 10:33 p.m. UTC | #2
On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > Android needs to mremap large regions of memory during memory management
> > related operations.
> 
> Just curious: why?

In Android we have a requirement of moving a large (up to a GB now, but may
grow bigger in future) memory range from one location to another. This move
operation has to happen when the application threads are paused for this
operation. Therefore, an inefficient move like it is now (for example 250ms
on arm64) will cause response time issues for applications, which is not
acceptable. Huge pages cannot be used in such memory ranges to avoid this
inefficiency as (when the application threads are running) our fault handlers
are designed to process 4KB pages at a time, to keep response times low. So
using huge pages in this context can, again, cause response time issues.

Also, the mremap syscall waiting for quarter of a second for a large mremap
is quite weird and we ought to improve it where possible.

> > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +	    || old_end - old_addr < PMD_SIZE)
> 
> The || goes on the first line.

Ok, fixed.

> > +		} else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) {
> 
> Overly long line.

Ok, fixed. Preview of updated patch is below.

thanks,

 - Joel

------8<---
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v3)

Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is mostly skipped except in
certain situations. I also flush the tlb every time we do this
optimization since I couldn't find a way to determine if the low-level
PTEs are dirty. It is seen that the cost of doing so is not much
compared the improvement, on both x86-64 and arm64.

Cc: minchan@kernel.org
Cc: pantin@google.com
Cc: hughd@google.com
Cc: lokeshgidra@google.com
Cc: dancol@google.com
Cc: mhocko@kernel.org
Cc: kirill@shutemov.name
Cc: akpm@linux-foundation.org
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/Kconfig |  5 ++++
 mm/mremap.c  | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..9724fe39884f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -518,6 +518,11 @@ config HAVE_IRQ_TIME_ACCOUNTING
 	  Archs need to ensure they use a high enough resolution clock to
 	  support irq time accounting and then call enable_sched_clock_irqtime().
 
+config HAVE_MOVE_PMD
+	bool
+	help
+	  Archs that select this are able to move page tables at the PMD level.
+
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	bool
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 9e68a02a52b1..a8dd98a59975 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		drop_rmap_locks(vma);
 }
 
+static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) ||
+	    old_end - old_addr < PMD_SIZE)
+		return false;
+
+	/*
+	 * The destination pmd shouldn't be established, free_pgtables()
+	 * should have release it.
+	 */
+	if (WARN_ON(!pmd_none(*new_pmd)))
+		return false;
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_sem prevents deadlock.
+	 */
+	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+	if (old_ptl) {
+		pmd_t pmd;
+
+		new_ptl = pmd_lockptr(mm, new_pmd);
+		if (new_ptl != old_ptl)
+			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+		/* Clear the pmd */
+		pmd = *old_pmd;
+		pmd_clear(old_pmd);
+
+		VM_BUG_ON(!pmd_none(*new_pmd));
+
+		/* Set the new pmd */
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
+		if (new_ptl != old_ptl)
+			spin_unlock(new_ptl);
+		spin_unlock(old_ptl);
+
+		*need_flush = true;
+		return true;
+	}
+	return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -239,7 +287,25 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			split_huge_pmd(vma, old_pmd, old_addr);
 			if (pmd_trans_unstable(old_pmd))
 				continue;
+		} else if (extent == PMD_SIZE &&
+			   IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) {
+			/*
+			 * If the extent is PMD-sized, try to speed the move by
+			 * moving at the PMD level if possible.
+			 */
+			bool moved;
+
+			if (need_rmap_locks)
+				take_rmap_locks(vma);
+			moved = move_normal_pmd(vma, old_addr, new_addr,
+						old_end, old_pmd, new_pmd,
+						&need_flush);
+			if (need_rmap_locks)
+				drop_rmap_locks(vma);
+			if (moved)
+				continue;
 		}
+
 		if (pte_alloc(new_vma->vm_mm, new_pmd))
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
Vlastimil Babka Oct. 16, 2018, 11:29 a.m. UTC | #3
On 10/16/18 12:33 AM, Joel Fernandes wrote:
> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
>>> Android needs to mremap large regions of memory during memory management
>>> related operations.
>>
>> Just curious: why?
> 
> In Android we have a requirement of moving a large (up to a GB now, but may
> grow bigger in future) memory range from one location to another.

I think Christoph's "why?" was about the requirement, not why it hurts
applications. I admit I'm now also curious :)

> This move
> operation has to happen when the application threads are paused for this
> operation. Therefore, an inefficient move like it is now (for example 250ms
> on arm64) will cause response time issues for applications, which is not
> acceptable. Huge pages cannot be used in such memory ranges to avoid this
> inefficiency as (when the application threads are running) our fault handlers
> are designed to process 4KB pages at a time, to keep response times low. So
> using huge pages in this context can, again, cause response time issues.
> 
> Also, the mremap syscall waiting for quarter of a second for a large mremap
> is quite weird and we ought to improve it where possible.
Joel Fernandes Oct. 16, 2018, 7:43 p.m. UTC | #4
On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote:
> On 10/16/18 12:33 AM, Joel Fernandes wrote:
> > On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
> >> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> >>> Android needs to mremap large regions of memory during memory management
> >>> related operations.
> >>
> >> Just curious: why?
> > 
> > In Android we have a requirement of moving a large (up to a GB now, but may
> > grow bigger in future) memory range from one location to another.
> 
> I think Christoph's "why?" was about the requirement, not why it hurts
> applications. I admit I'm now also curious :)

This issue was discovered when we wanted to be able to move the physical
pages of a memory range to another location quickly so that, after the
application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd
faults can be received on the original memory range. The actual operations
performed on the memory range are beyond the scope of this discussion. The
user threads continue to refer to the old address which will now fault. The
reason we want retain the old memory range and receives faults there is to
avoid the need to fix the addresses all over the address space of the threads
after we finish with performing operations on them in the fault handlers, so
we mremap it and receive faults at the old addresses.

Does that answer your question?

thanks,

- Joel
Vlastimil Babka Oct. 17, 2018, 7:38 a.m. UTC | #5
On 10/16/18 9:43 PM, Joel Fernandes wrote:
> On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote:
>> On 10/16/18 12:33 AM, Joel Fernandes wrote:
>>> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
>>>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
>>>>> Android needs to mremap large regions of memory during memory management
>>>>> related operations.
>>>>
>>>> Just curious: why?
>>>
>>> In Android we have a requirement of moving a large (up to a GB now, but may
>>> grow bigger in future) memory range from one location to another.
>>
>> I think Christoph's "why?" was about the requirement, not why it hurts
>> applications. I admit I'm now also curious :)
> 
> This issue was discovered when we wanted to be able to move the physical
> pages of a memory range to another location quickly so that, after the
> application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd
> faults can be received on the original memory range. The actual operations
> performed on the memory range are beyond the scope of this discussion. The
> user threads continue to refer to the old address which will now fault. The
> reason we want retain the old memory range and receives faults there is to
> avoid the need to fix the addresses all over the address space of the threads
> after we finish with performing operations on them in the fault handlers, so
> we mremap it and receive faults at the old addresses.
> 
> Does that answer your question?

Yes, interesting, thanks!

Vlastimil

> thanks,
> 
> - Joel
>
Kirill A. Shutemov Oct. 24, 2018, 10:12 a.m. UTC | #6
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9e68a02a52b1..2fd163cff406 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		drop_rmap_locks(vma);
>  }
>  
> +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> +		  unsigned long new_addr, unsigned long old_end,
> +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
> +	spinlock_t *old_ptl, *new_ptl;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> +	    || old_end - old_addr < PMD_SIZE)
> +		return false;
> +
> +	/*
> +	 * The destination pmd shouldn't be established, free_pgtables()
> +	 * should have release it.
> +	 */
> +	if (WARN_ON(!pmd_none(*new_pmd)))
> +		return false;
> +
> +	/*
> +	 * We don't have to worry about the ordering of src and dst
> +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> +	 */
> +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> +	if (old_ptl) {

How can it ever be false?

> +		pmd_t pmd;
> +
> +		new_ptl = pmd_lockptr(mm, new_pmd);
> +		if (new_ptl != old_ptl)
> +			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> +		/* Clear the pmd */
> +		pmd = *old_pmd;
> +		pmd_clear(old_pmd);
> +
> +		VM_BUG_ON(!pmd_none(*new_pmd));
> +
> +		/* Set the new pmd */
> +		set_pmd_at(mm, new_addr, new_pmd, pmd);
> +		if (new_ptl != old_ptl)
> +			spin_unlock(new_ptl);
> +		spin_unlock(old_ptl);
> +
> +		*need_flush = true;
> +		return true;
> +	}
> +	return false;
> +}
> +
Balbir Singh Oct. 24, 2018, 11:57 a.m. UTC | #7
On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 9e68a02a52b1..2fd163cff406 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> >  		drop_rmap_locks(vma);
> >  }
> >  
> > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > +		  unsigned long new_addr, unsigned long old_end,
> > +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > +{
> > +	spinlock_t *old_ptl, *new_ptl;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +
> > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +	    || old_end - old_addr < PMD_SIZE)
> > +		return false;
> > +
> > +	/*
> > +	 * The destination pmd shouldn't be established, free_pgtables()
> > +	 * should have release it.
> > +	 */
> > +	if (WARN_ON(!pmd_none(*new_pmd)))
> > +		return false;
> > +
> > +	/*
> > +	 * We don't have to worry about the ordering of src and dst
> > +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> > +	 */
> > +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > +	if (old_ptl) {
> 
> How can it ever be false?
> 
> > +		pmd_t pmd;
> > +
> > +		new_ptl = pmd_lockptr(mm, new_pmd);


Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
the code applies, why not just reuse as much as possible? The same comments
w.r.t mmap_sem helping protect against lock order issues applies as well.

> > +		if (new_ptl != old_ptl)
> > +			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > +		/* Clear the pmd */
> > +		pmd = *old_pmd;
> > +		pmd_clear(old_pmd);
> > +
> > +		VM_BUG_ON(!pmd_none(*new_pmd));
> > +
> > +		/* Set the new pmd */
> > +		set_pmd_at(mm, new_addr, new_pmd, pmd);
> > +		if (new_ptl != old_ptl)
> > +			spin_unlock(new_ptl);
> > +		spin_unlock(old_ptl);
> > +
> > +		*need_flush = true;
> > +		return true;
> > +	}
> > +	return false;
> > +}
> > +
> -- 
>  Kirill A. Shutemov
>
Kirill A. Shutemov Oct. 24, 2018, 12:57 p.m. UTC | #8
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 9e68a02a52b1..2fd163cff406 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> > >  		drop_rmap_locks(vma);
> > >  }
> > >  
> > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > +		  unsigned long new_addr, unsigned long old_end,
> > > +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > +{
> > > +	spinlock_t *old_ptl, *new_ptl;
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +
> > > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > +	    || old_end - old_addr < PMD_SIZE)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * The destination pmd shouldn't be established, free_pgtables()
> > > +	 * should have release it.
> > > +	 */
> > > +	if (WARN_ON(!pmd_none(*new_pmd)))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * We don't have to worry about the ordering of src and dst
> > > +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> > > +	 */
> > > +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > +	if (old_ptl) {
> > 
> > How can it ever be false?
> > 
> > > +		pmd_t pmd;
> > > +
> > > +		new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not
copy the code blindly.
Joel Fernandes Oct. 25, 2018, 2:09 a.m. UTC | #9
On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > --- a/mm/mremap.c
> > > > +++ b/mm/mremap.c
> > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> > > >  		drop_rmap_locks(vma);
> > > >  }
> > > >  
> > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > > +		  unsigned long new_addr, unsigned long old_end,
> > > > +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > +{
> > > > +	spinlock_t *old_ptl, *new_ptl;
> > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > +
> > > > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > +	    || old_end - old_addr < PMD_SIZE)
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * The destination pmd shouldn't be established, free_pgtables()
> > > > +	 * should have release it.
> > > > +	 */
> > > > +	if (WARN_ON(!pmd_none(*new_pmd)))
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * We don't have to worry about the ordering of src and dst
> > > > +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > +	 */
> > > > +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > +	if (old_ptl) {
> > > 
> > > How can it ever be false?

Kirill,
It cannot, you are right. I'll remove the test.

By the way, there are new changes upstream by Linus which flush the TLB
before releasing the ptlock instead of after. I'm guessing that patch came
about because of reviews of this patch and someone spotted an issue in the
existing code :)

Anyway the patch in concern is:
eb66ae030829 ("mremap: properly flush TLB before releasing the page")

I need to rebase on top of that with appropriate modifications, but I worry
that this patch will slow down performance since we have to flush at every
PMD/PTE move before releasing the ptlock. Where as with my patch, the
intention is to flush only at once in the end of move_page_tables. When I
tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2].

Further observation [1] is, it seems like the move_huge_pmds and move_ptes code
is a bit sub optimal in the sense, we are acquiring and releasing the same
ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
right? Instead we can do better by acquiring and release the ptlock less
often.

I think this observation [1] and the frequent TLB flush issue [2] can be solved
by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
the tlb and then release the ptlock, and then proceed to doing the same thing
for the PMDs in the next page-table page. What do you think?

- Joel
Joel Fernandes Oct. 25, 2018, 2:13 a.m. UTC | #10
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
[...]
> > > +		pmd_t pmd;
> > > +
> > > +		new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

I thought about this and when I looked into it, it seemed there are subtle
differences that make such sharing not worth it (or not possible).

 - Joel
Kirill A. Shutemov Oct. 25, 2018, 10:19 a.m. UTC | #11
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> > > > >  		drop_rmap_locks(vma);
> > > > >  }
> > > > >  
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > > > +		  unsigned long new_addr, unsigned long old_end,
> > > > > +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > +	spinlock_t *old_ptl, *new_ptl;
> > > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > +	    || old_end - old_addr < PMD_SIZE)
> > > > > +		return false;
> > > > > +
> > > > > +	/*
> > > > > +	 * The destination pmd shouldn't be established, free_pgtables()
> > > > > +	 * should have release it.
> > > > > +	 */
> > > > > +	if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > +		return false;
> > > > > +
> > > > > +	/*
> > > > > +	 * We don't have to worry about the ordering of src and dst
> > > > > +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > +	 */
> > > > > +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > +	if (old_ptl) {
> > > > 
> > > > How can it ever be false?
> 
> Kirill,
> It cannot, you are right. I'll remove the test.
> 
> By the way, there are new changes upstream by Linus which flush the TLB
> before releasing the ptlock instead of after. I'm guessing that patch came
> about because of reviews of this patch and someone spotted an issue in the
> existing code :)
> 
> Anyway the patch in concern is:
> eb66ae030829 ("mremap: properly flush TLB before releasing the page")
> 
> I need to rebase on top of that with appropriate modifications, but I worry
> that this patch will slow down performance since we have to flush at every
> PMD/PTE move before releasing the ptlock. Where as with my patch, the
> intention is to flush only at once in the end of move_page_tables. When I
> tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2].
> 
> Further observation [1] is, it seems like the move_huge_pmds and move_ptes code
> is a bit sub optimal in the sense, we are acquiring and releasing the same
> ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
> right? Instead we can do better by acquiring and release the ptlock less
> often.
> 
> I think this observation [1] and the frequent TLB flush issue [2] can be solved
> by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
> the tlb and then release the ptlock, and then proceed to doing the same thing
> for the PMDs in the next page-table page. What do you think?

Yeah, that's viable optimization.

The tricky part is that one PMD page table can have PMD entires of
different types: THP, page table that you can move as whole and the one
that you cannot (for any reason).

If we cannot move the PMD entry as a whole and must go to PTE page table
we would need to drop PMD ptl and take PTE ptl (it might be the same lock
in some configuations).

Also we don't want to take PMD lock unless it's required.

I expect it to be not very trivial to get everything right. But take a
shot :)
Joel Fernandes Oct. 26, 2018, 9:11 p.m. UTC | #12
On Thu, Oct 25, 2018 at 01:19:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> > On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > > > --- a/mm/mremap.c
> > > > > > +++ b/mm/mremap.c
> > > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> > > > > >  		drop_rmap_locks(vma);
> > > > > >  }
> > > > > >  
> > > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > > > > +		  unsigned long new_addr, unsigned long old_end,
> > > > > > +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > > +{
> > > > > > +	spinlock_t *old_ptl, *new_ptl;
> > > > > > +	struct mm_struct *mm = vma->vm_mm;
> > > > > > +
> > > > > > +	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > > +	    || old_end - old_addr < PMD_SIZE)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * The destination pmd shouldn't be established, free_pgtables()
> > > > > > +	 * should have release it.
> > > > > > +	 */
> > > > > > +	if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > > +		return false;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * We don't have to worry about the ordering of src and dst
> > > > > > +	 * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > > +	 */
> > > > > > +	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > > +	if (old_ptl) {
> > > > > 
> > > > > How can it ever be false?
> > 
> > Kirill,
> > It cannot, you are right. I'll remove the test.
> > 
> > By the way, there are new changes upstream by Linus which flush the TLB
> > before releasing the ptlock instead of after. I'm guessing that patch came
> > about because of reviews of this patch and someone spotted an issue in the
> > existing code :)
> > 
> > Anyway the patch in concern is:
> > eb66ae030829 ("mremap: properly flush TLB before releasing the page")
> > 
> > I need to rebase on top of that with appropriate modifications, but I worry
> > that this patch will slow down performance since we have to flush at every
> > PMD/PTE move before releasing the ptlock. Where as with my patch, the
> > intention is to flush only at once in the end of move_page_tables. When I
> > tried to flush TLB on every PMD move, it was quite slow on my arm64 device [2].
> > 
> > Further observation [1] is, it seems like the move_huge_pmds and move_ptes code
> > is a bit sub optimal in the sense, we are acquiring and releasing the same
> > ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
> > right? Instead we can do better by acquiring and release the ptlock less
> > often.
> > 
> > I think this observation [1] and the frequent TLB flush issue [2] can be solved
> > by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
> > the tlb and then release the ptlock, and then proceed to doing the same thing
> > for the PMDs in the next page-table page. What do you think?
> 
> Yeah, that's viable optimization.
> 
> The tricky part is that one PMD page table can have PMD entires of
> different types: THP, page table that you can move as whole and the one
> that you cannot (for any reason).
> 
> If we cannot move the PMD entry as a whole and must go to PTE page table
> we would need to drop PMD ptl and take PTE ptl (it might be the same lock
> in some configuations).

> Also we don't want to take PMD lock unless it's required.
> 
> I expect it to be not very trivial to get everything right. But take a
> shot :)

Yes, that is exactly the issue I hit when I attempted it. :) The locks need
to be release if we do something different on the next loop iteration. It
complicates the code and not sure if it is worth it in the long run.  On x86
atleast, I don't see any perf issues with the TLB-flush per-PMD move, so the
patch is Ok there. On arm64, it negates the performance benefit even though
its not any worse than what we are doing currently at the PTE level.

My thinking is to take it slow and get the patch in in its current state,
since it improves x86. Then as a next step, look into why the arm64 tlb
flushes are that expensive and look into optimizing that. On arm64 I am
testing on a 4.9 kernel so I'm wondering there are any optimizations since
4.9 that can help speed it up there. After that, if all else fails about
speeding up arm64, then I look into developing the cleanest possible solution
where we can keep the lock held for longer and flush lesser.

thanks,

 - Joel
Balbir Singh Oct. 27, 2018, 10:21 a.m. UTC | #13
On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> [...]
> > > > +		pmd_t pmd;
> > > > +
> > > > +		new_ptl = pmd_lockptr(mm, new_pmd);
> > 
> > 
> > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> > the code applies, why not just reuse as much as possible? The same comments
> > w.r.t mmap_sem helping protect against lock order issues applies as well.
> 
> I thought about this and when I looked into it, it seemed there are subtle
> differences that make such sharing not worth it (or not possible).
>

Could you elaborate on them?

Thanks,
Balbir Singh.
Joel Fernandes Oct. 27, 2018, 7:39 p.m. UTC | #14
Hi Balbir,

On Sat, Oct 27, 2018 at 09:21:02PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > [...]
> > > > > +		pmd_t pmd;
> > > > > +
> > > > > +		new_ptl = pmd_lockptr(mm, new_pmd);
> > > 
> > > 
> > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> > > the code applies, why not just reuse as much as possible? The same comments
> > > w.r.t mmap_sem helping protect against lock order issues applies as well.
> > 
> > I thought about this and when I looked into it, it seemed there are subtle
> > differences that make such sharing not worth it (or not possible).
> >
> 
> Could you elaborate on them?

The move_huge_page function is defined only for CONFIG_TRANSPARENT_HUGEPAGE
so we cannot reuse it to begin with, since we have it disabled on our
systems. I am not sure if it is a good idea to split that out and refactor it
for reuse especially since our case is quite simple compared to huge pages.

There are also a couple of subtle differences between the move_normal_pmd and
the move_huge_pmd. Atleast 2 of them are:

1. We don't concern ourself with the PMD dirty bit, since the pages being
moved are normal pages and at the soft-dirty bit accounting is at the PTE
level, since we are not moving PTEs, we don't need to do that.

2. The locking is simpler as Kirill pointed, pmd_lock cannot fail however
__pmd_trans_huge_lock can.

I feel it is not super useful to refactor move_huge_pmd to support our case
especially since move_normal_pmd is quite small, so IMHO the benefit of code
reuse isn't there very much.

Do let me know your thoughts and thanks for your interest in this.

thanks,

 - Joel
Balbir Singh Oct. 28, 2018, 10:40 p.m. UTC | #15
On Sat, Oct 27, 2018 at 12:39:17PM -0700, Joel Fernandes wrote:
> Hi Balbir,
> 
> On Sat, Oct 27, 2018 at 09:21:02PM +1100, Balbir Singh wrote:
> > On Wed, Oct 24, 2018 at 07:13:50PM -0700, Joel Fernandes wrote:
> > > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > [...]
> > > > > > +		pmd_t pmd;
> > > > > > +
> > > > > > +		new_ptl = pmd_lockptr(mm, new_pmd);
> > > > 
> > > > 
> > > > Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> > > > the code applies, why not just reuse as much as possible? The same comments
> > > > w.r.t mmap_sem helping protect against lock order issues applies as well.
> > > 
> > > I thought about this and when I looked into it, it seemed there are subtle
> > > differences that make such sharing not worth it (or not possible).
> > >
> > 
> > Could you elaborate on them?
> 
> The move_huge_page function is defined only for CONFIG_TRANSPARENT_HUGEPAGE
> so we cannot reuse it to begin with, since we have it disabled on our
> systems. I am not sure if it is a good idea to split that out and refactor it
> for reuse especially since our case is quite simple compared to huge pages.
> 
> There are also a couple of subtle differences between the move_normal_pmd and
> the move_huge_pmd. Atleast 2 of them are:
> 
> 1. We don't concern ourself with the PMD dirty bit, since the pages being
> moved are normal pages and at the soft-dirty bit accounting is at the PTE
> level, since we are not moving PTEs, we don't need to do that.
> 
> 2. The locking is simpler as Kirill pointed, pmd_lock cannot fail however
> __pmd_trans_huge_lock can.
> 
> I feel it is not super useful to refactor move_huge_pmd to support our case
> especially since move_normal_pmd is quite small, so IMHO the benefit of code
> reuse isn't there very much.
>

My big concern is that any bug fixes will need to monitor both paths.
Do you see a big overhead in checking the soft dirty bit? The locking is
a little different. Having said that, I am not strictly opposed to the
extra code, just concerned about missing fixes/updates as we find them.
 
> Do let me know your thoughts and thanks for your interest in this.
> 
>

Balbir Singh.
Will Deacon Oct. 29, 2018, 10:28 a.m. UTC | #16
On Fri, Oct 26, 2018 at 02:11:48PM -0700, Joel Fernandes wrote:
> My thinking is to take it slow and get the patch in in its current state,
> since it improves x86. Then as a next step, look into why the arm64 tlb
> flushes are that expensive and look into optimizing that. On arm64 I am
> testing on a 4.9 kernel so I'm wondering there are any optimizations since
> 4.9 that can help speed it up there. After that, if all else fails about
> speeding up arm64, then I look into developing the cleanest possible solution
> where we can keep the lock held for longer and flush lesser.

We rewrote a good chunk of the arm64 TLB invalidation and core mmu_gather
code this merge window, so please do have another look at -rc1!

Will
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..9724fe39884f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -518,6 +518,11 @@  config HAVE_IRQ_TIME_ACCOUNTING
 	  Archs need to ensure they use a high enough resolution clock to
 	  support irq time accounting and then call enable_sched_clock_irqtime().
 
+config HAVE_MOVE_PMD
+	bool
+	help
+	  Archs that select this are able to move page tables at the PMD level.
+
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	bool
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 9e68a02a52b1..2fd163cff406 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		drop_rmap_locks(vma);
 }
 
+static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+	    || old_end - old_addr < PMD_SIZE)
+		return false;
+
+	/*
+	 * The destination pmd shouldn't be established, free_pgtables()
+	 * should have release it.
+	 */
+	if (WARN_ON(!pmd_none(*new_pmd)))
+		return false;
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_sem prevents deadlock.
+	 */
+	old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+	if (old_ptl) {
+		pmd_t pmd;
+
+		new_ptl = pmd_lockptr(mm, new_pmd);
+		if (new_ptl != old_ptl)
+			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+		/* Clear the pmd */
+		pmd = *old_pmd;
+		pmd_clear(old_pmd);
+
+		VM_BUG_ON(!pmd_none(*new_pmd));
+
+		/* Set the new pmd */
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
+		if (new_ptl != old_ptl)
+			spin_unlock(new_ptl);
+		spin_unlock(old_ptl);
+
+		*need_flush = true;
+		return true;
+	}
+	return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -239,7 +287,24 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 			split_huge_pmd(vma, old_pmd, old_addr);
 			if (pmd_trans_unstable(old_pmd))
 				continue;
+		} else if (extent == PMD_SIZE && IS_ENABLED(CONFIG_HAVE_MOVE_PMD)) {
+			/*
+			 * If the extent is PMD-sized, try to speed the move by
+			 * moving at the PMD level if possible.
+			 */
+			bool moved;
+
+			if (need_rmap_locks)
+				take_rmap_locks(vma);
+			moved = move_normal_pmd(vma, old_addr, new_addr,
+					old_end, old_pmd, new_pmd,
+					&need_flush);
+			if (need_rmap_locks)
+				drop_rmap_locks(vma);
+			if (moved)
+				continue;
 		}
+
 		if (pte_alloc(new_vma->vm_mm, new_pmd))
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;