diff mbox series

[mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

Message ID 4d31abf5-56c0-9f3d-d12f-c9317936691@google.com
State New
Headers show
Series [mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd | expand

Commit Message

Hugh Dickins Aug. 21, 2023, 7:51 p.m. UTC
Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
thought it had emptied: page lock on the huge page is enough to protect
against WP faults (which find the PTE has been cleared), but not enough
to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.

retract_page_tables() protects against this by checking !vma->anon_vma;
but we know that MADV_COLLAPSE needs to be able to work on private shmem
mappings, even those with an anon_vma prepared for another part of the
mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
mappings which are userfaultfd_armed().  Whether it needs to work on
private shmem mappings which are userfaultfd_armed(), I'm not so sure:
but assume that it does.

Just for this case, take the pmd_lock() two steps earlier: not because
it gives any protection against this case itself, but because ptlock
nests inside it, and it's the dropping of ptlock which let the bug in.
In other cases, continue to minimize the pmd_lock() hold time.

Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 38 +++++++++++++++++++++++++++++---------
 1 file changed, 29 insertions(+), 9 deletions(-)

Comments

Peter Xu Aug. 21, 2023, 9:26 p.m. UTC | #1
On Mon, Aug 21, 2023 at 12:51:20PM -0700, Hugh Dickins wrote:
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> 
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.
> 
> Just for this case, take the pmd_lock() two steps earlier: not because
> it gives any protection against this case itself, but because ptlock
> nests inside it, and it's the dropping of ptlock which let the bug in.
> In other cases, continue to minimize the pmd_lock() hold time.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
> Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
> Signed-off-by: Hugh Dickins <hughd@google.com>

The locking is indeed slightly complicated.. but I didn't spot anything
wrong.

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,
Jann Horn Aug. 21, 2023, 9:59 p.m. UTC | #2
On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
>
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.

I think we couldn't rely on anon_vma here anyway, since holding the
mmap_lock in read mode doesn't prevent concurrent creation of an
anon_vma?

> Just for this case, take the pmd_lock() two steps earlier: not because
> it gives any protection against this case itself, but because ptlock
> nests inside it, and it's the dropping of ptlock which let the bug in.
> In other cases, continue to minimize the pmd_lock() hold time.

Special-casing userfaultfd like this makes me a bit uncomfortable; but
I also can't find anything other than userfaultfd that would insert
pages into regions that are khugepaged-compatible, so I guess this
works?

I guess an alternative would be to use a spin_trylock() instead of the
current pmd_lock(), and if that fails, temporarily drop the page table
lock and then restart from step 2 with both locks held - and at that
point the page table scan should be fast since we expect it to usually
be empty.
Hugh Dickins Aug. 22, 2023, 2:51 a.m. UTC | #3
On Mon, 21 Aug 2023, Jann Horn wrote:
> On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
> > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> > thought it had emptied: page lock on the huge page is enough to protect
> > against WP faults (which find the PTE has been cleared), but not enough
> > to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> >
> > retract_page_tables() protects against this by checking !vma->anon_vma;
> > but we know that MADV_COLLAPSE needs to be able to work on private shmem
> > mappings, even those with an anon_vma prepared for another part of the
> > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> > mappings which are userfaultfd_armed().  Whether it needs to work on
> > private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> > but assume that it does.
> 
> I think we couldn't rely on anon_vma here anyway, since holding the
> mmap_lock in read mode doesn't prevent concurrent creation of an
> anon_vma?

We would have had to do the same as in retract_page_tables() (which
doesn't even have mmap_lock for read): recheck !vma->anon_vma after
finally acquiring ptlock.  But the !anon_vma limitation is certainly
not acceptable here anyway.

> 
> > Just for this case, take the pmd_lock() two steps earlier: not because
> > it gives any protection against this case itself, but because ptlock
> > nests inside it, and it's the dropping of ptlock which let the bug in.
> > In other cases, continue to minimize the pmd_lock() hold time.
> 
> Special-casing userfaultfd like this makes me a bit uncomfortable; but
> I also can't find anything other than userfaultfd that would insert
> pages into regions that are khugepaged-compatible, so I guess this
> works?

I'm as sure as I can be that it's solely because userfaultfd breaks
the usual rules here (and in fairness, IIRC Andrea did ask my permission
before making it behave that way on shmem, COWing without a source page).

Perhaps something else will want that same behaviour in future (it's
tempting, but difficult to guarantee correctness); for now, it is just
userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
expecting uffd to add more such exceptional modes in future).

> 
> I guess an alternative would be to use a spin_trylock() instead of the
> current pmd_lock(), and if that fails, temporarily drop the page table
> lock and then restart from step 2 with both locks held - and at that
> point the page table scan should be fast since we expect it to usually
> be empty.

That's certainly a good idea, if collapse on userfaultfd_armed private
is anything of a common case (I doubt, but I don't know).  It may be a
better idea anyway (saving a drop and retake of ptlock).

I gave it a try, expecting to end up with something that would lead
me to say "I tried it, but it didn't work out well"; but actually it
looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
and no more complicated (as Peter rightly observes) than the original.

It's up to you and Peter, and whoever has strong feelings about it,
to choose between them: I don't mind (but I shall be sad if someone
demands that I indent that comment deeper - I'm not a fan of long
multi-line comments near column 80).


[PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
thought it had emptied: page lock on the huge page is enough to protect
against WP faults (which find the PTE has been cleared), but not enough
to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.

retract_page_tables() protects against this by checking !vma->anon_vma;
but we know that MADV_COLLAPSE needs to be able to work on private shmem
mappings, even those with an anon_vma prepared for another part of the
mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
mappings which are userfaultfd_armed().  Whether it needs to work on
private shmem mappings which are userfaultfd_armed(), I'm not so sure:
but assume that it does.

Now trylock pmd lock without dropping ptlock (suggested by jannh): if
that fails, drop and retake ptlock around taking pmd lock, and just in
the uffd private case, go back to recheck and empty the page table.

Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40d43eccdee8..ad1c571772fe 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	struct page *hpage;
 	pte_t *start_pte, *pte;
 	pmd_t *pmd, pgt_pmd;
-	spinlock_t *pml, *ptl;
+	spinlock_t *pml = NULL, *ptl;
 	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
@@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 				haddr, haddr + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 	notified = true;
-	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)		/* mmap_lock + page lock should prevent this */
-		goto abort;
+	spin_lock(ptl);
+recheck:
+	start_pte = pte_offset_map(pmd, haddr);
+	VM_BUG_ON(!start_pte);	/* mmap_lock + page lock should prevent this */
 
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		nr_ptes++;
 	}
 
-	pte_unmap_unlock(start_pte, ptl);
+	pte_unmap(start_pte);
 
 	/* step 3: set proper refcount and mm_counters. */
 	if (nr_ptes) {
 		page_ref_sub(hpage, nr_ptes);
 		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+		nr_ptes = 0;
 	}
 
-	/* step 4: remove page table */
+	/* step 4: remove empty page table */
+	if (!pml) {
+		pml = pmd_lockptr(mm, pmd);
+		if (pml != ptl && !spin_trylock(pml)) {
+			spin_unlock(ptl);
+			spin_lock(pml);
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	/*
+	 * pmd_lock covers a wider range than ptl, and (if split from mm's
+	 * page_table_lock) ptl nests inside pml. The less time we hold pml,
+	 * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
+	 * inserts a valid as-if-COWed PTE without even looking up page cache.
+	 * So page lock of hpage does not protect from it, so we must not drop
+	 * ptl before pgt_pmd is removed, so uffd private needs rechecking.
+	 */
+			if (userfaultfd_armed(vma) &&
+			    !(vma->vm_flags & VM_SHARED))
+				goto recheck;
+		}
+	}
 
-	/* Huge page lock is still held, so page table must remain empty */
-	pml = pmd_lock(mm, pmd);
-	if (ptl != pml)
-		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
 	pmdp_get_lockless_sync();
 	if (ptl != pml)
@@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	}
 	if (start_pte)
 		pte_unmap_unlock(start_pte, ptl);
+	if (pml && pml != ptl)
+		spin_unlock(pml);
 	if (notified)
 		mmu_notifier_invalidate_range_end(&range);
 drop_hpage:
Peter Xu Aug. 22, 2023, 2:31 p.m. UTC | #4
Hi, Hugh, Jann,

On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote:
> On Mon, 21 Aug 2023, Jann Horn wrote:
> > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
> > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> > > thought it had emptied: page lock on the huge page is enough to protect
> > > against WP faults (which find the PTE has been cleared), but not enough
> > > to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> > >
> > > retract_page_tables() protects against this by checking !vma->anon_vma;
> > > but we know that MADV_COLLAPSE needs to be able to work on private shmem
> > > mappings, even those with an anon_vma prepared for another part of the
> > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> > > mappings which are userfaultfd_armed().  Whether it needs to work on
> > > private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> > > but assume that it does.
> > 
> > I think we couldn't rely on anon_vma here anyway, since holding the
> > mmap_lock in read mode doesn't prevent concurrent creation of an
> > anon_vma?
> 
> We would have had to do the same as in retract_page_tables() (which
> doesn't even have mmap_lock for read): recheck !vma->anon_vma after
> finally acquiring ptlock.  But the !anon_vma limitation is certainly
> not acceptable here anyway.
> 
> > 
> > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > it gives any protection against this case itself, but because ptlock
> > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > In other cases, continue to minimize the pmd_lock() hold time.
> > 
> > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > I also can't find anything other than userfaultfd that would insert
> > pages into regions that are khugepaged-compatible, so I guess this
> > works?
> 
> I'm as sure as I can be that it's solely because userfaultfd breaks
> the usual rules here (and in fairness, IIRC Andrea did ask my permission
> before making it behave that way on shmem, COWing without a source page).
> 
> Perhaps something else will want that same behaviour in future (it's
> tempting, but difficult to guarantee correctness); for now, it is just
> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> expecting uffd to add more such exceptional modes in future).
> 
> > 
> > I guess an alternative would be to use a spin_trylock() instead of the
> > current pmd_lock(), and if that fails, temporarily drop the page table
> > lock and then restart from step 2 with both locks held - and at that
> > point the page table scan should be fast since we expect it to usually
> > be empty.
> 
> That's certainly a good idea, if collapse on userfaultfd_armed private
> is anything of a common case (I doubt, but I don't know).  It may be a
> better idea anyway (saving a drop and retake of ptlock).
> 
> I gave it a try, expecting to end up with something that would lead
> me to say "I tried it, but it didn't work out well"; but actually it
> looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> and no more complicated (as Peter rightly observes) than the original.
> 
> It's up to you and Peter, and whoever has strong feelings about it,
> to choose between them: I don't mind (but I shall be sad if someone
> demands that I indent that comment deeper - I'm not a fan of long
> multi-line comments near column 80).

No strong opinion here, either.  Just one trivial comment/question below on
the new patch (if that will be preferred)..

> 
> 
> [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd
> 
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> 
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.
> 
> Now trylock pmd lock without dropping ptlock (suggested by jannh): if
> that fails, drop and retake ptlock around taking pmd lock, and just in
> the uffd private case, go back to recheck and empty the page table.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/
> Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/khugepaged.c | 39 +++++++++++++++++++++++++++++----------
>  1 file changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 40d43eccdee8..ad1c571772fe 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	struct page *hpage;
>  	pte_t *start_pte, *pte;
>  	pmd_t *pmd, pgt_pmd;
> -	spinlock_t *pml, *ptl;
> +	spinlock_t *pml = NULL, *ptl;
>  	int nr_ptes = 0, result = SCAN_FAIL;
>  	int i;
>  
> @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  				haddr, haddr + HPAGE_PMD_SIZE);
>  	mmu_notifier_invalidate_range_start(&range);
>  	notified = true;
> -	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> -	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> -		goto abort;
> +	spin_lock(ptl);

.. here will the ptl always be valid?

That comes from the previous round of pte_offset_map_lock(), and I assume
after this whole "thp collapse without write lock" work landed, it has the
same lifecycle with the *pte pointer, so can be invalid right after the rcu
read lock released; mmap read lock isn't strong enough to protect the ptl,
not anymore.

Maybe it's all fine because the thp collapse path is the solo path(s) that
will release the pte pgtable page without write mmap lock (so as to release
the ptl too when doing so), and we at least still hold the page lock, so
the worst case is the other concurrent "thp collapse" will still serialize
with this one on the huge page lock. But that doesn't look as solid as
fetching again the ptl from another pte_offset_map_nolock().  So still just
raise this question up.  It's possible I just missed something.

> +recheck:
> +	start_pte = pte_offset_map(pmd, haddr);
> +	VM_BUG_ON(!start_pte);	/* mmap_lock + page lock should prevent this */
>  
>  	/* step 2: clear page table and adjust rmap */
>  	for (i = 0, addr = haddr, pte = start_pte;
> @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		nr_ptes++;
>  	}
>  
> -	pte_unmap_unlock(start_pte, ptl);
> +	pte_unmap(start_pte);
>  
>  	/* step 3: set proper refcount and mm_counters. */
>  	if (nr_ptes) {
>  		page_ref_sub(hpage, nr_ptes);
>  		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> +		nr_ptes = 0;
>  	}
>  
> -	/* step 4: remove page table */
> +	/* step 4: remove empty page table */
> +	if (!pml) {
> +		pml = pmd_lockptr(mm, pmd);
> +		if (pml != ptl && !spin_trylock(pml)) {
> +			spin_unlock(ptl);
> +			spin_lock(pml);
> +			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +	/*
> +	 * pmd_lock covers a wider range than ptl, and (if split from mm's
> +	 * page_table_lock) ptl nests inside pml. The less time we hold pml,
> +	 * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
> +	 * inserts a valid as-if-COWed PTE without even looking up page cache.
> +	 * So page lock of hpage does not protect from it, so we must not drop
> +	 * ptl before pgt_pmd is removed, so uffd private needs rechecking.
> +	 */
> +			if (userfaultfd_armed(vma) &&
> +			    !(vma->vm_flags & VM_SHARED))
> +				goto recheck;
> +		}
> +	}
>  
> -	/* Huge page lock is still held, so page table must remain empty */
> -	pml = pmd_lock(mm, pmd);
> -	if (ptl != pml)
> -		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>  	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>  	pmdp_get_lockless_sync();
>  	if (ptl != pml)
> @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	}
>  	if (start_pte)
>  		pte_unmap_unlock(start_pte, ptl);
> +	if (pml && pml != ptl)
> +		spin_unlock(pml);
>  	if (notified)
>  		mmu_notifier_invalidate_range_end(&range);
>  drop_hpage:
> -- 
> 2.35.3
Jann Horn Aug. 22, 2023, 2:39 p.m. UTC | #5
On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote:
> On Mon, 21 Aug 2023, Jann Horn wrote:
> > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
> > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > it gives any protection against this case itself, but because ptlock
> > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > In other cases, continue to minimize the pmd_lock() hold time.
> >
> > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > I also can't find anything other than userfaultfd that would insert
> > pages into regions that are khugepaged-compatible, so I guess this
> > works?
>
> I'm as sure as I can be that it's solely because userfaultfd breaks
> the usual rules here (and in fairness, IIRC Andrea did ask my permission
> before making it behave that way on shmem, COWing without a source page).
>
> Perhaps something else will want that same behaviour in future (it's
> tempting, but difficult to guarantee correctness); for now, it is just
> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> expecting uffd to add more such exceptional modes in future).

Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
wanted to make it possible to reliably install PTE markers with
madvise() or something like that, which might be nice for allowing
userspace to create guard pages without unnecessary extra VMAs...)

> > I guess an alternative would be to use a spin_trylock() instead of the
> > current pmd_lock(), and if that fails, temporarily drop the page table
> > lock and then restart from step 2 with both locks held - and at that
> > point the page table scan should be fast since we expect it to usually
> > be empty.
>
> That's certainly a good idea, if collapse on userfaultfd_armed private
> is anything of a common case (I doubt, but I don't know).  It may be a
> better idea anyway (saving a drop and retake of ptlock).

I was thinking it also has the advantage that it would still perform
okay if we got rid of the userfaultfd_armed() condition at some point
- though I realize that designing too much for hypothetical future
features is an antipattern.

> I gave it a try, expecting to end up with something that would lead
> me to say "I tried it, but it didn't work out well"; but actually it
> looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> and no more complicated (as Peter rightly observes) than the original.
>
> It's up to you and Peter, and whoever has strong feelings about it,
> to choose between them: I don't mind (but I shall be sad if someone
> demands that I indent that comment deeper - I'm not a fan of long
> multi-line comments near column 80).

I prefer this version because it would make it easier to remove the
"userfaultfd_armed()" check in the future if we have to, but I guess
we could also always change it later if that becomes necessary, so I
don't really have strong feelings on it at this point.
Matthew Wilcox Aug. 22, 2023, 3:22 p.m. UTC | #6
On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote:
> > Perhaps something else will want that same behaviour in future (it's
> > tempting, but difficult to guarantee correctness); for now, it is just
> > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > expecting uffd to add more such exceptional modes in future).
> 
> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> wanted to make it possible to reliably install PTE markers with
> madvise() or something like that, which might be nice for allowing
> userspace to create guard pages without unnecessary extra VMAs...)

I don't know what a userspace API for this would look like, but I have
a dream of creating guard VMAs which only live in the maple tree and
don't require the allocation of a struct VMA.  Use some magic reserved
pointer value like XA_ZERO_ENTRY to represent them ... seems more
robust than putting a PTE marker in the page tables?
David Hildenbrand Aug. 22, 2023, 3:28 p.m. UTC | #7
On 22.08.23 16:39, Jann Horn wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote:
>> On Mon, 21 Aug 2023, Jann Horn wrote:
>>> On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
>>>> Just for this case, take the pmd_lock() two steps earlier: not because
>>>> it gives any protection against this case itself, but because ptlock
>>>> nests inside it, and it's the dropping of ptlock which let the bug in.
>>>> In other cases, continue to minimize the pmd_lock() hold time.
>>>
>>> Special-casing userfaultfd like this makes me a bit uncomfortable; but
>>> I also can't find anything other than userfaultfd that would insert
>>> pages into regions that are khugepaged-compatible, so I guess this
>>> works?
>>
>> I'm as sure as I can be that it's solely because userfaultfd breaks
>> the usual rules here (and in fairness, IIRC Andrea did ask my permission
>> before making it behave that way on shmem, COWing without a source page).
>>
>> Perhaps something else will want that same behaviour in future (it's
>> tempting, but difficult to guarantee correctness); for now, it is just
>> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
>> expecting uffd to add more such exceptional modes in future).
> 
> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> wanted to make it possible to reliably install PTE markers with
> madvise() or something like that, which might be nice for allowing
> userspace to create guard pages without unnecessary extra VMAs...)

I'm working on something similar that goes a bit further than just guard 
pages. It also installs PTE markers into page tables, inside existing 
large VMAs.

Initially, I'll only tackle anon VMAs, though.
Jann Horn Aug. 22, 2023, 3:30 p.m. UTC | #8
On Tue, Aug 22, 2023 at 5:23 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote:
> > > Perhaps something else will want that same behaviour in future (it's
> > > tempting, but difficult to guarantee correctness); for now, it is just
> > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > > expecting uffd to add more such exceptional modes in future).
> >
> > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> > wanted to make it possible to reliably install PTE markers with
> > madvise() or something like that, which might be nice for allowing
> > userspace to create guard pages without unnecessary extra VMAs...)
>
> I don't know what a userspace API for this would look like, but I have
> a dream of creating guard VMAs which only live in the maple tree and
> don't require the allocation of a struct VMA.  Use some magic reserved
> pointer value like XA_ZERO_ENTRY to represent them ... seems more
> robust than putting a PTE marker in the page tables?

Chrome currently uses a lot of VMAs for its heap, which I think are
basically alternating PROT_NONE and PROT_READ|PROT_WRITE anonymous
VMAs. Like this:

[...]
3a10002cf000-3a10002d0000 ---p 00000000 00:00 0
3a10002d0000-3a10002e6000 rw-p 00000000 00:00 0
3a10002e6000-3a10002e8000 ---p 00000000 00:00 0
3a10002e8000-3a10002f2000 rw-p 00000000 00:00 0
3a10002f2000-3a10002f4000 ---p 00000000 00:00 0
3a10002f4000-3a10002fb000 rw-p 00000000 00:00 0
3a10002fb000-3a10002fc000 ---p 00000000 00:00 0
3a10002fc000-3a1000303000 rw-p 00000000 00:00 0
3a1000303000-3a1000304000 ---p 00000000 00:00 0
3a1000304000-3a100031b000 rw-p 00000000 00:00 0
3a100031b000-3a100031c000 ---p 00000000 00:00 0
3a100031c000-3a1000326000 rw-p 00000000 00:00 0
3a1000326000-3a1000328000 ---p 00000000 00:00 0
3a1000328000-3a100033a000 rw-p 00000000 00:00 0
3a100033a000-3a100033c000 ---p 00000000 00:00 0
3a100033c000-3a100038b000 rw-p 00000000 00:00 0
3a100038b000-3a100038c000 ---p 00000000 00:00 0
3a100038c000-3a100039b000 rw-p 00000000 00:00 0
3a100039b000-3a100039c000 ---p 00000000 00:00 0
3a100039c000-3a10003af000 rw-p 00000000 00:00 0
3a10003af000-3a10003b0000 ---p 00000000 00:00 0
3a10003b0000-3a10003e8000 rw-p 00000000 00:00 0
3a10003e8000-3a1000401000 ---p 00000000 00:00 0
3a1000401000-3a1000402000 rw-p 00000000 00:00 0
3a1000402000-3a100040c000 ---p 00000000 00:00 0
3a100040c000-3a100046f000 rw-p 00000000 00:00 0
3a100046f000-3a1000470000 ---p 00000000 00:00 0
3a1000470000-3a100047a000 rw-p 00000000 00:00 0
3a100047a000-3a100047c000 ---p 00000000 00:00 0
3a100047c000-3a1000492000 rw-p 00000000 00:00 0
3a1000492000-3a1000494000 ---p 00000000 00:00 0
3a1000494000-3a10004a2000 rw-p 00000000 00:00 0
3a10004a2000-3a10004a4000 ---p 00000000 00:00 0
3a10004a4000-3a10004b6000 rw-p 00000000 00:00 0
3a10004b6000-3a10004b8000 ---p 00000000 00:00 0
3a10004b8000-3a10004ea000 rw-p 00000000 00:00 0
3a10004ea000-3a10004ec000 ---p 00000000 00:00 0
3a10004ec000-3a10005f4000 rw-p 00000000 00:00 0
3a10005f4000-3a1000601000 ---p 00000000 00:00 0
3a1000601000-3a1000602000 rw-p 00000000 00:00 0
3a1000602000-3a1000604000 ---p 00000000 00:00 0
3a1000604000-3a100062b000 rw-p 00000000 00:00 0
3a100062b000-3a1000801000 ---p 00000000 00:00 0
[...]

I was thinking if you used PTE markers as guards, you could maybe turn
all that into more or less a single VMA?
David Hildenbrand Aug. 22, 2023, 3:33 p.m. UTC | #9
On 22.08.23 17:30, Jann Horn wrote:
> On Tue, Aug 22, 2023 at 5:23 PM Matthew Wilcox <willy@infradead.org> wrote:
>> On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote:
>>>> Perhaps something else will want that same behaviour in future (it's
>>>> tempting, but difficult to guarantee correctness); for now, it is just
>>>> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
>>>> expecting uffd to add more such exceptional modes in future).
>>>
>>> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
>>> wanted to make it possible to reliably install PTE markers with
>>> madvise() or something like that, which might be nice for allowing
>>> userspace to create guard pages without unnecessary extra VMAs...)
>>
>> I don't know what a userspace API for this would look like, but I have
>> a dream of creating guard VMAs which only live in the maple tree and
>> don't require the allocation of a struct VMA.  Use some magic reserved
>> pointer value like XA_ZERO_ENTRY to represent them ... seems more
>> robust than putting a PTE marker in the page tables?
> 
> Chrome currently uses a lot of VMAs for its heap, which I think are
> basically alternating PROT_NONE and PROT_READ|PROT_WRITE anonymous
> VMAs. Like this:
> 
> [...]
> 3a10002cf000-3a10002d0000 ---p 00000000 00:00 0
> 3a10002d0000-3a10002e6000 rw-p 00000000 00:00 0
> 3a10002e6000-3a10002e8000 ---p 00000000 00:00 0
> 3a10002e8000-3a10002f2000 rw-p 00000000 00:00 0
> 3a10002f2000-3a10002f4000 ---p 00000000 00:00 0
> 3a10002f4000-3a10002fb000 rw-p 00000000 00:00 0
> 3a10002fb000-3a10002fc000 ---p 00000000 00:00 0
> 3a10002fc000-3a1000303000 rw-p 00000000 00:00 0
> 3a1000303000-3a1000304000 ---p 00000000 00:00 0
> 3a1000304000-3a100031b000 rw-p 00000000 00:00 0
> 3a100031b000-3a100031c000 ---p 00000000 00:00 0
> 3a100031c000-3a1000326000 rw-p 00000000 00:00 0
> 3a1000326000-3a1000328000 ---p 00000000 00:00 0
> 3a1000328000-3a100033a000 rw-p 00000000 00:00 0
> 3a100033a000-3a100033c000 ---p 00000000 00:00 0
> 3a100033c000-3a100038b000 rw-p 00000000 00:00 0
> 3a100038b000-3a100038c000 ---p 00000000 00:00 0
> 3a100038c000-3a100039b000 rw-p 00000000 00:00 0
> 3a100039b000-3a100039c000 ---p 00000000 00:00 0
> 3a100039c000-3a10003af000 rw-p 00000000 00:00 0
> 3a10003af000-3a10003b0000 ---p 00000000 00:00 0
> 3a10003b0000-3a10003e8000 rw-p 00000000 00:00 0
> 3a10003e8000-3a1000401000 ---p 00000000 00:00 0
> 3a1000401000-3a1000402000 rw-p 00000000 00:00 0
> 3a1000402000-3a100040c000 ---p 00000000 00:00 0
> 3a100040c000-3a100046f000 rw-p 00000000 00:00 0
> 3a100046f000-3a1000470000 ---p 00000000 00:00 0
> 3a1000470000-3a100047a000 rw-p 00000000 00:00 0
> 3a100047a000-3a100047c000 ---p 00000000 00:00 0
> 3a100047c000-3a1000492000 rw-p 00000000 00:00 0
> 3a1000492000-3a1000494000 ---p 00000000 00:00 0
> 3a1000494000-3a10004a2000 rw-p 00000000 00:00 0
> 3a10004a2000-3a10004a4000 ---p 00000000 00:00 0
> 3a10004a4000-3a10004b6000 rw-p 00000000 00:00 0
> 3a10004b6000-3a10004b8000 ---p 00000000 00:00 0
> 3a10004b8000-3a10004ea000 rw-p 00000000 00:00 0
> 3a10004ea000-3a10004ec000 ---p 00000000 00:00 0
> 3a10004ec000-3a10005f4000 rw-p 00000000 00:00 0
> 3a10005f4000-3a1000601000 ---p 00000000 00:00 0
> 3a1000601000-3a1000602000 rw-p 00000000 00:00 0
> 3a1000602000-3a1000604000 ---p 00000000 00:00 0
> 3a1000604000-3a100062b000 rw-p 00000000 00:00 0
> 3a100062b000-3a1000801000 ---p 00000000 00:00 0
> [...]
> 
> I was thinking if you used PTE markers as guards, you could maybe turn
> all that into more or less a single VMA?

I proposed the topic "A proper API for sparse memory mappings" for the 
bi-weekly MM meeting on September 20, that would also cover exactly that 
use case. :)
Hugh Dickins Aug. 22, 2023, 6:34 p.m. UTC | #10
On Tue, 22 Aug 2023, Peter Xu wrote:
> On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote:
> > On Mon, 21 Aug 2023, Jann Horn wrote:
...
> > > 
> > > I guess an alternative would be to use a spin_trylock() instead of the
> > > current pmd_lock(), and if that fails, temporarily drop the page table
> > > lock and then restart from step 2 with both locks held - and at that
> > > point the page table scan should be fast since we expect it to usually
> > > be empty.
> > 
> > That's certainly a good idea, if collapse on userfaultfd_armed private
> > is anything of a common case (I doubt, but I don't know).  It may be a
> > better idea anyway (saving a drop and retake of ptlock).
> > 
> > I gave it a try, expecting to end up with something that would lead
> > me to say "I tried it, but it didn't work out well"; but actually it
> > looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> > and no more complicated (as Peter rightly observes) than the original.
> > 
> > It's up to you and Peter, and whoever has strong feelings about it,
> > to choose between them: I don't mind (but I shall be sad if someone
> > demands that I indent that comment deeper - I'm not a fan of long
> > multi-line comments near column 80).
> 
> No strong opinion here, either.  Just one trivial comment/question below on
> the new patch (if that will be preferred)..

I'm going to settle for the original v1 for now (I'll explain why in reply
to Jann next) - which already has the blessing of your Acked-by, thanks.

(Yes, the locking is a bit confusing: but mainly for the unrelated reason,
that with the split locking configs, we never quite know whether this lock
is the same as that lock or not, and so have to be rather careful.)

> > [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd
...
> > @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >  				haddr, haddr + HPAGE_PMD_SIZE);
> >  	mmu_notifier_invalidate_range_start(&range);
> >  	notified = true;
> > -	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > -	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> > -		goto abort;
> > +	spin_lock(ptl);
> 
> .. here will the ptl always be valid?
> 
> That comes from the previous round of pte_offset_map_lock(), and I assume
> after this whole "thp collapse without write lock" work landed, it has the
> same lifecycle with the *pte pointer, so can be invalid right after the rcu
> read lock released; mmap read lock isn't strong enough to protect the ptl,
> not anymore.
> 
> Maybe it's all fine because the thp collapse path is the solo path(s) that
> will release the pte pgtable page without write mmap lock (so as to release
> the ptl too when doing so), and we at least still hold the page lock, so
> the worst case is the other concurrent "thp collapse" will still serialize
> with this one on the huge page lock. But that doesn't look as solid as
> fetching again the ptl from another pte_offset_map_nolock().  So still just
> raise this question up.  It's possible I just missed something.

It is safe, as you say because of us holding the hpage lock, which stops
any racing callers of collapse_pte_mapped_thp() or retract_page_tables():
and these are the functions which (currently) make the *pmd transition
which pte_offset_map_lock() etc. are being careful to guard against.

[In future we can imagine empty page table removal making that transition
too: and that wouldn't even have any hpage to lock.  Will it rely on
mmap_lock for write? or pmd_lock? probably both, but no need to design
for it now.]

But I agree that it does *look* more questionable in this patch: there was
a reassuring pte_offset_map_lock() there before, and now I rely more on
the assumptions and just use the "previous" ptl (and that's why I chose
to make the !start_pte case a VM_BUG_ON a few lines later).

I expect, with more time spent, I could cast it back into more reassuring
form: but it's all a bit of a con trick - if you look further down (even
before v2 or v1 fixes) to "step 4", there we have "if (ptl != pml)" which
is also relying on the fact that ptl cannot have changed.  And no doubt
that too could be recast into more reassuring-looking form, but it
wouldn't actually be worthwhile.

Thanks for considering these, Peter: I'll recommend v1 to Andrew.

Hugh
Matthew Wilcox Aug. 22, 2023, 6:45 p.m. UTC | #11
On Tue, Aug 22, 2023 at 11:34:19AM -0700, Hugh Dickins wrote:
> (Yes, the locking is a bit confusing: but mainly for the unrelated reason,
> that with the split locking configs, we never quite know whether this lock
> is the same as that lock or not, and so have to be rather careful.)

Is it time to remove the PTE split locking config option?  I believe all
supported architectures have at least two levels of page tables, so if we
have split ptlocks, ptl and pml are always different from each other (it's
just that on two level machines, pmd == pud == p4d == pgd).  With huge
thread counts now being the norm, it's hard to see why anybody would want
to support SMP and !SPLIT_PTE_PTLOCKS.  To quote the documentation ...

  Split page table lock for PTE tables is enabled compile-time if
  CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
  If split lock is disabled, all tables are guarded by mm->page_table_lock.

You can barely buy a wrist-watch without eight CPUs these days.
Hugh Dickins Aug. 22, 2023, 6:54 p.m. UTC | #12
On Tue, 22 Aug 2023, Jann Horn wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 21 Aug 2023, Jann Horn wrote:
> > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote:
> > > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > > it gives any protection against this case itself, but because ptlock
> > > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > > In other cases, continue to minimize the pmd_lock() hold time.
> > >
> > > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > > I also can't find anything other than userfaultfd that would insert
> > > pages into regions that are khugepaged-compatible, so I guess this
> > > works?
> >
> > I'm as sure as I can be that it's solely because userfaultfd breaks
> > the usual rules here (and in fairness, IIRC Andrea did ask my permission
> > before making it behave that way on shmem, COWing without a source page).
> >
> > Perhaps something else will want that same behaviour in future (it's
> > tempting, but difficult to guarantee correctness); for now, it is just
> > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > expecting uffd to add more such exceptional modes in future).
> 
> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> wanted to make it possible to reliably install PTE markers with
> madvise() or something like that, which might be nice for allowing
> userspace to create guard pages without unnecessary extra VMAs...)

I see the mailthread has taken inspiration from your comment there,
and veered off in that direction: but I'll ignore those futures.

> 
> > > I guess an alternative would be to use a spin_trylock() instead of the
> > > current pmd_lock(), and if that fails, temporarily drop the page table
> > > lock and then restart from step 2 with both locks held - and at that
> > > point the page table scan should be fast since we expect it to usually
> > > be empty.
> >
> > That's certainly a good idea, if collapse on userfaultfd_armed private
> > is anything of a common case (I doubt, but I don't know).  It may be a
> > better idea anyway (saving a drop and retake of ptlock).
> 
> I was thinking it also has the advantage that it would still perform
> okay if we got rid of the userfaultfd_armed() condition at some point
> - though I realize that designing too much for hypothetical future
> features is an antipattern.
> 
> > I gave it a try, expecting to end up with something that would lead
> > me to say "I tried it, but it didn't work out well"; but actually it
> > looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> > and no more complicated (as Peter rightly observes) than the original.
> >
> > It's up to you and Peter, and whoever has strong feelings about it,
> > to choose between them: I don't mind (but I shall be sad if someone
> > demands that I indent that comment deeper - I'm not a fan of long
> > multi-line comments near column 80).
> 
> I prefer this version because it would make it easier to remove the
> "userfaultfd_armed()" check in the future if we have to, but I guess
> we could also always change it later if that becomes necessary, so I
> don't really have strong feelings on it at this point.

Thanks for considering them both, Jann.  I do think your trylock way,
as in v2, is in principle superior, and we may well have good reason
to switch over to it in future; but I find it slightly more confusing,
so will follow your and Peter's "no strong feelings" for now, and ask
Andrew please to take the original (implicit v1).

Overriding reason: I realized overnight that v2 is not quite correct:
I was clever enough to realize that nr_ptes needed to be reset to 0
to get the accounting right with a recheck pass, but not clever enough
to realize that resetting it to 0 there would likely skip the abort
path's flush_tlb_mm(mm), when we actually had cleared entries on the
first pass.  It needs a separate bool to decide the flush_tlb_mm(mm),
or it needs that (ridiculously minor!) step 3 to be moved down.

But rather than reworking it, please let's just go with v1 for now.

Thanks,
Hugh
Jann Horn Aug. 22, 2023, 7:07 p.m. UTC | #13
On Tue, Aug 22, 2023 at 8:54 PM Hugh Dickins <hughd@google.com> wrote:
> But rather than reworking it, please let's just go with v1 for now.

Sounds good to me.
Hugh Dickins Aug. 22, 2023, 7:10 p.m. UTC | #14
On Tue, 22 Aug 2023, Matthew Wilcox wrote:
> On Tue, Aug 22, 2023 at 11:34:19AM -0700, Hugh Dickins wrote:
> > (Yes, the locking is a bit confusing: but mainly for the unrelated reason,
> > that with the split locking configs, we never quite know whether this lock
> > is the same as that lock or not, and so have to be rather careful.)
> 
> Is it time to remove the PTE split locking config option?  I believe all
> supported architectures have at least two levels of page tables, so if we
> have split ptlocks, ptl and pml are always different from each other (it's
> just that on two level machines, pmd == pud == p4d == pgd).  With huge
> thread counts now being the norm, it's hard to see why anybody would want
> to support SMP and !SPLIT_PTE_PTLOCKS.  To quote the documentation ...
> 
>   Split page table lock for PTE tables is enabled compile-time if
>   CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
>   If split lock is disabled, all tables are guarded by mm->page_table_lock.
> 
> You can barely buy a wrist-watch without eight CPUs these days.

Whilst I'm still happy with my 0-CPU wrist-watch, I do think you're right:
that SPLIT_PTLOCK_CPUS business was really just a safety-valve for when
introducing split ptlock in the first place, 4 pulled out of a hat, and
the unsplit ptlock path quite under-tested.

But I'll leave it to someone else do the job of removing it whenever.

Hugh
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40d43eccdee8..d5650541083a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1476,7 +1476,7 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	struct page *hpage;
 	pte_t *start_pte, *pte;
 	pmd_t *pmd, pgt_pmd;
-	spinlock_t *pml, *ptl;
+	spinlock_t *pml = NULL, *ptl;
 	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
@@ -1572,9 +1572,25 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 				haddr, haddr + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 	notified = true;
-	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+
+	/*
+	 * pmd_lock covers a wider range than ptl, and (if split from mm's
+	 * page_table_lock) ptl nests inside pml. The less time we hold pml,
+	 * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
+	 * inserts a valid as-if-COWed PTE without even looking up page cache.
+	 * So page lock of hpage does not protect from it, so we must not drop
+	 * ptl before pgt_pmd is removed, so uffd private needs pml taken now.
+	 */
+	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
+		pml = pmd_lock(mm, pmd);
+
+	start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
+	if (!pml)
+		spin_lock(ptl);
+	else if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1608,7 +1624,9 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		nr_ptes++;
 	}
 
-	pte_unmap_unlock(start_pte, ptl);
+	pte_unmap(start_pte);
+	if (!pml)
+		spin_unlock(ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
 	if (nr_ptes) {
@@ -1616,12 +1634,12 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove page table */
-
-	/* Huge page lock is still held, so page table must remain empty */
-	pml = pmd_lock(mm, pmd);
-	if (ptl != pml)
-		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	/* step 4: remove empty page table */
+	if (!pml) {
+		pml = pmd_lock(mm, pmd);
+		if (ptl != pml)
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	}
 	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
 	pmdp_get_lockless_sync();
 	if (ptl != pml)
@@ -1648,6 +1666,8 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	}
 	if (start_pte)
 		pte_unmap_unlock(start_pte, ptl);
+	if (pml && pml != ptl)
+		spin_unlock(pml);
 	if (notified)
 		mmu_notifier_invalidate_range_end(&range);
 drop_hpage: