diff mbox series

[v5,07/22] mm: Protect VMA modifications using VMA sequence count

Message ID 1507729966-10660-8-git-send-email-ldufour@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show
Series Speculative page faults | expand

Commit Message

Laurent Dufour Oct. 11, 2017, 1:52 p.m. UTC
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
	- madvise()
	- mremap()
	- mpol_rebind_policy()
	- vma_replace_policy()
	- change_prot_numa()
	- mlock(), munlock()
	- mprotect()
	- mmap_region()
	- collapse_huge_page()
	- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c |  5 ++++-
 fs/userfaultfd.c   | 17 +++++++++++++----
 mm/khugepaged.c    |  3 +++
 mm/madvise.c       |  6 +++++-
 mm/mempolicy.c     | 51 ++++++++++++++++++++++++++++++++++-----------------
 mm/mlock.c         | 13 ++++++++-----
 mm/mmap.c          | 17 ++++++++++-------
 mm/mprotect.c      |  4 +++-
 mm/mremap.c        |  6 ++++++
 9 files changed, 86 insertions(+), 36 deletions(-)

Comments

Andrea Arcangeli Oct. 26, 2017, 10:18 a.m. UTC | #1
Hello Laurent,

Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows
significant slowdown even for brk/malloc ops both single and
multi threaded.

The single threaded case I think is the most important because it has
zero chance of getting back any benefit later during page faults.

Could you check if:

1. it's possible change vm_write_begin to be a noop if mm->mm_count is
   <= 1? Hint: clone() will run single threaded so there's no way it can run
   in the middle of a being/end critical section (clone could set an
   MMF flag to possibly keep the sequence counter activated if a child
   thread exits and mm_count drops to 1 while the other cpu is in the
   middle of a critical section in the other thread).

2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
   freeing happened only once a MMF flag is set? That will at least
   reduce the risk of temporary memory waste until the next RCU grace
   period. The read of the MMF will scale fine. Of course to allow
   point 1 and 2 then the page fault should also take the mmap_sem
   until the MMF flag is set.

Could you also investigate a much bigger change: I wonder if it's
possible to drop the sequence number entirely from the vma and stop
using sequence numbers entirely (which is likely the source of the
single threaded regression in point 1 that may explain the report in
the above message-id), and just call the vma rbtree lookup once again
and check that everything is still the same in the vma and the PT lock
obtained is still a match to finish the anon page fault and fill the
pte?

Then of course we also need to add a method to the read-write
semaphore so it tells us if there's already one user holding the read
mmap_sem and we're the second one.  If we're the second one (or more
than second) only then we should skip taking the down_read mmap_sem.
Even a multithreaded app won't ever skip taking the mmap_sem until
there's sign of runtime contention, and it won't have to run the way
more expensive sequence number-less revalidation during page faults,
unless we get an immediate scalability payoff because we already know
the mmap_sem is already contended and there are multiple nested
threads in the page fault handler of the same mm.

Perhaps we'd need something more advanced than a
down_read_trylock_if_not_hold() (which has to guaranteed not to write
to any cacheline) and we'll have to count the per-thread exponential
backoff of mmap_sem frequency, but starting with
down_read_trylock_if_not_hold() would be good I think.

This is not how the current patch works, the current patch uses a
sequence number because it pretends to go lockless always and in turn
has to slow down all vma updates fast paths or the revalidation
slowsdown performance for page fault too much (as it always
revalidates).

I think it would be much better to go speculative only when there's
"detected" runtime contention on the mmap_sem with
down_read_trylock_if_not_hold() and that will make the revalidation
cost not an issue to worry about because normally we won't have to
revalidate the vma at all during page fault. In turn by making the
revalidation more expensive by starting a vma rbtree lookup from
scratch, we can drop the sequence number entirely and that should
simplify the patch tremendously because all vm_write_begin/end would
disappear from the patch and in turn the mmap/brk slowdown measured by
the message-id above, should disappear as well.
   
Thanks,
Andrea
Laurent Dufour Nov. 2, 2017, 3:16 p.m. UTC | #2
Hi Andrea,

Thanks for reviewing this series, and sorry for the late answer, I took few
days off...

On 26/10/2017 12:18, Andrea Arcangeli wrote:
> Hello Laurent,
> 
> Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows
> significant slowdown even for brk/malloc ops both single and
> multi threaded.
> 
> The single threaded case I think is the most important because it has
> zero chance of getting back any benefit later during page faults.
> 
> Could you check if:
> 
> 1. it's possible change vm_write_begin to be a noop if mm->mm_count is
>    <= 1? Hint: clone() will run single threaded so there's no way it can run
>    in the middle of a being/end critical section (clone could set an
>    MMF flag to possibly keep the sequence counter activated if a child
>    thread exits and mm_count drops to 1 while the other cpu is in the
>    middle of a critical section in the other thread).

This sounds to be a good idea, I'll dig on that.
The major risk here is to have a thread calling vm_*_begin() with
mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but
as you mentioned we should find a way to work around this.

> 
> 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
>    freeing happened only once a MMF flag is set? That will at least
>    reduce the risk of temporary memory waste until the next RCU grace
>    period. The read of the MMF will scale fine. Of course to allow
>    point 1 and 2 then the page fault should also take the mmap_sem
>    until the MMF flag is set.
> 

I think we could also deal with the mm->mm_count value here, if there is
only one thread, no need to postpone the VMA's free operation. Isn't it ?
Also, if mm->mm_count <= 1, there is no need to try the speculative path.

> Could you also investigate a much bigger change: I wonder if it's
> possible to drop the sequence number entirely from the vma and stop
> using sequence numbers entirely (which is likely the source of the
> single threaded regression in point 1 that may explain the report in
> the above message-id), and just call the vma rbtree lookup once again
> and check that everything is still the same in the vma and the PT lock
> obtained is still a match to finish the anon page fault and fill the
> pte?

That's an interesting idea. The big deal here would be to detect that the
VMA has been touched in our back, but there are not so much VMA's fields
involved in the speculative path so that sounds reasonable. The other point
is to identify the impact of the vma rbtree lookup, it's also a known
order, but there is the vma_srcu's lock involved.
> 
> Then of course we also need to add a method to the read-write
> semaphore so it tells us if there's already one user holding the read
> mmap_sem and we're the second one.  If we're the second one (or more
> than second) only then we should skip taking the down_read mmap_sem.
> Even a multithreaded app won't ever skip taking the mmap_sem until
> there's sign of runtime contention, and it won't have to run the way
> more expensive sequence number-less revalidation during page faults,
> unless we get an immediate scalability payoff because we already know
> the mmap_sem is already contended and there are multiple nested
> threads in the page fault handler of the same mm.

The problem is that we may have a thread entering the page fault path,
seeing that the mmap_sem is free, grab it and continue processing the page
fault. Then another thread is entering mprotect or any other mm service
which grab the mmap_sem and it will be blocked until the page fault is
done. The idea with the speculative page fault is also to not block the
other thread which may need to grab the mmap_sem.

> 
> Perhaps we'd need something more advanced than a
> down_read_trylock_if_not_hold() (which has to guaranteed not to write
> to any cacheline) and we'll have to count the per-thread exponential
> backoff of mmap_sem frequency, but starting with
> down_read_trylock_if_not_hold() would be good I think.
> 
> This is not how the current patch works, the current patch uses a
> sequence number because it pretends to go lockless always and in turn
> has to slow down all vma updates fast paths or the revalidation
> slowsdown performance for page fault too much (as it always
> revalidates).
> 
> I think it would be much better to go speculative only when there's
> "detected" runtime contention on the mmap_sem with
> down_read_trylock_if_not_hold() and that will make the revalidation
> cost not an issue to worry about because normally we won't have to
> revalidate the vma at all during page fault. In turn by making the
> revalidation more expensive by starting a vma rbtree lookup from
> scratch, we can drop the sequence number entirely and that should
> simplify the patch tremendously because all vm_write_begin/end would
> disappear from the patch and in turn the mmap/brk slowdown measured by
> the message-id above, should disappear as well.

As I mentioned above, I'm not sure about checking the lock contention when
entering the page fault path, checking for the mm->mm_count or a dedicated
mm flags should be enough, but removing the sequence lock would be a very
good simplification. I'll dig further here, and come back soon.

Thanks a lot,
Laurent.
Laurent Dufour Nov. 2, 2017, 5:25 p.m. UTC | #3
On 02/11/2017 16:16, Laurent Dufour wrote:
> Hi Andrea,
> 
> Thanks for reviewing this series, and sorry for the late answer, I took few
> days off...
> 
> On 26/10/2017 12:18, Andrea Arcangeli wrote:
>> Hello Laurent,
>>
>> Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com> shows
>> significant slowdown even for brk/malloc ops both single and
>> multi threaded.
>>
>> The single threaded case I think is the most important because it has
>> zero chance of getting back any benefit later during page faults.
>>
>> Could you check if:
>>
>> 1. it's possible change vm_write_begin to be a noop if mm->mm_count is
>>    <= 1? Hint: clone() will run single threaded so there's no way it can run
>>    in the middle of a being/end critical section (clone could set an
>>    MMF flag to possibly keep the sequence counter activated if a child
>>    thread exits and mm_count drops to 1 while the other cpu is in the
>>    middle of a critical section in the other thread).
> 
> This sounds to be a good idea, I'll dig on that.
> The major risk here is to have a thread calling vm_*_begin() with
> mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but
> as you mentioned we should find a way to work around this.
> 
>>
>> 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
>>    freeing happened only once a MMF flag is set? That will at least
>>    reduce the risk of temporary memory waste until the next RCU grace
>>    period. The read of the MMF will scale fine. Of course to allow
>>    point 1 and 2 then the page fault should also take the mmap_sem
>>    until the MMF flag is set.
>>
> 
> I think we could also deal with the mm->mm_count value here, if there is
> only one thread, no need to postpone the VMA's free operation. Isn't it ?
> Also, if mm->mm_count <= 1, there is no need to try the speculative path.
> 
>> Could you also investigate a much bigger change: I wonder if it's
>> possible to drop the sequence number entirely from the vma and stop
>> using sequence numbers entirely (which is likely the source of the
>> single threaded regression in point 1 that may explain the report in
>> the above message-id), and just call the vma rbtree lookup once again
>> and check that everything is still the same in the vma and the PT lock
>> obtained is still a match to finish the anon page fault and fill the
>> pte?
> 
> That's an interesting idea. The big deal here would be to detect that the
> VMA has been touched in our back, but there are not so much VMA's fields
> involved in the speculative path so that sounds reasonable. The other point
> is to identify the impact of the vma rbtree lookup, it's also a known
> order, but there is the vma_srcu's lock involved.

I think there is some memory barrier missing when the VMA is modified so
currently the modifications done in the VMA structure may not be written
down at the time the pte is locked. So doing that change will also requires
to call smp_wmb() before locking the page tables. In the current patch this
is ensured by the call to write_seqcount_end().
Doing so will still require to have a memory barrier when touching the VMA.
Not sure we get far better performance compared to the sequence count
change. But I'll give it a try anyway ;)

>>
>> Then of course we also need to add a method to the read-write
>> semaphore so it tells us if there's already one user holding the read
>> mmap_sem and we're the second one.  If we're the second one (or more
>> than second) only then we should skip taking the down_read mmap_sem.
>> Even a multithreaded app won't ever skip taking the mmap_sem until
>> there's sign of runtime contention, and it won't have to run the way
>> more expensive sequence number-less revalidation during page faults,
>> unless we get an immediate scalability payoff because we already know
>> the mmap_sem is already contended and there are multiple nested
>> threads in the page fault handler of the same mm.
> 
> The problem is that we may have a thread entering the page fault path,
> seeing that the mmap_sem is free, grab it and continue processing the page
> fault. Then another thread is entering mprotect or any other mm service
> which grab the mmap_sem and it will be blocked until the page fault is
> done. The idea with the speculative page fault is also to not block the
> other thread which may need to grab the mmap_sem.
> 
>>
>> Perhaps we'd need something more advanced than a
>> down_read_trylock_if_not_hold() (which has to guaranteed not to write
>> to any cacheline) and we'll have to count the per-thread exponential
>> backoff of mmap_sem frequency, but starting with
>> down_read_trylock_if_not_hold() would be good I think.
>>
>> This is not how the current patch works, the current patch uses a
>> sequence number because it pretends to go lockless always and in turn
>> has to slow down all vma updates fast paths or the revalidation
>> slowsdown performance for page fault too much (as it always
>> revalidates).
>>
>> I think it would be much better to go speculative only when there's
>> "detected" runtime contention on the mmap_sem with
>> down_read_trylock_if_not_hold() and that will make the revalidation
>> cost not an issue to worry about because normally we won't have to
>> revalidate the vma at all during page fault. In turn by making the
>> revalidation more expensive by starting a vma rbtree lookup from
>> scratch, we can drop the sequence number entirely and that should
>> simplify the patch tremendously because all vm_write_begin/end would
>> disappear from the patch and in turn the mmap/brk slowdown measured by
>> the message-id above, should disappear as well.
> 
> As I mentioned above, I'm not sure about checking the lock contention when
> entering the page fault path, checking for the mm->mm_count or a dedicated
> mm flags should be enough, but removing the sequence lock would be a very
> good simplification. I'll dig further here, and come back soon.
> 
> Thanks a lot,
> Laurent.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Andrea Arcangeli Nov. 2, 2017, 8:08 p.m. UTC | #4
On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote:
> I think there is some memory barrier missing when the VMA is modified so
> currently the modifications done in the VMA structure may not be written
> down at the time the pte is locked. So doing that change will also requires
> to call smp_wmb() before locking the page tables. In the current patch this
> is ensured by the call to write_seqcount_end().
> Doing so will still require to have a memory barrier when touching the VMA.
> Not sure we get far better performance compared to the sequence count
> change. But I'll give it a try anyway ;)

Luckily smp_wmb is a noop on x86. I would suggest to ignore the above
issue completely if you give it a try, and then if this performs, we
can just embed a smp_wmb() before spin_lock() somewhere in
pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose
spin_lock isn't a smp_wmb() equivalent. I would focus at flushing
writes before every pagetable spin_lock for non-x86 archs, rather than
after all vma modifications. That should be easier to keep under
control and it's going to be more efficient too as if something there
are fewer spin locks than vma modifications.

For non-x86 archs we may then need a smp_wmb__before_spin_lock. That
looks more self contained than surrounding all vma modifications and
it's a noop on x86 anyway.

I thought about the contention detection logic too yesterday: to
detect contention we could have a mm->mmap_sem_contention_jiffies and
if down_read_trylock_exclusive() [same as down_read_if_not_hold in
prev mail] fails (and it'll fail if either read or write mmap_sem is
hold, so also convering mremap/mprotect etc..) we set
mm->mmap_sem_contention_jiffies = jiffies and then to know if you must
not touch the mmap_sem at all, you compare jiffies against
mmap_sem_contention_jiffies, if it's equal we go speculative. If
that's not enough we can just keep going speculative for a few more
jiffies with time_before(). The srcu lock is non concerning because the
inc/dec of the fast path is in per-cpu cacheline of course, no false
sharing possible there or it wouldn't be any better than a normal lock.

The vma revalidation is already done by khugepaged and mm/userfaultfd,
both need to drop the mmap_sem and continue working on the pagetables,
so we already know it's workable and not too slow.

Summarizing.. by using a runtime contention triggered speculative
design that goes speculative only when contention is runtime-detected
using the above logic (or equivalent), and by having to revalidate the
vma by hand with find_vma without knowing instantly if the vma become
stale, we will run with a substantially slower speculative page fault
than with your current speculative always-on design, but the slower
speculative page fault runtime will still scale 100% in SMP so it
should still be faster on large SMP systems. The pros is that it won't
regress the mmap/brk vma modifications. The whole complexity of
tracking the vma modifications should also go away and the resulting
code should be more maintainable and less risky to break in subtle
ways impossible to reproduce.

Thanks!
Andrea
Laurent Dufour Nov. 6, 2017, 9:47 a.m. UTC | #5
Hi Andrea,

On 02/11/2017 21:08, Andrea Arcangeli wrote:
> On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote:
>> I think there is some memory barrier missing when the VMA is modified so
>> currently the modifications done in the VMA structure may not be written
>> down at the time the pte is locked. So doing that change will also requires
>> to call smp_wmb() before locking the page tables. In the current patch this
>> is ensured by the call to write_seqcount_end().
>> Doing so will still require to have a memory barrier when touching the VMA.
>> Not sure we get far better performance compared to the sequence count
>> change. But I'll give it a try anyway ;)
> 
> Luckily smp_wmb is a noop on x86. I would suggest to ignore the above
> issue completely if you give it a try, and then if this performs, we
> can just embed a smp_wmb() before spin_lock() somewhere in
> pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose
> spin_lock isn't a smp_wmb() equivalent. I would focus at flushing
> writes before every pagetable spin_lock for non-x86 archs, rather than
> after all vma modifications. That should be easier to keep under
> control and it's going to be more efficient too as if something there
> are fewer spin locks than vma modifications.

I do agree that would simplify the patch series a lot.
I'll double check that pte lock is not done in a loop other wise having
smp_wmb() there will be bad.

Another point I'm trying to double check is that we may have inconsistency
while reading the vma's flags in the page fault path until the memory
barrier got it in the VMA's changing path. Especially we may have vm_flags
and vm_page_prot not matching at all, which couldn't happen when checking
for the vm_sequence count.

> 
> For non-x86 archs we may then need a smp_wmb__before_spin_lock. That
> looks more self contained than surrounding all vma modifications and
> it's a noop on x86 anyway.
> 
> I thought about the contention detection logic too yesterday: to
> detect contention we could have a mm->mmap_sem_contention_jiffies and
> if down_read_trylock_exclusive() [same as down_read_if_not_hold in
> prev mail] fails (and it'll fail if either read or write mmap_sem is
> hold, so also convering mremap/mprotect etc..) we set
> mm->mmap_sem_contention_jiffies = jiffies and then to know if you must
> not touch the mmap_sem at all, you compare jiffies against
> mmap_sem_contention_jiffies, if it's equal we go speculative. If
> that's not enough we can just keep going speculative for a few more
> jiffies with time_before(). The srcu lock is non concerning because the
> inc/dec of the fast path is in per-cpu cacheline of course, no false
> sharing possible there or it wouldn't be any better than a normal lock.

I'm sorry, I should have missed something here. I can't see how this would
help fixing the case where a thread is entering the page fault handler
seeing that no one else has the mmap_sem and then grab it. While it is
processing the page fault another thread is entering mprotect for instance
and thus will wait for the mmap_sem to be released by the thread processing
the page fault.

Cheers,
Laurent.

> The vma revalidation is already done by khugepaged and mm/userfaultfd,
> both need to drop the mmap_sem and continue working on the pagetables,
> so we already know it's workable and not too slow.
> 
> Summarizing.. by using a runtime contention triggered speculative
> design that goes speculative only when contention is runtime-detected
> using the above logic (or equivalent), and by having to revalidate the
> vma by hand with find_vma without knowing instantly if the vma become
> stale, we will run with a substantially slower speculative page fault
> than with your current speculative always-on design, but the slower
> speculative page fault runtime will still scale 100% in SMP so it
> should still be faster on large SMP systems. The pros is that it won't
> regress the mmap/brk vma modifications. The whole complexity of
> tracking the vma modifications should also go away and the resulting
> code should be more maintainable and less risky to break in subtle
> ways impossible to reproduce.
> 
> Thanks!
> Andrea
>
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0bf9e423aa99..4fc37f88437c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1155,8 +1155,11 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					goto out_mm;
 				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
-					vma->vm_flags &= ~VM_SOFTDIRTY;
+					vm_write_begin(vma);
+					WRITE_ONCE(vma->vm_flags,
+						   vma->vm_flags & ~VM_SOFTDIRTY);
 					vma_set_page_prot(vma);
+					vm_write_end(vma);
 				}
 				downgrade_write(&mm->mmap_sem);
 				break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c713fd5b3e6..426af4fd407c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -640,8 +640,11 @@  int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 
 	octx = vma->vm_userfaultfd_ctx.ctx;
 	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+		vm_write_begin(vma);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		WRITE_ONCE(vma->vm_flags,
+			   vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+		vm_write_end(vma);
 		return 0;
 	}
 
@@ -866,8 +869,10 @@  static int userfaultfd_release(struct inode *inode, struct file *file)
 			vma = prev;
 		else
 			prev = vma;
-		vma->vm_flags = new_flags;
+		vm_write_begin(vma);
+		WRITE_ONCE(vma->vm_flags, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+		vm_write_end(vma);
 	}
 	up_write(&mm->mmap_sem);
 	mmput(mm);
@@ -1425,8 +1430,10 @@  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
-		vma->vm_flags = new_flags;
+		vm_write_begin(vma);
+		WRITE_ONCE(vma->vm_flags, new_flags);
 		vma->vm_userfaultfd_ctx.ctx = ctx;
+		vm_write_end(vma);
 
 	skip:
 		prev = vma;
@@ -1583,8 +1590,10 @@  static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
-		vma->vm_flags = new_flags;
+		vm_write_begin(vma);
+		WRITE_ONCE(vma->vm_flags, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+		vm_write_end(vma);
 
 	skip:
 		prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c01f177a1120..f723d42140db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	if (mm_find_pmd(mm, address) != pmd)
 		goto out;
 
+	vm_write_begin(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
 	pte = pte_offset_map(pmd, address);
@@ -1040,6 +1041,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
+		vm_write_end(vma);
 		result = SCAN_FAIL;
 		goto out;
 	}
@@ -1074,6 +1076,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
+	vm_write_end(vma);
 
 	*hpage = NULL;
 
diff --git a/mm/madvise.c b/mm/madvise.c
index acdd39fb1f5d..707e0657b33f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -183,7 +183,9 @@  static long madvise_behavior(struct vm_area_struct *vma,
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 */
-	vma->vm_flags = new_flags;
+	vm_write_begin(vma);
+	WRITE_ONCE(vma->vm_flags, new_flags);
+	vm_write_end(vma);
 out:
 	return error;
 }
@@ -451,9 +453,11 @@  static void madvise_free_page_range(struct mmu_gather *tlb,
 		.private = tlb,
 	};
 
+	vm_write_begin(vma);
 	tlb_start_vma(tlb, vma);
 	walk_page_range(addr, end, &free_walk);
 	tlb_end_vma(tlb, vma);
+	vm_write_end(vma);
 }
 
 static int madvise_free_single_vma(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a2af6d58a68f..72645928daa0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -379,8 +379,11 @@  void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		vm_write_begin(vma);
 		mpol_rebind_policy(vma->vm_policy, new);
+		vm_write_end(vma);
+	}
 	up_write(&mm->mmap_sem);
 }
 
@@ -578,9 +581,11 @@  unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
+	vm_write_begin(vma);
 	nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+	vm_write_end(vma);
 
 	return nr_updated;
 }
@@ -681,6 +686,7 @@  static int vma_replace_policy(struct vm_area_struct *vma,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	vm_write_begin(vma);
 	if (vma->vm_ops && vma->vm_ops->set_policy) {
 		err = vma->vm_ops->set_policy(vma, new);
 		if (err)
@@ -688,11 +694,17 @@  static int vma_replace_policy(struct vm_area_struct *vma,
 	}
 
 	old = vma->vm_policy;
-	vma->vm_policy = new; /* protected by mmap_sem */
+	/*
+	 * The speculative page fault handler access this field without
+	 * hodling the mmap_sem.
+	 */
+	WRITE_ONCE(vma->vm_policy,  new);
+	vm_write_end(vma);
 	mpol_put(old);
 
 	return 0;
  err_out:
+	vm_write_end(vma);
 	mpol_put(new);
 	return err;
 }
@@ -1562,23 +1574,28 @@  COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len,
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
-	struct mempolicy *pol = NULL;
+	struct mempolicy *pol;
 
-	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy) {
-			pol = vma->vm_ops->get_policy(vma, addr);
-		} else if (vma->vm_policy) {
-			pol = vma->vm_policy;
+	if (!vma)
+		return NULL;
 
-			/*
-			 * shmem_alloc_page() passes MPOL_F_SHARED policy with
-			 * a pseudo vma whose vma->vm_ops=NULL. Take a reference
-			 * count on these policies which will be dropped by
-			 * mpol_cond_put() later
-			 */
-			if (mpol_needs_cond_ref(pol))
-				mpol_get(pol);
-		}
+	if (vma->vm_ops && vma->vm_ops->get_policy)
+		return vma->vm_ops->get_policy(vma, addr);
+
+	/*
+	 * This could be called without holding the mmap_sem in the
+	 * speculative page fault handler's path.
+	 */
+	pol = READ_ONCE(vma->vm_policy);
+	if (pol) {
+		/*
+		 * shmem_alloc_page() passes MPOL_F_SHARED policy with
+		 * a pseudo vma whose vma->vm_ops=NULL. Take a reference
+		 * count on these policies which will be dropped by
+		 * mpol_cond_put() later
+		 */
+		if (mpol_needs_cond_ref(pol))
+			mpol_get(pol);
 	}
 
 	return pol;
diff --git a/mm/mlock.c b/mm/mlock.c
index 4d009350893f..83e49f99ad38 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -438,7 +438,9 @@  static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	vm_write_begin(vma);
+	WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK);
+	vm_write_end(vma);
 
 	while (start < end) {
 		struct page *page;
@@ -562,10 +564,11 @@  static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
-	if (lock)
-		vma->vm_flags = newflags;
-	else
+	if (lock) {
+		vm_write_begin(vma);
+		WRITE_ONCE(vma->vm_flags, newflags);
+		vm_write_end(vma);
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
 out:
diff --git a/mm/mmap.c b/mm/mmap.c
index 0e90b469fd97..e28136cd3275 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -847,17 +847,18 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (start != vma->vm_start) {
-		vma->vm_start = start;
+		WRITE_ONCE(vma->vm_start, start);
 		start_changed = true;
 	}
 	if (end != vma->vm_end) {
-		vma->vm_end = end;
+		WRITE_ONCE(vma->vm_end, end);
 		end_changed = true;
 	}
-	vma->vm_pgoff = pgoff;
+	WRITE_ONCE(vma->vm_pgoff, pgoff);
 	if (adjust_next) {
-		next->vm_start += adjust_next << PAGE_SHIFT;
-		next->vm_pgoff += adjust_next;
+		WRITE_ONCE(next->vm_start,
+			   next->vm_start + (adjust_next << PAGE_SHIFT));
+		WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
 	}
 
 	if (root) {
@@ -1754,6 +1755,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 out:
 	perf_event_mmap(vma);
 
+	vm_write_begin(vma);
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
@@ -1777,6 +1779,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	vm_write_end(vma);
 
 	return addr;
 
@@ -2405,8 +2408,8 @@  int expand_downwards(struct vm_area_struct *vma,
 					mm->locked_vm += grow;
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
-				vma->vm_start = address;
-				vma->vm_pgoff -= grow;
+				WRITE_ONCE(vma->vm_start, address);
+				WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow);
 				anon_vma_interval_tree_post_update_vma(vma);
 				vma_gap_update(vma);
 				spin_unlock(&mm->page_table_lock);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6d3e2f082290..ce93c4f6af70 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -358,7 +358,8 @@  mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
-	vma->vm_flags = newflags;
+	vm_write_begin(vma);
+	WRITE_ONCE(vma->vm_flags, newflags);
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
 
@@ -373,6 +374,7 @@  mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 			(newflags & VM_WRITE)) {
 		populate_vma_page_range(vma, start, end, NULL);
 	}
+	vm_write_end(vma);
 
 	vm_stat_account(mm, oldflags, -nrpages);
 	vm_stat_account(mm, newflags, nrpages);
diff --git a/mm/mremap.c b/mm/mremap.c
index cfec004c4ff9..ca0b5cb6ed4d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -301,6 +301,9 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 	if (!new_vma)
 		return -ENOMEM;
 
+	vm_write_begin(vma);
+	vm_write_begin_nested(new_vma, SINGLE_DEPTH_NESTING);
+
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
 				     need_rmap_locks);
 	if (moved_len < old_len) {
@@ -317,6 +320,7 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
 				 true);
+		vm_write_end(vma);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -325,7 +329,9 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
+		vm_write_end(vma);
 	}
+	vm_write_end(new_vma);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {