diff mbox series

[1/2] mmap locking API: Order lock of nascent mm outside lock of live mm

Message ID CAG48ez1kMuPUW8VKp=9=KDLVisa-zuqp+DbYjc=A-kGUi_ik3A@mail.gmail.com
State Not Applicable
Headers show
Series Broad write-locking of nascent mm in execve | expand

Commit Message

Jann Horn Oct. 2, 2020, 1:24 a.m. UTC
Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
of the old mm (in dup_mmap() and in UML's activate_mm()).
A following patch will change the exec path to very broadly lock the
nascent mm, but fine-grained locking should still work at the same time for
the new mm.
To do this in a way that lockdep is happy about, let's turn around the lock
ordering in both places that currently nest the locks.
Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
instead.

The added locking calls in exec_mmap() are temporary; the following patch
will move the locking out of exec_mmap().

Signed-off-by: Jann Horn <jannh@google.com>
---
 arch/um/include/asm/mmu_context.h |  3 +--
 fs/exec.c                         |  4 ++++
 include/linux/mmap_lock.h         | 23 +++++++++++++++++++++--
 kernel/fork.c                     |  7 ++-----
 4 files changed, 28 insertions(+), 9 deletions(-)

Comments

Michel Lespinasse Oct. 2, 2020, 9:17 a.m. UTC | #1
On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> of the old mm (in dup_mmap() and in UML's activate_mm()).
> A following patch will change the exec path to very broadly lock the
> nascent mm, but fine-grained locking should still work at the same time for
> the new mm.
> To do this in a way that lockdep is happy about, let's turn around the lock
> ordering in both places that currently nest the locks.
> Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> instead.
>
> The added locking calls in exec_mmap() are temporary; the following patch
> will move the locking out of exec_mmap().

Thanks for doing this.

This is probably a silly question, but I am not sure exactly where we
lock the old MM while bprm is creating the new MM ? I am guessing this
would be only in setup_arg_pages(), copying the args and environment
from the old the the new MM ? If that is correct, then wouldn't it be
sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
the issue that we'd prefer to have a killable version of it there ?

Also FYI I was going to play with these patches a bit to help answer
these questions on my own, but wasn't able to easily apply them as
they came lightly mangled (whitespace issues) when I saved them.
Jason Gunthorpe Oct. 2, 2020, 11:39 a.m. UTC | #2
On Fri, Oct 02, 2020 at 02:17:49AM -0700, Michel Lespinasse wrote:
> Also FYI I was going to play with these patches a bit to help answer
> these questions on my own, but wasn't able to easily apply them as
> they came lightly mangled (whitespace issues) when I saved them.

Me too

It seems OK, you've created sort of a SINGLE_DEPTH_NESTING but in
reverse - instead of marking the child of the nest it marks the
parent.

It would be nice to add a note in the commit message where the nesting
happens on this path.

Thanks,
Jason
Jann Horn Oct. 2, 2020, 4:33 p.m. UTC | #3
On Fri, Oct 2, 2020 at 11:18 AM Michel Lespinasse <walken@google.com> wrote:
> On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > A following patch will change the exec path to very broadly lock the
> > nascent mm, but fine-grained locking should still work at the same time for
> > the new mm.
> > To do this in a way that lockdep is happy about, let's turn around the lock
> > ordering in both places that currently nest the locks.
> > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > instead.
> >
> > The added locking calls in exec_mmap() are temporary; the following patch
> > will move the locking out of exec_mmap().
>
> Thanks for doing this.
>
> This is probably a silly question, but I am not sure exactly where we
> lock the old MM while bprm is creating the new MM ? I am guessing this
> would be only in setup_arg_pages(), copying the args and environment
> from the old the the new MM ? If that is correct, then wouldn't it be
> sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
> the issue that we'd prefer to have a killable version of it there ?

We're also implicitly locking the old MM anytime we take page faults
before exec_mmap(), which basically means the various userspace memory
accesses in do_execveat_common(). This happens after bprm_mm_init(),
so we've already set bprm->vma at that point.

> Also FYI I was going to play with these patches a bit to help answer
> these questions on my own, but wasn't able to easily apply them as
> they came lightly mangled (whitespace issues) when I saved them.

Uuugh, dammit, I see what happened. Sorry about the trouble. Thanks
for telling me, guess I'll go back to sending patches the way I did it
before. :/

I guess I'll go make a v2 of this with some extra comment about where
the old MM is accessed, as Jason suggested, and without the whitespace
issues?
Michel Lespinasse Oct. 3, 2020, 9:30 p.m. UTC | #4
On Fri, Oct 2, 2020 at 9:33 AM Jann Horn <jannh@google.com> wrote:
> On Fri, Oct 2, 2020 at 11:18 AM Michel Lespinasse <walken@google.com> wrote:
> > On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> > > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> > > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > > A following patch will change the exec path to very broadly lock the
> > > nascent mm, but fine-grained locking should still work at the same time for
> > > the new mm.
> > > To do this in a way that lockdep is happy about, let's turn around the lock
> > > ordering in both places that currently nest the locks.
> > > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > > instead.
> > >
> > > The added locking calls in exec_mmap() are temporary; the following patch
> > > will move the locking out of exec_mmap().
> >
> > Thanks for doing this.
> >
> > This is probably a silly question, but I am not sure exactly where we
> > lock the old MM while bprm is creating the new MM ? I am guessing this
> > would be only in setup_arg_pages(), copying the args and environment
> > from the old the the new MM ? If that is correct, then wouldn't it be
> > sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
> > the issue that we'd prefer to have a killable version of it there ?
>
> We're also implicitly locking the old MM anytime we take page faults
> before exec_mmap(), which basically means the various userspace memory
> accesses in do_execveat_common(). This happens after bprm_mm_init(),
> so we've already set bprm->vma at that point.

Ah yes, I see the issue now. It would be much nicer if copy_strings
could coax copy_from_user into taking a nested lock, but of course
there is no way to do that.

I'm not sure if it'd be reasonable to kmap the source pages like we do
for the destination pages ?

Adding a nascent lock instead of a nested lock, as you propose, seems
to work, but it also looks quite unusual. Not that I have anything
better to propose at this point though...


Unrelated to the above: copy_from_user and copy_to_user should not be
called with mmap_lock held; it may be worth adding these assertions
too (probably in separate patches) ?


> Uuugh, dammit, I see what happened. Sorry about the trouble. Thanks
> for telling me, guess I'll go back to sending patches the way I did it
> before. :/

Yeah, I've hit such issues with gmail before too :/
Jann Horn Oct. 5, 2020, 1:30 a.m. UTC | #5
On Sat, Oct 3, 2020 at 11:30 PM Michel Lespinasse <walken@google.com> wrote:
> Unrelated to the above: copy_from_user and copy_to_user should not be
> called with mmap_lock held; it may be worth adding these assertions
> too (probably in separate patches) ?

We already have that: All (hopefully?) the userspace accessors call
might_fault(), and that does might_lock_read(&current->mm->mmap_lock)
(if we're not running in a lazytlb kernel thread or KERNEL_DS is on or
we're in IRQ context or page faults have explicitly been disabled).


But another place where lockdep asserts should be added is find_vma();
there are currently several architectures that sometimes improperly
call that with no lock held:

SPARC's arch_validate_prot():
https://lore.kernel.org/linux-mm/CAG48ez3YsfTfOFKa-Po58e4PNp7FK54MFbkK3aUPSRt3LWtxQA@mail.gmail.com/

nios2 sys_cacheflush():
https://lore.kernel.org/linux-mm/CAG48ez3hxeXU29UGWRH-gRXX2jb5Lc==npbXFt8UDrWO4eHZdQ@mail.gmail.com/

nds32 sys_cacheflush():
https://lore.kernel.org/linux-mm/CAG48ez1UnQEMok9rqFQC4XHBaMmBe=eaedu8Z_RXdjFHTna_LA@mail.gmail.com/
Jason Gunthorpe Oct. 5, 2020, 12:52 p.m. UTC | #6
On Mon, Oct 05, 2020 at 03:30:43AM +0200, Jann Horn wrote:
> But another place where lockdep asserts should be added is find_vma();
> there are currently several architectures that sometimes improperly
> call that with no lock held:

Yes, I've seen several cases of this mis-use in drivers too

Jason
diff mbox series

Patch

diff --git a/arch/um/include/asm/mmu_context.h
b/arch/um/include/asm/mmu_context.h
index 17ddd4edf875..c13bc5150607 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -48,9 +48,8 @@  static inline void activate_mm(struct mm_struct
*old, struct mm_struct *new)
 	 * when the new ->mm is used for the first time.
 	 */
 	__switch_mm(&new->context.id);
-	mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING);
+	mmap_assert_write_locked(new);
 	uml_setup_stubs(new);
-	mmap_write_unlock(new);
 }

 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..229dbc7aa61a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1114,6 +1114,8 @@  static int exec_mmap(struct mm_struct *mm)
 	if (ret)
 		return ret;

+	mmap_write_lock_nascent(mm);
+
 	if (old_mm) {
 		/*
 		 * Make sure that if there is a core dump in progress
@@ -1125,6 +1127,7 @@  static int exec_mmap(struct mm_struct *mm)
 		if (unlikely(old_mm->core_state)) {
 			mmap_read_unlock(old_mm);
 			mutex_unlock(&tsk->signal->exec_update_mutex);
+			mmap_write_unlock(mm);
 			return -EINTR;
 		}
 	}
@@ -1138,6 +1141,7 @@  static int exec_mmap(struct mm_struct *mm)
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
+	mmap_write_unlock(mm);
 	if (old_mm) {
 		mmap_read_unlock(old_mm);
 		BUG_ON(active_mm != old_mm);
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0707671851a8..24de1fe99ee4 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -3,6 +3,18 @@ 

 #include <linux/mmdebug.h>

+/*
+ * Lock subclasses for the mmap_lock.
+ *
+ * MMAP_LOCK_SUBCLASS_NASCENT is for core kernel code that wants to lock an mm
+ * that is still being constructed and wants to be able to access the active mm
+ * normally at the same time. It nests outside MMAP_LOCK_SUBCLASS_NORMAL.
+ */
+enum {
+	MMAP_LOCK_SUBCLASS_NORMAL = 0,
+	MMAP_LOCK_SUBCLASS_NASCENT
+};
+
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),

@@ -16,9 +28,16 @@  static inline void mmap_write_lock(struct mm_struct *mm)
 	down_write(&mm->mmap_lock);
 }

-static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
+/*
+ * Lock an mm_struct that is still being set up (during fork or exec).
+ * This nests outside the mmap locks of live mm_struct instances.
+ * No interruptible/killable versions exist because at the points where you're
+ * supposed to use this helper, the mm isn't visible to anything else, so we
+ * expect the mmap_lock to be uncontended.
+ */
+static inline void mmap_write_lock_nascent(struct mm_struct *mm)
 {
-	down_write_nested(&mm->mmap_lock, subclass);
+	down_write_nested(&mm->mmap_lock, MMAP_LOCK_SUBCLASS_NASCENT);
 }

 static inline int mmap_write_lock_killable(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..db67eb4ac7bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	unsigned long charge;
 	LIST_HEAD(uf);

+	mmap_write_lock_nascent(mm);
 	uprobe_start_dup_mmap();
 	if (mmap_write_lock_killable(oldmm)) {
 		retval = -EINTR;
@@ -481,10 +482,6 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	}
 	flush_cache_dup_mm(oldmm);
 	uprobe_dup_mmap(oldmm, mm);
-	/*
-	 * Not linked in yet - no deadlock potential:
-	 */
-	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);

 	/* No ordering required: file already has been exposed. */
 	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
@@ -600,12 +597,12 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	/* a new mm has just been created */
 	retval = arch_dup_mmap(oldmm, mm);
 out:
-	mmap_write_unlock(mm);
 	flush_tlb_mm(oldmm);
 	mmap_write_unlock(oldmm);
 	dup_userfaultfd_complete(&uf);
 fail_uprobe_end:
 	uprobe_end_dup_mmap();
+	mmap_write_unlock(mm);
 	return retval;
 fail_nomem_anon_vma_fork:
 	mpol_put(vma_policy(tmp));