diff mbox series

[v17,10/13] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

Message ID 20191117011713.13032-11-cyphar@cyphar.com
State New
Headers show
Series open: introduce openat2(2) syscall | expand

Commit Message

Aleksa Sarai Nov. 17, 2019, 1:17 a.m. UTC
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].

As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.

  thread1 [attacker]:
    for (;;)
      renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
    for (;;)
      openat2(dirb, "b/c/../../etc/shadow",
              { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check for
    magic-links after they are resolved. However this seems unlikely to
    be a feature that people *really* need -- it can be added later if
    it turns out a lot of people want it.

[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namei.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

Comments

Aleksa Sarai Nov. 25, 2019, 1:21 p.m. UTC | #1
On 2019-11-25, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Nov 17, 2019 at 12:17:10PM +1100, Aleksa Sarai wrote:
> > +		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> > +			/*
> > +			 * If there was a racing rename or mount along our
> > +			 * path, then we can't be sure that ".." hasn't jumped
> > +			 * above nd->root (and so userspace should retry or use
> > +			 * some fallback).
> > +			 */
> > +			if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
> > +				return -EAGAIN;
> > +			if (unlikely(read_seqretry(&rename_lock, nd->r_seq)))
> > +				return -EAGAIN;
> > +		}
> 
> Looks like excessive barriers to me - it's
> 	rmb
> 	check mount_lock.sequence
> 	rmb
> 	check rename_lock.sequence

If you like, I can switch this to

	smp_rmb();
	if (unlikely(__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq)))
		return -EAGAIN;
	if (unlikely(__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq)))
		return -EAGAIN;

Though I think it makes it more noisy (and this code-path will only be
hit for ".." and LOOKUP_IS_SCOPED).

> > @@ -2266,6 +2274,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  	nd->last_type = LAST_ROOT; /* if there are only slashes... */
> >  	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> >  	nd->depth = 0;
> > +
> > +	nd->m_seq = read_seqbegin(&mount_lock);
> > +	nd->r_seq = read_seqbegin(&rename_lock);
> 
> Same here, pretty much (fetch/rmb/fetch/rmb)

Unless I'm mistaken, wouldn't we have to do
seqcount_lockdep_reader_access() explicitly -- so it would end up
looking something like:

	seqcount_lockdep_reader_access(&mount_lock.seqcount);
	nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
	seqcount_lockdep_reader_access(&mount_lock.seqcount);
	nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
	smp_rmb();

Given this code only runs once at the start of each lookup, I'm not sure
it makes much sense to expand it like that and make it look uglier.

If you really want to avoid the duplicate memory barriers in the common
case I could instead gate the rename_lock grab behind LOOKUP_IS_SCOPED
(since that's the only time it's used).
Aleksa Sarai Nov. 28, 2019, 10:10 a.m. UTC | #2
On 2019-11-26, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-11-25, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Sun, Nov 17, 2019 at 12:17:10PM +1100, Aleksa Sarai wrote:
> > > +		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> > > +			/*
> > > +			 * If there was a racing rename or mount along our
> > > +			 * path, then we can't be sure that ".." hasn't jumped
> > > +			 * above nd->root (and so userspace should retry or use
> > > +			 * some fallback).
> > > +			 */
> > > +			if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
> > > +				return -EAGAIN;
> > > +			if (unlikely(read_seqretry(&rename_lock, nd->r_seq)))
> > > +				return -EAGAIN;
> > > +		}
> > 
> > Looks like excessive barriers to me - it's
> > 	rmb
> > 	check mount_lock.sequence
> > 	rmb
> > 	check rename_lock.sequence
> 
> If you like, I can switch this to
> 
> 	smp_rmb();
> 	if (unlikely(__read_seqcount_retry(&mount_lock.seqcount, nd->m_seq)))
> 		return -EAGAIN;
> 	if (unlikely(__read_seqcount_retry(&rename_lock.seqcount, nd->r_seq)))
> 		return -EAGAIN;
> 
> Though I think it makes it more noisy (and this code-path will only be
> hit for ".." and LOOKUP_IS_SCOPED).
> 
> > > @@ -2266,6 +2274,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  	nd->last_type = LAST_ROOT; /* if there are only slashes... */
> > >  	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
> > >  	nd->depth = 0;
> > > +
> > > +	nd->m_seq = read_seqbegin(&mount_lock);
> > > +	nd->r_seq = read_seqbegin(&rename_lock);
> > 
> > Same here, pretty much (fetch/rmb/fetch/rmb)
> 
> Unless I'm mistaken, wouldn't we have to do
> seqcount_lockdep_reader_access() explicitly -- so it would end up
> looking something like:
> 
> 	seqcount_lockdep_reader_access(&mount_lock.seqcount);
> 	nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
> 	seqcount_lockdep_reader_access(&mount_lock.seqcount);
> 	nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
> 	smp_rmb();

Actually, looking it again (unless I'm mistaken) the following should be
acceptable and it also avoids the extra fetch+rmb of mount_lock for
LOOKUP_ROOT. The only downside is that we don't get lockdep information
but path_init() already ignores lockdep when grabbing d_seq.

I will include the following in v18, but let me know if I'm missing
something obvious:

 >>	nd->m_seq = __read_seqcount_begin(&mount_lock);
 >>	nd->r_seq = __read_seqcount_begin(&rename_lock);
 >>	smp_rmb();

	if (flags & LOOKUP_ROOT) {
		struct dentry *root = nd->root.dentry;
		struct inode *inode = root->d_inode;
		if (*s && unlikely(!d_can_lookup(root)))
			return ERR_PTR(-ENOTDIR);
		nd->path = nd->root;
		nd->inode = inode;
		if (flags & LOOKUP_RCU) {
 >>			nd->seq = raw_read_seqcount_begin(&nd->path.dentry->d_seq);
			nd->root_seq = nd->seq;
		} else {
			path_get(&nd->path);
		}
		return s;
	}

I could also move the smp_rmb() to after LOOKUP_ROOT (and add an
smp_rmb() at the end of LOOKUP_ROOT) which would avoid a double-rmb for
LOOKUP_ROOT -- but it makes it harder to read IMHO.
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index a6196786db13..88c706e459f6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@  struct nameidata {
 	struct path	root;
 	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
-	unsigned	seq, m_seq;
+	unsigned	seq, m_seq, r_seq;
 	int		last_type;
 	unsigned	depth;
 	int		total_link_count;
@@ -1781,22 +1781,30 @@  static inline int handle_dots(struct nameidata *nd, int type)
 	if (type == LAST_DOTDOT) {
 		int error = 0;
 
-		/*
-		 * Scoped-lookup flags resolving ".." is not currently safe --
-		 * races can cause our parent to have moved outside of the root
-		 * and us to skip over it.
-		 */
-		if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
-			return -EXDEV;
 		if (!nd->root.mnt) {
 			error = set_root(nd);
 			if (error)
 				return error;
 		}
-		if (nd->flags & LOOKUP_RCU) {
-			return follow_dotdot_rcu(nd);
-		} else
-			return follow_dotdot(nd);
+		if (nd->flags & LOOKUP_RCU)
+			error = follow_dotdot_rcu(nd);
+		else
+			error = follow_dotdot(nd);
+		if (error)
+			return error;
+
+		if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+			/*
+			 * If there was a racing rename or mount along our
+			 * path, then we can't be sure that ".." hasn't jumped
+			 * above nd->root (and so userspace should retry or use
+			 * some fallback).
+			 */
+			if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
+				return -EAGAIN;
+			if (unlikely(read_seqretry(&rename_lock, nd->r_seq)))
+				return -EAGAIN;
+		}
 	}
 	return 0;
 }
@@ -2266,6 +2274,10 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
 	nd->depth = 0;
+
+	nd->m_seq = read_seqbegin(&mount_lock);
+	nd->r_seq = read_seqbegin(&rename_lock);
+
 	if (flags & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
@@ -2287,7 +2299,6 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 
-	nd->m_seq = read_seqbegin(&mount_lock);
 
 	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
 	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {