clone3: validate stack arguments
diff mbox series

Message ID 20191031113608.20713-1-christian.brauner@ubuntu.com
State New
Headers show
Series
  • clone3: validate stack arguments
Related show

Commit Message

Christian Brauner Oct. 31, 2019, 11:36 a.m. UTC
Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.

Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:

 #define __STACK_SIZE (8 * 1024 * 1024)
 pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
 {
         pid_t ret;
         void *stack;

         stack = malloc(__STACK_SIZE);
         if (!stack)
                 return -ENOMEM;

 #ifdef __ia64__
         ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #elif defined(__parisc__) /* stack grows up */
         ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
 #else
         ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #endif
         return ret;
 }

or even crazier variants such as [3].

With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.

/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().

[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd316 ("fork: add clone3")
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha@sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/uapi/linux/sched.h |  4 ++++
 kernel/fork.c              | 33 ++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

Comments

David Laight Oct. 31, 2019, 11:41 a.m. UTC | #1
From: Christian Brauner
> Sent: 31 October 2019 11:36
...
> /* Intentional user visible API change */
> clone3() was released with 5.3. Currently, it is not documented and very
> unclear to userspace how the stack and stack_size argument have to be
> passed. After talking to glibc folks we concluded that trying to change
> clone3() to setup the stack instead of requiring userspace to do this is
> the right course of action.

What happens if someone 'accidentally' runs a program compiled for
the new API on a system running the existing 5.3 kernel?

While it won't work, it needs to die reasonably gracefully.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Aleksa Sarai Oct. 31, 2019, 1:59 p.m. UTC | #2
On 2019-10-31, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> Validate the stack arguments and setup the stack depening on whether or not
> it is growing down or up.
> 
> Legacy clone() required userspace to know in which direction the stack is
> growing and pass down the stack pointer appropriately. To make things more
> confusing microblaze uses a variant of the clone() syscall selected by
> CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
> IA64 has a separate clone2() syscall which also takes an additional
> stack_size argument. Finally, parisc has a stack that is growing upwards.
> Userspace therefore has a lot nasty code like the following:
> 
>  #define __STACK_SIZE (8 * 1024 * 1024)
>  pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
>  {
>          pid_t ret;
>          void *stack;
> 
>          stack = malloc(__STACK_SIZE);
>          if (!stack)
>                  return -ENOMEM;
> 
>  #ifdef __ia64__
>          ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
>  #elif defined(__parisc__) /* stack grows up */
>          ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
>  #else
>          ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
>  #endif
>          return ret;
>  }
> 
> or even crazier variants such as [3].
> 
> With clone3() we have the ability to validate the stack. We can check that
> when stack_size is passed, the stack pointer is valid and the other way
> around. We can also check that the memory area userspace gave us is fine to
> use via access_ok(). Furthermore, we probably should not require
> userspace to know in which direction the stack is growing. It is easy
> for us to do this in the kernel and I couldn't find the original
> reasoning behind exposing this detail to userspace.
> 
> /* Intentional user visible API change */
> clone3() was released with 5.3. Currently, it is not documented and very
> unclear to userspace how the stack and stack_size argument have to be
> passed. After talking to glibc folks we concluded that trying to change
> clone3() to setup the stack instead of requiring userspace to do this is
> the right course of action.
> Note, that this is an explicit change in user visible behavior we introduce
> with this patch. If it breaks someone's use-case we will revert! (And then
> e.g. place the new behavior under an appropriate flag.)
> Breaking someone's use-case is very unlikely though. First, neither glibc
> nor musl currently expose a wrapper for clone3(). Second, there is no real
> motivation for anyone to use clone3() directly since it does not provide
> features that legacy clone doesn't. New features for clone3() will first
> happen in v5.5 which is why v5.4 is still a good time to try and make that
> change now and backport it to v5.3. Searches on [4] did not reveal any
> packages calling clone3().
> 
> [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
> [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
> [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
> [4]: https://codesearch.debian.net
> Fixes: 7f192e3cd316 ("fork: add clone3")
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: linux-api@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: <stable@vger.kernel.org> # 5.3
> Cc: GNU C Library <libc-alpha@sourceware.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Looks reasonable, and it'll be lovely to not have to worry about stack
ordering anymore. As for the UAPI breakage, I agree that nobody is using
clone3() yet and so we still have a bit of lee-way to fix this
behaviour.

Acked-by: Aleksa Sarai <cyphar@cyphar.com>

> ---
>  include/uapi/linux/sched.h |  4 ++++
>  kernel/fork.c              | 33 ++++++++++++++++++++++++++++++++-
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 99335e1f4a27..25b4fa00bad1 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -51,6 +51,10 @@
>   *               sent when the child exits.
>   * @stack:       Specify the location of the stack for the
>   *               child process.
> + *               Note, @stack is expected to point to the
> + *               lowest address. The stack direction will be
> + *               determined by the kernel and set up
> + *               appropriately based on @stack_size.
>   * @stack_size:  The size of the stack for the child process.
>   * @tls:         If CLONE_SETTLS is set, the tls descriptor
>   *               is set to tls.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index bcdf53125210..55af6931c6ec 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2561,7 +2561,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  	return 0;
>  }
>  
> -static bool clone3_args_valid(const struct kernel_clone_args *kargs)
> +/**
> + * clone3_stack_valid - check and prepare stack
> + * @kargs: kernel clone args
> + *
> + * Verify that the stack arguments userspace gave us are sane.
> + * In addition, set the stack direction for userspace since it's easy for us to
> + * determine.
> + */
> +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (kargs->stack == 0) {
> +		if (kargs->stack_size > 0)
> +			return false;
> +	} else {
> +		if (kargs->stack_size == 0)
> +			return false;
> +
> +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> +			return false;
> +
> +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
> +		kargs->stack += kargs->stack_size;
> +#endif
> +	}
> +
> +	return true;
> +}
> +
> +static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  {
>  	/*
>  	 * All lower bits of the flag word are taken.
> @@ -2581,6 +2609,9 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
>  	    kargs->exit_signal)
>  		return false;
>  
> +	if (!clone3_stack_valid(kargs))
> +		return false;
> +
>  	return true;
>  }
David Laight Oct. 31, 2019, 2:27 p.m. UTC | #3
From Christian Brauner
> Sent: 31 October 2019 11:36
> 
> Validate the stack arguments and setup the stack depening on whether or not
> it is growing down or up.
> 
...
> -static bool clone3_args_valid(const struct kernel_clone_args *kargs)
> +/**
> + * clone3_stack_valid - check and prepare stack
> + * @kargs: kernel clone args
> + *
> + * Verify that the stack arguments userspace gave us are sane.
> + * In addition, set the stack direction for userspace since it's easy for us to
> + * determine.
> + */
> +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (kargs->stack == 0) {
> +		if (kargs->stack_size > 0)
> +			return false;
> +	} else {
> +		if (kargs->stack_size == 0)
> +			return false;
> +
> +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> +			return false;

Does access_ok() do anything useful here?
It only verifies that the buffer isn't in kernel space.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Oleg Nesterov Oct. 31, 2019, 4:46 p.m. UTC | #4
On 10/31, Christian Brauner wrote:
>
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -51,6 +51,10 @@
>   *               sent when the child exits.
>   * @stack:       Specify the location of the stack for the
>   *               child process.
> + *               Note, @stack is expected to point to the
> + *               lowest address. The stack direction will be
> + *               determined by the kernel and set up
> + *               appropriately based on @stack_size.

I can't review this patch, I have no idea what does stack_size mean
if !arch/x86.

x86 doesn't use stack_size unless a kthread does kernel_thread(), so
this change is probably fine...

Hmm. Off-topic question, why did 7f192e3cd3 ("fork: add clone3") add
"& ~CSIGNAL" in kernel_thread() ? This looks pointless and confusing
to me...

> +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (kargs->stack == 0) {
> +		if (kargs->stack_size > 0)
> +			return false;
> +	} else {
> +		if (kargs->stack_size == 0)
> +			return false;

So to implement clone3_wrapper(void *bottom_of_stack) you need to do

	clone3_wrapper(void *bottom_of_stack)
	{
		struct clone_args args = {
			...
			// make clone3_stack_valid() happy
			.stack = bottom_of_stack - 1,
			.stack_size = 1,
		};
	}

looks a bit strange. OK, I agree, this example is very artificial.
But why do you think clone3() should nack stack_size == 0 ?

> +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> +			return false;

Why?

Oleg.
Christian Brauner Nov. 1, 2019, 11:06 a.m. UTC | #5
On Thu, Oct 31, 2019 at 05:46:53PM +0100, Oleg Nesterov wrote:
> On 10/31, Christian Brauner wrote:
> >
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -51,6 +51,10 @@
> >   *               sent when the child exits.
> >   * @stack:       Specify the location of the stack for the
> >   *               child process.
> > + *               Note, @stack is expected to point to the
> > + *               lowest address. The stack direction will be
> > + *               determined by the kernel and set up
> > + *               appropriately based on @stack_size.
> 
> I can't review this patch, I have no idea what does stack_size mean
> if !arch/x86.

In short: nothing at all if it weren't for ia64 (and maybe parisc).
But let me provide some (hopefully useful) context. (Probably most of
that is well-know, so sorry for superflous info. :))

The stack and stack_size argument are used in copy_thread_tls() and in
copy_thread(). What the arch will end up calling depends on
CONFIG_HAVE_COPY_THREAD. Afaict, mips, powerpc, s390, and x86
call copy_thread_tls(). The other arches call copy_thread().
On all arches _except_ IA64 copy_thread{_tls}() just assigns "stack" to
the right register and is done with it.
On all arches _except_ parisc "stack" needs to point to the highest
address. On parisc it needs to point to the lowest
(CONFIG_STACK_GROWSUP).
IA64 has a downwards growing stack like all the other architectures but
it expects "stack" to poin to the _lowest_ address nonetheless. In
contrast to all the other arches it does:

	child_ptregs->r12 = user_stack_base + user_stack_size - 16;

so ia64 sets up the stack pointer itself.

So now we have:
 parisc ->   upwards growing stack, stack_size _unused_ for user stacks
!parisc -> downwards growing stack, stack_size _unused_ for user stacks
   ia64 -> downwards growing stack, stack_size   _used_ for user stacks

Now it gets more confusing since the clone() syscall layout is arch
dependent as well. Let's ignore the case of arches that have a clone
syscall version with switched flags and stack argument and only focus on
arches with an _additional_ stack_size argument:

microblaze -> clone(stack, stack_size)

Then there's clone2() for ia64 which is a _separate_ syscall with an
additional stack_size argument:

ia64       -> clone2(stack, stack_size)

Now, contrary to what you'd expect, microblaze ignores the stack_size
argument.

So the stack_size argument _would_ be completely meaningless if it
weren't for ia64 and parisc.

> 
> x86 doesn't use stack_size unless a kthread does kernel_thread(), so
> this change is probably fine...
> 
> Hmm. Off-topic question, why did 7f192e3cd3 ("fork: add clone3") add
> "& ~CSIGNAL" in kernel_thread() ? This looks pointless and confusing
> to me...

(Can we discuss this over a patch that removes this restriction if we
think this is pointless?)

> 
> > +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> > +{
> > +	if (kargs->stack == 0) {
> > +		if (kargs->stack_size > 0)
> > +			return false;
> > +	} else {
> > +		if (kargs->stack_size == 0)
> > +			return false;
> 
> So to implement clone3_wrapper(void *bottom_of_stack) you need to do
> 
> 	clone3_wrapper(void *bottom_of_stack)
> 	{
> 		struct clone_args args = {
> 			...
> 			// make clone3_stack_valid() happy
> 			.stack = bottom_of_stack - 1,
> 			.stack_size = 1,
> 		};
> 	}
> 
> looks a bit strange. OK, I agree, this example is very artificial.
> But why do you think clone3() should nack stack_size == 0 ?

In short, consistency.
I think prior clone() versions (on accident) have exposed the stack
direction as an implementation detail to userspace. Userspace clone()
code wrapping code is _wild_ and buggy partially because of that.

The best thing imho, is to clearly communicate to userspace that stack
needs to point to the lowest address and stack_size to the initial range
of the stack pointer or both are 0.

The alternative is to let userspace either give us a stack pointer that
we expect to be setup correctly by userspace or a stack pointer to the
lowest address and a stack_size argument. That's just an invitation for
more confusion and we have proof with legacy clone that this is not a
good idea.

> 
> > +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> > +			return false;
> 
> Why?

It's nice of us to tell userspace _before_ we have created a thread that
it messed up its parameters instead of starting a thread that then
immediately crashes.

Christian
Oleg Nesterov Nov. 1, 2019, 12:32 p.m. UTC | #6
On 11/01, Christian Brauner wrote:
>
> On Thu, Oct 31, 2019 at 05:46:53PM +0100, Oleg Nesterov wrote:
> > On 10/31, Christian Brauner wrote:
> > >
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > > @@ -51,6 +51,10 @@
> > >   *               sent when the child exits.
> > >   * @stack:       Specify the location of the stack for the
> > >   *               child process.
> > > + *               Note, @stack is expected to point to the
> > > + *               lowest address. The stack direction will be
> > > + *               determined by the kernel and set up
> > > + *               appropriately based on @stack_size.
> >
> > I can't review this patch, I have no idea what does stack_size mean
> > if !arch/x86.
>
> In short: nothing at all if it weren't for ia64 (and maybe parisc).
> But let me provide some (hopefully useful) context.

Thanks...

> (Probably most of
> that is well-know,

Certainly not to me ;) Thanks.

> > > +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> > > +{
> > > +	if (kargs->stack == 0) {
> > > +		if (kargs->stack_size > 0)
> > > +			return false;
> > > +	} else {
> > > +		if (kargs->stack_size == 0)
> > > +			return false;
> >
> > So to implement clone3_wrapper(void *bottom_of_stack) you need to do
> >
> > 	clone3_wrapper(void *bottom_of_stack)
> > 	{
> > 		struct clone_args args = {
> > 			...
> > 			// make clone3_stack_valid() happy
> > 			.stack = bottom_of_stack - 1,
> > 			.stack_size = 1,
> > 		};
> > 	}
> >
> > looks a bit strange. OK, I agree, this example is very artificial.
> > But why do you think clone3() should nack stack_size == 0 ?
>
> In short, consistency.

And in my opinion this stack_size == 0 check destroys the consistency,
see below.

But just in case, let me say that overall I personally like this change.

> The best thing imho, is to clearly communicate to userspace that stack
> needs to point to the lowest address and stack_size to the initial range
> of the stack pointer

Agreed.

But the kernel can't verify that "stack" actually points to the lowest
address and stack_size is actually the stack size. Consider another
artificial

    	clone3_wrapper(void *bottom_of_stack, unsigned long offs)
    	{
    		struct clone_args args = {
    			...
    			// make clone3_stack_valid() happy
    			.stack = bottom_of_stack - offs,
    			.stack_size = offs,
    		};
    		sys_clone3(args);
    	}
	
Now,

	clone3_wrapper(bottom_of_stack, offs);

is same thing for _any_ offs except offs == 0 will fail. Why? To me this
is not consistent, I think the "stack_size == 0" check buys nothing and
only adds some confusion.

Say, stack_size == 1 is "obviously wrong" too, this certainly means that
"stack" doesn't point to the lowest address (or the child will corrupt the
memory), but it works.

OK, I won't insist. Perhaps it can help to detect the case when a user
forgets to pass the correct stack size.

> > > +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> > > +			return false;
> >
> > Why?
>
> It's nice of us to tell userspace _before_ we have created a thread that
> it messed up its parameters instead of starting a thread that then
> immediately crashes.

Heh. Then why this code doesn't verify that at least stack + stack_size is
properly mmaped with PROT_READ|WRITE?

Oleg.
Christian Brauner Nov. 1, 2019, 2:40 p.m. UTC | #7
On Fri, Nov 01, 2019 at 01:32:57PM +0100, Oleg Nesterov wrote:
> On 11/01, Christian Brauner wrote:
> >
> > On Thu, Oct 31, 2019 at 05:46:53PM +0100, Oleg Nesterov wrote:
> > > On 10/31, Christian Brauner wrote:
> > > >
> > > > --- a/include/uapi/linux/sched.h
> > > > +++ b/include/uapi/linux/sched.h
> > > > @@ -51,6 +51,10 @@
> > > >   *               sent when the child exits.
> > > >   * @stack:       Specify the location of the stack for the
> > > >   *               child process.
> > > > + *               Note, @stack is expected to point to the
> > > > + *               lowest address. The stack direction will be
> > > > + *               determined by the kernel and set up
> > > > + *               appropriately based on @stack_size.
> > >
> > > I can't review this patch, I have no idea what does stack_size mean
> > > if !arch/x86.
> >
> > In short: nothing at all if it weren't for ia64 (and maybe parisc).
> > But let me provide some (hopefully useful) context.
> 
> Thanks...
> 
> > (Probably most of
> > that is well-know,
> 
> Certainly not to me ;) Thanks.
> 
> > > > +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> > > > +{
> > > > +	if (kargs->stack == 0) {
> > > > +		if (kargs->stack_size > 0)
> > > > +			return false;
> > > > +	} else {
> > > > +		if (kargs->stack_size == 0)
> > > > +			return false;
> > >
> > > So to implement clone3_wrapper(void *bottom_of_stack) you need to do
> > >
> > > 	clone3_wrapper(void *bottom_of_stack)
> > > 	{
> > > 		struct clone_args args = {
> > > 			...
> > > 			// make clone3_stack_valid() happy
> > > 			.stack = bottom_of_stack - 1,
> > > 			.stack_size = 1,
> > > 		};
> > > 	}
> > >
> > > looks a bit strange. OK, I agree, this example is very artificial.
> > > But why do you think clone3() should nack stack_size == 0 ?
> >
> > In short, consistency.
> 
> And in my opinion this stack_size == 0 check destroys the consistency,
> see below.
> 
> But just in case, let me say that overall I personally like this change.
> 
> > The best thing imho, is to clearly communicate to userspace that stack
> > needs to point to the lowest address and stack_size to the initial range
> > of the stack pointer
> 
> Agreed.
> 
> But the kernel can't verify that "stack" actually points to the lowest
> address and stack_size is actually the stack size. Consider another
> artificial

Sure, but that's the similar to other structs that are passed via a
pointer and come with a size. You could pass:

setxattr(..., ..., value - size, size, ...);

and the kernel would be confused as well.

> 
>     	clone3_wrapper(void *bottom_of_stack, unsigned long offs)
>     	{
>     		struct clone_args args = {
>     			...
>     			// make clone3_stack_valid() happy
>     			.stack = bottom_of_stack - offs,
>     			.stack_size = offs,
>     		};
>     		sys_clone3(args);
>     	}
> 	
> Now,
> 
> 	clone3_wrapper(bottom_of_stack, offs);
> 
> is same thing for _any_ offs except offs == 0 will fail. Why? To me this
> is not consistent, I think the "stack_size == 0" check buys nothing and
> only adds some confusion.

I disagree. It's a very easy contract: pass a stack and a size or
request copy-on-write by passing both as 0.
Sure, you can flaunt that contract but that's true of every other
pointer + size api. The point is: the api we endorse should be simple
and stack + stack_size is very simple.

> 
> Say, stack_size == 1 is "obviously wrong" too, this certainly means that
> "stack" doesn't point to the lowest address (or the child will corrupt the
> memory), but it works.
> 
> OK, I won't insist. Perhaps it can help to detect the case when a user
> forgets to pass the correct stack size.
> 
> > > > +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> > > > +			return false;
> > >
> > > Why?
> >
> > It's nice of us to tell userspace _before_ we have created a thread that
> > it messed up its parameters instead of starting a thread that then
> > immediately crashes.
> 
> Heh. Then why this code doesn't verify that at least stack + stack_size is
> properly mmaped with PROT_READ|WRITE?

access_ok() is uncomplicated.
The other check makes a lot more assumptions. Theare are users that might
want to have a PROT_NONE part of their stack as their own "private"
guard page (Jann just made that point) and there are other corner cases.

Christian
Szabolcs Nagy Nov. 1, 2019, 2:57 p.m. UTC | #8
On 31/10/2019 11:36, Christian Brauner wrote:
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 99335e1f4a27..25b4fa00bad1 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -51,6 +51,10 @@
>   *               sent when the child exits.
>   * @stack:       Specify the location of the stack for the
>   *               child process.
> + *               Note, @stack is expected to point to the
> + *               lowest address. The stack direction will be
> + *               determined by the kernel and set up
> + *               appropriately based on @stack_size.
>   * @stack_size:  The size of the stack for the child process.
>   * @tls:         If CLONE_SETTLS is set, the tls descriptor
>   *               is set to tls.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index bcdf53125210..55af6931c6ec 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2561,7 +2561,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  	return 0;
>  }
>  
> -static bool clone3_args_valid(const struct kernel_clone_args *kargs)
> +/**
> + * clone3_stack_valid - check and prepare stack
> + * @kargs: kernel clone args
> + *
> + * Verify that the stack arguments userspace gave us are sane.
> + * In addition, set the stack direction for userspace since it's easy for us to
> + * determine.
> + */
> +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (kargs->stack == 0) {
> +		if (kargs->stack_size > 0)
> +			return false;
> +	} else {
> +		if (kargs->stack_size == 0)
> +			return false;
> +
> +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> +			return false;
> +
> +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
> +		kargs->stack += kargs->stack_size;
> +#endif
> +	}

from the description it is not clear whose
responsibility it is to guarantee the alignment
of sp on entry.

i think 0 stack size may work if signals are
blocked and then prohibiting it might not be
the right thing.

it's not clear how libc should deal with v5.3
kernels which don't have the stack+=stack_size
logic.
Christian Brauner Nov. 1, 2019, 3:10 p.m. UTC | #9
On Fri, Nov 01, 2019 at 02:57:10PM +0000, Szabolcs Nagy wrote:
> On 31/10/2019 11:36, Christian Brauner wrote:
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 99335e1f4a27..25b4fa00bad1 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -51,6 +51,10 @@
> >   *               sent when the child exits.
> >   * @stack:       Specify the location of the stack for the
> >   *               child process.
> > + *               Note, @stack is expected to point to the
> > + *               lowest address. The stack direction will be
> > + *               determined by the kernel and set up
> > + *               appropriately based on @stack_size.
> >   * @stack_size:  The size of the stack for the child process.
> >   * @tls:         If CLONE_SETTLS is set, the tls descriptor
> >   *               is set to tls.
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index bcdf53125210..55af6931c6ec 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2561,7 +2561,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> >  	return 0;
> >  }
> >  
> > -static bool clone3_args_valid(const struct kernel_clone_args *kargs)
> > +/**
> > + * clone3_stack_valid - check and prepare stack
> > + * @kargs: kernel clone args
> > + *
> > + * Verify that the stack arguments userspace gave us are sane.
> > + * In addition, set the stack direction for userspace since it's easy for us to
> > + * determine.
> > + */
> > +static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
> > +{
> > +	if (kargs->stack == 0) {
> > +		if (kargs->stack_size > 0)
> > +			return false;
> > +	} else {
> > +		if (kargs->stack_size == 0)
> > +			return false;
> > +
> > +		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
> > +			return false;
> > +
> > +#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
> > +		kargs->stack += kargs->stack_size;
> > +#endif
> > +	}
> 
> from the description it is not clear whose
> responsibility it is to guarantee the alignment
> of sp on entry.

Userspace.

> 
> i think 0 stack size may work if signals are
> blocked and then prohibiting it might not be
> the right thing.

Note that stack size 0 is allowed:

struct clone_args args = {
	.exit_signal = SIGCHLD,
};

clone3(&args, sizeof(args));

will just work fine.

> 
> it's not clear how libc should deal with v5.3
> kernels which don't have the stack+=stack_size
> logic.

stable is already Cced and the change will be backported to v5.3.
Nearly all distros track pull in stable updates.

Florian, thoughts on this?

Christian
Arnd Bergmann Nov. 5, 2019, 2:23 p.m. UTC | #10
On Thu, Oct 31, 2019 at 12:36 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> Validate the stack arguments and setup the stack depening on whether or not
> it is growing down or up.
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>
Michael Kerrisk (man-pages) Nov. 5, 2019, 8:23 p.m. UTC | #11
On Tue, 5 Nov 2019 at 15:25, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Oct 31, 2019 at 12:36 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > Validate the stack arguments and setup the stack depening on whether or not
> > it is growing down or up.
> >
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>
> Acked-by: Arnd Bergmann <arnd@arndb.de>

and
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian Brauner Nov. 6, 2019, 2:18 a.m. UTC | #12
On Thu, Oct 31, 2019 at 12:40:36PM +0000, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 7f192e3cd316b fork: add clone3.
> 
> The bot has tested the following trees: v5.3.8.
> 
> v5.3.8: Failed to apply! Possible dependencies:
>     78f6face5af34 ("sched: add kernel-doc for struct clone_args")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?

Hey Sasha,

This has now landed in mainline (cf. [2]).
I would suggest to backport [1] together with [2].
The patch in [1] only documents struct clone_args and has no functional
changes.
If you prefer to only backport a v5.3 specific version of [2] you can
find it inline (cf. [3]) including the base commit info for the 5.3 stable
tree.

Christian

[1]: 78f6face5af3 ("sched: add kernel-doc for struct clone_args")
[2]: fa729c4df558 ("clone3: validate stack arguments")
[3]:

From 5bc5279d0dfa90cc6af385b6e3f65958f223ccab Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Thu, 31 Oct 2019 12:36:08 +0100
Subject: [PATCH] clone3: validate stack arguments

Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.

Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:

 #define __STACK_SIZE (8 * 1024 * 1024)
 pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
 {
         pid_t ret;
         void *stack;

         stack = malloc(__STACK_SIZE);
         if (!stack)
                 return -ENOMEM;

 #ifdef __ia64__
         ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #elif defined(__parisc__) /* stack grows up */
         ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
 #else
         ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #endif
         return ret;
 }

or even crazier variants such as [3].

With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.

/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().

[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd316 ("fork: add clone3")
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha@sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
---
 kernel/fork.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3647097e6783..8bbd39585301 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2586,7 +2586,35 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 	return 0;
 }
 
-static bool clone3_args_valid(const struct kernel_clone_args *kargs)
+/**
+ * clone3_stack_valid - check and prepare stack
+ * @kargs: kernel clone args
+ *
+ * Verify that the stack arguments userspace gave us are sane.
+ * In addition, set the stack direction for userspace since it's easy for us to
+ * determine.
+ */
+static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (kargs->stack == 0) {
+		if (kargs->stack_size > 0)
+			return false;
+	} else {
+		if (kargs->stack_size == 0)
+			return false;
+
+		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
+			return false;
+
+#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
+		kargs->stack += kargs->stack_size;
+#endif
+	}
+
+	return true;
+}
+
+static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/*
 	 * All lower bits of the flag word are taken.
@@ -2606,6 +2634,9 @@ static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
+	if (!clone3_stack_valid(kargs))
+		return false;
+
 	return true;
 }
 

base-commit: db0655e705be645ad673b0a70160921e088517c0

Patch
diff mbox series

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 99335e1f4a27..25b4fa00bad1 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -51,6 +51,10 @@ 
  *               sent when the child exits.
  * @stack:       Specify the location of the stack for the
  *               child process.
+ *               Note, @stack is expected to point to the
+ *               lowest address. The stack direction will be
+ *               determined by the kernel and set up
+ *               appropriately based on @stack_size.
  * @stack_size:  The size of the stack for the child process.
  * @tls:         If CLONE_SETTLS is set, the tls descriptor
  *               is set to tls.
diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf53125210..55af6931c6ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2561,7 +2561,35 @@  noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 	return 0;
 }
 
-static bool clone3_args_valid(const struct kernel_clone_args *kargs)
+/**
+ * clone3_stack_valid - check and prepare stack
+ * @kargs: kernel clone args
+ *
+ * Verify that the stack arguments userspace gave us are sane.
+ * In addition, set the stack direction for userspace since it's easy for us to
+ * determine.
+ */
+static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (kargs->stack == 0) {
+		if (kargs->stack_size > 0)
+			return false;
+	} else {
+		if (kargs->stack_size == 0)
+			return false;
+
+		if (!access_ok((void __user *)kargs->stack, kargs->stack_size))
+			return false;
+
+#if !defined(CONFIG_STACK_GROWSUP) && !defined(CONFIG_IA64)
+		kargs->stack += kargs->stack_size;
+#endif
+	}
+
+	return true;
+}
+
+static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/*
 	 * All lower bits of the flag word are taken.
@@ -2581,6 +2609,9 @@  static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
+	if (!clone3_stack_valid(kargs))
+		return false;
+
 	return true;
 }