Message ID | 20220503170032.2531471-1-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | [v3] linux: Add fallback for clone failure on posix_spawn (BZ#29115) | expand |
On 2022-05-03 20:00, Adhemerval Zanella wrote: > Even though it might characterizeis as a kernel bug (incomplete kernel > support for CLONE_VM used along with CLONE_VFORK), time namespace > support > is already presented in a some kernel releases (since 2020). > > Although I agree with Carlos that it results in unexpected surprise > behaviour, I also see that providing a reasonable fallback that does > not > add possible issues like race conditions (as some syscall fallbacks in > the > past) provides a easier transition and make the interfaca to work on > multiple scenarios. > > The fix itself does adds some corner cases that need to be handled, but > I still think it is maintanable and a general improvement. > > -- > > Linux clone with CLONE_VM may fail for some namespace restriction (for > instance if kernel does not allow processes in different time > namespaces > to share the sameaddress space). > > In this case clone fails with EINVAL and posix_spawn can not spawn a > new > process. However the same process can be spawned with fork and exec. > > The patch fixes by retrying the clone syscall with just CLONE_VFORK > if clone fails with a non transient failure (ENOMEM and EAGAIN still > returns failure to caller). > > Error on preparation phase or execve is returned by a pipe now that > there is no shared memory that ca be used. It requires some extra care > for some file operations: > > * If the file operation would clobber the pipe file descriptor, the > helper process dup the pipe onto an unoccupied file descriptor. > > * dup2 file action that targets the pipe file descriptor returns > EBADF. > > * If closefrom argument overlaps the pipe file descriptor, it is > splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou]. > > Failure on prepare phase in helper process does not trigger the fork > and exec fallback. > > Checked on x86_64-linux-gnu. > --- > v3: Fixed closerange fallback,. > v2: Fixed closerange file actions, handle EPIPE, return all errors, > handle if pipe fd is used in file actions. > --- > include/unistd.h | 4 +- > io/closefrom.c | 2 +- > sysdeps/unix/sysv/linux/closefrom_fallback.c | 4 +- > sysdeps/unix/sysv/linux/spawni.c | 235 +++++++++++++++---- > 4 files changed, 196 insertions(+), 49 deletions(-) > > diff --git a/include/unistd.h b/include/unistd.h > index af795a37c8..2643991a09 100644 > --- a/include/unistd.h > +++ b/include/unistd.h > @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int > __lowfd, _Bool dirfd_fallback) > return false; > } > # else > -extern _Bool __closefrom_fallback (int __lowfd, _Bool) > attribute_hidden; > +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int > __to, > + _Bool) > + attribute_hidden; > # endif > extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); > libc_hidden_proto (__read) > diff --git a/io/closefrom.c b/io/closefrom.c > index 48cac2e3d1..a07de06e9b 100644 > --- a/io/closefrom.c > +++ b/io/closefrom.c > @@ -30,7 +30,7 @@ __closefrom (int lowfd) > if (r == 0) > return ; > > - if (!__closefrom_fallback (l, true)) > + if (!__closefrom_fallback (l, ~0U, true)) > __fortify_fail ("closefrom failed to close a file descriptor"); > } > weak_alias (__closefrom, closefrom) > diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c > b/sysdeps/unix/sysv/linux/closefrom_fallback.c > index a9dd0c46b2..ad301e1910 100644 > --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c > +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c > @@ -28,7 +28,7 @@ > /proc/self/fd open will trigger a fallback that tries to close a > file > descriptor before proceed. */ > _Bool > -__closefrom_fallback (int from, _Bool dirfd_fallback) > +__closefrom_fallback (unsigned int from, unsigned int to, _Bool > dirfd_fallback) > { > int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | > O_DIRECTORY, > 0); > @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback) > > if (fd == dirfd || fd < from) > continue; > + if (fd > to) > + break; > __closefrom_fallback() currently ignores "to" if it can't open /proc/self/fd and dirfd_fallback is true. This path is not used by posix_spawn(), but it's a potential future bug. > /* We ignore close errors because EBADF, EINTR, and EIO > means the > descriptor has been released. */ > diff --git a/sysdeps/unix/sysv/linux/spawni.c > b/sysdeps/unix/sysv/linux/spawni.c > index d6f5ca89cd..0481791013 100644 > --- a/sysdeps/unix/sysv/linux/spawni.c > +++ b/sysdeps/unix/sysv/linux/spawni.c > @@ -44,7 +44,11 @@ > third issue is done by a stack allocation in parent, and by using a > field in struct spawn_args where the child can write an error > code. CLONE_VFORK ensures that the parent does not run until the > - child has either exec'ed successfully or exited. */ > + child has either exec'ed successfully or exited. > + > + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel > limitation > + such as time namespace), only CLONE_VFORK is used instead and the > + preparation and execve failures are communicated with a pipe. */ > > > /* The Unix standard contains a long explanation of the way to signal > @@ -67,6 +71,7 @@ struct posix_spawn_args > char *const *envp; > int xflags; > int err; > + int pipe[2]; > }; > > /* Older version requires that shell script without shebang definition > @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args > *args) > } > } > > +/* If the file operation would clobber the pipe fd used to communicate > with > + parent, dup the pipe onto an unoccupied file descriptor. */ > +static inline bool > +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) > +{ > + int fd; > + > + switch (fa->tag) > + { > + case spawn_do_close: > + fd = fa->action.close_action.fd; > + break; > + case spawn_do_open: > + fd = fa->action.open_action.fd; > + break; > + case spawn_do_dup2: > + fd = fa->action.dup2_action.newfd; > + break; > + case spawn_do_fchdir: > + fd = fa->action.fchdir_action.fd; > + default: > + return true; > + } > + > + if (fd == p[1]) > + { > + int r = __fcntl (p[1], F_DUPFD_CLOEXEC); > + if (r < 0) > + return false; > + __close_nocancel (p[1]); > + p[1] = r; > + } > + > + return true; > +} > + > +static inline bool > +spawni_fa_closerange (int from, int to) > +{ > +#if 0 > + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); > + return r == 0 || __closefrom_fallback (from, to, false); > +#else > + return __closefrom_fallback (from, to, false); > +#endif > +} > + > /* Function used in the clone call to setup the signals mask, > posix_spawn > attributes, and file actions. It run on its own stack (provided by > the > posix_spawn call). */ > -static int > -__spawni_child (void *arguments) > +static _Noreturn int > +spawni_child (void *arguments) > { > struct posix_spawn_args *args = arguments; > const posix_spawnattr_t *restrict attr = args->attr; > const posix_spawn_file_actions_t *file_actions = args->fa; > + bool use_pipe = args->pipe[0] != -1; > > /* The child must ensure that no signal handler are enabled because > it shared > memory with parent, so the signal disposition must be either > SIG_DFL or > @@ -113,6 +166,9 @@ __spawni_child (void *arguments) > struct sigaction sa; > memset (&sa, '\0', sizeof (sa)); > > + if (use_pipe) > + __close (args->pipe[0]); > + > sigset_t hset; > __sigprocmask (SIG_BLOCK, 0, &hset); > for (int sig = 1; sig < _NSIG; ++sig) > @@ -181,6 +237,9 @@ __spawni_child (void *arguments) > { > struct __spawn_action *action = &file_actions->__actions[cnt]; > > + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) > + goto fail; > + > switch (action->tag) > { > case spawn_do_close: > @@ -233,6 +292,11 @@ __spawni_child (void *arguments) > break; > > case spawn_do_dup2: > + if (use_pipe && action->action.dup2_action.fd == args->pipe[1]) > + { > + errno = EBADF; > + goto fail; > + } > /* Austin Group issue #411 requires adddup2 action with source > and destination being equal to remove close-on-exec flag. */ > if (action->action.dup2_action.fd > @@ -264,8 +328,18 @@ __spawni_child (void *arguments) > case spawn_do_closefrom: > { > int lowfd = action->action.closefrom_action.from; > - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0); > - if (r != 0 && !__closefrom_fallback (lowfd, false)) > + /* Skip the pipe descriptor if it is used. No need to handle > + it since it is created with O_CLOEXEC. */ > + if (use_pipe && args->pipe[1] >= lowfd) > + { > + if (args->pipe[1] == lowfd > + && !spawni_fa_closerange (lowfd + 1, ~0U)) > + goto fail; > + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1) > + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) The "else" branch will also run if "args->pipe[1] == lowfd" and spawni_fa_closerange (lowfd + 1, ~0U) succeeds. > + goto fail; > + } > + else if (!spawni_fa_closerange (lowfd, ~0U)) > goto fail; > } break; > > @@ -300,10 +374,112 @@ fail: > (EINTERNALBUG) describing that, use ECHILD. Another option would > be to set args->err to some negative sentinel and have the parent > abort(), but that seems needlessly harsh. */ > - args->err = errno ? : ECHILD; > + int ret = errno ? : ECHILD; > + if (use_pipe) > + { > + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0) > + if (errno == EPIPE || errno == EBADF) > + break; > + } > + else > + args->err = ret; > + > _exit (SPAWN_ERROR); > } > > +static pid_t > +clone_call (struct posix_spawn_args *args, int flags, void *stack, > + size_t stack_size) > +{ > + struct clone_args clone_args = > + { > + .flags = flags, > + .exit_signal = SIGCHLD, > + .stack = (uintptr_t) stack, > + .stack_size = stack_size, > + }; > + return __clone_internal (&clone_args, spawni_child, args); > +} > + > +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to > optimize > + memory and overcommit) and return TRUE if the helper was created or > if the > + failure was not due resource exhaustion. */ > +static bool > +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec, > + void *stack, size_t stack_size) > +{ > + /* The clone flags used will create a new child that will run in the > same > + memory space (CLONE_VM) and the execution of calling thread will > be > + suspend until the child calls execve or _exit. > + > + Also since the calling thread execution will be suspend, there is > not > + need for CLONE_SETTLS. Although parent and child share the same > TLS > + namespace, there will be no concurrent access for TLS variables > (errno > + for instance). */ > + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, > stack_size); > + > + /* It needs to collect the case where the auxiliary process was > created > + but failed to execute the file (due either any preparation step > or > + for execve itself). */ > + if (*new_pid > 0) > + { > + /* Also, it handles the unlikely case where the auxiliary > process was > + terminated before calling execve as if it was successfully. The > + args.err is set to 0 as default and changed to a positive value > + only in case of failure, so in case of premature termination > + due a signal args.err will remain zeroed and it will be up to > + caller to actually collect it. */ > + *ec = args->err; > + if (*ec > 0) > + /* There still an unlikely case where the child is cancelled after > + setting args.err, due to a positive error value. Also there is > + possible pid reuse race (where the kernel allocated the same pid > + to an unrelated process). Unfortunately due synchronization > + issues where the kernel might not have the process collected > + the waitpid below can not use WNOHANG. */ > + __waitpid (*new_pid, NULL, 0); > + } > + else > + *ec = errno; > + > + /* There is no much point in retrying with fork and exec if kernel > returns a > + failure due resource exhaustion. */ > + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); > +} > + > +/* Fallback spawn case which does not use CLONE_VM. Any preparation > step or > + execve failure is passed with a pipe, which requires additional > care by > + the helper stating process since it additional file descriptors > handle. */ > +static void > +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, > + char *stack, size_t stack_size) > +{ > + if (__pipe2 (args->pipe, O_CLOEXEC) != 0) > + { > + *ec = errno; > + return; > + } > + > + /* Do not trigger atfork handler nor any internal state reset since > the > + helper process will call execve. */ > + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); > + > + __close (args->pipe[1]); > + > + if (*new_pid > 0) > + { > + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) > + /* A successful execve will close the helper process pipe end. */ > + *ec = 0; > + else > + __waitpid (*new_pid, NULL, 0); > + } > + else > + *ec = errno; > + > + __close (args->pipe[0]); > +} > + > /* Spawn a new process executing PATH with the attributes describes in > *ATTRP. > Before running the process perform the actions described in > FILE-ACTIONS. */ > static int > @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, > args.argc = argc; > args.envp = envp; > args.xflags = xflags; > + args.pipe[0] = args.pipe[1] = -1; > > __libc_signal_block_all (&args.oldmask); > > - /* The clone flags used will create a new child that will run in the > same > - memory space (CLONE_VM) and the execution of calling thread will > be > - suspend until the child calls execve or _exit. > - > - Also since the calling thread execution will be suspend, there is > not > - need for CLONE_SETTLS. Although parent and child share the same > TLS > - namespace, there will be no concurrent access for TLS variables > (errno > - for instance). */ > - struct clone_args clone_args = > - { > - .flags = CLONE_VM | CLONE_VFORK, > - .exit_signal = SIGCHLD, > - .stack = (uintptr_t) stack, > - .stack_size = stack_size, > - }; > - new_pid = __clone_internal (&clone_args, __spawni_child, &args); > - > - /* It needs to collect the case where the auxiliary process was > created > - but failed to execute the file (due either any preparation step > or > - for execve itself). */ > - if (new_pid > 0) > - { > - /* Also, it handles the unlikely case where the auxiliary > process was > - terminated before calling execve as if it was successfully. The > - args.err is set to 0 as default and changed to a positive value > - only in case of failure, so in case of premature termination > - due a signal args.err will remain zeroed and it will be up to > - caller to actually collect it. */ > - ec = args.err; > - if (ec > 0) > - /* There still an unlikely case where the child is cancelled after > - setting args.err, due to a positive error value. Also there is > - possible pid reuse race (where the kernel allocated the same pid > - to an unrelated process). Unfortunately due synchronization > - issues where the kernel might not have the process collected > - the waitpid below can not use WNOHANG. */ > - __waitpid (new_pid, NULL, 0); > - } > - else > - ec = errno; > + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace > restriction > + (for instance Linux does not allow processes in different time > namespaces > + to share address space) and in this case clone fails with EINVAL. > Retry > + with fork and exec. */ > + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) > + spawni_fork (&args, &new_pid, &ec, stack, stack_size); > > __munmap (stack, stack_size);
On 03/05/2022 14:36, Alexey Izbyshev wrote: > On 2022-05-03 20:00, Adhemerval Zanella wrote: >> Even though it might characterizeis as a kernel bug (incomplete kernel >> support for CLONE_VM used along with CLONE_VFORK), time namespace >> support >> is already presented in a some kernel releases (since 2020). >> >> Although I agree with Carlos that it results in unexpected surprise >> behaviour, I also see that providing a reasonable fallback that does not >> add possible issues like race conditions (as some syscall fallbacks in >> the >> past) provides a easier transition and make the interfaca to work on >> multiple scenarios. >> >> The fix itself does adds some corner cases that need to be handled, but >> I still think it is maintanable and a general improvement. >> >> -- >> >> Linux clone with CLONE_VM may fail for some namespace restriction (for >> instance if kernel does not allow processes in different time namespaces >> to share the sameaddress space). >> >> In this case clone fails with EINVAL and posix_spawn can not spawn a new >> process. However the same process can be spawned with fork and exec. >> >> The patch fixes by retrying the clone syscall with just CLONE_VFORK >> if clone fails with a non transient failure (ENOMEM and EAGAIN still >> returns failure to caller). >> >> Error on preparation phase or execve is returned by a pipe now that >> there is no shared memory that ca be used. It requires some extra care >> for some file operations: >> >> * If the file operation would clobber the pipe file descriptor, the >> helper process dup the pipe onto an unoccupied file descriptor. >> >> * dup2 file action that targets the pipe file descriptor returns >> EBADF. >> >> * If closefrom argument overlaps the pipe file descriptor, it is >> splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou]. >> >> Failure on prepare phase in helper process does not trigger the fork >> and exec fallback. >> >> Checked on x86_64-linux-gnu. >> --- >> v3: Fixed closerange fallback,. >> v2: Fixed closerange file actions, handle EPIPE, return all errors, >> handle if pipe fd is used in file actions. >> --- >> include/unistd.h | 4 +- >> io/closefrom.c | 2 +- >> sysdeps/unix/sysv/linux/closefrom_fallback.c | 4 +- >> sysdeps/unix/sysv/linux/spawni.c | 235 +++++++++++++++---- >> 4 files changed, 196 insertions(+), 49 deletions(-) >> >> diff --git a/include/unistd.h b/include/unistd.h >> index af795a37c8..2643991a09 100644 >> --- a/include/unistd.h >> +++ b/include/unistd.h >> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int >> __lowfd, _Bool dirfd_fallback) >> return false; >> } >> # else >> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) attribute_hidden; >> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int __to, >> + _Bool) >> + attribute_hidden; >> # endif >> extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); >> libc_hidden_proto (__read) >> diff --git a/io/closefrom.c b/io/closefrom.c >> index 48cac2e3d1..a07de06e9b 100644 >> --- a/io/closefrom.c >> +++ b/io/closefrom.c >> @@ -30,7 +30,7 @@ __closefrom (int lowfd) >> if (r == 0) >> return ; >> >> - if (!__closefrom_fallback (l, true)) >> + if (!__closefrom_fallback (l, ~0U, true)) >> __fortify_fail ("closefrom failed to close a file descriptor"); >> } >> weak_alias (__closefrom, closefrom) >> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c >> b/sysdeps/unix/sysv/linux/closefrom_fallback.c >> index a9dd0c46b2..ad301e1910 100644 >> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c >> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c >> @@ -28,7 +28,7 @@ >> /proc/self/fd open will trigger a fallback that tries to close a file >> descriptor before proceed. */ >> _Bool >> -__closefrom_fallback (int from, _Bool dirfd_fallback) >> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool dirfd_fallback) >> { >> int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | O_DIRECTORY, >> 0); >> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback) >> >> if (fd == dirfd || fd < from) >> continue; >> + if (fd > to) >> + break; >> > __closefrom_fallback() currently ignores "to" if it can't open /proc/self/fd and dirfd_fallback is true. This path is not used by posix_spawn(), but it's a potential future bug. Indeed, I think iterate over [from, to] instead of [from, INT_MAX] should fix it. > >> /* We ignore close errors because EBADF, EINTR, and EIO means the >> descriptor has been released. */ >> diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c >> index d6f5ca89cd..0481791013 100644 >> --- a/sysdeps/unix/sysv/linux/spawni.c >> +++ b/sysdeps/unix/sysv/linux/spawni.c >> @@ -44,7 +44,11 @@ >> third issue is done by a stack allocation in parent, and by using a >> field in struct spawn_args where the child can write an error >> code. CLONE_VFORK ensures that the parent does not run until the >> - child has either exec'ed successfully or exited. */ >> + child has either exec'ed successfully or exited. >> + >> + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel limitation >> + such as time namespace), only CLONE_VFORK is used instead and the >> + preparation and execve failures are communicated with a pipe. */ >> >> >> /* The Unix standard contains a long explanation of the way to signal >> @@ -67,6 +71,7 @@ struct posix_spawn_args >> char *const *envp; >> int xflags; >> int err; >> + int pipe[2]; >> }; >> >> /* Older version requires that shell script without shebang definition >> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args *args) >> } >> } >> >> +/* If the file operation would clobber the pipe fd used to communicate with >> + parent, dup the pipe onto an unoccupied file descriptor. */ >> +static inline bool >> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) >> +{ >> + int fd; >> + >> + switch (fa->tag) >> + { >> + case spawn_do_close: >> + fd = fa->action.close_action.fd; >> + break; >> + case spawn_do_open: >> + fd = fa->action.open_action.fd; >> + break; >> + case spawn_do_dup2: >> + fd = fa->action.dup2_action.newfd; >> + break; >> + case spawn_do_fchdir: >> + fd = fa->action.fchdir_action.fd; >> + default: >> + return true; >> + } >> + >> + if (fd == p[1]) >> + { >> + int r = __fcntl (p[1], F_DUPFD_CLOEXEC); >> + if (r < 0) >> + return false; >> + __close_nocancel (p[1]); >> + p[1] = r; >> + } >> + >> + return true; >> +} >> + >> +static inline bool >> +spawni_fa_closerange (int from, int to) >> +{ >> +#if 0 >> + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); >> + return r == 0 || __closefrom_fallback (from, to, false); >> +#else >> + return __closefrom_fallback (from, to, false); >> +#endif >> +} >> + >> /* Function used in the clone call to setup the signals mask, posix_spawn >> attributes, and file actions. It run on its own stack (provided by the >> posix_spawn call). */ >> -static int >> -__spawni_child (void *arguments) >> +static _Noreturn int >> +spawni_child (void *arguments) >> { >> struct posix_spawn_args *args = arguments; >> const posix_spawnattr_t *restrict attr = args->attr; >> const posix_spawn_file_actions_t *file_actions = args->fa; >> + bool use_pipe = args->pipe[0] != -1; >> >> /* The child must ensure that no signal handler are enabled because it shared >> memory with parent, so the signal disposition must be either SIG_DFL or >> @@ -113,6 +166,9 @@ __spawni_child (void *arguments) >> struct sigaction sa; >> memset (&sa, '\0', sizeof (sa)); >> >> + if (use_pipe) >> + __close (args->pipe[0]); >> + >> sigset_t hset; >> __sigprocmask (SIG_BLOCK, 0, &hset); >> for (int sig = 1; sig < _NSIG; ++sig) >> @@ -181,6 +237,9 @@ __spawni_child (void *arguments) >> { >> struct __spawn_action *action = &file_actions->__actions[cnt]; >> >> + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) >> + goto fail; >> + >> switch (action->tag) >> { >> case spawn_do_close: >> @@ -233,6 +292,11 @@ __spawni_child (void *arguments) >> break; >> >> case spawn_do_dup2: >> + if (use_pipe && action->action.dup2_action.fd == args->pipe[1]) >> + { >> + errno = EBADF; >> + goto fail; >> + } >> /* Austin Group issue #411 requires adddup2 action with source >> and destination being equal to remove close-on-exec flag. */ >> if (action->action.dup2_action.fd >> @@ -264,8 +328,18 @@ __spawni_child (void *arguments) >> case spawn_do_closefrom: >> { >> int lowfd = action->action.closefrom_action.from; >> - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0); >> - if (r != 0 && !__closefrom_fallback (lowfd, false)) >> + /* Skip the pipe descriptor if it is used. No need to handle >> + it since it is created with O_CLOEXEC. */ >> + if (use_pipe && args->pipe[1] >= lowfd) >> + { >> + if (args->pipe[1] == lowfd >> + && !spawni_fa_closerange (lowfd + 1, ~0U)) >> + goto fail; >> + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1) >> + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) > > The "else" branch will also run if "args->pipe[1] == lowfd" and spawni_fa_closerange (lowfd + 1, ~0U) succeeds. I was trying to be too clever here, I think the following is more clear: if (use_pipe && args->pipe[1] == lowfd) { if (!spawni_fa_closerange (lowfd + 1, ~0U)) goto fail; } else if (use_pipe && args->pipe[1] > lowfd) { if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1) || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) goto fail; } else if (!spawni_fa_closerange (lowfd, ~0U)) goto fail; > >> + goto fail; >> + } >> + else if (!spawni_fa_closerange (lowfd, ~0U)) >> goto fail; >> } break; >> >> @@ -300,10 +374,112 @@ fail: >> (EINTERNALBUG) describing that, use ECHILD. Another option would >> be to set args->err to some negative sentinel and have the parent >> abort(), but that seems needlessly harsh. */ >> - args->err = errno ? : ECHILD; >> + int ret = errno ? : ECHILD; >> + if (use_pipe) >> + { >> + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0) >> + if (errno == EPIPE || errno == EBADF) >> + break; >> + } >> + else >> + args->err = ret; >> + >> _exit (SPAWN_ERROR); >> } >> >> +static pid_t >> +clone_call (struct posix_spawn_args *args, int flags, void *stack, >> + size_t stack_size) >> +{ >> + struct clone_args clone_args = >> + { >> + .flags = flags, >> + .exit_signal = SIGCHLD, >> + .stack = (uintptr_t) stack, >> + .stack_size = stack_size, >> + }; >> + return __clone_internal (&clone_args, spawni_child, args); >> +} >> + >> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to optimize >> + memory and overcommit) and return TRUE if the helper was created or if the >> + failure was not due resource exhaustion. */ >> +static bool >> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec, >> + void *stack, size_t stack_size) >> +{ >> + /* The clone flags used will create a new child that will run in the same >> + memory space (CLONE_VM) and the execution of calling thread will be >> + suspend until the child calls execve or _exit. >> + >> + Also since the calling thread execution will be suspend, there is not >> + need for CLONE_SETTLS. Although parent and child share the same TLS >> + namespace, there will be no concurrent access for TLS variables (errno >> + for instance). */ >> + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, stack_size); >> + >> + /* It needs to collect the case where the auxiliary process was created >> + but failed to execute the file (due either any preparation step or >> + for execve itself). */ >> + if (*new_pid > 0) >> + { >> + /* Also, it handles the unlikely case where the auxiliary process was >> + terminated before calling execve as if it was successfully. The >> + args.err is set to 0 as default and changed to a positive value >> + only in case of failure, so in case of premature termination >> + due a signal args.err will remain zeroed and it will be up to >> + caller to actually collect it. */ >> + *ec = args->err; >> + if (*ec > 0) >> + /* There still an unlikely case where the child is cancelled after >> + setting args.err, due to a positive error value. Also there is >> + possible pid reuse race (where the kernel allocated the same pid >> + to an unrelated process). Unfortunately due synchronization >> + issues where the kernel might not have the process collected >> + the waitpid below can not use WNOHANG. */ >> + __waitpid (*new_pid, NULL, 0); >> + } >> + else >> + *ec = errno; >> + >> + /* There is no much point in retrying with fork and exec if kernel returns a >> + failure due resource exhaustion. */ >> + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); >> +} >> + >> +/* Fallback spawn case which does not use CLONE_VM. Any preparation step or >> + execve failure is passed with a pipe, which requires additional care by >> + the helper stating process since it additional file descriptors handle. */ >> +static void >> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, >> + char *stack, size_t stack_size) >> +{ >> + if (__pipe2 (args->pipe, O_CLOEXEC) != 0) >> + { >> + *ec = errno; >> + return; >> + } >> + >> + /* Do not trigger atfork handler nor any internal state reset since the >> + helper process will call execve. */ >> + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); >> + >> + __close (args->pipe[1]); >> + >> + if (*new_pid > 0) >> + { >> + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) >> + /* A successful execve will close the helper process pipe end. */ >> + *ec = 0; >> + else >> + __waitpid (*new_pid, NULL, 0); >> + } >> + else >> + *ec = errno; >> + >> + __close (args->pipe[0]); >> +} >> + >> /* Spawn a new process executing PATH with the attributes describes in *ATTRP. >> Before running the process perform the actions described in FILE-ACTIONS. */ >> static int >> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, >> args.argc = argc; >> args.envp = envp; >> args.xflags = xflags; >> + args.pipe[0] = args.pipe[1] = -1; >> >> __libc_signal_block_all (&args.oldmask); >> >> - /* The clone flags used will create a new child that will run in the same >> - memory space (CLONE_VM) and the execution of calling thread will be >> - suspend until the child calls execve or _exit. >> - >> - Also since the calling thread execution will be suspend, there is not >> - need for CLONE_SETTLS. Although parent and child share the same TLS >> - namespace, there will be no concurrent access for TLS variables (errno >> - for instance). */ >> - struct clone_args clone_args = >> - { >> - .flags = CLONE_VM | CLONE_VFORK, >> - .exit_signal = SIGCHLD, >> - .stack = (uintptr_t) stack, >> - .stack_size = stack_size, >> - }; >> - new_pid = __clone_internal (&clone_args, __spawni_child, &args); >> - >> - /* It needs to collect the case where the auxiliary process was created >> - but failed to execute the file (due either any preparation step or >> - for execve itself). */ >> - if (new_pid > 0) >> - { >> - /* Also, it handles the unlikely case where the auxiliary process was >> - terminated before calling execve as if it was successfully. The >> - args.err is set to 0 as default and changed to a positive value >> - only in case of failure, so in case of premature termination >> - due a signal args.err will remain zeroed and it will be up to >> - caller to actually collect it. */ >> - ec = args.err; >> - if (ec > 0) >> - /* There still an unlikely case where the child is cancelled after >> - setting args.err, due to a positive error value. Also there is >> - possible pid reuse race (where the kernel allocated the same pid >> - to an unrelated process). Unfortunately due synchronization >> - issues where the kernel might not have the process collected >> - the waitpid below can not use WNOHANG. */ >> - __waitpid (new_pid, NULL, 0); >> - } >> - else >> - ec = errno; >> + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace restriction >> + (for instance Linux does not allow processes in different time namespaces >> + to share address space) and in this case clone fails with EINVAL. Retry >> + with fork and exec. */ >> + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) >> + spawni_fork (&args, &new_pid, &ec, stack, stack_size); >> >> __munmap (stack, stack_size);
On 2022-05-03 20:50, Adhemerval Zanella wrote: > On 03/05/2022 14:36, Alexey Izbyshev wrote: >> On 2022-05-03 20:00, Adhemerval Zanella wrote: >>> Even though it might characterizeis as a kernel bug (incomplete >>> kernel >>> support for CLONE_VM used along with CLONE_VFORK), time namespace >>> support >>> is already presented in a some kernel releases (since 2020). >>> >>> Although I agree with Carlos that it results in unexpected surprise >>> behaviour, I also see that providing a reasonable fallback that does >>> not >>> add possible issues like race conditions (as some syscall fallbacks >>> in >>> the >>> past) provides a easier transition and make the interfaca to work on >>> multiple scenarios. >>> >>> The fix itself does adds some corner cases that need to be handled, >>> but >>> I still think it is maintanable and a general improvement. >>> >>> -- >>> >>> Linux clone with CLONE_VM may fail for some namespace restriction >>> (for >>> instance if kernel does not allow processes in different time >>> namespaces >>> to share the sameaddress space). >>> >>> In this case clone fails with EINVAL and posix_spawn can not spawn a >>> new >>> process. However the same process can be spawned with fork and exec. >>> >>> The patch fixes by retrying the clone syscall with just CLONE_VFORK >>> if clone fails with a non transient failure (ENOMEM and EAGAIN still >>> returns failure to caller). >>> >>> Error on preparation phase or execve is returned by a pipe now that >>> there is no shared memory that ca be used. It requires some extra >>> care >>> for some file operations: >>> >>> * If the file operation would clobber the pipe file descriptor, the >>> helper process dup the pipe onto an unoccupied file descriptor. >>> >>> * dup2 file action that targets the pipe file descriptor returns >>> EBADF. >>> >>> * If closefrom argument overlaps the pipe file descriptor, it is >>> splited in two calls: [lowdp, pipe - 1] and [pipe + 1, ~Ou]. >>> >>> Failure on prepare phase in helper process does not trigger the fork >>> and exec fallback. >>> >>> Checked on x86_64-linux-gnu. >>> --- >>> v3: Fixed closerange fallback,. >>> v2: Fixed closerange file actions, handle EPIPE, return all errors, >>> handle if pipe fd is used in file actions. >>> --- >>> include/unistd.h | 4 +- >>> io/closefrom.c | 2 +- >>> sysdeps/unix/sysv/linux/closefrom_fallback.c | 4 +- >>> sysdeps/unix/sysv/linux/spawni.c | 235 >>> +++++++++++++++---- >>> 4 files changed, 196 insertions(+), 49 deletions(-) >>> >>> diff --git a/include/unistd.h b/include/unistd.h >>> index af795a37c8..2643991a09 100644 >>> --- a/include/unistd.h >>> +++ b/include/unistd.h >>> @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int >>> __lowfd, _Bool dirfd_fallback) >>> return false; >>> } >>> # else >>> -extern _Bool __closefrom_fallback (int __lowfd, _Bool) >>> attribute_hidden; >>> +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int >>> __to, >>> + _Bool) >>> + attribute_hidden; >>> # endif >>> extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); >>> libc_hidden_proto (__read) >>> diff --git a/io/closefrom.c b/io/closefrom.c >>> index 48cac2e3d1..a07de06e9b 100644 >>> --- a/io/closefrom.c >>> +++ b/io/closefrom.c >>> @@ -30,7 +30,7 @@ __closefrom (int lowfd) >>> if (r == 0) >>> return ; >>> >>> - if (!__closefrom_fallback (l, true)) >>> + if (!__closefrom_fallback (l, ~0U, true)) >>> __fortify_fail ("closefrom failed to close a file descriptor"); >>> } >>> weak_alias (__closefrom, closefrom) >>> diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> b/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> index a9dd0c46b2..ad301e1910 100644 >>> --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c >>> @@ -28,7 +28,7 @@ >>> /proc/self/fd open will trigger a fallback that tries to close a >>> file >>> descriptor before proceed. */ >>> _Bool >>> -__closefrom_fallback (int from, _Bool dirfd_fallback) >>> +__closefrom_fallback (unsigned int from, unsigned int to, _Bool >>> dirfd_fallback) >>> { >>> int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | >>> O_DIRECTORY, >>> 0); >>> @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool >>> dirfd_fallback) >>> >>> if (fd == dirfd || fd < from) >>> continue; >>> + if (fd > to) >>> + break; >>> >> __closefrom_fallback() currently ignores "to" if it can't open >> /proc/self/fd and dirfd_fallback is true. This path is not used by >> posix_spawn(), but it's a potential future bug. > > Indeed, I think iterate over [from, to] instead of [from, INT_MAX] > should fix it. > Yes, with some care for the fact that "to" is unsigned. > >> >>> /* We ignore close errors because EBADF, EINTR, and EIO >>> means the >>> descriptor has been released. */ >>> diff --git a/sysdeps/unix/sysv/linux/spawni.c >>> b/sysdeps/unix/sysv/linux/spawni.c >>> index d6f5ca89cd..0481791013 100644 >>> --- a/sysdeps/unix/sysv/linux/spawni.c >>> +++ b/sysdeps/unix/sysv/linux/spawni.c >>> @@ -44,7 +44,11 @@ >>> third issue is done by a stack allocation in parent, and by using >>> a >>> field in struct spawn_args where the child can write an error >>> code. CLONE_VFORK ensures that the parent does not run until the >>> - child has either exec'ed successfully or exited. */ >>> + child has either exec'ed successfully or exited. >>> + >>> + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel >>> limitation >>> + such as time namespace), only CLONE_VFORK is used instead and the >>> + preparation and execve failures are communicated with a pipe. */ >>> >>> >>> /* The Unix standard contains a long explanation of the way to >>> signal >>> @@ -67,6 +71,7 @@ struct posix_spawn_args >>> char *const *envp; >>> int xflags; >>> int err; >>> + int pipe[2]; >>> }; >>> >>> /* Older version requires that shell script without shebang >>> definition >>> @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args >>> *args) >>> } >>> } >>> >>> +/* If the file operation would clobber the pipe fd used to >>> communicate with >>> + parent, dup the pipe onto an unoccupied file descriptor. */ >>> +static inline bool >>> +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) >>> +{ >>> + int fd; >>> + >>> + switch (fa->tag) >>> + { >>> + case spawn_do_close: >>> + fd = fa->action.close_action.fd; >>> + break; >>> + case spawn_do_open: >>> + fd = fa->action.open_action.fd; >>> + break; >>> + case spawn_do_dup2: >>> + fd = fa->action.dup2_action.newfd; >>> + break; >>> + case spawn_do_fchdir: >>> + fd = fa->action.fchdir_action.fd; >>> + default: >>> + return true; >>> + } >>> + >>> + if (fd == p[1]) >>> + { >>> + int r = __fcntl (p[1], F_DUPFD_CLOEXEC); >>> + if (r < 0) >>> + return false; >>> + __close_nocancel (p[1]); >>> + p[1] = r; >>> + } >>> + >>> + return true; >>> +} >>> + >>> +static inline bool >>> +spawni_fa_closerange (int from, int to) >>> +{ >>> +#if 0 >>> + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); >>> + return r == 0 || __closefrom_fallback (from, to, false); >>> +#else >>> + return __closefrom_fallback (from, to, false); >>> +#endif >>> +} >>> + >>> /* Function used in the clone call to setup the signals mask, >>> posix_spawn >>> attributes, and file actions. It run on its own stack (provided >>> by the >>> posix_spawn call). */ >>> -static int >>> -__spawni_child (void *arguments) >>> +static _Noreturn int >>> +spawni_child (void *arguments) >>> { >>> struct posix_spawn_args *args = arguments; >>> const posix_spawnattr_t *restrict attr = args->attr; >>> const posix_spawn_file_actions_t *file_actions = args->fa; >>> + bool use_pipe = args->pipe[0] != -1; >>> >>> /* The child must ensure that no signal handler are enabled >>> because it shared >>> memory with parent, so the signal disposition must be either >>> SIG_DFL or >>> @@ -113,6 +166,9 @@ __spawni_child (void *arguments) >>> struct sigaction sa; >>> memset (&sa, '\0', sizeof (sa)); >>> >>> + if (use_pipe) >>> + __close (args->pipe[0]); >>> + >>> sigset_t hset; >>> __sigprocmask (SIG_BLOCK, 0, &hset); >>> for (int sig = 1; sig < _NSIG; ++sig) >>> @@ -181,6 +237,9 @@ __spawni_child (void *arguments) >>> { >>> struct __spawn_action *action = &file_actions->__actions[cnt]; >>> >>> + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) >>> + goto fail; >>> + >>> switch (action->tag) >>> { >>> case spawn_do_close: >>> @@ -233,6 +292,11 @@ __spawni_child (void *arguments) >>> break; >>> >>> case spawn_do_dup2: >>> + if (use_pipe && action->action.dup2_action.fd == >>> args->pipe[1]) >>> + { >>> + errno = EBADF; >>> + goto fail; >>> + } >>> /* Austin Group issue #411 requires adddup2 action with >>> source >>> and destination being equal to remove close-on-exec flag. >>> */ >>> if (action->action.dup2_action.fd >>> @@ -264,8 +328,18 @@ __spawni_child (void *arguments) >>> case spawn_do_closefrom: >>> { >>> int lowfd = action->action.closefrom_action.from; >>> - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, >>> 0); >>> - if (r != 0 && !__closefrom_fallback (lowfd, false)) >>> + /* Skip the pipe descriptor if it is used. No need to >>> handle >>> + it since it is created with O_CLOEXEC. */ >>> + if (use_pipe && args->pipe[1] >= lowfd) >>> + { >>> + if (args->pipe[1] == lowfd >>> + && !spawni_fa_closerange (lowfd + 1, ~0U)) >>> + goto fail; >>> + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - >>> 1) >>> + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) >> >> The "else" branch will also run if "args->pipe[1] == lowfd" and >> spawni_fa_closerange (lowfd + 1, ~0U) succeeds. > > I was trying to be too clever here, I think the following is more > clear: > > if (use_pipe && args->pipe[1] == lowfd) > { > if (!spawni_fa_closerange (lowfd + 1, ~0U)) > goto fail; > } > else if (use_pipe && args->pipe[1] > lowfd) > { > if (!spawni_fa_closerange (lowfd, args->pipe[1] - > 1) > || !spawni_fa_closerange (args->pipe[1] + 1, > ~0U)) > goto fail; > } > else if (!spawni_fa_closerange (lowfd, ~0U)) > goto fail; > Yeah, it should work. In a (theoretical?) case when args->pipe[1] == INT_MAX it might also make sense to avoid signed overflow (UB) on additions (e.g. by using "lowfd + 1u"). >>> } break; >>> >>> @@ -300,10 +374,112 @@ fail: >>> (EINTERNALBUG) describing that, use ECHILD. Another option >>> would >>> be to set args->err to some negative sentinel and have the >>> parent >>> abort(), but that seems needlessly harsh. */ >>> - args->err = errno ? : ECHILD; >>> + int ret = errno ? : ECHILD; >>> + if (use_pipe) >>> + { >>> + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < >>> 0) >>> + if (errno == EPIPE || errno == EBADF) >>> + break; >>> + } >>> + else >>> + args->err = ret; >>> + >>> _exit (SPAWN_ERROR); >>> } >>> >>> +static pid_t >>> +clone_call (struct posix_spawn_args *args, int flags, void *stack, >>> + size_t stack_size) >>> +{ >>> + struct clone_args clone_args = >>> + { >>> + .flags = flags, >>> + .exit_signal = SIGCHLD, >>> + .stack = (uintptr_t) stack, >>> + .stack_size = stack_size, >>> + }; >>> + return __clone_internal (&clone_args, spawni_child, args); >>> +} >>> + >>> +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to >>> optimize >>> + memory and overcommit) and return TRUE if the helper was created >>> or if the >>> + failure was not due resource exhaustion. */ >>> +static bool >>> +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int >>> *ec, >>> + void *stack, size_t stack_size) >>> +{ >>> + /* The clone flags used will create a new child that will run in >>> the same >>> + memory space (CLONE_VM) and the execution of calling thread >>> will be >>> + suspend until the child calls execve or _exit. >>> + >>> + Also since the calling thread execution will be suspend, there >>> is not >>> + need for CLONE_SETTLS. Although parent and child share the >>> same TLS >>> + namespace, there will be no concurrent access for TLS variables >>> (errno >>> + for instance). */ >>> + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, >>> stack_size); >>> + >>> + /* It needs to collect the case where the auxiliary process was >>> created >>> + but failed to execute the file (due either any preparation step >>> or >>> + for execve itself). */ >>> + if (*new_pid > 0) >>> + { >>> + /* Also, it handles the unlikely case where the auxiliary >>> process was >>> + terminated before calling execve as if it was successfully. >>> The >>> + args.err is set to 0 as default and changed to a positive value >>> + only in case of failure, so in case of premature termination >>> + due a signal args.err will remain zeroed and it will be up to >>> + caller to actually collect it. */ >>> + *ec = args->err; >>> + if (*ec > 0) >>> + /* There still an unlikely case where the child is cancelled >>> after >>> + setting args.err, due to a positive error value. Also there >>> is >>> + possible pid reuse race (where the kernel allocated the same >>> pid >>> + to an unrelated process). Unfortunately due synchronization >>> + issues where the kernel might not have the process collected >>> + the waitpid below can not use WNOHANG. */ >>> + __waitpid (*new_pid, NULL, 0); >>> + } >>> + else >>> + *ec = errno; >>> + >>> + /* There is no much point in retrying with fork and exec if kernel >>> returns a >>> + failure due resource exhaustion. */ >>> + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); >>> +} >>> + >>> +/* Fallback spawn case which does not use CLONE_VM. Any preparation >>> step or >>> + execve failure is passed with a pipe, which requires additional >>> care by >>> + the helper stating process since it additional file descriptors >>> handle. */ >>> +static void >>> +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, >>> + char *stack, size_t stack_size) >>> +{ >>> + if (__pipe2 (args->pipe, O_CLOEXEC) != 0) >>> + { >>> + *ec = errno; >>> + return; >>> + } >>> + >>> + /* Do not trigger atfork handler nor any internal state reset >>> since the >>> + helper process will call execve. */ >>> + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); >>> + >>> + __close (args->pipe[1]); >>> + >>> + if (*new_pid > 0) >>> + { >>> + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) >>> + /* A successful execve will close the helper process pipe end. >>> */ >>> + *ec = 0; >>> + else >>> + __waitpid (*new_pid, NULL, 0); >>> + } >>> + else >>> + *ec = errno; >>> + >>> + __close (args->pipe[0]); >>> +} >>> + >>> /* Spawn a new process executing PATH with the attributes describes >>> in *ATTRP. >>> Before running the process perform the actions described in >>> FILE-ACTIONS. */ >>> static int >>> @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, >>> args.argc = argc; >>> args.envp = envp; >>> args.xflags = xflags; >>> + args.pipe[0] = args.pipe[1] = -1; >>> >>> __libc_signal_block_all (&args.oldmask); >>> >>> - /* The clone flags used will create a new child that will run in >>> the same >>> - memory space (CLONE_VM) and the execution of calling thread >>> will be >>> - suspend until the child calls execve or _exit. >>> - >>> - Also since the calling thread execution will be suspend, there >>> is not >>> - need for CLONE_SETTLS. Although parent and child share the >>> same TLS >>> - namespace, there will be no concurrent access for TLS variables >>> (errno >>> - for instance). */ >>> - struct clone_args clone_args = >>> - { >>> - .flags = CLONE_VM | CLONE_VFORK, >>> - .exit_signal = SIGCHLD, >>> - .stack = (uintptr_t) stack, >>> - .stack_size = stack_size, >>> - }; >>> - new_pid = __clone_internal (&clone_args, __spawni_child, &args); >>> - >>> - /* It needs to collect the case where the auxiliary process was >>> created >>> - but failed to execute the file (due either any preparation step >>> or >>> - for execve itself). */ >>> - if (new_pid > 0) >>> - { >>> - /* Also, it handles the unlikely case where the auxiliary >>> process was >>> - terminated before calling execve as if it was successfully. >>> The >>> - args.err is set to 0 as default and changed to a positive value >>> - only in case of failure, so in case of premature termination >>> - due a signal args.err will remain zeroed and it will be up to >>> - caller to actually collect it. */ >>> - ec = args.err; >>> - if (ec > 0) >>> - /* There still an unlikely case where the child is cancelled >>> after >>> - setting args.err, due to a positive error value. Also there >>> is >>> - possible pid reuse race (where the kernel allocated the same >>> pid >>> - to an unrelated process). Unfortunately due synchronization >>> - issues where the kernel might not have the process collected >>> - the waitpid below can not use WNOHANG. */ >>> - __waitpid (new_pid, NULL, 0); >>> - } >>> - else >>> - ec = errno; >>> + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace >>> restriction >>> + (for instance Linux does not allow processes in different time >>> namespaces >>> + to share address space) and in this case clone fails with >>> EINVAL. Retry >>> + with fork and exec. */ >>> + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) >>> + spawni_fork (&args, &new_pid, &ec, stack, stack_size); >>> >>> __munmap (stack, stack_size);
diff --git a/include/unistd.h b/include/unistd.h index af795a37c8..2643991a09 100644 --- a/include/unistd.h +++ b/include/unistd.h @@ -167,7 +167,9 @@ static inline _Bool __closefrom_fallback (int __lowfd, _Bool dirfd_fallback) return false; } # else -extern _Bool __closefrom_fallback (int __lowfd, _Bool) attribute_hidden; +extern _Bool __closefrom_fallback (unsigned int __from, unsigned int __to, + _Bool) + attribute_hidden; # endif extern ssize_t __read (int __fd, void *__buf, size_t __nbytes); libc_hidden_proto (__read) diff --git a/io/closefrom.c b/io/closefrom.c index 48cac2e3d1..a07de06e9b 100644 --- a/io/closefrom.c +++ b/io/closefrom.c @@ -30,7 +30,7 @@ __closefrom (int lowfd) if (r == 0) return ; - if (!__closefrom_fallback (l, true)) + if (!__closefrom_fallback (l, ~0U, true)) __fortify_fail ("closefrom failed to close a file descriptor"); } weak_alias (__closefrom, closefrom) diff --git a/sysdeps/unix/sysv/linux/closefrom_fallback.c b/sysdeps/unix/sysv/linux/closefrom_fallback.c index a9dd0c46b2..ad301e1910 100644 --- a/sysdeps/unix/sysv/linux/closefrom_fallback.c +++ b/sysdeps/unix/sysv/linux/closefrom_fallback.c @@ -28,7 +28,7 @@ /proc/self/fd open will trigger a fallback that tries to close a file descriptor before proceed. */ _Bool -__closefrom_fallback (int from, _Bool dirfd_fallback) +__closefrom_fallback (unsigned int from, unsigned int to, _Bool dirfd_fallback) { int dirfd = __open_nocancel (FD_TO_FILENAME_PREFIX, O_RDONLY | O_DIRECTORY, 0); @@ -84,6 +84,8 @@ __closefrom_fallback (int from, _Bool dirfd_fallback) if (fd == dirfd || fd < from) continue; + if (fd > to) + break; /* We ignore close errors because EBADF, EINTR, and EIO means the descriptor has been released. */ diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c index d6f5ca89cd..0481791013 100644 --- a/sysdeps/unix/sysv/linux/spawni.c +++ b/sysdeps/unix/sysv/linux/spawni.c @@ -44,7 +44,11 @@ third issue is done by a stack allocation in parent, and by using a field in struct spawn_args where the child can write an error code. CLONE_VFORK ensures that the parent does not run until the - child has either exec'ed successfully or exited. */ + child has either exec'ed successfully or exited. + + If the clone with CLONE_VM and CLONE_VFORK fails (due any kernel limitation + such as time namespace), only CLONE_VFORK is used instead and the + preparation and execve failures are communicated with a pipe. */ /* The Unix standard contains a long explanation of the way to signal @@ -67,6 +71,7 @@ struct posix_spawn_args char *const *envp; int xflags; int err; + int pipe[2]; }; /* Older version requires that shell script without shebang definition @@ -94,15 +99,63 @@ maybe_script_execute (struct posix_spawn_args *args) } } +/* If the file operation would clobber the pipe fd used to communicate with + parent, dup the pipe onto an unoccupied file descriptor. */ +static inline bool +spawni_fa_handle_pipe (const struct __spawn_action *fa, int p[]) +{ + int fd; + + switch (fa->tag) + { + case spawn_do_close: + fd = fa->action.close_action.fd; + break; + case spawn_do_open: + fd = fa->action.open_action.fd; + break; + case spawn_do_dup2: + fd = fa->action.dup2_action.newfd; + break; + case spawn_do_fchdir: + fd = fa->action.fchdir_action.fd; + default: + return true; + } + + if (fd == p[1]) + { + int r = __fcntl (p[1], F_DUPFD_CLOEXEC); + if (r < 0) + return false; + __close_nocancel (p[1]); + p[1] = r; + } + + return true; +} + +static inline bool +spawni_fa_closerange (int from, int to) +{ +#if 0 + int r = INLINE_SYSCALL_CALL (close_range, from, to, 0); + return r == 0 || __closefrom_fallback (from, to, false); +#else + return __closefrom_fallback (from, to, false); +#endif +} + /* Function used in the clone call to setup the signals mask, posix_spawn attributes, and file actions. It run on its own stack (provided by the posix_spawn call). */ -static int -__spawni_child (void *arguments) +static _Noreturn int +spawni_child (void *arguments) { struct posix_spawn_args *args = arguments; const posix_spawnattr_t *restrict attr = args->attr; const posix_spawn_file_actions_t *file_actions = args->fa; + bool use_pipe = args->pipe[0] != -1; /* The child must ensure that no signal handler are enabled because it shared memory with parent, so the signal disposition must be either SIG_DFL or @@ -113,6 +166,9 @@ __spawni_child (void *arguments) struct sigaction sa; memset (&sa, '\0', sizeof (sa)); + if (use_pipe) + __close (args->pipe[0]); + sigset_t hset; __sigprocmask (SIG_BLOCK, 0, &hset); for (int sig = 1; sig < _NSIG; ++sig) @@ -181,6 +237,9 @@ __spawni_child (void *arguments) { struct __spawn_action *action = &file_actions->__actions[cnt]; + if (use_pipe && !spawni_fa_handle_pipe (action, args->pipe)) + goto fail; + switch (action->tag) { case spawn_do_close: @@ -233,6 +292,11 @@ __spawni_child (void *arguments) break; case spawn_do_dup2: + if (use_pipe && action->action.dup2_action.fd == args->pipe[1]) + { + errno = EBADF; + goto fail; + } /* Austin Group issue #411 requires adddup2 action with source and destination being equal to remove close-on-exec flag. */ if (action->action.dup2_action.fd @@ -264,8 +328,18 @@ __spawni_child (void *arguments) case spawn_do_closefrom: { int lowfd = action->action.closefrom_action.from; - int r = INLINE_SYSCALL_CALL (close_range, lowfd, ~0U, 0); - if (r != 0 && !__closefrom_fallback (lowfd, false)) + /* Skip the pipe descriptor if it is used. No need to handle + it since it is created with O_CLOEXEC. */ + if (use_pipe && args->pipe[1] >= lowfd) + { + if (args->pipe[1] == lowfd + && !spawni_fa_closerange (lowfd + 1, ~0U)) + goto fail; + else if (!spawni_fa_closerange (lowfd, args->pipe[1] - 1) + || !spawni_fa_closerange (args->pipe[1] + 1, ~0U)) + goto fail; + } + else if (!spawni_fa_closerange (lowfd, ~0U)) goto fail; } break; @@ -300,10 +374,112 @@ fail: (EINTERNALBUG) describing that, use ECHILD. Another option would be to set args->err to some negative sentinel and have the parent abort(), but that seems needlessly harsh. */ - args->err = errno ? : ECHILD; + int ret = errno ? : ECHILD; + if (use_pipe) + { + while (__write_nocancel (args->pipe[1], &ret, sizeof (ret)) < 0) + if (errno == EPIPE || errno == EBADF) + break; + } + else + args->err = ret; + _exit (SPAWN_ERROR); } +static pid_t +clone_call (struct posix_spawn_args *args, int flags, void *stack, + size_t stack_size) +{ + struct clone_args clone_args = + { + .flags = flags, + .exit_signal = SIGCHLD, + .stack = (uintptr_t) stack, + .stack_size = stack_size, + }; + return __clone_internal (&clone_args, spawni_child, args); +} + +/* Spawn a new process using clone with CLONE_VM | CLONE_VFORK (to optimize + memory and overcommit) and return TRUE if the helper was created or if the + failure was not due resource exhaustion. */ +static bool +spawni_clone (struct posix_spawn_args *args, pid_t *new_pid, int *ec, + void *stack, size_t stack_size) +{ + /* The clone flags used will create a new child that will run in the same + memory space (CLONE_VM) and the execution of calling thread will be + suspend until the child calls execve or _exit. + + Also since the calling thread execution will be suspend, there is not + need for CLONE_SETTLS. Although parent and child share the same TLS + namespace, there will be no concurrent access for TLS variables (errno + for instance). */ + *new_pid = clone_call (args, CLONE_VM | CLONE_VFORK, stack, stack_size); + + /* It needs to collect the case where the auxiliary process was created + but failed to execute the file (due either any preparation step or + for execve itself). */ + if (*new_pid > 0) + { + /* Also, it handles the unlikely case where the auxiliary process was + terminated before calling execve as if it was successfully. The + args.err is set to 0 as default and changed to a positive value + only in case of failure, so in case of premature termination + due a signal args.err will remain zeroed and it will be up to + caller to actually collect it. */ + *ec = args->err; + if (*ec > 0) + /* There still an unlikely case where the child is cancelled after + setting args.err, due to a positive error value. Also there is + possible pid reuse race (where the kernel allocated the same pid + to an unrelated process). Unfortunately due synchronization + issues where the kernel might not have the process collected + the waitpid below can not use WNOHANG. */ + __waitpid (*new_pid, NULL, 0); + } + else + *ec = errno; + + /* There is no much point in retrying with fork and exec if kernel returns a + failure due resource exhaustion. */ + return *new_pid > 0 || (errno == ENOMEM || errno == EAGAIN); +} + +/* Fallback spawn case which does not use CLONE_VM. Any preparation step or + execve failure is passed with a pipe, which requires additional care by + the helper stating process since it additional file descriptors handle. */ +static void +spawni_fork (struct posix_spawn_args *args, pid_t *new_pid, int *ec, + char *stack, size_t stack_size) +{ + if (__pipe2 (args->pipe, O_CLOEXEC) != 0) + { + *ec = errno; + return; + } + + /* Do not trigger atfork handler nor any internal state reset since the + helper process will call execve. */ + *new_pid = clone_call (args, CLONE_VFORK, stack, stack_size); + + __close (args->pipe[1]); + + if (*new_pid > 0) + { + if (__read (args->pipe[0], ec, sizeof *ec) != sizeof *ec) + /* A successful execve will close the helper process pipe end. */ + *ec = 0; + else + __waitpid (*new_pid, NULL, 0); + } + else + *ec = errno; + + __close (args->pipe[0]); +} + /* Spawn a new process executing PATH with the attributes describes in *ATTRP. Before running the process perform the actions described in FILE-ACTIONS. */ static int @@ -367,49 +543,16 @@ __spawnix (pid_t * pid, const char *file, args.argc = argc; args.envp = envp; args.xflags = xflags; + args.pipe[0] = args.pipe[1] = -1; __libc_signal_block_all (&args.oldmask); - /* The clone flags used will create a new child that will run in the same - memory space (CLONE_VM) and the execution of calling thread will be - suspend until the child calls execve or _exit. - - Also since the calling thread execution will be suspend, there is not - need for CLONE_SETTLS. Although parent and child share the same TLS - namespace, there will be no concurrent access for TLS variables (errno - for instance). */ - struct clone_args clone_args = - { - .flags = CLONE_VM | CLONE_VFORK, - .exit_signal = SIGCHLD, - .stack = (uintptr_t) stack, - .stack_size = stack_size, - }; - new_pid = __clone_internal (&clone_args, __spawni_child, &args); - - /* It needs to collect the case where the auxiliary process was created - but failed to execute the file (due either any preparation step or - for execve itself). */ - if (new_pid > 0) - { - /* Also, it handles the unlikely case where the auxiliary process was - terminated before calling execve as if it was successfully. The - args.err is set to 0 as default and changed to a positive value - only in case of failure, so in case of premature termination - due a signal args.err will remain zeroed and it will be up to - caller to actually collect it. */ - ec = args.err; - if (ec > 0) - /* There still an unlikely case where the child is cancelled after - setting args.err, due to a positive error value. Also there is - possible pid reuse race (where the kernel allocated the same pid - to an unrelated process). Unfortunately due synchronization - issues where the kernel might not have the process collected - the waitpid below can not use WNOHANG. */ - __waitpid (new_pid, NULL, 0); - } - else - ec = errno; + /* clone with CLONE_VM | CLONE_VFORK may fail for some namespace restriction + (for instance Linux does not allow processes in different time namespaces + to share address space) and in this case clone fails with EINVAL. Retry + with fork and exec. */ + if (!spawni_clone (&args, &new_pid, &ec, stack, stack_size)) + spawni_fork (&args, &new_pid, &ec, stack, stack_size); __munmap (stack, stack_size);