mbox series

[v2,00/15] Make the user mode driver code a better citizen

Message ID 87bll17ili.fsf_-_@x220.int.ebiederm.org
Headers show
Series Make the user mode driver code a better citizen | expand

Message

Eric W. Biederman June 29, 2020, 7:55 p.m. UTC
This is the second round of my changeset to split the user mode driver
code from the user mode helper code, and to make the code use common
facilities to get things done instead of recreating them just
for the user mode driver code.

I have split the changes into small enough pieces so they should be
easily readable and testable.

The changes lean into the preexisting interfaces in the kernel and
remove special cases for user mode driver code in favor of solutions
that don't need special cases.  This results in smaller code with fewer
bugs.

At a practical level this removes the maintenance burden of the user
mode drivers from the user mode helper code and from exec as the special
cases are removed.

Similarly the LSM interaction bugs are fixed by not having unnecessary
special cases for user mode drivers.

I have tested thes changes by booting with the code compiled in and
by killing "bpfilter_umh" and running iptables -vnL to restart
the userspace driver.

I have compiled tested each change with and without CONFIG_BPFILTER
enabled.

I made a few very small changes from v1 to v2:
- Updated the function name in a comment when the function is renamed
- Moved some more code so that the the !CONFIG_BPFILTER case continues
  to compile when I moved the code into umd.c
- A fix for the module loading case to really flush the file descriptor.
- Removed split_argv entirely from fork_usermode_driver.
  There was nothing to split so it was just confusing.

Please let me know if you see any bugs.  Once the code review is
finished I plan to place the code in a non-rebasing branch
so I can pull it into my tree and so it can also be pulled into
the bpf-next tree.

Eric W. Biederman (15):
      umh: Capture the pid in umh_pipe_setup
      umh: Move setting PF_UMH into umh_pipe_setup
      umh: Rename the user mode driver helpers for clarity
      umh: Remove call_usermodehelper_setup_file.
      umh: Separate the user mode driver and the user mode helper support
      umd: For clarity rename umh_info umd_info
      umd: Rename umd_info.cmdline umd_info.driver_name
      umd: Transform fork_usermode_blob into fork_usermode_driver
      umh: Stop calling do_execve_file
      exec: Remove do_execve_file
      bpfilter: Move bpfilter_umh back into init data
      umd: Track user space drivers with struct pid
      bpfilter: Take advantage of the facilities of struct pid
      umd: Remove exit_umh
      umd: Stop using split_argv

 fs/exec.c                        |  38 ++------
 include/linux/binfmts.h          |   1 -
 include/linux/bpfilter.h         |   7 +-
 include/linux/sched.h            |   9 --
 include/linux/umd.h              |  18 ++++
 include/linux/umh.h              |  15 ----
 kernel/Makefile                  |   1 +
 kernel/exit.c                    |   1 -
 kernel/umd.c                     | 182 +++++++++++++++++++++++++++++++++++++++
 kernel/umh.c                     | 171 +-----------------------------------
 net/bpfilter/bpfilter_kern.c     |  38 ++++----
 net/bpfilter/bpfilter_umh_blob.S |   2 +-
 net/ipv4/bpfilter/sockopt.c      |  20 +++--
 13 files changed, 248 insertions(+), 255 deletions(-)

v1: https://lkml.kernel.org/r/87pn9mgfc2.fsf_-_@x220.int.ebiederm.org
---
git range-diff master v1 v2

 1:  2b76f9b3158d !  1:  d8fb851fa3d8 umh: Capture the pid in umh_pipe_setup
    @@ Commit message
         code that is specific to user mode drivers from the common user path of
         user mode helpers.
     
    +    Link: https://lkml.kernel.org/r/87h7uygf9i.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umh.h ##
 2:  d853e933ae32 !  2:  b191c5df43ec umh: Move setting PF_UMH into umh_pipe_setup
    @@ Commit message
         Setting PF_UMH unconditionally is harmless as an action will only
         happen if it is paired with an entry on umh_list.
     
    +    Link: https://lkml.kernel.org/r/87bll6gf8t.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## kernel/umh.c ##
 3:  92d2550f0d6a !  3:  74e8c0bf3076 umh: Rename the user mode driver helpers for clarity
    @@ Commit message
         don't make much sense.  Instead name them  umd_setup and umd_cleanup
         for the functional role in setting up user mode drivers.
     
    +    Link: https://lkml.kernel.org/r/875zbegf82.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## kernel/umh.c ##
    @@ kernel/umh.c: static int umh_pipe_setup(struct subprocess_info *info, struct cre
      {
      	struct umh_info *umh_info = info->data;
      
    +-	/* cleanup if umh_pipe_setup() was successful but exec failed */
    ++	/* cleanup if umh_setup() was successful but exec failed */
    + 	if (info->retval) {
    + 		fput(umh_info->pipe_to_umh);
    + 		fput(umh_info->pipe_from_umh);
     @@ kernel/umh.c: int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
      	}
      
 4:  5a9cc2c6c64f !  4:  6652f7c0a909 umh: Remove call_usermodehelper_setup_file.
    @@ Commit message
         For this to work the argv_free is moved from umh_clean_and_save_pid
         to fork_usermode_blob.
     
    +    Link: https://lkml.kernel.org/r/87zh8qf0mp.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umh.h ##
 5:  03ed13fa8eee !  5:  2a1ccb05cf9f umh: Separate the user mode driver and the user mode helper support
    @@ Commit message
         This makes the kernel smaller for everyone who does not use a usermode
         driver.
     
    +    v2: Moved exit_umh from sched.h to umd.h and handle the case when the
    +    code is compiled out.
    +
    +    Link: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/bpfilter.h ##
    @@ include/linux/bpfilter.h
      struct sock;
      int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
     
    + ## include/linux/sched.h ##
    +@@ include/linux/sched.h: static inline void rseq_execve(struct task_struct *t)
    + 
    + #endif
    + 
    +-void __exit_umh(struct task_struct *tsk);
    +-
    +-static inline void exit_umh(struct task_struct *tsk)
    +-{
    +-	if (unlikely(tsk->flags & PF_UMH))
    +-		__exit_umh(tsk);
    +-}
    +-
    + #ifdef CONFIG_DEBUG_RSEQ
    + 
    + void rseq_syscall(struct pt_regs *regs);
    +
      ## include/linux/umd.h (new) ##
     @@
     +#ifndef __LINUX_UMD_H__
    @@ include/linux/umd.h (new)
     +
     +#include <linux/umh.h>
     +
    ++#ifdef CONFIG_BPFILTER
    ++void __exit_umh(struct task_struct *tsk);
    ++
    ++static inline void exit_umh(struct task_struct *tsk)
    ++{
    ++	if (unlikely(tsk->flags & PF_UMH))
    ++		__exit_umh(tsk);
    ++}
    ++#else
    ++static inline void exit_umh(struct task_struct *tsk)
    ++{
    ++}
    ++#endif
    ++
     +struct umh_info {
     +	const char *cmdline;
     +	struct file *pipe_to_umh;
    @@ kernel/Makefile: obj-y     = fork.o exec_domain.o panic.o \
      obj-$(CONFIG_MULTIUSER) += groups.o
      
     
    + ## kernel/exit.c ##
    +@@
    + #include <linux/random.h>
    + #include <linux/rcuwait.h>
    + #include <linux/compat.h>
    ++#include <linux/umd.h>
    + 
    + #include <linux/uaccess.h>
    + #include <asm/unistd.h>
    +
      ## kernel/umd.c (new) ##
     @@
     +// SPDX-License-Identifier: GPL-2.0-only
    @@ kernel/umd.c (new)
     +{
     +	struct umh_info *umh_info = info->data;
     +
    -+	/* cleanup if umh_pipe_setup() was successful but exec failed */
    ++	/* cleanup if umh_setup() was successful but exec failed */
     +	if (info->retval) {
     +		fput(umh_info->pipe_to_umh);
     +		fput(umh_info->pipe_from_umh);
    @@ kernel/umh.c: struct subprocess_info *call_usermodehelper_setup(const char *path
     -{
     -	struct umh_info *umh_info = info->data;
     -
    --	/* cleanup if umh_pipe_setup() was successful but exec failed */
    +-	/* cleanup if umh_setup() was successful but exec failed */
     -	if (info->retval) {
     -		fput(umh_info->pipe_to_umh);
     -		fput(umh_info->pipe_from_umh);
 6:  698bfbcb6c7f !  6:  b16081fb8d92 umd: For clarity rename umh_info umd_info
    @@ Commit message
         This structure is only used for user mode drivers so change
         the prefix from umh to umd to make that clear.
     
    +    Link: https://lkml.kernel.org/r/87o8p6f0kw.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/bpfilter.h ##
    @@ include/linux/bpfilter.h: int bpfilter_ip_set_sockopt(struct sock *sk, int optna
      	int (*sockopt)(struct sock *sk, int optname,
     
      ## include/linux/umd.h ##
    -@@
    - 
    - #include <linux/umh.h>
    +@@ include/linux/umd.h: static inline void exit_umh(struct task_struct *tsk)
    + }
    + #endif
      
     -struct umh_info {
     +struct umd_info {
    @@ kernel/umd.c: static int umd_setup(struct subprocess_info *info, struct cred *ne
     -	struct umh_info *umh_info = info->data;
     +	struct umd_info *umd_info = info->data;
      
    - 	/* cleanup if umh_pipe_setup() was successful but exec failed */
    + 	/* cleanup if umh_setup() was successful but exec failed */
      	if (info->retval) {
     -		fput(umh_info->pipe_to_umh);
     -		fput(umh_info->pipe_from_umh);
 7:  9cdcb5e7fc61 !  7:  42c13aa9c526 umd: Rename umd_info.cmdline umd_info.driver_name
    @@ Commit message
         driver_name any place where the code is looking for a name
         of the binary.
     
    +    Link: https://lkml.kernel.org/r/87imfef0k3.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umd.h ##
    -@@
    - #include <linux/umh.h>
    +@@ include/linux/umd.h: static inline void exit_umh(struct task_struct *tsk)
    + #endif
      
      struct umd_info {
     -	const char *cmdline;
 8:  5ada2f70ae21 !  8:  385ed14a025b umd: Transform fork_usermode_blob into fork_usermode_driver
    @@ Commit message
         path based LSMs there are no new special cases.
     
         [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
    +    Link: https://lkml.kernel.org/r/87d05mf0j9.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umd.h ##
    @@ include/linux/umd.h
      #include <linux/umh.h>
     +#include <linux/path.h>
      
    - struct umd_info {
    - 	const char *driver_name;
    + #ifdef CONFIG_BPFILTER
    + void __exit_umh(struct task_struct *tsk);
     @@ include/linux/umd.h: struct umd_info {
      	struct file *pipe_from_umh;
      	struct list_head list;
    @@ kernel/umd.c
      #include <linux/pipe_fs_i.h>
     +#include <linux/mount.h>
     +#include <linux/fs_struct.h>
    ++#include <linux/task_work.h>
      #include <linux/umd.h>
      
      static LIST_HEAD(umh_list);
    @@ kernel/umd.c
     +		return ERR_PTR(err);
     +	}
     +
    -+	__fput_sync(file);
    ++	fput(file);
    ++
    ++	/* Flush delayed fput so exec can open the file read-only */
    ++	flush_delayed_fput();
    ++	task_work_run();
     +	return mnt;
     +}
     +
 9:  e4ff478e77c9 !  9:  eeae92e3f0da umh: Stop calling do_execve_file
    @@ Commit message
         call_usermodehelper_exec_async that would call do_execve_file instead
         of do_execve if file was set.
     
    +    Link: https://lkml.kernel.org/r/877dvuf0i7.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umh.h ##
10:  dc0a38f6bd51 ! 10:  c7fdaf5660b8 exec: Remove do_execve_file
    @@ Commit message
     
         Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
         [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
    +    Link: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## fs/exec.c ##
11:  d0c0c2ddf53b ! 11:  43d08e6986a7 bpfilter: Move bpfilter_umh back into init data
    @@ Commit message
         the blob the blob no longer needs to live .rodata to allow for restarting.
         So move the blob back to .init.rodata.
     
    +    Link: https://lkml.kernel.org/r/87sgeidlvq.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## net/bpfilter/bpfilter_umh_blob.S ##
12:  51b703ad75dd ! 12:  729ee744af46 umd: Track user space drivers with struct pid
    @@ Commit message
         As the tgid is now refcounted verify the tgid is NULL at the start of
         fork_usermode_driver to avoid the possibility of silent pid leaks.
     
    +    Link: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/umd.h ##
13:  cdadf89503c9 ! 13:  2d85b10b965e bpfilter: Take advantage of the facilities of struct pid
    @@ Commit message
         struct pid can be tested to see if a process still exists, and that
         struct pid has a wait queue that notifies when the process dies.
     
    +    Link: https://lkml.kernel.org/r/87h7uydlu9.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/bpfilter.h ##
14:  1d621649e144 ! 14:  6e7e8ddd2b44 umd: Remove exit_umh
    @@ Commit message
         callback is what exit_umh exists to call.  So remove exit_umh and all
         of it's associated booking.
     
    +    Link: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org
    +    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
         Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
     
      ## include/linux/sched.h ##
    @@ include/linux/sched.h: extern struct pid *cad_pid;
      #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
      #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
      #define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
    -@@ include/linux/sched.h: static inline void rseq_execve(struct task_struct *t)
    - 
    - #endif
    +
    + ## include/linux/umd.h ##
    +@@
    + #include <linux/umh.h>
    + #include <linux/path.h>
      
    +-#ifdef CONFIG_BPFILTER
     -void __exit_umh(struct task_struct *tsk);
     -
     -static inline void exit_umh(struct task_struct *tsk)
    @@ include/linux/sched.h: static inline void rseq_execve(struct task_struct *t)
     -	if (unlikely(tsk->flags & PF_UMH))
     -		__exit_umh(tsk);
     -}
    +-#else
    +-static inline void exit_umh(struct task_struct *tsk)
    +-{
    +-}
    +-#endif
     -
    - #ifdef CONFIG_DEBUG_RSEQ
    - 
    - void rseq_syscall(struct pt_regs *regs);
    -
    - ## include/linux/umd.h ##
    -@@ include/linux/umd.h: struct umd_info {
    + struct umd_info {
      	const char *driver_name;
      	struct file *pipe_to_umh;
      	struct file *pipe_from_umh;
    @@ include/linux/umd.h: struct umd_info {
      };
     
      ## kernel/exit.c ##
    +@@
    + #include <linux/random.h>
    + #include <linux/rcuwait.h>
    + #include <linux/compat.h>
    +-#include <linux/umd.h>
    + 
    + #include <linux/uaccess.h>
    + #include <asm/unistd.h>
     @@ kernel/exit.c: void __noreturn do_exit(long code)
      	exit_task_namespaces(tsk);
      	exit_task_work(tsk);
    @@ kernel/exit.c: void __noreturn do_exit(long code)
     
      ## kernel/umd.c ##
     @@
    - #include <linux/fs_struct.h>
    + #include <linux/task_work.h>
      #include <linux/umd.h>
      
     -static LIST_HEAD(umh_list);
 -:  ------------ > 15:  662deff06d76 umd: Stop using split_argv

Comments

Alexei Starovoitov June 29, 2020, 10:12 p.m. UTC | #1
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
> 
> I have tested thes changes by booting with the code compiled in and
> by killing "bpfilter_umh" and running iptables -vnL to restart
> the userspace driver.
> 
> I have compiled tested each change with and without CONFIG_BPFILTER
> enabled.

With
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=m
it doesn't build:

ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!

I've added:
+EXPORT_SYMBOL(kill_pid_info);
to continue testing...

And then did:
while true; do iptables -L;rmmod bpfilter; done
 
Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().

I suspect patch 13 is somehow responsible:
+	if (tgid) {
+		kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
+		wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
+		bpfilter_umh_cleanup(info);
+	}

I cannot figure out why it hangs. Some sort of race ?
Since adding short delay between kill and wait makes it work.
Eric W. Biederman June 30, 2020, 1:13 a.m. UTC | #2
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>> 
>> I have tested thes changes by booting with the code compiled in and
>> by killing "bpfilter_umh" and running iptables -vnL to restart
>> the userspace driver.
>> 
>> I have compiled tested each change with and without CONFIG_BPFILTER
>> enabled.
>
> With
> CONFIG_BPFILTER=y
> CONFIG_BPFILTER_UMH=m
> it doesn't build:
>
> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>
> I've added:
> +EXPORT_SYMBOL(kill_pid_info);
> to continue testing...
>
> And then did:
> while true; do iptables -L;rmmod bpfilter; done
>  
> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>
> I suspect patch 13 is somehow responsible:
> +	if (tgid) {
> +		kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> +		wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> +		bpfilter_umh_cleanup(info);
> +	}
>
> I cannot figure out why it hangs. Some sort of race ?
> Since adding short delay between kill and wait makes it work.

Thanks.  I will take a look.

Eric
Tetsuo Handa June 30, 2020, 6:16 a.m. UTC | #3
On 2020/06/30 10:13, Eric W. Biederman wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
>> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>>>
>>> I have tested thes changes by booting with the code compiled in and
>>> by killing "bpfilter_umh" and running iptables -vnL to restart
>>> the userspace driver.
>>>
>>> I have compiled tested each change with and without CONFIG_BPFILTER
>>> enabled.
>>
>> With
>> CONFIG_BPFILTER=y
>> CONFIG_BPFILTER_UMH=m
>> it doesn't build:
>>
>> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>>
>> I've added:
>> +EXPORT_SYMBOL(kill_pid_info);
>> to continue testing...

kill_pid() is already exported.

>>
>> And then did:
>> while true; do iptables -L;rmmod bpfilter; done
>>  
>> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>>
>> I suspect patch 13 is somehow responsible:
>> +	if (tgid) {
>> +		kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
>> +		wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
>> +		bpfilter_umh_cleanup(info);
>> +	}
>>
>> I cannot figure out why it hangs. Some sort of race ?
>> Since adding short delay between kill and wait makes it work.

Because there is a race window that detach_pid() from __unhash_process() from
__exit_signal() from release_task() from exit_notify() from do_exit() is called
some time after wake_up_all(&pid->wait_pidfd) from do_notify_pidfd() from
do_notify_parent() from exit_notify() from do_exit() was called (in other words,
we can't use pid->wait_pidfd when pid_task() is used at wait_event()) ?

Below are changes I suggest.

diff --git a/kernel/umd.c b/kernel/umd.c
index ff79fb16d738..f688813b8830 100644
--- a/kernel/umd.c
+++ b/kernel/umd.c
@@ -26,7 +26,7 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na
 	if (IS_ERR(mnt))
 		return mnt;
 
-	file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY, 0700);
+	file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY | O_EXCL, 0700);
 	if (IS_ERR(file)) {
 		mntput(mnt);
 		return ERR_CAST(file);
@@ -52,16 +52,18 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na
 
 /**
  * umd_load_blob - Remember a blob of bytes for fork_usermode_driver
- * @info: information about usermode driver
- * @data: a blob of bytes that can be executed as a file
- * @len:  The lentgh of the blob
+ * @info: information about usermode driver (shouldn't be NULL)
+ * @data: a blob of bytes that can be executed as a file (shouldn't be NULL)
+ * @len:  The lentgh of the blob (shouldn't be 0)
  *
  */
 int umd_load_blob(struct umd_info *info, const void *data, size_t len)
 {
 	struct vfsmount *mnt;
 
-	if (WARN_ON_ONCE(info->wd.dentry || info->wd.mnt))
+	if (!info || !info->driver_name || !data || !len)
+		return -EINVAL;
+	if (info->wd.dentry || info->wd.mnt)
 		return -EBUSY;
 
 	mnt = blob_to_mnt(data, len, info->driver_name);
@@ -76,15 +78,14 @@ EXPORT_SYMBOL_GPL(umd_load_blob);
 
 /**
  * umd_unload_blob - Disassociate @info from a previously loaded blob
- * @info: information about usermode driver
+ * @info: information about usermode driver (shouldn't be NULL)
  *
  */
 int umd_unload_blob(struct umd_info *info)
 {
-	if (WARN_ON_ONCE(!info->wd.mnt ||
-			 !info->wd.dentry ||
-			 info->wd.mnt->mnt_root != info->wd.dentry))
+	if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt)
 		return -EINVAL;
+	BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry);
 
 	kern_unmount(info->wd.mnt);
 	info->wd.mnt = NULL;
@@ -138,7 +139,7 @@ static void umd_cleanup(struct subprocess_info *info)
 {
 	struct umd_info *umd_info = info->data;
 
-	/* cleanup if umh_setup() was successful but exec failed */
+	/* cleanup if umd_setup() was successful but exec failed */
 	if (info->retval) {
 		fput(umd_info->pipe_to_umh);
 		fput(umd_info->pipe_from_umh);
@@ -163,7 +164,10 @@ int fork_usermode_driver(struct umd_info *info)
 	const char *argv[] = { info->driver_name, NULL };
 	int err;
 
-	if (WARN_ON_ONCE(info->tgid))
+	if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt)
+		return -EINVAL;
+	BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry);
+	if (info->tgid)
 		return -EBUSY;
 
 	err = -ENOMEM;
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 91474884ddb7..9dd70aacb81a 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -19,8 +19,13 @@ static void shutdown_umh(void)
 	struct pid *tgid = info->tgid;
 
 	if (tgid) {
-		kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
-		wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
+		kill_pid(tgid, SIGKILL, 1);
+		while (({ bool done;
+			  rcu_read_lock();
+			  done = !pid_task(tgid, PIDTYPE_TGID);
+			  rcu_read_unlock();
+			  done; }))
+			schedule_timeout_uninterruptible(1);
 		bpfilter_umh_cleanup(info);
 	}
 }
Eric W. Biederman June 30, 2020, 12:29 p.m. UTC | #4
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

2> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>> 
>> I have tested thes changes by booting with the code compiled in and
>> by killing "bpfilter_umh" and running iptables -vnL to restart
>> the userspace driver.
>> 
>> I have compiled tested each change with and without CONFIG_BPFILTER
>> enabled.
>
> With
> CONFIG_BPFILTER=y
> CONFIG_BPFILTER_UMH=m
> it doesn't build:
>
> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>
> I've added:
> +EXPORT_SYMBOL(kill_pid_info);
> to continue testing...

I am rather surprised I thought Tetsuo had already compile tested
modules.


> I suspect patch 13 is somehow responsible:
> +	if (tgid) {
> +		kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> +		wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> +		bpfilter_umh_cleanup(info);
> +	}
>
> I cannot figure out why it hangs. Some sort of race ?
> Since adding short delay between kill and wait makes it work.

Having had a chance to sleep kill_pid_info was a thinko, as was
!pid_task.  It should have been !pid_has_task as that takes the proper
rcu locking.

I don't know if that is going to be enough to fix the wait_event
but those are obvious bugs that need to be fixed.

diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 91474884ddb7..3e1874030daa 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -19,8 +19,8 @@ static void shutdown_umh(void)
        struct pid *tgid = info->tgid;
 
        if (tgid) {
-               kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
-               wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
+               kill_pid(tgid, SIGKILL, 1);
+               wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
                bpfilter_umh_cleanup(info);
        }
 }

> And then did:
> while true; do iptables -L;rmmod bpfilter; done
>  
> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().

Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
release_task is called so there is a race.  As it is possible to wake
up and then go back to sleep before pid_has_task becomes false.

So I think I need a friendly helper that does:

bool task_has_exited(struct pid *tgid)
{
	bool exited = false;

	rcu_read_lock();
        tsk = pid_task(tgid, PIDTYPE_TGID);
        exited = !!tsk;
        if (tsk) {
        	exited = !!tsk->exit_state;
out:
	rcu_unlock();
	return exited;
}

There should be a sensible way to do that.

Eric
Tetsuo Handa June 30, 2020, 1:21 p.m. UTC | #5
On 2020/06/30 21:29, Eric W. Biederman wrote:
> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
> release_task is called so there is a race.  As it is possible to wake
> up and then go back to sleep before pid_has_task becomes false.

What is the reason we want to wait until pid_has_task() becomes false?

- wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
+ while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));




By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
that use of flush_delayed_fput() has to be careful. Al, is it safe to call
flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
called from both kernel thread and from process context (e.g. init_module()
syscall by /sbin/insmod )) ?
Alexei Starovoitov June 30, 2020, 4:52 p.m. UTC | #6
On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote:
> 
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> index 91474884ddb7..3e1874030daa 100644
> --- a/net/bpfilter/bpfilter_kern.c
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -19,8 +19,8 @@ static void shutdown_umh(void)
>         struct pid *tgid = info->tgid;
>  
>         if (tgid) {
> -               kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> -               wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> +               kill_pid(tgid, SIGKILL, 1);
> +               wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>                 bpfilter_umh_cleanup(info);
>         }
>  }
> 
> > And then did:
> > while true; do iptables -L;rmmod bpfilter; done
> >  
> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
> 
> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
> release_task is called so there is a race.  As it is possible to wake
> up and then go back to sleep before pid_has_task becomes false.
> 
> So I think I need a friendly helper that does:
> 
> bool task_has_exited(struct pid *tgid)
> {
> 	bool exited = false;
> 
> 	rcu_read_lock();
>         tsk = pid_task(tgid, PIDTYPE_TGID);
>         exited = !!tsk;
>         if (tsk) {
>         	exited = !!tsk->exit_state;
> out:
> 	rcu_unlock();
> 	return exited;
> }

All makes sense to me.
If I understood the race condition such helper should indeed solve it.
Are you going to add such patch to your series?
I'll proceed with my work on top of your series and will ignore this
race for now, but I think it should be fixed before we land this set
into multiple trees.
Eric W. Biederman July 1, 2020, 5:12 p.m. UTC | #7
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote:
>> 
>> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
>> index 91474884ddb7..3e1874030daa 100644
>> --- a/net/bpfilter/bpfilter_kern.c
>> +++ b/net/bpfilter/bpfilter_kern.c
>> @@ -19,8 +19,8 @@ static void shutdown_umh(void)
>>         struct pid *tgid = info->tgid;
>>  
>>         if (tgid) {
>> -               kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
>> -               wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
>> +               kill_pid(tgid, SIGKILL, 1);
>> +               wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>                 bpfilter_umh_cleanup(info);
>>         }
>>  }
>> 
>> > And then did:
>> > while true; do iptables -L;rmmod bpfilter; done
>> >  
>> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>> 
>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>> release_task is called so there is a race.  As it is possible to wake
>> up and then go back to sleep before pid_has_task becomes false.
>> 
>> So I think I need a friendly helper that does:
>> 
>> bool task_has_exited(struct pid *tgid)
>> {
>> 	bool exited = false;
>> 
>> 	rcu_read_lock();
>>         tsk = pid_task(tgid, PIDTYPE_TGID);
>>         exited = !!tsk;
>>         if (tsk) {
>>         	exited = !!tsk->exit_state;
>> out:
>> 	rcu_unlock();
>> 	return exited;
>> }
>
> All makes sense to me.
> If I understood the race condition such helper should indeed solve it.
> Are you going to add such patch to your series?
> I'll proceed with my work on top of your series and will ignore this
> race for now, but I think it should be fixed before we land this set
> into multiple trees.

Yes. I am just finishing it up now.

Eric
Eric W. Biederman July 2, 2020, 1:08 p.m. UTC | #8
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:

> On 2020/06/30 21:29, Eric W. Biederman wrote:
>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>> release_task is called so there is a race.  As it is possible to wake
>> up and then go back to sleep before pid_has_task becomes false.
>
> What is the reason we want to wait until pid_has_task() becomes false?
>
> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));

So that it is safe to call bpfilter_umh_cleanup.  The previous code
performed the wait by having a callback in do_exit.

It might be possible to call bpf_umh_cleanup early but I have not done
that analysis.

To perform the test correctly what I have right now is:

bool thread_group_exited(struct pid *pid)
{
	struct task_struct *tsk;
	bool exited;

	rcu_read_lock();
	tsk = pid_task(pid, PIDTYPE_PID);
	exited = !tsk || (READ_ONCE(tsk->exit_state) && thread_group_empty(tsk));
	rcu_read_unlock();

	return exited;
}

Which is factored out of pidfd_poll.  Which means that this won't be
something that the bpfilter code has to maintain.  That seems to be a
fundamentally good facility to have regardless of bpfilter.

I will post the whole thing in a bit once I have a chance to dot my i's
and cross my t's.

> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
> called from both kernel thread and from process context (e.g. init_module()
> syscall by /sbin/insmod )) ?

And __fput_sync needs to be even more careful.
umd_load_blob is called in these changes without any locks held.

We fundamentally AKA in any correct version of this code need to flush
the file descriptor before we call exec or exec can not open it a
read-only denying all writes from any other opens.

The use case of flush_delayed_fput is exactly the same as that used
when loading the initramfs.

Eric
Tetsuo Handa July 2, 2020, 1:40 p.m. UTC | #9
On 2020/07/02 22:08, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
> 
>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>> release_task is called so there is a race.  As it is possible to wake
>>> up and then go back to sleep before pid_has_task becomes false.
>>
>> What is the reason we want to wait until pid_has_task() becomes false?
>>
>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));
> 
> So that it is safe to call bpfilter_umh_cleanup.  The previous code
> performed the wait by having a callback in do_exit.

But bpfilter_umh_cleanup() does only

	fput(info->pipe_to_umh);
	fput(info->pipe_from_umh);
	put_pid(info->tgid);
	info->tgid = NULL;

which is (I think) already safe regardless of the usermode process because
bpfilter_umh_cleanup() merely closes one side of two pipes used between
two processes and forgets about the usermode process.

> 
> It might be possible to call bpf_umh_cleanup early but I have not done
> that analysis.
> 
> To perform the test correctly what I have right now is:

Waiting for the termination of a SIGKILLed usermode process is not
such simple. If a usermode process was killed by the OOM killer, it
might take minutes for the killed process to reach do_exit() due to
invisible memory allocation dependency chain. Since the OOM killer
kicks the OOM reaper, and the OOM reaper forgets about the killed
process after one second if mmap_sem could not be held (in order to
avoid OOM deadlock), the OOM situation will be eventually solved; but
there is no guarantee that the killed process can reach do_exit()
in a short period.
Eric W. Biederman July 2, 2020, 4:02 p.m. UTC | #10
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:

> On 2020/07/02 22:08, Eric W. Biederman wrote:
>> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
>> 
>>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>>> release_task is called so there is a race.  As it is possible to wake
>>>> up and then go back to sleep before pid_has_task becomes false.
>>>
>>> What is the reason we want to wait until pid_has_task() becomes false?
>>>
>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));
>> 
>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>> performed the wait by having a callback in do_exit.
>
> But bpfilter_umh_cleanup() does only
>
> 	fput(info->pipe_to_umh);
> 	fput(info->pipe_from_umh);
> 	put_pid(info->tgid);
> 	info->tgid = NULL;
>
> which is (I think) already safe regardless of the usermode process because
> bpfilter_umh_cleanup() merely closes one side of two pipes used between
> two processes and forgets about the usermode process.

It is not safe.

Baring bugs there is only one use of shtudown_umh that matters.  The one
in fini_umh.  The use of the file by the mm must be finished before
umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

>> It might be possible to call bpf_umh_cleanup early but I have not done
>> that analysis.
>> 
>> To perform the test correctly what I have right now is:
>
> Waiting for the termination of a SIGKILLed usermode process is not
> such simple.

The waiting is that simple.

You are correct it might not be a quick process.

A good general principle is to start with something simple and correct
for what it does, and then to make it more complicated when real world
cases show up, and it can be understood what the real challenges are.

I am not going to merge known broken code but I am also not going to
overcomplicate it.

Dealing with very rare and pathological cases that are not handled or
considered today is out of scope for my patchset.

Eric
Tetsuo Handa July 3, 2020, 1:19 p.m. UTC | #11
On 2020/07/02 22:08, Eric W. Biederman wrote:
>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>> called from both kernel thread and from process context (e.g. init_module()
>> syscall by /sbin/insmod )) ?
> 
> And __fput_sync needs to be even more careful.
> umd_load_blob is called in these changes without any locks held.

But where is the guarantee that a thread which called flush_delayed_fput() waits for
the completion of processing _all_ "struct file" linked into delayed_fput_list ?
If some other thread or delayed_fput_work (scheduled by fput_many()) called
flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
needs to be processed before execve(), can't it?

Also, I don't know how convoluted the dependency of all "struct file" linked into
delayed_fput_list might be, for there can be "struct file" which will not be a
simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.

On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
there is a guarantee that __fput_sync() waits for the completion of "struct file"
which needs to be flushed before execve(), isn't there?

> 
> We fundamentally AKA in any correct version of this code need to flush
> the file descriptor before we call exec or exec can not open it a
> read-only denying all writes from any other opens.
> 
> The use case of flush_delayed_fput is exactly the same as that used
> when loading the initramfs.

When loading the initramfs, the number of threads is quite few (which
means that the possibility of hitting the race window and convoluted
dependency is small).

But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
flush_delayed_fput() might be called after many number of threads already
started running.



On 2020/07/03 1:02, Eric W. Biederman wrote:
>>>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>>>> release_task is called so there is a race.  As it is possible to wake
>>>>> up and then go back to sleep before pid_has_task becomes false.
>>>>
>>>> What is the reason we want to wait until pid_has_task() becomes false?
>>>>
>>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));
>>>
>>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>>> performed the wait by having a callback in do_exit.
>>
>> But bpfilter_umh_cleanup() does only
>>
>> 	fput(info->pipe_to_umh);
>> 	fput(info->pipe_from_umh);
>> 	put_pid(info->tgid);
>> 	info->tgid = NULL;
>>
>> which is (I think) already safe regardless of the usermode process because
>> bpfilter_umh_cleanup() merely closes one side of two pipes used between
>> two processes and forgets about the usermode process.
> 
> It is not safe.
> 
> Baring bugs there is only one use of shtudown_umh that matters.  The one
> in fini_umh.  The use of the file by the mm must be finished before
> umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ?
LSM modules might prefer only one instance of filesystem for umd blobs.

For pathname based LSMs, since that filesystem is not visible from mount tree, only
info->driver_name can be used for distinction. Therefore, one instance of filesystem
with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be preferable.

For inode based LSMs, reusing one instance of filesystem created upon early boot might
be convenient for labeling.

Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in
order to implement protections without labeling files. Then, we might also be able to
implement minimal protections without LSMs.
Eric W. Biederman July 3, 2020, 10:25 p.m. UTC | #12
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:

> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>> called from both kernel thread and from process context (e.g. init_module()
>>> syscall by /sbin/insmod )) ?
>> 
>> And __fput_sync needs to be even more careful.
>> umd_load_blob is called in these changes without any locks held.
>
> But where is the guarantee that a thread which called flush_delayed_fput() waits for
> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
> If some other thread or delayed_fput_work (scheduled by fput_many()) called
> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
> needs to be processed before execve(), can't it?

As a module the guarantee is we call task_work_run.
Built into the kernel the guarantee as best I can trace it is that
kthreadd hasn't started, and as such nothing that is scheduled has run
yet.

> Also, I don't know how convoluted the dependency of all "struct file" linked into
> delayed_fput_list might be, for there can be "struct file" which will not be a
> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>
> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
> there is a guarantee that __fput_sync() waits for the completion of "struct file"
> which needs to be flushed before execve(), isn't there?

There is really not a good helper or helpers, and this code suggests we
have something better.  Right now I have used the existing helpers to
the best of my ability.  If you or someone else wants to write a better
version of flushing so that exec can happen be my guest.

As far as I can tell what I have is good enough.

>> We fundamentally AKA in any correct version of this code need to flush
>> the file descriptor before we call exec or exec can not open it a
>> read-only denying all writes from any other opens.
>> 
>> The use case of flush_delayed_fput is exactly the same as that used
>> when loading the initramfs.
>
> When loading the initramfs, the number of threads is quite few (which
> means that the possibility of hitting the race window and convoluted
> dependency is small).

But the reality is the code run very early, before the initramfs is
initialized in practice.

> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
> flush_delayed_fput() might be called after many number of threads already
> started running.

At which point the code probably won't be runnig from a kernel thread
but instead will be running in a thread where task_work_run is relevant.

At worst it is a very small race, where someone else in another thread
starts flushing the file.  Which means the file could still be
completely close before exec.   Even that is not necessarily fatal,
as the usermode driver code has a respawn capability.

Code that is used enough that it hits that race sounds like a very
good problem to have from the perspective of the usermode driver code.

> Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ?
> LSM modules might prefer only one instance of filesystem for umd
> blobs.

It is simple. People are free to change it, but a single filesystem
seems like a very good place to start with this functionality.

> For pathname based LSMs, since that filesystem is not visible from mount tree, only
> info->driver_name can be used for distinction. Therefore, one instance of filesystem
> with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL)
> might be preferable.

I took a quick look and the creation and removal of files with the
in-kernel helpers is not particularly easy.  Certainly it is more work
and thus a higher likelyhood of bugs than what I have done.

A directory per driver does sound tempting.  Just more work that I am
willing to do.

> For inode based LSMs, reusing one instance of filesystem created upon early boot might
> be convenient for labeling.
>
> Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in
> order to implement protections without labeling files. Then, we might also be able to
> implement minimal protections without LSMs.

All valid points.  Nothing sets this design in stone.
Nothing says this is the endpoint of the evolution of this code.

The entire point of this patchset for me is that I remove the
unnecessary special cases from exec and do_exit, so I don't have to deal
with the usermode driver code anymore.

Eric
Tetsuo Handa July 4, 2020, 6:57 a.m. UTC | #13
On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
> 
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>>> called from both kernel thread and from process context (e.g. init_module()
>>>> syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() waits for
>> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
>> needs to be processed before execve(), can't it?
> 
> As a module the guarantee is we call task_work_run.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.

Have you ever checked how early the kthreadd (PID=2) gets started?

----------
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
        trace_task_newtask(p, clone_flags);
        uprobe_copy_process(p, clone_flags);

+       printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
        return p;

 bad_fork_cancel_cgroup:
----------

----------
[    0.090757][    T0] pid_max: default: 65536 minimum: 512
[    0.090890][    T0] LSM: Security Framework initializing
[    0.090890][    T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Disabled fast string operations
[    0.090890][    T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[    0.090890][    T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[    0.090890][    T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.090890][    T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[    0.090890][    T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.090890][    T0] SRBDS: Unknown: Dependent on hypervisor status
[    0.090890][    T0] MDS: Mitigation: Clear CPU buffers
[    0.090890][    T0] Freeing SMP alternatives memory: 24K
[    0.090890][    T0] Created PID: 1 Comm: swapper/0
[    0.090890][    T0] Created PID: 2 Comm: swapper/0
[    0.090890][    T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[    0.091000][    T2] Created PID: 3 Comm: kthreadd
[    0.091995][    T2] Created PID: 4 Comm: kthreadd
[    0.093028][    T2] Created PID: 5 Comm: kthreadd
[    0.093997][    T2] Created PID: 6 Comm: kthreadd
[    0.094995][    T2] Created PID: 7 Comm: kthreadd
[    0.096037][    T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[    0.135716][    T2] Created PID: 13 Comm: kthreadd
[    0.135716][    T1] smp: Bringing up secondary CPUs ...
[    0.135716][    T2] Created PID: 14 Comm: kthreadd
[    0.135716][    T2] Created PID: 15 Comm: kthreadd
[    0.135716][    T2] Created PID: 16 Comm: kthreadd
[    0.135716][    T2] Created PID: 17 Comm: kthreadd
[    0.135716][    T2] Created PID: 18 Comm: kthreadd
[    0.135716][    T1] x86: Booting SMP configuration:
(...snipped...)
[    0.901990][    T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.902145][    T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.902213][    T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes
[    0.902224][    T1] Trying to unpack rootfs image as initramfs...
[    1.107993][    T1] Freeing initrd memory: 18876K
[    1.109049][    T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.111003][    T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB)
[    1.112136][    T1] check: Scanning for low memory corruption every 60 seconds
[    1.115040][    T2] Created PID: 52 Comm: kthreadd
[    1.116110][    T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[    1.120936][    T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled
[    1.129626][    T2] Created PID: 53 Comm: kthreadd
[    1.131403][    T2] Created PID: 54 Comm: kthreadd
----------

kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by
swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking
initramfs.

And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel
thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being
called after many processes already started running.

> 
>> Also, I don't know how convoluted the dependency of all "struct file" linked into
>> delayed_fput_list might be, for there can be "struct file" which will not be a
>> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>>
>> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
>> there is a guarantee that __fput_sync() waits for the completion of "struct file"
>> which needs to be flushed before execve(), isn't there?
> 
> There is really not a good helper or helpers, and this code suggests we
> have something better.  Right now I have used the existing helpers to
> the best of my ability.  If you or someone else wants to write a better
> version of flushing so that exec can happen be my guest.
> 
> As far as I can tell what I have is good enough.

Just saying what you think is not a "review". I'm waiting for answer from Al Viro
because I consider that Al will be the most familiar with fput()'s behavior.
At least I consider that

	if (current->flags & PF_KTHREAD) {
		__fput_sync(file);
	} else {
		fput(file);
		task_work_run();
	}

is a candidate for closing the race window. And depending on Al's answer,
removing

	BUG_ON(!(task->flags & PF_KTHREAD));

 from __fput_sync() and unconditionally using

	__fput_sync(file);

 from blob_to_mnt() might be the better choice. Anyway, I consider that
Al's response is important for this "review".

> 
>>> We fundamentally AKA in any correct version of this code need to flush
>>> the file descriptor before we call exec or exec can not open it a
>>> read-only denying all writes from any other opens.
>>>
>>> The use case of flush_delayed_fput is exactly the same as that used
>>> when loading the initramfs.
>>
>> When loading the initramfs, the number of threads is quite few (which
>> means that the possibility of hitting the race window and convoluted
>> dependency is small).
> 
> But the reality is the code run very early, before the initramfs is
> initialized in practice.

Such expectation is not a reality.

> 
>> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
>> flush_delayed_fput() might be called after many number of threads already
>> started running.
> 
> At which point the code probably won't be runnig from a kernel thread
> but instead will be running in a thread where task_work_run is relevant.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> 
> At worst it is a very small race, where someone else in another thread
> starts flushing the file.  Which means the file could still be
> completely close before exec.   Even that is not necessarily fatal,
> as the usermode driver code has a respawn capability.
> 
> Code that is used enough that it hits that race sounds like a very
> good problem to have from the perspective of the usermode driver code.

In general, unconditionally retrying call_usermodehelper() when it returned
a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which
code is an implementation of "a respawn capability"; I'd like to check where
that code is and whether that code is checking -ETXTBSY.
Eric W. Biederman July 8, 2020, 4:46 a.m. UTC | #14
Just to make certain I understand what is going on I instrumented a
kernel with some print statements.

a) The workqueues and timers start before populate_rootfs.

b) populate_rootfs does indeed happen long before the bpfilter
   module is intialized.

c) What prevents populate_rootfs and the umd_load_blob from
   having problems when they call flush_delayed_put is the
   fact that fput_many does:
   "schedule_delayed_work(&delayed_fput_work,1)".

   That 1 requests a delay of at least 1 jiffy.  A jiffy is between
   1ms and 10ms depending on how Linux is configured.

   In my test configuration running a kernel in kvm printing to a serial
   console I measured 0.8ms between the fput in blob_to_mnt and
   flush_delayed_fput which immediately follows it.

   So unless the fput becomes incredibly slow there is nothing to worry
   about in blob_to_mnt.

d) As the same mechanism is used by populate_rootfs.  A but in the
   mechanism applies to both.

e) No one appears to have reported a problem executing files out of
   initramfs these last several years since the flush_delayed_fput was
   introduced.
 
f) The code works for me.  There is real reason to believe the code will
   work for everyone else, as the exact same logic is used by initramfs.
   So it should be perfectly fine for the patchset and the
   usermode_driver code to go ahead as written.

h) If there is something to be fixed it is flush_delayed_fput as that is
   much more important than anything in the usermode driver code.

Eric

p.s.) When I talked of restarts of the usermode driver code ealier I was
   referring to the code that restarts the usermode driver if it is
   killed, the next time the kernel tries to talk to it.

   That could mask an -ETXTBUSY except if it happens on the first exec
   the net/bfilter/bpfilter_kern.c:load_umh() will return an error.
Luis Chamberlain July 8, 2020, 5:20 a.m. UTC | #15
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
> 
> I have tested thes changes by booting with the code compiled in and
> by killing "bpfilter_umh" and running iptables -vnL to restart
> the userspace driver.
> 
> I have compiled tested each change with and without CONFIG_BPFILTER
> enabled.

Sounds like grounds for a selftests driver and respective selftest?
And if so, has the other issues Tetsuo reported be hacked into one?

  Luis