Message ID | 20210414055217.543246-1-avagin@gmail.com |
---|---|
Headers | show |
Series | Allow executing code and syscalls in another address space | expand |
On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. > > = Use-cases = It seems to me like your proposed API doesn't really fit either one of those usecases well... > Here are two known use-cases. The first one is “application kernel” > sandboxes like User-mode Linux and gVisor. In this case, we have a > process that runs the sandbox kernel and a set of stub processes that > are used to manage guest address spaces. Guest code is executed in the > context of stub processes but all system calls are intercepted and > handled in the sandbox kernel. Right now, these sort of sandboxes use > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > significantly speed them up. In this case, since you really only want an mm_struct to run code under, it seems weird to create a whole task with its own PID and so on. It seems to me like something similar to the /dev/kvm API would be more appropriate here? Implementation options that I see for that would be: 1. mm_struct-based: a set of syscalls to create a new mm_struct, change memory mappings under that mm_struct, and switch to it 2. pagetable-mirroring-based: like /dev/kvm, an API to create a new pagetable, mirror parts of the mm_struct's pagetables over into it with modified permissions (like KVM_SET_USER_MEMORY_REGION), and run code under that context. page fault handling would first handle the fault against mm->pgd as normal, then mirror the PTE over into the secondary pagetables. invalidation could be handled with MMU notifiers. > Another use-case is CRIU (Checkpoint/Restore in User-space). Several > process properties can be received only from the process itself. Right > now, we use a parasite code that is injected into the process. We do > this with ptrace but it is slow, unsafe, and tricky. But this API will only let you run code under the *mm* of the target process, not fully in the context of a target *task*, right? So you still won't be able to use this for accessing anything other than memory? That doesn't seem very generically useful to me. Also, I don't doubt that anything involving ptrace is kinda tricky, but it would be nice to have some more detail on what exactly makes this slow, unsafe and tricky. Are there API additions for ptrace that would make this work better? I imagine you're thinking of things like an API for injecting a syscall into the target process without having to first somehow find an existing SYSCALL instruction in the target process? > process_vm_exec can > simplify the process of injecting a parasite code and it will allow > pre-dump memory without stopping processes. The pre-dump here is when we > enable a memory tracker and dump the memory while a process is continue > running. On each interaction we dump memory that has been changed from > the previous iteration. In the final step, we will stop processes and > dump their full state. Right now the most effective way to dump process > memory is to create a set of pipes and splice memory into these pipes > from the parasite code. With process_vm_exec, we will be able to call > vmsplice directly. It means that we will not need to stop a process to > inject the parasite code. Alternatively you could add splice support to /proc/$pid/mem or add a syscall similar to process_vm_readv() that splices into a pipe, right?
On 14/04/2021 06:52, Andrei Vagin wrote: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. > > = Use-cases = > > Here are two known use-cases. The first one is “application kernel” > sandboxes like User-mode Linux and gVisor. In this case, we have a > process that runs the sandbox kernel and a set of stub processes that > are used to manage guest address spaces. Guest code is executed in the > context of stub processes but all system calls are intercepted and > handled in the sandbox kernel. Right now, these sort of sandboxes use > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > significantly speed them up. Certainly interesting, but will require um to rework most of its memory management and we will most likely need extra mm support to make use of it in UML. We are not likely to get away just with one syscall there. > > Another use-case is CRIU (Checkpoint/Restore in User-space). Several > process properties can be received only from the process itself. Right > now, we use a parasite code that is injected into the process. We do > this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can > simplify the process of injecting a parasite code and it will allow > pre-dump memory without stopping processes. The pre-dump here is when we > enable a memory tracker and dump the memory while a process is continue > running. On each interaction we dump memory that has been changed from > the previous iteration. In the final step, we will stop processes and > dump their full state. Right now the most effective way to dump process > memory is to create a set of pipes and splice memory into these pipes > from the parasite code. With process_vm_exec, we will be able to call > vmsplice directly. It means that we will not need to stop a process to > inject the parasite code. > > = How it works = > > process_vm_exec has two modes: > > * Execute code in an address space of a target process and stop on any > signal or system call. > > * Execute a system call in an address space of a target process. > > int process_vm_exec(pid_t pid, struct sigcontext uctx, > unsigned long flags, siginfo_t siginfo, > sigset_t *sigmask, size_t sizemask) > > PID - target process identification. We can consider to use pidfd > instead of PID here. > > sigcontext contains a process state with what the process will be > resumed after switching the address space and then when a process will > be stopped, its sate will be saved back to sigcontext. > > siginfo is information about a signal that has interrupted the process. > If a process is interrupted by a system call, signfo will contain a > synthetic siginfo of the SIGSYS signal. > > sigmask is a set of signals that process_vm_exec returns via signfo. > > # How fast is it > > In the fourth patch, you can find two benchmarks that execute a function > that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap > system calls, proces_vm_exec uses the process_vm_exec syscall to do the > same thing. > > ptrace_vm_exec: 1446 ns/syscall > ptrocess_vm_exec: 289 ns/syscall > > PS: This version is just a prototype. Its goal is to collect the initial > feedback, to discuss the interfaces, and maybe to get some advice on > implementation.. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> > Cc: Christian Brauner <christian.brauner@ubuntu.com> > Cc: Dmitry Safonov <0x7f454c46@gmail.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Jeff Dike <jdike@addtoit.com> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Cc: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Richard Weinberger <richard@nod.at> > Cc: Thomas Gleixner <tglx@linutronix.de> > > Andrei Vagin (4): > signal: add a helper to restore a process state from sigcontex > arch/x86: implement the process_vm_exec syscall > arch/x86: allow to execute syscalls via process_vm_exec > selftests: add tests for process_vm_exec > > arch/Kconfig | 15 ++ > arch/x86/Kconfig | 1 + > arch/x86/entry/common.c | 19 +++ > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/asm/sigcontext.h | 2 + > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/process_vm_exec.c | 160 ++++++++++++++++++ > arch/x86/kernel/signal.c | 125 ++++++++++---- > include/linux/entry-common.h | 2 + > include/linux/process_vm_exec.h | 17 ++ > include/linux/sched.h | 7 + > include/linux/syscalls.h | 6 + > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/process_vm_exec.h | 8 + > kernel/entry/common.c | 2 +- > kernel/fork.c | 9 + > kernel/sys_ni.c | 2 + > .../selftests/process_vm_exec/Makefile | 7 + > tools/testing/selftests/process_vm_exec/log.h | 26 +++ > .../process_vm_exec/process_vm_exec.c | 105 ++++++++++++ > .../process_vm_exec/process_vm_exec_fault.c | 111 ++++++++++++ > .../process_vm_exec/process_vm_exec_syscall.c | 81 +++++++++ > .../process_vm_exec/ptrace_vm_exec.c | 111 ++++++++++++ > 23 files changed, 785 insertions(+), 37 deletions(-) > create mode 100644 arch/x86/kernel/process_vm_exec.c > create mode 100644 include/linux/process_vm_exec.h > create mode 100644 include/uapi/linux/process_vm_exec.h > create mode 100644 tools/testing/selftests/process_vm_exec/Makefile > create mode 100644 tools/testing/selftests/process_vm_exec/log.h > create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c > create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c > create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c > create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c >
On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote: > On 14/04/2021 06:52, Andrei Vagin wrote: > > We already have process_vm_readv and process_vm_writev to read and write > > to a process memory faster than we can do this with ptrace. And now it > > is time for process_vm_exec that allows executing code in an address > > space of another process. We can do this with ptrace but it is much > > slower. > > > > = Use-cases = > > > > Here are two known use-cases. The first one is “application kernel” > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > process that runs the sandbox kernel and a set of stub processes that > > are used to manage guest address spaces. Guest code is executed in the > > context of stub processes but all system calls are intercepted and > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > significantly speed them up. > > Certainly interesting, but will require um to rework most of its memory > management and we will most likely need extra mm support to make use of > it in UML. We are not likely to get away just with one syscall there. Might help the seccomp mode though: https://patchwork.ozlabs.org/project/linux-um/list/?series=231980 johannes
On Wed, 2021-04-14 at 09:34 +0200, Johannes Berg wrote: > On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote: > > On 14/04/2021 06:52, Andrei Vagin wrote: > > > We already have process_vm_readv and process_vm_writev to read and > > > write > > > to a process memory faster than we can do this with ptrace. And now > > > it > > > is time for process_vm_exec that allows executing code in an > > > address > > > space of another process. We can do this with ptrace but it is much > > > slower. > > > > > > = Use-cases = > > > > > > Here are two known use-cases. The first one is “application kernel” > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > process that runs the sandbox kernel and a set of stub processes > > > that > > > are used to manage guest address spaces. Guest code is executed in > > > the > > > context of stub processes but all system calls are intercepted and > > > handled in the sandbox kernel. Right now, these sort of sandboxes > > > use > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > significantly speed them up. > > > > Certainly interesting, but will require um to rework most of its > > memory > > management and we will most likely need extra mm support to make use > > of > > it in UML. We are not likely to get away just with one syscall there. > > Might help the seccomp mode though: > > https://patchwork.ozlabs.org/project/linux-um/list/?series=231980 Hmm, to me it sounds like it replaces both ptrace and seccomp mode while completely avoiding the scheduling overhead that these techniques have. I think everything UML needs is covered: * The new API can do syscalls in the target memory space (we can modify the address space) * The new API can run code until the next syscall happens (or a signal happens, which means SIGALRM for scheduling works) * Single step tracing should work by setting EFLAGS I think the memory management itself stays fundamentally the same. We just do the initial clone() using CLONE_STOPPED. We don't need any stub code/data and we have everything we need to modify the address space and run the userspace process. Benjamin
* Andrei Vagin: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. > > = Use-cases = We also have some vaguely related within the same address space: running code on another thread, without modifying its stack, while it has signal handlers blocked, and without causing system calls to fail with EINTR. This can be used to implement certain kinds of memory barriers. It is also necessary to implement set*id with POSIX semantics in userspace. (Linux only changes the current thread credentials, POSIX requires process-wide changes.) We currently use a signal for set*id, but it has issues (it can be blocked, the signal could come from somewhere, etc.). We can't use signals for barriers because of the EINTR issue, and because the signal context is stored on the stack. Thanks, Florian
On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Andrei Vagin: > > > We already have process_vm_readv and process_vm_writev to read and write > > to a process memory faster than we can do this with ptrace. And now it > > is time for process_vm_exec that allows executing code in an address > > space of another process. We can do this with ptrace but it is much > > slower. > > > > = Use-cases = > > We also have some vaguely related within the same address space: running > code on another thread, without modifying its stack, while it has signal > handlers blocked, and without causing system calls to fail with EINTR. > This can be used to implement certain kinds of memory barriers. That's what the membarrier() syscall is for, right? Unless you don't want to register all threads for expedited membarrier use? > It is > also necessary to implement set*id with POSIX semantics in userspace. > (Linux only changes the current thread credentials, POSIX requires > process-wide changes.) We currently use a signal for set*id, but it has > issues (it can be blocked, the signal could come from somewhere, etc.). > We can't use signals for barriers because of the EINTR issue, and > because the signal context is stored on the stack. This essentially becomes a question of "how much is set*id allowed to block and what level of guarantee should there be by the time it returns that no threads will perform privileged actions anymore after it returns", right? Like, if some piece of kernel code grabs a pointer to the current credentials or acquires a temporary reference to some privileged resource, then blocks on reading an argument from userspace, and then performs a privileged action using the previously-grabbed credentials or resource, what behavior do you want? Should setuid() block until that privileged action has completed? Should it abort that action (which is kinda what you get with the signals approach)? Should it just return immediately even though an attacker who can write to process memory at that point might still be able to influence a privileged operation that hasn't read all its inputs yet? Should the kernel be designed to keep track of whether it is currently holding a privileged resource? Or should the kernel just specifically permit credential changes in specific places where it is known that a task might block for a long time and it is not holding any privileged resources (kinda like the approach taken for freezer stuff)? If userspace wants multithreaded setuid() without syscall aborting, things get gnarly really fast; and having an interface to remotely perform operations under another task's context isn't really relevant to the core problem here, I think.
* Jann Horn: > On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Andrei Vagin: >> >> > We already have process_vm_readv and process_vm_writev to read and write >> > to a process memory faster than we can do this with ptrace. And now it >> > is time for process_vm_exec that allows executing code in an address >> > space of another process. We can do this with ptrace but it is much >> > slower. >> > >> > = Use-cases = >> >> We also have some vaguely related within the same address space: running >> code on another thread, without modifying its stack, while it has signal >> handlers blocked, and without causing system calls to fail with EINTR. >> This can be used to implement certain kinds of memory barriers. > > That's what the membarrier() syscall is for, right? Unless you don't > want to register all threads for expedited membarrier use? membarrier is not sufficiently powerful for revoking biased locks, for example. For the EINTR issue, <https://github.com/golang/go/issues/38836> is an example. I believe CIFS has since seen a few fixes (after someone reported that tar on CIFS wouldn't work because the SIGCHLD causing utimensat to fail—and there isn't even a signal handler for SIGCHLD!), but the time it took to get to this point doesn't give me confidence that it is safe to send signals to a thread that is running unknown code. But as you explained regarding the set*id broadcast, it seems that if we had this run-on-another-thread functionality, we would likely encounter issues similar to those with SA_RESTART. We don't see the issue with set*id today because it's a rare operation, and multi-threaded file servers that need to change credentials frequently opt out of the set*id broadcast anyway. (What I have in mind is a future world where any printf call, any malloc call, can trigger such a broadcast.) The cross-VM CRIU scenario would probably somewhere in between (not quite the printf/malloc level, but more frequent than set*id). Thanks, Florian
On Wed, Apr 14, 2021 at 2:20 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Jann Horn: > > > On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Andrei Vagin: > >> > >> > We already have process_vm_readv and process_vm_writev to read and write > >> > to a process memory faster than we can do this with ptrace. And now it > >> > is time for process_vm_exec that allows executing code in an address > >> > space of another process. We can do this with ptrace but it is much > >> > slower. > >> > > >> > = Use-cases = > >> > >> We also have some vaguely related within the same address space: running > >> code on another thread, without modifying its stack, while it has signal > >> handlers blocked, and without causing system calls to fail with EINTR. > >> This can be used to implement certain kinds of memory barriers. > > > > That's what the membarrier() syscall is for, right? Unless you don't > > want to register all threads for expedited membarrier use? > > membarrier is not sufficiently powerful for revoking biased locks, for > example. But on Linux >=5.10, together with rseq, it is, right? Then lock acquisition could look roughly like this, in pseudo-C (yes, I know, real rseq doesn't quite look like that, you'd need inline asm for that unless the compiler adds special support for this): enum local_state { STATE_FREE_OR_BIASED, STATE_LOCKED }; #define OWNER_LOCKBIT (1U<<31) #define OWNER_WAITER_BIT (1U<<30) /* notify futex when OWNER_LOCKBIT is cleared */ struct biased_lock { unsigned int owner_with_lockbit; enum local_state local_state; }; void lock(struct biased_lock *L) { unsigned int my_tid = THREAD_SELF->tid; RSEQ_SEQUENCE_START(); // restart here on failure if (READ_ONCE(L->owner) == my_tid) { if (READ_ONCE(L->local_state) == STATE_LOCKED) { RSEQ_SEQUENCE_END(); /* * Deadlock, abort execution. * Note that we are not necessarily actually *holding* the lock; * this can also happen if we entered a signal handler while we * were in the process of acquiring the lock. * But in that case it could just as well have happened that we * already grabbed the lock, so the caller is wrong anyway. */ fatal_error(); } RSEQ_COMMIT(L->local_state = STATE_LOCKED); return; /* fastpath success */ } RSEQ_SEQUENCE_END(); /* slowpath */ /* acquire and lock owner field */ unsigned int old_owner_with_lockbit; while (1) { old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit); if (old_owner_with_lockbit & OWNER_LOCKBIT) { if (!__sync_bool_compare_and_swap (&L->owner_with_lockbit, old_owner_with_lockbit, my_tid | OWNER_LOCKBIT | OWNER_WAITER_BIT)) continue; futex(&L->owner_with_lockbit, FUTEX_WAIT, old_owner_with_lockbit, NULL, NULL, 0); continue; } else { if (__sync_bool_compare_and_swap (&L->owner_with_lockbit, old_owner_with_lockbit, my_tid | OWNER_LOCKBIT)) break; } } /* * ensure old owner won't lock local_state anymore. * we only have to worry about the owner that directly preceded us here; * it will have done this step for the owners that preceded it before clearing * the LOCKBIT; so if we were the old owner, we don't have to sync. */ if (old_owner_with_lockbit != my_tid) { if (membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, 0, 0)) fatal_error(); } /* * As soon as the lock becomes STATE_FREE_OR_BIASED, we own it; but * at this point it might still be locked. */ while (READ_ONCE(L->local_state) == STATE_LOCKED) { futex(&L->local_state, FUTEX_WAIT, STATE_LOCKED, NULL, NULL, 0); } /* OK, now the lock is biased to us and we can grab it. */ WRITE_ONCE(L->local_state, STATE_LOCKED); /* drop lockbit */ unsigned int old_owner_with_lockbit; while (1) { old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit); if (__sync_bool_compare_and_swap (&L->owner_with_lockbit, old_owner_with_lockbit, my_tid)) break; } if (old_owner_with_lockbit & OWNER_WAITER_BIT) futex(&L->owner_with_lockbit, FUTEX_WAKE, INT_MAX, NULL, NULL, 0); } void unlock(struct biased_lock *L) { unsigned int my_tid = THREAD_SELF->tid; /* * If we run before the membarrier(), the lock() path will immediately * see the lock as uncontended, and we don't need to call futex(). * If we run after the membarrier(), the ->owner_with_lockbit read * here will observe the new owner and we'll wake the futex. */ RSEQ_SEQUENCE_START(); unsigned int old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit); RSEQ_COMMIT(WRITE_ONCE(L->local_state, STATE_FREE_OR_BIASED)); if (old_owner_with_lockbit != my_tid) futex(&L->local_state, FUTEX_WAKE, INT_MAX, NULL, NULL, 0); }
On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote: > > We already have process_vm_readv and process_vm_writev to read and write > > to a process memory faster than we can do this with ptrace. And now it > > is time for process_vm_exec that allows executing code in an address > > space of another process. We can do this with ptrace but it is much > > slower. > > > > = Use-cases = > > It seems to me like your proposed API doesn't really fit either one of > those usecases well... We definitely can invent more specific interfaces for each of these problems. Sure, they will handle their use-cases a bit better than this generic one. But do we want to have two very specific interfaces with separate kernel implementations? My previous experiences showed that the kernel community doesn't like interfaces that are specific for only one narrow use-case. So when I was working on process_vm_exec, I was thinking how to make one interfaces that will be good enough for all these use-cases. > > > Here are two known use-cases. The first one is “application kernel” > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > process that runs the sandbox kernel and a set of stub processes that > > are used to manage guest address spaces. Guest code is executed in the > > context of stub processes but all system calls are intercepted and > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > significantly speed them up. > > In this case, since you really only want an mm_struct to run code > under, it seems weird to create a whole task with its own PID and so > on. It seems to me like something similar to the /dev/kvm API would be > more appropriate here? Implementation options that I see for that > would be: > > 1. mm_struct-based: > a set of syscalls to create a new mm_struct, > change memory mappings under that mm_struct, and switch to it > 2. pagetable-mirroring-based: > like /dev/kvm, an API to create a new pagetable, mirror parts of > the mm_struct's pagetables over into it with modified permissions > (like KVM_SET_USER_MEMORY_REGION), > and run code under that context. > page fault handling would first handle the fault against mm->pgd > as normal, then mirror the PTE over into the secondary pagetables. > invalidation could be handled with MMU notifiers. We are ready to discuss this sort of interfaces if the community will agree to accept it. Are there any other users except sandboxes that will need something like this? Will the sandbox use-case enough to justify the addition of this interface? > > > Another use-case is CRIU (Checkpoint/Restore in User-space). Several > > process properties can be received only from the process itself. Right > > now, we use a parasite code that is injected into the process. We do > > this with ptrace but it is slow, unsafe, and tricky. > > But this API will only let you run code under the *mm* of the target > process, not fully in the context of a target *task*, right? So you > still won't be able to use this for accessing anything other than > memory? That doesn't seem very generically useful to me. You are right, this will not rid us of the need to run a parasite code. I wrote that it will make a process of injecting a parasite code a bit simpler. > > Also, I don't doubt that anything involving ptrace is kinda tricky, > but it would be nice to have some more detail on what exactly makes > this slow, unsafe and tricky. Are there API additions for ptrace that > would make this work better? I imagine you're thinking of things like > an API for injecting a syscall into the target process without having > to first somehow find an existing SYSCALL instruction in the target > process? You describe the first problem right. We need to find or inject a syscall instruction to a target process. Right now, we need to do these steps to execute a system call: * inject the syscall instruction (PTRACE_PEEKDATA/PTRACE_POKEDATA). * get origin registers * set new registers * get a signal mask. * block signals * resume the process * stop it on the next syscall-exit * get registers * set origin registers * restore a signal mask. One of the CRIU principals is to avoid changing a process state, so if criu is interrupted, processes must be resumed and continue running. The procedure of injecting a system call creates a window when a process is in an inconsistent state, and a disappearing CRIU at such moments will be fatal for the process. We don't think that we can eliminate such windows, but we want to make them smaller. In CRIU, we have a self-healed parasite. The idea is to inject a parasite code with a signal frame that contains the origin process state. The parasite runs in an "RPC daemon mode" and gets commands from criu via a unix socket. If it detects that criu disappeared, it calls rt_sigreturn and resumes the origin process. As for the performance of the ptrace, there are a few reasons why it is slow. First, it is a number of steps what we need to do. Second, it is two synchronious context switches. Even if we will solve the first problem with a new ptrace command, it will be not enough to stop using a parasite in CRIU. > > > process_vm_exec can > > simplify the process of injecting a parasite code and it will allow > > pre-dump memory without stopping processes. The pre-dump here is when we > > enable a memory tracker and dump the memory while a process is continue > > running. On each interaction we dump memory that has been changed from > > the previous iteration. In the final step, we will stop processes and > > dump their full state. Right now the most effective way to dump process > > memory is to create a set of pipes and splice memory into these pipes > > from the parasite code. With process_vm_exec, we will be able to call > > vmsplice directly. It means that we will not need to stop a process to > > inject the parasite code. > > Alternatively you could add splice support to /proc/$pid/mem or add a > syscall similar to process_vm_readv() that splices into a pipe, right? We send patches to introcude process_vm_splice: https://lore.kernel.org/patchwork/cover/871116/ but they were not merged and the main reason was a lack of enough users to justify its addition.
On Tue, Apr 13, 2021 at 10:52:13PM -0700, Andrei Vagin wrote: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. I'd like to add that there are cases when using ptrace is even hardly possible: in my situation one process needs to modify address space of another process while that target process is being blocked under pagefault. From https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/notes.txt#L149-171 , https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L395-397 : ---- 8< ---- Client cannot be ptraced while under pagefault ============================================== We cannot use ptrace to run code on client thread that is under pagefault: The kernel sends SIGSTOP to interrupt tracee, but the signal will be processed only when the process returns from kernel space, e.g. here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160 This way the tracer won't receive obligatory information that tracee stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other ptrace commands will fail: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207 My original idea was to use ptrace to run code in process to change it's memory mappings, while the triggering process is under pagefault/read to wcfs, and the above shows it won't work - trying to ptrace the client from under wcfs will just block forever (the kernel will be waiting for read operation to finish for ptrace, and read will be first waiting on ptrace stopping to complete = deadlock) ... // ( one could imagine adjusting mappings synchronously via running // wcfs-trusted code via ptrace that wcfs injects into clients, but ptrace // won't work when client thread is blocked under pagefault or syscall(^) ) ---- 8< ---- To workaround that I need to add special thread into target process and implement custom additional "isolation protocol" in between my filesystem and client processes that use it: https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L94-182 https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.h#L20-96 https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.cpp#L24-203 Most parts of that dance would be much easier, or completely unnecessary, if it could be possible to reliably make changes to address space of target process from outside. Kirill
Just to add to the list of use cases for PROCESS_VM_EXEC_SYSCALL, another use case is initializing a process from the "outside", instead of from the "inside" as fork requires. This can be much easier to work with. http://catern.com/rsys21.pdf goes into this use case in some depth. It relies heavily on a remote syscall primitive: https://github.com/catern/rsyscall. The PROCESS_VM_EXEC_SYSCALL API proposed in this patch would be a great replacement for the current implementation, which relies on running code inside the target process.
On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote: > > We already have process_vm_readv and process_vm_writev to read and write > > to a process memory faster than we can do this with ptrace. And now it > > is time for process_vm_exec that allows executing code in an address > > space of another process. We can do this with ptrace but it is much > > slower. > > > > = Use-cases = > > It seems to me like your proposed API doesn't really fit either one of > those usecases well... > > > Here are two known use-cases. The first one is “application kernel” > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > process that runs the sandbox kernel and a set of stub processes that > > are used to manage guest address spaces. Guest code is executed in the > > context of stub processes but all system calls are intercepted and > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > significantly speed them up. > > In this case, since you really only want an mm_struct to run code > under, it seems weird to create a whole task with its own PID and so > on. It seems to me like something similar to the /dev/kvm API would be > more appropriate here? Implementation options that I see for that > would be: > > 1. mm_struct-based: > a set of syscalls to create a new mm_struct, > change memory mappings under that mm_struct, and switch to it I like the idea to have a handle for mm. Instead of pid, we will pass this handle to process_vm_exec. We have pidfd for processes and we can introduce mmfd for mm_struct. > 2. pagetable-mirroring-based: > like /dev/kvm, an API to create a new pagetable, mirror parts of > the mm_struct's pagetables over into it with modified permissions > (like KVM_SET_USER_MEMORY_REGION), > and run code under that context. > page fault handling would first handle the fault against mm->pgd > as normal, then mirror the PTE over into the secondary pagetables. > invalidation could be handled with MMU notifiers. > I found this idea interesting and decided to look at it more closely. After reading the kernel code for a few days, I realized that it would not be easy to implement something like this, but more important is that I don’t understand what problem it solves. Will it simplify the user-space code? I don’t think so. Will it improve performance? It is unclear for me too. First, in the KVM case, we have a few big linear mappings and need to support one “shadow” address space. In the case of sandboxes, we can have a tremendous amount of mappings and many address spaces that we need to manage. Memory mappings will be mapped with different addresses in a supervisor address space and “guest” address spaces. If guest address spaces will not have their mm_structs, we will need to reinvent vma-s in some form. If guest address spaces have mm_structs, this will look similar to https://lwn.net/Articles/830648/. Second, each pagetable is tied up with mm_stuct. You suggest creating new pagetables that will not have their mm_struct-s (sorry if I misunderstood something). I am not sure that it will be easy to implement. How many corner cases will be there? As for page faults in a secondary address space, we will need to find a fault address in the main address space, handle the fault there and then mirror the PTE to the secondary pagetable. Effectively, it means that page faults will be handled in two address spaces. Right now, we use memfd and shared mappings. It means that each fault is handled only in one address space, and we map a guest memory region to the supervisor address space only when we need to access it. A large portion of guest anonymous memory is never mapped to the supervisor address space. Will an overhead of mirrored address spaces be smaller than memfd shared mappings? I am not sure. Third, this approach will not get rid of having process_vm_exec. We will need to switch to a guest address space with a specified state and switch back on faults or syscalls. If the main concern is the ability to run syscalls on a remote mm, we can think about how to fix this. I see two ways what we can do here: * Specify the exact list of system calls that are allowed. The first three candidates are mmap, munmap, and vmsplice. * Instead of allowing us to run system calls, we can implement this in the form of commands. In the case of sandboxes, we need to implement only two commands to create and destroy memory mappings in a target address space. Thanks, Andrei
On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote: > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote: > > > We already have process_vm_readv and process_vm_writev to read and write > > > to a process memory faster than we can do this with ptrace. And now it > > > is time for process_vm_exec that allows executing code in an address > > > space of another process. We can do this with ptrace but it is much > > > slower. > > > > > > = Use-cases = > > > > It seems to me like your proposed API doesn't really fit either one of > > those usecases well... > > > > > Here are two known use-cases. The first one is “application kernel” > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > process that runs the sandbox kernel and a set of stub processes that > > > are used to manage guest address spaces. Guest code is executed in the > > > context of stub processes but all system calls are intercepted and > > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > significantly speed them up. > > > > In this case, since you really only want an mm_struct to run code > > under, it seems weird to create a whole task with its own PID and so > > on. It seems to me like something similar to the /dev/kvm API would be > > more appropriate here? Implementation options that I see for that > > would be: > > > > 1. mm_struct-based: > > a set of syscalls to create a new mm_struct, > > change memory mappings under that mm_struct, and switch to it > > I like the idea to have a handle for mm. Instead of pid, we will pass > this handle to process_vm_exec. We have pidfd for processes and we can > introduce mmfd for mm_struct. I personally think that it might be quite unwieldy when it comes to the restrictions you get from trying to have shared memory with the owning process - I'm having trouble figuring out how you can implement copy-on-write semantics without relying on copy-on-write logic in the host OS and without being able to use userfaultfd. But if that's not a problem somehow, and you can find some reasonable way to handle memory usage accounting and fix up everything that assumes that multithreaded userspace threads don't switch ->mm, I guess this might work for your usecase. > > 2. pagetable-mirroring-based: > > like /dev/kvm, an API to create a new pagetable, mirror parts of > > the mm_struct's pagetables over into it with modified permissions > > (like KVM_SET_USER_MEMORY_REGION), > > and run code under that context. > > page fault handling would first handle the fault against mm->pgd > > as normal, then mirror the PTE over into the secondary pagetables. > > invalidation could be handled with MMU notifiers. > > > > I found this idea interesting and decided to look at it more closely. > After reading the kernel code for a few days, I realized that it would > not be easy to implement something like this, Yeah, it might need architecture-specific code to flip the page tables on userspace entry/exit, and maybe also for mirroring them. And for the TLB flushing logic... > but more important is that > I don’t understand what problem it solves. Will it simplify the > user-space code? I don’t think so. Will it improve performance? It is > unclear for me too. Some reasons I can think of are: - direct guest memory access: I imagined you'd probably want to be able to directly access userspace memory from the supervisor, and with this approach that'd become easy. - integration with on-demand paging of the host OS: You'd be able to create things like file-backed copy-on-write mappings from the host filesystem, or implement your own mappings backed by some kind of storage using userfaultfd. - sandboxing: For sandboxing usecases (not your usecase), it would be possible to e.g. create a read-only clone of the entire address space of a process and give write access to specific parts of it, or something like that. These address space clones could potentially be created and destroyed fairly quickly. - accounting: memory usage would be automatically accounted to the supervisor process, so even without a parasite process, you'd be able to see the memory usage correctly in things like "top". - small (non-pageable) memory footprint in the host kernel: The only things the host kernel would have to persistently store would be the normal MM data structures for the supervisor plus the mappings from "guest userspace" memory ranges to supervisor memory ranges; userspace pagetables would be discardable, and could even be shared with those of the supervisor in cases where the alignment fits. So with this, large anonymous mappings with 4K granularity only cost you ~0.20% overhead across host and guest address space; without this, if you used shared mappings instead, you'd pay twice that for every 2MiB range from which parts are accessed in both contexts, plus probably another ~0.2% or so for the "struct address_space"? - all memory-management-related syscalls could be directly performed in the "kernel" process But yeah, some of those aren't really relevant for your usecase, and I guess things like the accounting aspect could just as well be solved differently... > First, in the KVM case, we have a few big linear mappings and need to > support one “shadow” address space. In the case of sandboxes, we can > have a tremendous amount of mappings and many address spaces that we > need to manage. Memory mappings will be mapped with different addresses > in a supervisor address space and “guest” address spaces. If guest > address spaces will not have their mm_structs, we will need to reinvent > vma-s in some form. If guest address spaces have mm_structs, this will > look similar to https://lwn.net/Articles/830648/. > > Second, each pagetable is tied up with mm_stuct. You suggest creating > new pagetables that will not have their mm_struct-s (sorry if I > misunderstood something). Yeah, that's what I had in mind, page tables without an mm_struct. > I am not sure that it will be easy to > implement. How many corner cases will be there? Yeah, it would require some work around TLB flushing and entry/exit from userspace. But from a high-level perspective it feels to me like a change with less systematic impact. Maybe I'm wrong about that. > As for page faults in a secondary address space, we will need to find a > fault address in the main address space, handle the fault there and then > mirror the PTE to the secondary pagetable. Right. > Effectively, it means that > page faults will be handled in two address spaces. Right now, we use > memfd and shared mappings. It means that each fault is handled only in > one address space, and we map a guest memory region to the supervisor > address space only when we need to access it. A large portion of guest > anonymous memory is never mapped to the supervisor address space. > Will an overhead of mirrored address spaces be smaller than memfd shared > mappings? I am not sure. But as long as the mappings are sufficiently big and aligned properly, or you explicitly manage the supervisor address space, some of that cost disappears: E.g. even if a page is mapped in both address spaces, you wouldn't have a memory cost for the second mapping if the page tables are shared. > Third, this approach will not get rid of having process_vm_exec. We will > need to switch to a guest address space with a specified state and > switch back on faults or syscalls. Yeah, you'd still need a syscall for running code under a different set of page tables. But that's something that KVM _almost_ already does. > If the main concern is the ability to > run syscalls on a remote mm, we can think about how to fix this. I see > two ways what we can do here: > > * Specify the exact list of system calls that are allowed. The first > three candidates are mmap, munmap, and vmsplice. > > * Instead of allowing us to run system calls, we can implement this in > the form of commands. In the case of sandboxes, we need to implement > only two commands to create and destroy memory mappings in a target > address space. FWIW, there is precedent for something similar: The Android folks already added process_madvise() for remotely messing with the VMAs of another process to some degree.
On 4/13/21 10:52 PM, Andrei Vagin wrote: > process_vm_exec has two modes: > > * Execute code in an address space of a target process and stop on any > signal or system call. We already have a perfectly good context switch mechanism: context switches. If you execute code, you are basically guaranteed to be subject to being hijacked, which means you pretty much can't allow syscalls. But there's a lot of non-syscall state, and I think context switching needs to be done with extreme care. (Just as example, suppose you switch mms, then set %gs to point to the LDT, then switch back. Now you're in a weird state. With %ss the plot is a bit thicker. And there are emulated vsyscalls and such.) If you, PeterZ, and the UMCG could all find an acceptable, efficient way to wake-and-wait so you can switch into an injected task in the target process and switch back quickly, then I think a much nicer solution will become available. > > * Execute a system call in an address space of a target process. I could get behind this, but there are plenty of cans of worms to watch out for. Serious auditing would be needed.
On Fri, Jul 02, 2021 at 05:12:02PM +0200, Jann Horn wrote: > On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <avagin@gmail.com> wrote: > > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@gmail.com> wrote: > > > > We already have process_vm_readv and process_vm_writev to read and write > > > > to a process memory faster than we can do this with ptrace. And now it > > > > is time for process_vm_exec that allows executing code in an address > > > > space of another process. We can do this with ptrace but it is much > > > > slower. > > > > > > > > = Use-cases = > > > > > > It seems to me like your proposed API doesn't really fit either one of > > > those usecases well... > > > > > > > Here are two known use-cases. The first one is “application kernel” > > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > > process that runs the sandbox kernel and a set of stub processes that > > > > are used to manage guest address spaces. Guest code is executed in the > > > > context of stub processes but all system calls are intercepted and > > > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > > significantly speed them up. > > > > > > In this case, since you really only want an mm_struct to run code > > > under, it seems weird to create a whole task with its own PID and so > > > on. It seems to me like something similar to the /dev/kvm API would be > > > more appropriate here? Implementation options that I see for that > > > would be: > > > > > > 1. mm_struct-based: > > > a set of syscalls to create a new mm_struct, > > > change memory mappings under that mm_struct, and switch to it > > > > I like the idea to have a handle for mm. Instead of pid, we will pass > > this handle to process_vm_exec. We have pidfd for processes and we can > > introduce mmfd for mm_struct. > > I personally think that it might be quite unwieldy when it comes to > the restrictions you get from trying to have shared memory with the > owning process - I'm having trouble figuring out how you can implement > copy-on-write semantics without relying on copy-on-write logic in the > host OS and without being able to use userfaultfd. It is easy. COW mappings are mapped to guest address spaces without the write permission. If one of processes wants to write something, it triggers a fault that is handled in the Sentry (supervisor/kernel). > > But if that's not a problem somehow, and you can find some reasonable > way to handle memory usage accounting and fix up everything that > assumes that multithreaded userspace threads don't switch ->mm, I > guess this might work for your usecase. > > > > 2. pagetable-mirroring-based: > > > like /dev/kvm, an API to create a new pagetable, mirror parts of > > > the mm_struct's pagetables over into it with modified permissions > > > (like KVM_SET_USER_MEMORY_REGION), > > > and run code under that context. > > > page fault handling would first handle the fault against mm->pgd > > > as normal, then mirror the PTE over into the secondary pagetables. > > > invalidation could be handled with MMU notifiers. > > > > > > > I found this idea interesting and decided to look at it more closely. > > After reading the kernel code for a few days, I realized that it would > > not be easy to implement something like this, > > Yeah, it might need architecture-specific code to flip the page tables > on userspace entry/exit, and maybe also for mirroring them. And for > the TLB flushing logic... > > > but more important is that > > I don’t understand what problem it solves. Will it simplify the > > user-space code? I don’t think so. Will it improve performance? It is > > unclear for me too. > > Some reasons I can think of are: > > - direct guest memory access: I imagined you'd probably want to be able to > directly access userspace memory from the supervisor, and > with this approach that'd become easy. Right now, we use shared memory regions for that and they work fine. As I already mentioned the most part of memory are never mapped to the supervisor address space. > > - integration with on-demand paging of the host OS: You'd be able to > create things like file-backed copy-on-write mappings from the > host filesystem, or implement your own mappings backed by some kind > of storage using userfaultfd. This isn't a problem either... > > - sandboxing: For sandboxing usecases (not your usecase), it would be > possible to e.g. create a read-only clone of the entire address space of a > process and give write access to specific parts of it, or something > like that. > These address space clones could potentially be created and destroyed > fairly quickly. This is a very valid example and I would assume this is where your idea was coming from. I have some doubts about the idea of additional sub-page-tables in the kernel, but I know a good way how to implement your idea with KVM. You can look at how the KVM platform is implemented in gVisor and this sort of sandboxing can be implemented in the same way. In a few words, we create a KVM virtual machine, repeat the process address space in the guest ring0, implement basic operating system-level stubs, so that the process can jump between the host ring3 and the guest ring0. https://github.com/google/gvisor/blob/master/pkg/ring0/ https://github.com/google/gvisor/tree/master/pkg/sentry/platform/kvm When we have all these bits, we can create any page tables for a guest ring3 and run untrusted code there. The sandbox process switches to the guest ring0 and then it switches to a guest ring3 with a specified page tables and a state. https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm/machine_amd64.go;l=356 With this scheme, the sandbox process will have direct access to page tables and will be able to change them. > > - accounting: memory usage would be automatically accounted to the > supervisor process, so even without a parasite process, you'd be able > to see the memory usage correctly in things like "top". > > - small (non-pageable) memory footprint in the host kernel: > The only things the host kernel would have to persistently store would be > the normal MM data structures for the supervisor plus the mappings > from "guest userspace" memory ranges to supervisor memory ranges; > userspace pagetables would be discardable, and could even be shared > with those of the supervisor in cases where the alignment fits. > So with this, large anonymous mappings with 4K granularity only cost you > ~0.20% overhead across host and guest address space; without this, if you > used shared mappings instead, you'd pay twice that for every 2MiB range > from which parts are accessed in both contexts, plus probably another > ~0.2% or so for the "struct address_space"? If we use shared mappings, we don't map the most part of guest memory to the supervisor address space and don't have page tables for it there. I would say that this is a question where a memory footprint will be smaller... > > - all memory-management-related syscalls could be directly performed > in the "kernel" process > > But yeah, some of those aren't really relevant for your usecase, and I > guess things like the accounting aspect could just as well be solved > differently... > > > First, in the KVM case, we have a few big linear mappings and need to > > support one “shadow” address space. In the case of sandboxes, we can > > have a tremendous amount of mappings and many address spaces that we > > need to manage. Memory mappings will be mapped with different addresses > > in a supervisor address space and “guest” address spaces. If guest > > address spaces will not have their mm_structs, we will need to reinvent > > vma-s in some form. If guest address spaces have mm_structs, this will > > look similar to https://lwn.net/Articles/830648/. > > > > Second, each pagetable is tied up with mm_stuct. You suggest creating > > new pagetables that will not have their mm_struct-s (sorry if I > > misunderstood something). > > Yeah, that's what I had in mind, page tables without an mm_struct. > > > I am not sure that it will be easy to > > implement. How many corner cases will be there? > > Yeah, it would require some work around TLB flushing and entry/exit > from userspace. But from a high-level perspective it feels to me like > a change with less systematic impact. Maybe I'm wrong about that. > > > As for page faults in a secondary address space, we will need to find a > > fault address in the main address space, handle the fault there and then > > mirror the PTE to the secondary pagetable. > > Right. > > > Effectively, it means that > > page faults will be handled in two address spaces. Right now, we use > > memfd and shared mappings. It means that each fault is handled only in > > one address space, and we map a guest memory region to the supervisor > > address space only when we need to access it. A large portion of guest > > anonymous memory is never mapped to the supervisor address space. > > Will an overhead of mirrored address spaces be smaller than memfd shared > > mappings? I am not sure. > > But as long as the mappings are sufficiently big and aligned properly, > or you explicitly manage the supervisor address space, some of that > cost disappears: E.g. even if a page is mapped in both address spaces, > you wouldn't have a memory cost for the second mapping if the page > tables are shared. You are right. It is interesting how many pte-s will be shared. For example, if a guest process forks a child, all anon memory will be COW, this means we will need to remove the W bit from pte-s and so we will need to allocate pte-s for both processes... > > > Third, this approach will not get rid of having process_vm_exec. We will > > need to switch to a guest address space with a specified state and > > switch back on faults or syscalls. > > Yeah, you'd still need a syscall for running code under a different > set of page tables. But that's something that KVM _almost_ already > does. I don't understand this analogy with KVM... > > > If the main concern is the ability to > > run syscalls on a remote mm, we can think about how to fix this. I see > > two ways what we can do here: > > > > * Specify the exact list of system calls that are allowed. The first > > three candidates are mmap, munmap, and vmsplice. > > > > * Instead of allowing us to run system calls, we can implement this in > > the form of commands. In the case of sandboxes, we need to implement > > only two commands to create and destroy memory mappings in a target > > address space. > > FWIW, there is precedent for something similar: The Android folks > already added process_madvise() for remotely messing with the VMAs of > another process to some degree. I know. We tried to implement process_vm_mmap and process_vm_splice: https://lkml.org/lkml/2018/1/9/32 https://patchwork.kernel.org/project/linux-mm/cover/155836064844.2441.10911127801797083064.stgit@localhost.localdomain/ Thanks, Andrei
On Fri, Jul 02, 2021 at 03:44:41PM -0700, Andy Lutomirski wrote: > On 4/13/21 10:52 PM, Andrei Vagin wrote: > > > process_vm_exec has two modes: > > > > * Execute code in an address space of a target process and stop on any > > signal or system call. > > We already have a perfectly good context switch mechanism: context > switches. If you execute code, you are basically guaranteed to be > subject to being hijacked, which means you pretty much can't allow > syscalls. But there's a lot of non-syscall state, and I think context > switching needs to be done with extreme care. > > (Just as example, suppose you switch mms, then set %gs to point to the > LDT, then switch back. Now you're in a weird state. With %ss the plot > is a bit thicker. And there are emulated vsyscalls and such.) > > If you, PeterZ, and the UMCG could all find an acceptable, efficient way > to wake-and-wait so you can switch into an injected task in the target > process and switch back quickly, then I think a much nicer solution will > become available. I know about umcg and I even did a prototype that used fuxet_swap (the previous attempt of umcg). Here are a few problems and maybe you will have some ideas on how to solve them. The main question is how to hijack a stub process where a guest code is executing. We need to trap system calls, memory faults, and other exceptions and handle them in the Sentry (supervisor/kernel). All interested events except system calls generate signals. We can use seccomp to get signals on system calls too. In my prototype, a guest code is running in stub processes. One stub process is for each guest address space. In a stub process, I set a signal handler for SIGSEGV, SIGBUS, SIGFPE, SIGSYS, SIGILL, set an alternate signal stack, and set seccomp rules. The signal handler communicates with the Sentry (supervisor/kernel) via shared memory and uses futex_swap to make fast switches to the Sentry and back to a stub process. Here are a few problems. First, we have a signal handler code, its stack, and a shared memory region in a guest address space, and we need to guarantee that a guest code will not be able to use them to do something unexpected. The second problem is performance. It is much faster if we compare it with the ptrace platform, but it is still a few times slower than process_vm_exec. Signal handling is expensive. The kernel has to generate a signal frame, execute a signal handler, and then it needs to call rt_sigreturn. Futex_swap makes fast context switches, but it is still slower than process_vm_exec. UMCG should be faster because it doesn’t have a futex overhead. Andy, what do you think about the idea to rework process_vm_exec so that it executes code and syscalls in the context of a target process? Maybe you see other ways how we can “hijack” a remote process? Thanks, Andrei > > > > > * Execute a system call in an address space of a target process. > > I could get behind this, but there are plenty of cans of worms to watch > out for. Serious auditing would be needed.