Message ID | 20210417163949.530921-1-zhuyifei1999@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | uml/helper: Fix stack alignment | expand |
On Sat, 2021-04-17 at 11:39 -0500, YiFei Zhu wrote: > GCC assumes that stack is aligned to 16-byte on function entry [1]. Fun. > Therefore, there is no need to have the stack pointer > pointer always point to valid memory unless the stack is poped; > so the `- sizeof(void *)` in the code is unnecessary. > I've always wondered about that - sizeof(void *) thing there ... didn't seem to make much sense :) > On the other hand, glibc reserves the 16 bytes it needs on stack > and pops itself, so by the call instruction the stack pointer > is exactly the caller-supplied sp. It then push the 16 bytes of > the return address and the saved stack pointer, so the base > pointer will be 16-byte aligned if and only if the caller > supplied sp is 16-byte aligned. Therefore, the caller must supply > a 16-byte aligned pointer, which `stack + UM_KERN_PAGE_SIZE` > already satisfies. > > On a side note, musl is unaffected by this issue because it forces > 16 byte alignment via `and $-16,%rsi` in its clone wrapper. > Similarly, glibc i386 is also uneffected because it has typo: unaffected :) > To reproduce this bug, enable CONFIG_UML_RTC. uml_rtc will call > add_sigio_fd which will then cause write_sigio_thread to go > into segfault loop. Probably not the only way you can do this, but yeah ... How about the same pattern in start_userspace() and new_thread()? Doesn't matter for some reason? Or just didn't hit it yet because there's no 16-byte "thing" on the stack? johannes
On Sat, Apr 17, 2021 at 2:40 PM Johannes Berg <johannes@sipsolutions.net> wrote: > How about the same pattern in start_userspace() and new_thread()? > Doesn't matter for some reason? Or just didn't hit it yet because > there's no 16-byte "thing" on the stack? I see. I was not aware they were there. Didn't hit :) * start_userspace -- yes, the exact same pattern via clone. * new_thread -- no, this is a longjmp buffer that sets the instruction pointer and stack pointer directly. It emulates a call instruction, as if the return address has already been pushed. Let me give an example. The clone trace works like: ; 16-aligned callq *%rax ; push 8 push %rbp ; push 8 ; 16 aligned mov %rsp,%rbp However, for a longjmp buffer, it will jump directly to `push %rbp`, so a 8-offset is needed to account for the missing offset from `callq *%rax`. In my case, if I remove this offset from new_thread, init_thread_regs, and start_idle_thread, it will segfault loop in ZSTD_getParams while initializing btrfs. Other places I found similar patterns: * init_thread_regs & start_idle_thread -- no, longjmp buffers, same as new_thread. * ubd_driver_init -- yes, same as run_helper. * set_sigstack -- This one is interesting: * um on native x86_64: a quick test reveals that whether the offset is used does not affect the stack pointer. In the code, upon signal delivery (in the host kernel) get_sigframe calls align_sigframe, so unaligned stack pointers make no difference. * um on um x86_64: I'm having trouble testing um within um, getting a weird error ("start_userspace : expected SIGSTOP, got status = 2943" when starting init, might try to debug later), but the code in handle_signal also aligns the stack. * The man page of sigaltstack and a quick google search shows that the conventional usage of sigaltstack does not reduce the size of stack by sizeof(void *), so I think it's probably best to follow the convention here. * stub_clone_handler -- This looks fragile: * As far as I understand, it splits the STUB_DATA page into half, the higher-address half for the parent to run the stub (from init_thread_regs) and the lower half are switched to in the child. This is fragile. The clone syscall to the host is done in the middle of a function body and we are depending on the compiler optimizations that it won't store any shared variables on stack that would be missing once the stack pointer is set to something else. Similarly, the base pointer is still shared between the two processes. It's also dependent on optimizations to say shared variables won't overwrite each other. * I'm thinking, considering that we don't have CLONE_VM set in the stub, is it possible to force the child to map a private page for use as a stack, so modifications to the stack done inside the child are not seen by the parent? * That said, it's still potentially trouble-causing if during the body of the function the stack pointer isn't aligned. I can't think of any case where GCC would want to use aligned SSE instructions in the stub, however. I also realized, I've been compiling um with CONFIG_FRAME_POINTER, hence the use of rbp. I just also checked how if the bug can be reproduced without CONFIG_FRAME_POINTER, and yes, the same segfault loop, and the assembly just bases the pointers off of rsp rather than rbp, still assuming the 16 byte alignment. YiFei Zhu
On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > * um on um x86_64: I'm having trouble testing um within um, getting > a weird error ("start_userspace : expected SIGSTOP, got status = 2943" > when starting init, might try to debug later), but the code in > handle_signal also aligns the stack. Figured this one out. The inner um, in userspace_tramp, is trying to mmap the syscall stub to the same syscall stub at the same location as the outer um, and that fails with ENOMEM. In theory, this would cause the printk of "mapping mmap stub at ... failed, errno = ..." to occur, but because: * call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi -> nmi_count -> preempt_count -> current_thread_info * um's current_thread_info is at the current stack pointer & mask, hence it is often not valid when on small temporary stacks. Therefore, userspace_tramp can't printk. I'm wondering, is this issue of printk being broken in userspace_tramp an issue worth fixing? Has there been prior discussions on it? YiFei Zhu
On 18/04/2021 07:56, YiFei Zhu wrote: > On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: >> * um on um x86_64: I'm having trouble testing um within um, getting >> a weird error ("start_userspace : expected SIGSTOP, got status = 2943" >> when starting init, might try to debug later), but the code in >> handle_signal also aligns the stack. > > Figured this one out. The inner um, in userspace_tramp, is trying to > mmap the syscall stub to the same syscall stub at the same location as > the outer um, and that fails with ENOMEM. In theory, this would cause > the printk of "mapping mmap stub at ... failed, errno = ..." to occur, > but because: > * call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi > -> nmi_count -> preempt_count -> current_thread_info > * um's current_thread_info is at the current stack pointer & mask, > hence it is often not valid when on small temporary stacks. > Therefore, userspace_tramp can't printk. > > I'm wondering, is this issue of printk being broken in userspace_tramp > an issue worth fixing? Has there been prior discussions on it? > > YiFei Zhu > Based ob experience - printk does not work correctly out of some uml threads. We had to kill printk use in the ubd helper thread. I never got to the bottom of that, it was easier to kill it. In that case it was not particularly informative.
On Sun, 2021-04-18 at 01:56 -0500, YiFei Zhu wrote: > On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com> > wrote: > > * um on um x86_64: I'm having trouble testing um within um, getting > > a weird error ("start_userspace : expected SIGSTOP, got status = > > 2943" > > when starting init, might try to debug later), but the code in > > handle_signal also aligns the stack. > > Figured this one out. The inner um, in userspace_tramp, is trying to > mmap the syscall stub to the same syscall stub at the same location as > the outer um, and that fails with ENOMEM. In theory, this would cause > the printk of "mapping mmap stub at ... failed, errno = ..." to occur, > but because: > * call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi > -> nmi_count -> preempt_count -> current_thread_info > * um's current_thread_info is at the current stack pointer & mask, > hence it is often not valid when on small temporary stacks. > Therefore, userspace_tramp can't printk. > > I'm wondering, is this issue of printk being broken in userspace_tramp > an issue worth fixing? Has there been prior discussions on it? Yeah, printk is only usable from kernel threads. So in other cases logging should happen directly. But do note that os_info with its current glibc printf implementation seems to also create issues sometimes as it may require more stack than available. In my seccomp patchset, I added a few patches to fix related issues, see https://patchwork.ozlabs.org/project/linux-um/list/?series=231980 specifically * [06/27] um: Don't use vfprintf() for os_info() * [07/27] um: Do not use printk in SIGWINCH helper thread * [09/27] um: Do not use printk in userspace trampoline Benjamin
On Mon, Apr 19, 2021 at 8:35 AM Benjamin Berg <benjamin@sipsolutions.net> wrote: > Yeah, printk is only usable from kernel threads. So in other cases > logging should happen directly. But do note that os_info with its > current glibc printf implementation seems to also create issues > sometimes as it may require more stack than available. > > In my seccomp patchset, I added a few patches to fix related issues, > see > https://patchwork.ozlabs.org/project/linux-um/list/?series=231980 > specifically > * [06/27] um: Don't use vfprintf() for os_info() > * [07/27] um: Do not use printk in SIGWINCH helper thread > * [09/27] um: Do not use printk in userspace trampoline > > Benjamin I see! Thanks. Good to know there's a patch working to fix this issue. YiFei Zhu
diff --git a/arch/um/os-Linux/helper.c b/arch/um/os-Linux/helper.c index 9fa6e4187d4f..32e88baf18dd 100644 --- a/arch/um/os-Linux/helper.c +++ b/arch/um/os-Linux/helper.c @@ -64,7 +64,7 @@ int run_helper(void (*pre_exec)(void *), void *pre_data, char **argv) goto out_close; } - sp = stack + UM_KERN_PAGE_SIZE - sizeof(void *); + sp = stack + UM_KERN_PAGE_SIZE; data.pre_exec = pre_exec; data.pre_data = pre_data; data.argv = argv; @@ -120,7 +120,7 @@ int run_helper_thread(int (*proc)(void *), void *arg, unsigned int flags, if (stack == 0) return -ENOMEM; - sp = stack + UM_KERN_PAGE_SIZE - sizeof(void *); + sp = stack + UM_KERN_PAGE_SIZE; pid = clone(proc, (void *) sp, flags, arg); if (pid < 0) { err = -errno;
GCC assumes that stack is aligned to 16-byte on function entry [1]. Since GCC 8, GCC began using 16-byte aligned SSE instructions to implement assignments to structs on stack. This affects os-Linux/sigio.c, write_sigio_thread: struct pollfds *fds, tmp; tmp = current_poll; Note that struct pollfds is exactly 16 bytes in size. GCC 8+ generates assembly similar to: movdqa (%rdi),%xmm0 movaps %xmm0,-0x50(%rbp) This is an issue, because movaps will #GP if -0x50(%rbp) is not aligned to 16 bytes [2], and how rbp gets assigned to is via glibc clone thread_start, then function prologue, going though execution trace similar to (showing only relevant instructions): sub $0x10,%rsi mov %rcx,0x8(%rsi) mov %rdi,(%rsi) syscall pop %rax pop %rdi callq *%rax push %rbp mov %rsp,%rbp The stack pointer always points to the topmost element on stack, rather then the space right above the topmost. On push, the pointer decrements first before writing to the memory pointed to by it. Therefore, there is no need to have the stack pointer pointer always point to valid memory unless the stack is poped; so the `- sizeof(void *)` in the code is unnecessary. On the other hand, glibc reserves the 16 bytes it needs on stack and pops itself, so by the call instruction the stack pointer is exactly the caller-supplied sp. It then push the 16 bytes of the return address and the saved stack pointer, so the base pointer will be 16-byte aligned if and only if the caller supplied sp is 16-byte aligned. Therefore, the caller must supply a 16-byte aligned pointer, which `stack + UM_KERN_PAGE_SIZE` already satisfies. On a side note, musl is unaffected by this issue because it forces 16 byte alignment via `and $-16,%rsi` in its clone wrapper. Similarly, glibc i386 is also uneffected because it has `andl $0xfffffff0, %ecx`. To reproduce this bug, enable CONFIG_UML_RTC. uml_rtc will call add_sigio_fd which will then cause write_sigio_thread to go into segfault loop. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 [2] https://c9x.me/x86/html/file_module_x86_id_180.html Signed-off-by: YiFei Zhu <zhuyifei1999@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") --- arch/um/os-Linux/helper.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)