diff mbox series

uml/helper: Fix stack alignment

Message ID 20210417163949.530921-1-zhuyifei1999@gmail.com
State Superseded
Headers show
Series uml/helper: Fix stack alignment | expand

Commit Message

YiFei Zhu April 17, 2021, 4:39 p.m. UTC
GCC assumes that stack is aligned to 16-byte on function entry [1].
Since GCC 8, GCC began using 16-byte aligned SSE instructions to
implement assignments to structs on stack. This affects
os-Linux/sigio.c, write_sigio_thread:

  struct pollfds *fds, tmp;
  tmp = current_poll;

Note that struct pollfds is exactly 16 bytes in size.
GCC 8+ generates assembly similar to:

  movdqa (%rdi),%xmm0
  movaps %xmm0,-0x50(%rbp)

This is an issue, because movaps will #GP if -0x50(%rbp) is not
aligned to 16 bytes [2], and how rbp gets assigned to is via glibc
clone thread_start, then function prologue, going though execution
trace similar to (showing only relevant instructions):

  sub    $0x10,%rsi
  mov    %rcx,0x8(%rsi)
  mov    %rdi,(%rsi)
  syscall
  pop    %rax
  pop    %rdi
  callq  *%rax
  push   %rbp
  mov    %rsp,%rbp

The stack pointer always points to the topmost element on stack,
rather then the space right above the topmost. On push, the
pointer decrements first before writing to the memory pointed to
by it. Therefore, there is no need to have the stack pointer
pointer always point to valid memory unless the stack is poped;
so the `- sizeof(void *)` in the code is unnecessary.

On the other hand, glibc reserves the 16 bytes it needs on stack
and pops itself, so by the call instruction the stack pointer
is exactly the caller-supplied sp. It then push the 16 bytes of
the return address and the saved stack pointer, so the base
pointer will be 16-byte aligned if and only if the caller
supplied sp is 16-byte aligned. Therefore, the caller must supply
a 16-byte aligned pointer, which `stack + UM_KERN_PAGE_SIZE`
already satisfies.

On a side note, musl is unaffected by this issue because it forces
16 byte alignment via `and $-16,%rsi` in its clone wrapper.
Similarly, glibc i386 is also uneffected because it has
`andl $0xfffffff0, %ecx`.

To reproduce this bug, enable CONFIG_UML_RTC. uml_rtc will call
add_sigio_fd which will then cause write_sigio_thread to go
into segfault loop.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838
[2] https://c9x.me/x86/html/file_module_x86_id_180.html

Signed-off-by: YiFei Zhu <zhuyifei1999@gmail.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
---
 arch/um/os-Linux/helper.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Johannes Berg April 17, 2021, 7:40 p.m. UTC | #1
On Sat, 2021-04-17 at 11:39 -0500, YiFei Zhu wrote:
> GCC assumes that stack is aligned to 16-byte on function entry [1].

Fun.

> Therefore, there is no need to have the stack pointer
> pointer always point to valid memory unless the stack is poped;
> so the `- sizeof(void *)` in the code is unnecessary.
> 

I've always wondered about that - sizeof(void *) thing there ... didn't
seem to make much sense :)

> On the other hand, glibc reserves the 16 bytes it needs on stack
> and pops itself, so by the call instruction the stack pointer
> is exactly the caller-supplied sp. It then push the 16 bytes of
> the return address and the saved stack pointer, so the base
> pointer will be 16-byte aligned if and only if the caller
> supplied sp is 16-byte aligned. Therefore, the caller must supply
> a 16-byte aligned pointer, which `stack + UM_KERN_PAGE_SIZE`
> already satisfies.
> 
> On a side note, musl is unaffected by this issue because it forces
> 16 byte alignment via `and $-16,%rsi` in its clone wrapper.
> Similarly, glibc i386 is also uneffected because it has

typo: unaffected :)


> To reproduce this bug, enable CONFIG_UML_RTC. uml_rtc will call
> add_sigio_fd which will then cause write_sigio_thread to go
> into segfault loop.

Probably not the only way you can do this, but yeah ...


How about the same pattern in start_userspace() and new_thread()?
Doesn't matter for some reason? Or just didn't hit it yet because
there's no 16-byte "thing" on the stack?

johannes
YiFei Zhu April 18, 2021, 4:56 a.m. UTC | #2
On Sat, Apr 17, 2021 at 2:40 PM Johannes Berg <johannes@sipsolutions.net> wrote:
> How about the same pattern in start_userspace() and new_thread()?
> Doesn't matter for some reason? Or just didn't hit it yet because
> there's no 16-byte "thing" on the stack?

I see. I was not aware they were there. Didn't hit :)

* start_userspace -- yes, the exact same pattern via clone.
* new_thread -- no, this is a longjmp buffer that sets the instruction
pointer and stack pointer directly. It emulates a call instruction, as
if the return address has already been pushed. Let me give an example.

The clone trace works like:

  ; 16-aligned
  callq  *%rax ; push 8
  push   %rbp ; push 8
  ; 16 aligned
  mov    %rsp,%rbp

However, for a longjmp buffer, it will jump directly to `push %rbp`,
so a 8-offset is needed to account for the missing offset from `callq
*%rax`.

In my case, if I remove this offset from new_thread, init_thread_regs,
and start_idle_thread, it will segfault loop in ZSTD_getParams while
initializing btrfs.

Other places I found similar patterns:
* init_thread_regs & start_idle_thread -- no,  longjmp buffers, same
as new_thread.
* ubd_driver_init -- yes, same as run_helper.
* set_sigstack -- This one is interesting:
  * um on native x86_64: a quick test reveals that whether the offset
is used does not affect the stack pointer. In the code, upon signal
delivery (in the host kernel) get_sigframe calls align_sigframe, so
unaligned stack pointers make no difference.
  * um on um x86_64: I'm having trouble testing um within um, getting
a weird error ("start_userspace : expected SIGSTOP, got status = 2943"
when starting init, might try to debug later), but the code in
handle_signal also aligns the stack.
  * The man page of sigaltstack and a quick google search shows that
the conventional usage of sigaltstack does not reduce the size of
stack by sizeof(void *), so I think it's probably best to follow the
convention here.
* stub_clone_handler -- This looks fragile:
  * As far as I understand, it splits the STUB_DATA page into half,
the higher-address half for the parent to run the stub (from
init_thread_regs) and the lower half are switched to in the child.
This is fragile. The clone syscall to the host is done in the middle
of a function body and we are depending on the compiler optimizations
that it won't store any shared variables on stack that would be
missing once the stack pointer is set to something else. Similarly,
the base pointer is still shared between the two processes. It's also
dependent on optimizations to say shared variables won't overwrite
each other.
  * I'm thinking, considering that we don't have CLONE_VM set in the
stub, is it possible to force the child to map a private page for use
as a stack, so modifications to the stack done inside the child are
not seen by the parent?
  * That said, it's still potentially trouble-causing if during the
body of the function the stack pointer isn't aligned. I can't think of
any case where GCC would want to use aligned SSE instructions in the
stub, however.

I also realized, I've been compiling um with CONFIG_FRAME_POINTER,
hence the use of rbp. I just also checked how if the bug can be
reproduced without CONFIG_FRAME_POINTER, and yes, the same segfault
loop, and the assembly just bases the pointers off of rsp rather than
rbp, still assuming the 16 byte alignment.

YiFei Zhu
YiFei Zhu April 18, 2021, 6:56 a.m. UTC | #3
On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>   * um on um x86_64: I'm having trouble testing um within um, getting
> a weird error ("start_userspace : expected SIGSTOP, got status = 2943"
> when starting init, might try to debug later), but the code in
> handle_signal also aligns the stack.

Figured this one out. The inner um, in userspace_tramp, is trying to
mmap the syscall stub to the same syscall stub at the same location as
the outer um, and that fails with ENOMEM. In theory, this would cause
the printk of "mapping mmap stub at ... failed, errno = ..." to occur,
but because:
* call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi
-> nmi_count -> preempt_count -> current_thread_info
* um's current_thread_info is at the current stack pointer & mask,
hence it is often not valid when on small temporary stacks.
Therefore, userspace_tramp can't printk.

I'm wondering, is this issue of printk being broken in userspace_tramp
an issue worth fixing? Has there been prior discussions on it?

YiFei Zhu
Anton Ivanov April 18, 2021, 7:36 a.m. UTC | #4
On 18/04/2021 07:56, YiFei Zhu wrote:
> On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>    * um on um x86_64: I'm having trouble testing um within um, getting
>> a weird error ("start_userspace : expected SIGSTOP, got status = 2943"
>> when starting init, might try to debug later), but the code in
>> handle_signal also aligns the stack.
> 
> Figured this one out. The inner um, in userspace_tramp, is trying to
> mmap the syscall stub to the same syscall stub at the same location as
> the outer um, and that fails with ENOMEM. In theory, this would cause
> the printk of "mapping mmap stub at ... failed, errno = ..." to occur,
> but because:
> * call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi
> -> nmi_count -> preempt_count -> current_thread_info
> * um's current_thread_info is at the current stack pointer & mask,
> hence it is often not valid when on small temporary stacks.
> Therefore, userspace_tramp can't printk.
> 
> I'm wondering, is this issue of printk being broken in userspace_tramp
> an issue worth fixing? Has there been prior discussions on it?
> 
> YiFei Zhu
> 

Based ob experience - printk does not work correctly out of some uml 
threads. We had to kill printk use in the ubd helper thread.

I never got to the bottom of that, it was easier to kill it. In that 
case it was not particularly informative.
Benjamin Berg April 19, 2021, 1:34 p.m. UTC | #5
On Sun, 2021-04-18 at 01:56 -0500, YiFei Zhu wrote:
> On Sat, Apr 17, 2021 at 11:56 PM YiFei Zhu <zhuyifei1999@gmail.com>
> wrote:
> >   * um on um x86_64: I'm having trouble testing um within um, getting
> > a weird error ("start_userspace : expected SIGSTOP, got status =
> > 2943"
> > when starting init, might try to debug later), but the code in
> > handle_signal also aligns the stack.
> 
> Figured this one out. The inner um, in userspace_tramp, is trying to
> mmap the syscall stub to the same syscall stub at the same location as
> the outer um, and that fails with ENOMEM. In theory, this would cause
> the printk of "mapping mmap stub at ... failed, errno = ..." to occur,
> but because:
> * call stack: vprintk_store -> printk_caller_id -> in_task -> in_nmi
> -> nmi_count -> preempt_count -> current_thread_info
> * um's current_thread_info is at the current stack pointer & mask,
> hence it is often not valid when on small temporary stacks.
> Therefore, userspace_tramp can't printk.
> 
> I'm wondering, is this issue of printk being broken in userspace_tramp
> an issue worth fixing? Has there been prior discussions on it?

Yeah, printk is only usable from kernel threads. So in other cases
logging should happen directly. But do note that os_info with its
current glibc printf implementation seems to also create issues
sometimes as it may require more stack than available.

In my seccomp patchset, I added a few patches to fix related issues,
see
 https://patchwork.ozlabs.org/project/linux-um/list/?series=231980
specifically
 * [06/27] um: Don't use vfprintf() for os_info()
 * [07/27] um: Do not use printk in SIGWINCH helper thread
 * [09/27] um: Do not use printk in userspace trampoline

Benjamin
YiFei Zhu April 19, 2021, 3:07 p.m. UTC | #6
On Mon, Apr 19, 2021 at 8:35 AM Benjamin Berg <benjamin@sipsolutions.net> wrote:
> Yeah, printk is only usable from kernel threads. So in other cases
> logging should happen directly. But do note that os_info with its
> current glibc printf implementation seems to also create issues
> sometimes as it may require more stack than available.
>
> In my seccomp patchset, I added a few patches to fix related issues,
> see
>  https://patchwork.ozlabs.org/project/linux-um/list/?series=231980
> specifically
>  * [06/27] um: Don't use vfprintf() for os_info()
>  * [07/27] um: Do not use printk in SIGWINCH helper thread
>  * [09/27] um: Do not use printk in userspace trampoline
>
> Benjamin

I see! Thanks. Good to know there's a patch working to fix this issue.

YiFei Zhu
diff mbox series

Patch

diff --git a/arch/um/os-Linux/helper.c b/arch/um/os-Linux/helper.c
index 9fa6e4187d4f..32e88baf18dd 100644
--- a/arch/um/os-Linux/helper.c
+++ b/arch/um/os-Linux/helper.c
@@ -64,7 +64,7 @@  int run_helper(void (*pre_exec)(void *), void *pre_data, char **argv)
 		goto out_close;
 	}
 
-	sp = stack + UM_KERN_PAGE_SIZE - sizeof(void *);
+	sp = stack + UM_KERN_PAGE_SIZE;
 	data.pre_exec = pre_exec;
 	data.pre_data = pre_data;
 	data.argv = argv;
@@ -120,7 +120,7 @@  int run_helper_thread(int (*proc)(void *), void *arg, unsigned int flags,
 	if (stack == 0)
 		return -ENOMEM;
 
-	sp = stack + UM_KERN_PAGE_SIZE - sizeof(void *);
+	sp = stack + UM_KERN_PAGE_SIZE;
 	pid = clone(proc, (void *) sp, flags, arg);
 	if (pid < 0) {
 		err = -errno;