diff mbox series

[RFC,v2,10/13] x86/um: nommu: signal handling

Message ID 5a769da2dcc8e7f9b89fbdbc4bccd0b8a1660309.1731290567.git.thehajime@gmail.com
State RFC
Headers show
Series nommu UML | expand

Commit Message

Hajime Tazaki Nov. 11, 2024, 6:27 a.m. UTC
This commit updates the behavior of signal handling under !MMU
environment. 1) the stack preparation for the signal handlers and
2) restoration of stack after rt_sigreturn(2) syscall.  Those are needed
as the stack usage on vfork(2) syscall is different.

It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
---
 arch/um/include/shared/kern_util.h |  3 +++
 arch/um/kernel/trap.c              | 10 ++++++++
 arch/um/os-Linux/signal.c          | 18 ++++++++++++++-
 arch/x86/um/signal.c               | 37 +++++++++++++++++++++++++++++-
 4 files changed, 66 insertions(+), 2 deletions(-)

Comments

Benjamin Berg Nov. 28, 2024, 10:37 a.m. UTC | #1
Hi,

On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> This commit updates the behavior of signal handling under !MMU
> environment. 1) the stack preparation for the signal handlers and
> 2) restoration of stack after rt_sigreturn(2) syscall.  Those are needed
> as the stack usage on vfork(2) syscall is different.
> 
> It also adds the follow up routine for SIGSEGV as a signal delivery runs
> in the same stack frame while we have to avoid endless SIGSEGV.
> 
> Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
> ---
>  arch/um/include/shared/kern_util.h |  3 +++
>  arch/um/kernel/trap.c              | 10 ++++++++
>  arch/um/os-Linux/signal.c          | 18 ++++++++++++++-
>  arch/x86/um/signal.c               | 37 +++++++++++++++++++++++++++++-
>  4 files changed, 66 insertions(+), 2 deletions(-)
> 
> [SNIP]
> diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
> index 52852018a3ad..a06622415d8f 100644
> --- a/arch/um/os-Linux/signal.c
> +++ b/arch/um/os-Linux/signal.c
> @@ -36,7 +36,15 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
>   struct uml_pt_regs r;
>   int save_errno = errno;
>  
> - r.is_user = 0;
> +#ifndef CONFIG_MMU
> + memset(&r, 0, sizeof(r));
> + /* mark is_user=1 when the IP is from userspace code. */
> + if (mc && (REGS_IP(mc->gregs) > uml_reserved
> +    && REGS_IP(mc->gregs) < high_physmem))
> + r.is_user = 1;
> + else
> +#endif
> + r.is_user = 0;

Does this work if we load modules dynamically?

I suppose one could map them into a separate memory area rather than
running them directly from the physical memory.
Otherwise we'll also get problem with the SECCOMP filter.

>   if (sig == SIGSEGV) {
>   /* For segfaults, we want the data from the sigcontext. */
>   get_regs_from_mc(&r, mc);
> @@ -191,6 +199,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>   ucontext_t *uc = p;
>   mcontext_t *mc = &uc->uc_mcontext;
>   unsigned long pending = 1UL << sig;
> + int is_segv = 0;
>  
>   do {
>   int nested, bail;
> @@ -214,6 +223,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>  
>   while ((sig = ffs(pending)) != 0){
>   sig--;
> + is_segv = (sig == SIGSEGV) ? 1 : 0;
>   pending &= ~(1 << sig);
>   (*handlers[sig])(sig, (struct siginfo *)si, mc);
>   }
> @@ -227,6 +237,12 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
>   if (!nested)
>   pending = from_irq_stack(nested);
>   } while (pending);
> +
> +#ifndef CONFIG_MMU
> + /* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
> + if (is_segv)
> + sigsegv_post_routine();
> +#endif
>  }

I am confused, this doesn't feel quite correct to me.

So, for normal UML, I think we always do an rt_sigreturn. Which means,
we always go back to the corresponding *kernel* task. To schedule in
response to SIGALRM, we forward the signal to the userspace process.
I believe that means:
   1. We cannot schedule kernel threads (that seems like a bug)
   2. Scheduling for userspace happens once the signal is delivered.
      Then userspace() saves the state and calls interrupt_end().


Now, keep in mind that we are on the separate signal stack here. If we
jump anywhere directly, we abandon the old state information stored by
the host kernel into the mcontext. We can absolutely do that, but we
need to be careful to not forget anything.

As such, I wonder whether nommu should:
   1. When entering from kernel, update "current->thread.switch_buf"
      from the mcontext.
       - If we need to schedule, push a stack frame that calls the scheduling
         code and returns with the correct state.
   2. When entering from user, store the task registers from the
      mcontext. At some point (here or earlier) ensure that the
      "current->thread.switch_buf" is set up so that we can return to
      userspace by restoring the task registers.
       - To schedule, piggy back on 1. or add special code.
   3. Always do a UML_LONGJMP() back into the "current" task.

That said, I am probably not having the full picture right now.

Benjamin

PS: On a further note, I think the current code to enter userspace
cannot handle single stepping. I suppose that is fine, but you should
probably set arch_has_single_step to 0 for nommu.

>  void set_handler(int sig)
> diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
> index 75087e85b6fd..b7365c75a967 100644
> --- a/arch/x86/um/signal.c
> +++ b/arch/x86/um/signal.c
> @@ -371,6 +371,13 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
>   round_down(stack_top - sizeof(struct rt_sigframe), 16);
>  
>   /* Add required space for math frame */
> +#ifndef CONFIG_MMU
> + /*
> + * the sig_frame on !MMU needs be aligned for SSE as
> + * the frame is used as-is.
> + */
> + math_size = round_down(math_size, 16);
> +#endif
>   frame = (struct rt_sigframe __user *)((unsigned long)frame - math_size);
>  
>   /* Subtract 128 for a red zone and 8 for proper alignment */
> @@ -417,6 +424,18 @@ int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
>   /* could use a vstub here */
>   return err;
>  
> +#ifndef CONFIG_MMU
> + /*
> + * we need to push handler address at top of stack, as
> + * __kernel_vsyscall, called after this returns with ret with
> + * stack contents, thus push the handler here.
> + */
> + frame = (struct rt_sigframe __user *) ((unsigned long) frame -
> +        sizeof(unsigned long));
> + err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
> +   (unsigned long *)frame);
> +#endif
> +
>   if (err)
>   return err;
>  
> @@ -442,9 +461,25 @@ SYSCALL_DEFINE0(rt_sigreturn)
>   unsigned long sp = PT_REGS_SP(&current->thread.regs);
>   struct rt_sigframe __user *frame =
>   (struct rt_sigframe __user *)(sp - sizeof(long));
> - struct ucontext __user *uc = &frame->uc;
> + struct ucontext __user *uc;
>   sigset_t set;
>  
> +#ifndef CONFIG_MMU
> + /**
> + * we enter here with:
> + *
> + * __restore_rt:
> + *     mov $15, %rax
> + *     call *%rax (translated from syscall)
> + *
> + * (code is from musl libc)
> + * so, stack needs to be popped of "call"ed address before
> + * looking at rt_sigframe.
> + */
> + frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
> +#endif
> + uc = &frame->uc;
> +
>   if (copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
>   goto segfault;
>
Hajime Tazaki Dec. 1, 2024, 1:38 a.m. UTC | #2
Hello,

On Thu, 28 Nov 2024 19:37:21 +0900,
Benjamin Berg wrote:

> > +#ifndef CONFIG_MMU
> > + memset(&r, 0, sizeof(r));
> > + /* mark is_user=1 when the IP is from userspace code. */
> > + if (mc && (REGS_IP(mc->gregs) > uml_reserved
> > +    && REGS_IP(mc->gregs) < high_physmem))
> > + r.is_user = 1;
> > + else
> > +#endif
> > + r.is_user = 0;
> 
> Does this work if we load modules dynamically?
> 
> I suppose one could map them into a separate memory area rather than
> running them directly from the physical memory.
> Otherwise we'll also get problem with the SECCOMP filter.

currently, I thought modules use the separate area from execmem, but
nommu allocator ignores this location info to map the memory; instead
mixing up with area used by userspace programs.

we may be able to come up with execmem_arch_setup() to fix this
situation.

so, no, this is_user detection doesn't work; modules also become
is_user=1.

MMU full allocator (normal UML and seccomp asl well ?) seems to be
fine as long as using execmem.

I will look into detail how we should handle.

> >   if (sig == SIGSEGV) {
> >   /* For segfaults, we want the data from the sigcontext. */
> >   get_regs_from_mc(&r, mc);
> > @@ -191,6 +199,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >   ucontext_t *uc = p;
> >   mcontext_t *mc = &uc->uc_mcontext;
> >   unsigned long pending = 1UL << sig;
> > + int is_segv = 0;
> >  
> >   do {
> >   int nested, bail;
> > @@ -214,6 +223,7 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >  
> >   while ((sig = ffs(pending)) != 0){
> >   sig--;
> > + is_segv = (sig == SIGSEGV) ? 1 : 0;
> >   pending &= ~(1 << sig);
> >   (*handlers[sig])(sig, (struct siginfo *)si, mc);
> >   }
> > @@ -227,6 +237,12 @@ static void hard_handler(int sig, siginfo_t *si, void *p)
> >   if (!nested)
> >   pending = from_irq_stack(nested);
> >   } while (pending);
> > +
> > +#ifndef CONFIG_MMU
> > + /* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
> > + if (is_segv)
> > + sigsegv_post_routine();
> > +#endif
> >  }
> 
> I am confused, this doesn't feel quite correct to me.

thanks for pointing this out.  the above code, which I spot the
working example under nommu, is indeed suspicious and doesn't look a
right code.

that signal handing (this patch) is immature, and need more work to
understand existing code, nommu characteristic, etc.

> So, for normal UML, I think we always do an rt_sigreturn. Which means,
> we always go back to the corresponding *kernel* task. To schedule in
> response to SIGALRM, we forward the signal to the userspace process.
> I believe that means:
>    1. We cannot schedule kernel threads (that seems like a bug)
>    2. Scheduling for userspace happens once the signal is delivered.
>       Then userspace() saves the state and calls interrupt_end().
> 
> 
> Now, keep in mind that we are on the separate signal stack here. If we
> jump anywhere directly, we abandon the old state information stored by
> the host kernel into the mcontext. We can absolutely do that, but we
> need to be careful to not forget anything.
> 
> As such, I wonder whether nommu should:
>    1. When entering from kernel, update "current->thread.switch_buf"
>       from the mcontext.
>        - If we need to schedule, push a stack frame that calls the scheduling
>          code and returns with the correct state.
>    2. When entering from user, store the task registers from the
>       mcontext. At some point (here or earlier) ensure that the
>       "current->thread.switch_buf" is set up so that we can return to
>       userspace by restoring the task registers.
>        - To schedule, piggy back on 1. or add special code.
>    3. Always do a UML_LONGJMP() back into the "current" task.

thanks, the current code jumps in the signal handler and unblocking
signals without returning the handler (and not calling rt_sigreturn at
host either) upon SIGSEGV, which should not work as you mentioned.

I will also investigate how I can handle.

> That said, I am probably not having the full picture right now.
> 
> Benjamin
> 
> PS: On a further note, I think the current code to enter userspace
> cannot handle single stepping. I suppose that is fine, but you should
> probably set arch_has_single_step to 0 for nommu.

I did almost zero tests with ptrace(2) (inside nommu UM) and might
miss a lot of features that mmu-UM could.  will also look into that.

thanks,

-- Hajime
diff mbox series

Patch

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index f21dc8517538..bcc8d28279ae 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -62,6 +62,9 @@  extern int singlestepping(void);
 extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs);
 extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs);
 extern void fatal_sigsegv(void) __attribute__ ((noreturn));
+#ifndef CONFIG_MMU
+extern void sigsegv_post_routine(void);
+#endif
 
 void um_idle_sleep(void);
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index a7519b3de4bf..b9b54e777894 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -174,6 +174,16 @@  void fatal_sigsegv(void)
 	os_dump_core();
 }
 
+#ifndef CONFIG_MMU
+void sigsegv_post_routine(void)
+{
+	change_sig(SIGIO, 1);
+	change_sig(SIGALRM, 1);
+	change_sig(SIGWINCH, 1);
+	userspace(&current->thread.regs.regs);
+}
+#endif
+
 /**
  * segv_handler() - the SIGSEGV handler
  * @sig:	the signal number
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 52852018a3ad..a06622415d8f 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -36,7 +36,15 @@  static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	struct uml_pt_regs r;
 	int save_errno = errno;
 
-	r.is_user = 0;
+#ifndef CONFIG_MMU
+	memset(&r, 0, sizeof(r));
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (mc && (REGS_IP(mc->gregs) > uml_reserved
+		   && REGS_IP(mc->gregs) < high_physmem))
+		r.is_user = 1;
+	else
+#endif
+		r.is_user = 0;
 	if (sig == SIGSEGV) {
 		/* For segfaults, we want the data from the sigcontext. */
 		get_regs_from_mc(&r, mc);
@@ -191,6 +199,7 @@  static void hard_handler(int sig, siginfo_t *si, void *p)
 	ucontext_t *uc = p;
 	mcontext_t *mc = &uc->uc_mcontext;
 	unsigned long pending = 1UL << sig;
+	int is_segv = 0;
 
 	do {
 		int nested, bail;
@@ -214,6 +223,7 @@  static void hard_handler(int sig, siginfo_t *si, void *p)
 
 		while ((sig = ffs(pending)) != 0){
 			sig--;
+			is_segv = (sig == SIGSEGV) ? 1 : 0;
 			pending &= ~(1 << sig);
 			(*handlers[sig])(sig, (struct siginfo *)si, mc);
 		}
@@ -227,6 +237,12 @@  static void hard_handler(int sig, siginfo_t *si, void *p)
 		if (!nested)
 			pending = from_irq_stack(nested);
 	} while (pending);
+
+#ifndef CONFIG_MMU
+	/* if there is SIGSEGV notified, let the userspace run w/ __noreturn */
+	if (is_segv)
+		sigsegv_post_routine();
+#endif
 }
 
 void set_handler(int sig)
diff --git a/arch/x86/um/signal.c b/arch/x86/um/signal.c
index 75087e85b6fd..b7365c75a967 100644
--- a/arch/x86/um/signal.c
+++ b/arch/x86/um/signal.c
@@ -371,6 +371,13 @@  int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		round_down(stack_top - sizeof(struct rt_sigframe), 16);
 
 	/* Add required space for math frame */
+#ifndef CONFIG_MMU
+	/*
+	 * the sig_frame on !MMU needs be aligned for SSE as
+	 * the frame is used as-is.
+	 */
+	math_size = round_down(math_size, 16);
+#endif
 	frame = (struct rt_sigframe __user *)((unsigned long)frame - math_size);
 
 	/* Subtract 128 for a red zone and 8 for proper alignment */
@@ -417,6 +424,18 @@  int setup_signal_stack_si(unsigned long stack_top, struct ksignal *ksig,
 		/* could use a vstub here */
 		return err;
 
+#ifndef CONFIG_MMU
+	/*
+	 * we need to push handler address at top of stack, as
+	 * __kernel_vsyscall, called after this returns with ret with
+	 * stack contents, thus push the handler here.
+	 */
+	frame = (struct rt_sigframe __user *) ((unsigned long) frame -
+					       sizeof(unsigned long));
+	err |= __put_user((unsigned long)ksig->ka.sa.sa_handler,
+			  (unsigned long *)frame);
+#endif
+
 	if (err)
 		return err;
 
@@ -442,9 +461,25 @@  SYSCALL_DEFINE0(rt_sigreturn)
 	unsigned long sp = PT_REGS_SP(&current->thread.regs);
 	struct rt_sigframe __user *frame =
 		(struct rt_sigframe __user *)(sp - sizeof(long));
-	struct ucontext __user *uc = &frame->uc;
+	struct ucontext __user *uc;
 	sigset_t set;
 
+#ifndef CONFIG_MMU
+	/**
+	 * we enter here with:
+	 *
+	 * __restore_rt:
+	 *     mov $15, %rax
+	 *     call *%rax (translated from syscall)
+	 *
+	 * (code is from musl libc)
+	 * so, stack needs to be popped of "call"ed address before
+	 * looking at rt_sigframe.
+	 */
+	frame = (struct rt_sigframe __user *)((unsigned long)frame + sizeof(long));
+#endif
+	uc = &frame->uc;
+
 	if (copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
 		goto segfault;