diff mbox

[GIT] Sparc

Message ID 20100922183200.GC19804@ZenIV.linux.org.uk
State RFC
Delegated to: David Miller
Headers show

Commit Message

Al Viro Sept. 22, 2010, 6:32 p.m. UTC
On Wed, Sep 22, 2010 at 11:10:19AM -0700, David Miller wrote:
> 
> Al Viro has been auditing signal handling in various architectures,
> and he found some sparc bugs along the way.
> 
> Those he found are fixed here, as well as a short patch to add
> raw perf event support.

Actually, there's one more problem; I have a tentative fix for sparc64,
but not for sparc32 (yet).

Basically, sparc is unlike all other architectures in this respect:
normally, after do_signal() has handled a signal the syscall exit
glue loops back, rechecks if we need to be rescheduled, if we need
to deal with more signals, etc.  As the result, we build all sigframes
at once.  sparc (and, until recently, alpha) normally calls do_signal()
only once; sparc64 _may_ call it twice, but that's it.  sparc32 may
call it up to three times if we had userland window spill on entry.

That has unpleasant results - for starters, delivery of SIGSEGV upon
failure to set sigframe up is delayed unpredictably; we will take it
only when we trap again.  Moreover, sigsuspend() has inconsistent
behaviour in case when we have several pending signals unmasked by
new mask.  E.g. if we do the following on anything but sparc
	block signals 1 and 2
	set handlers for both
	get signals 1 and 2 sent to you
	sigsuspend(&emptyset)
we'll get both handlers run.  On sparc we'll get only one of them, except
that if we get an IRQ while running in its handler, we'll get both run...

What happens is that sparc (and alpha until it got fixed) sets up one sigframe
and happily buggers off to userland; the first handler runs until we hit
a trap.  If that trap happens to be the call of sigreturn() in the end of
handler, the first thing it'll do will be restoring the original sigmask.
Anything else will leave sigmask alone, see that we have another pending
signal and handle it...

Anyway, the sparc64 part of fix follows; I'm still digging through the
sparc32 side of things.  Dave, could you take a look at that and ACK or NAK
it?

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Linus Torvalds Sept. 22, 2010, 6:46 p.m. UTC | #1
On Wed, Sep 22, 2010 at 11:32 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> That has unpleasant results - for starters, delivery of SIGSEGV upon
> failure to set sigframe up is delayed unpredictably; we will take it
> only when we trap again.

I think this whole argument is a total red herring.

It's a bug in next_signal() if we allow this to happen. We need to
enqueue those synchronous signals first, and NO AMOUNT OF SIGNAL
QUEUEING will ever change that.

The fact is, even if you queue up all the signals at once, you need to
queue up the synchronous ones first. Otherwise their instruction
pointer information etc will simply be bogus. It's that simple. Your
argument about queuing up one, two, or more signals is bogus, for the
simple reason that it doesn't matter: whether you queue or not is
irrelevant, since the "innermost" one in the queue always has to be
the SIGSEGV.

Whether we queue other signals on top of that (and they get executed
first, since it's a stack) doesn't matter. That's a timing issue, and
the program acts as if those asynchronous signals happened before the
trap. But that's fine. All that matters is that the actual synchronous
signal has the register contents of the time of the synchronous trap,
ie it gets enqueued first.

It's why we have that "if (x & SYNCHRONOUS_MASK)" in next_signal().
It's not pretty, it's not perfect, but it's required.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 22, 2010, 6:53 p.m. UTC | #2
On Wed, Sep 22, 2010 at 11:46:27AM -0700, Linus Torvalds wrote:
> On Wed, Sep 22, 2010 at 11:32 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > That has unpleasant results - for starters, delivery of SIGSEGV upon
> > failure to set sigframe up is delayed unpredictably; we will take it
> > only when we trap again.
> 
> I think this whole argument is a total red herring.
> 
> It's a bug in next_signal() if we allow this to happen. We need to
> enqueue those synchronous signals first, and NO AMOUNT OF SIGNAL
> QUEUEING will ever change that.
> 
> The fact is, even if you queue up all the signals at once, you need to
> queue up the synchronous ones first. Otherwise their instruction
> pointer information etc will simply be bogus. It's that simple. Your
> argument about queuing up one, two, or more signals is bogus, for the
> simple reason that it doesn't matter: whether you queue or not is
> irrelevant, since the "innermost" one in the queue always has to be
> the SIGSEGV.
> 
> Whether we queue other signals on top of that (and they get executed
> first, since it's a stack) doesn't matter. That's a timing issue, and
> the program acts as if those asynchronous signals happened before the
> trap. But that's fine. All that matters is that the actual synchronous
> signal has the register contents of the time of the synchronous trap,
> ie it gets enqueued first.
> 
> It's why we have that "if (x & SYNCHRONOUS_MASK)" in next_signal().
> It's not pretty, it's not perfect, but it's required.

Um, no.  You've *already* called get_signal_to_deliver().  There had been
no SIGSEGV in sight.  You happily went on to set a sigframe for e.g.
SIGHUP, but ran out of stack.  At that point you get force_sigsegv()
from handle_signal().  _NOW_ you have a pending SIGSEGV; whether you'll
be able to handle it (e.g. if your SIGSEGV handler is set to run on
altstack) or not, you won't get to it until you call get_signal_to_deliver()
again.  Which requires do_signal() to run.

Sure, it will be the first one to be picked, but we need to try and pick
_something_ to get it.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 22, 2010, 7:04 p.m. UTC | #3
On Wed, Sep 22, 2010 at 07:53:28PM +0100, Al Viro wrote:

> Um, no.  You've *already* called get_signal_to_deliver().  There had been
> no SIGSEGV in sight.  You happily went on to set a sigframe for e.g.
> SIGHUP, but ran out of stack.  At that point you get force_sigsegv()
> from handle_signal().  _NOW_ you have a pending SIGSEGV; whether you'll
> be able to handle it (e.g. if your SIGSEGV handler is set to run on
> altstack) or not, you won't get to it until you call get_signal_to_deliver()
> again.  Which requires do_signal() to run.
> 
> Sure, it will be the first one to be picked, but we need to try and pick
> _something_ to get it.

PS: on something like x86 that'll happen before we return to userland,
since the glue on syscall exit does that:

int_signal:
        testl $_TIF_DO_NOTIFY_MASK,%edx
        jz 1f
        movq %rsp,%rdi          # &ptregs -> arg1
        xorl %esi,%esi          # oldset -> arg2
        call do_notify_resume
1:      movl $_TIF_WORK_MASK,%edi
int_restore_rest:
        RESTORE_REST
        DISABLE_INTERRUPTS(CLBR_NONE)
        TRACE_IRQS_OFF
        jmp int_with_check

and int_with_check leads back to int_signal if _TIF_SIGPENDING is still set.

That kind of looping happens on everything except sparc and that's exactly
the difference I'm talking about.  On sparc the sucker manages to escape
to userland with SIGSEGV still pending.  Try it and you'll see - simple
signal(SIGHUP, something) + munmap a little bit under your %sp +
raise(SIGHUP), then look at the resulting coredump...  On sparc you'll
get that SIGSEGV later; in fact, putting write(1, "ouch\n", 5); after
that raise() has a good chance of saying ouch before it dumps core.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 22, 2010, 7:08 p.m. UTC | #4
On Wed, Sep 22, 2010 at 11:53 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Um, no.  You've *already* called get_signal_to_deliver().  There had been
> no SIGSEGV in sight.  You happily went on to set a sigframe for e.g.
> SIGHUP, but ran out of stack.  At that point you get force_sigsegv()
> from handle_signal().  _NOW_ you have a pending SIGSEGV

Ahh. Ok. Different case from the one I thought you were worried about.
And yeah, I guess that one does require us to mess with the low-level
asm code (although I do wonder if we could not make the whole
do_notify_resume + reschedule code be generic C code - it's a lot of
duplicated subtle asm as it is).

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 22, 2010, 7:53 p.m. UTC | #5
On Wed, Sep 22, 2010 at 12:08:53PM -0700, Linus Torvalds wrote:
> On Wed, Sep 22, 2010 at 11:53 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Um, no. ?You've *already* called get_signal_to_deliver(). ?There had been
> > no SIGSEGV in sight. ?You happily went on to set a sigframe for e.g.
> > SIGHUP, but ran out of stack. ?At that point you get force_sigsegv()
> > from handle_signal(). ?_NOW_ you have a pending SIGSEGV
> 
> Ahh. Ok. Different case from the one I thought you were worried about.
> And yeah, I guess that one does require us to mess with the low-level
> asm code (although I do wonder if we could not make the whole
> do_notify_resume + reschedule code be generic C code - it's a lot of
> duplicated subtle asm as it is).

Worse than just that...  Note that on sparc you need to deal with
fault_in_user_windows(), which can also trigger signals.  So much
for merging it cross-architecture, even if we ignore the differences
between the ways we pass data required for restart to do_signal().
Note that e.g. on alpha we pass _two_ values - one for overwritten
v0 (syscall number), another for overwritten a3 (error flag), etc.

Worse, quite a few targets do not save all registers on syscall entry.
Callee-saved stuff doesn't have to be in pt_regs.  Except that you
want *all* of them saved on stack when it's time to fill sigframes.
And once you've entered a C function you can't guarantee that compiler
hasn't already modified them; sure, they'll be restored on return, but
that doesn't help you to get to their values.  So you get switch_stack
built on stack around calling do_notify_resume() on those.  And you
really want to avoid doing that if you've got no signals - remember,
we hit that on exit from irqs as well, so it's one hell of a hot path.
For processors with big flat register file it can get very ugly.

There is a lot of ugly almost-duplication in there, no arguments about that.
Asm glue is actually not the worst part...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 22, 2010, 8:43 p.m. UTC | #6
On Wed, Sep 22, 2010 at 08:53:49PM +0100, Al Viro wrote:
> On Wed, Sep 22, 2010 at 12:08:53PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 22, 2010 at 11:53 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Um, no. ?You've *already* called get_signal_to_deliver(). ?There had been
> > > no SIGSEGV in sight. ?You happily went on to set a sigframe for e.g.
> > > SIGHUP, but ran out of stack. ?At that point you get force_sigsegv()
> > > from handle_signal(). ?_NOW_ you have a pending SIGSEGV
> > 
> > Ahh. Ok. Different case from the one I thought you were worried about.
> > And yeah, I guess that one does require us to mess with the low-level
> > asm code (although I do wonder if we could not make the whole
> > do_notify_resume + reschedule code be generic C code - it's a lot of
> > duplicated subtle asm as it is).
> 
> Worse than just that...  Note that on sparc you need to deal with
> fault_in_user_windows(), which can also trigger signals.

	Actually, I wonder why don't we do the following:
1) check wsaved first, do fault_in_user_windows() if needed (and probably do
Something Cruel(tm) if we fail copy_to_user() in there)
2) in a loop check if we need to reschedule / if we need to handle signals
3) don't bother with wsaved checks in setup_frame() variants at all -
wsaved can't grow back at that point; we also can use flush_user_windows()
instead of full synchronize_user_stack() in there.

It's definitely a separate patch, but it looks like it might be worth
doing...  Comments?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 22, 2010, 9:15 p.m. UTC | #7
From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Wed, 22 Sep 2010 21:43:24 +0100

> 	Actually, I wonder why don't we do the following:
> 1) check wsaved first, do fault_in_user_windows() if needed (and probably do
> Something Cruel(tm) if we fail copy_to_user() in there)
> 2) in a loop check if we need to reschedule / if we need to handle signals
> 3) don't bother with wsaved checks in setup_frame() variants at all -
> wsaved can't grow back at that point; we also can use flush_user_windows()
> instead of full synchronize_user_stack() in there.
> 
> It's definitely a separate patch, but it looks like it might be worth
> doing...  Comments?

Ok, let me think about this.

I think there are two things:

1) I think I originally intended to allow the signal dispatch to
   succeed even if we had windows we couldn't fault in.

   The idea is that the wsaved windows would be put into the signal
   frame.

   This never was implemented, however...

2) It would be nice to, in this case, still be able to let the debugger
   have a look.

   And part of "having a look" would mean letting it see the registers
   from the windows we couldn't save onto the stack.

   Otherwise GDB is just going to try and access the stack pages we
   were unable to.

Making #2 work could be done with an implementation of #1, since GDB
would need to be able to fetch the values easily from somewhere and
the signal frame storage we create for #1 would be as good as any.

I can't see much utility for a user signal handler for #1, however,
other than getting an accurate stack backtrace to emit a crash log
message or similar.

Alternatively, #2 could be implemented using a special ptrace getregs
call created specifically to fetch the windowed registers.  And
the regset implementation of that could be used for dumping them
into core files as well, and this specifically I've been meaning to
do at some point.

Again, let me think about this some more.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 22, 2010, 11:12 p.m. UTC | #8
On Wed, Sep 22, 2010 at 02:15:14PM -0700, David Miller wrote:

> Alternatively, #2 could be implemented using a special ptrace getregs
> call created specifically to fetch the windowed registers.  And
> the regset implementation of that could be used for dumping them
> into core files as well, and this specifically I've been meaning to
> do at some point.
> 
> Again, let me think about this some more.

OK... sparc32 question: just what the !@#!@# happens if sun4c_rett_stackchk
find (%fp & 7) != 0?  We go to ret_trap_user_stack_is_bolixed, which
tries to page in the underlying page.  OK, suppose it's already there and
readable; we return without doing anything - and go to signal_p.  Which finds
itself with nothing to do, and %fp is *still* buggered.  Spin ad infinitum?
srmmu_rett_stackchk will do the same, BTW...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 23, 2010, 4:59 a.m. UTC | #9
On Wed, Sep 22, 2010 at 12:08:53PM -0700, Linus Torvalds wrote:
> On Wed, Sep 22, 2010 at 11:53 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Um, no. ?You've *already* called get_signal_to_deliver(). ?There had been
> > no SIGSEGV in sight. ?You happily went on to set a sigframe for e.g.
> > SIGHUP, but ran out of stack. ?At that point you get force_sigsegv()
> > from handle_signal(). ?_NOW_ you have a pending SIGSEGV
> 
> Ahh. Ok. Different case from the one I thought you were worried about.
> And yeah, I guess that one does require us to mess with the low-level
> asm code (although I do wonder if we could not make the whole
> do_notify_resume + reschedule code be generic C code - it's a lot of
> duplicated subtle asm as it is).

BTW, there's an interesting idea in s390 implementation (and I have to say
that I'm bloody impressed by them - it's the only architecture besides x86
where I haven't found serious holes in signal handling yet; there are QOI
issues, but that's it so far).  What they do with syscall restarts is unusual
and they might have a good point there.

1) They deal with restart immediately on the entry to do_signal(); if
restarts are not suppressed and if the error is one of the restart-worthy
ones, they do what should be done for no-handler case.

2) They store the pre-restart address, post-restart address and error.  Then
they call get_signal_to_deliver().  Of course, it may return us a positive
signal number.  In that case they may need to discard the restart they'd
done.  And they do it, but only if the address has remained equal to
post-restart one.

3) They ignore ERESTART_RESTARTBLOCK if the address has changed.  Actually,
I suspect that they might need to clear the ->restart_block.fn in that case,
but I haven't done analysis of that yet.

They do have a reason for doing it that way and it's worth considering on
other platforms.  Think what happens if we are getting traced.  We'll be
stopped and tracer will be notified.  Normally it'll tell us to continue
execution, possibly with a different signal *AND* with a different userland
address to return to.  Suppose we've got a different return address set
for us (e.g. with PTRACE_POKEUSR).  Should we ever shift it back by what
hopefully is a size of syscall insn?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 24, 2010, 4:48 a.m. UTC | #10
From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Thu, 23 Sep 2010 00:12:19 +0100

> OK... sparc32 question: just what the !@#!@# happens if sun4c_rett_stackchk
> find (%fp & 7) != 0?  We go to ret_trap_user_stack_is_bolixed, which
> tries to page in the underlying page.  OK, suppose it's already there and
> readable; we return without doing anything - and go to signal_p.  Which finds
> itself with nothing to do, and %fp is *still* buggered.  Spin ad infinitum?
> srmmu_rett_stackchk will do the same, BTW...

That's a bug.

Likely all of the window_*_fault() routines should force a SIGILL when
the stack is mis-aligned.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 24, 2010, 5:01 a.m. UTC | #11
From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Thu, 23 Sep 2010 05:59:22 +0100

> They do have a reason for doing it that way and it's worth considering on
> other platforms.  Think what happens if we are getting traced.  We'll be
> stopped and tracer will be notified.  Normally it'll tell us to continue
> execution, possibly with a different signal *AND* with a different userland
> address to return to.  Suppose we've got a different return address set
> for us (e.g. with PTRACE_POKEUSR).  Should we ever shift it back by what
> hopefully is a size of syscall insn?

These "tracer changing the program counter" issues are why on sparc we
have this software state bit in the %tstate/%psr we report via regsets
(and thus via ptrace) to debuggers which tells if we are inside of a
syscall or not.

GDB uses this to know whether the kernel signal handling is going to
modify the program counter or not when the inferior continues.

If GDB ever writes the program counter of the inferior, it clears
this "in-syscall" bit, and this short-circuits the restart syscall
logic in the signal dispatch code.

This is what the GDB code looks like:

static void
sparc_linux_write_pc (struct regcache *regcache, CORE_ADDR pc)
{
  struct gdbarch_tdep *tdep = gdbarch_tdep (get_regcache_arch (regcache));
  ULONGEST psr;

  regcache_cooked_write_unsigned (regcache, tdep->pc_regnum, pc);
  regcache_cooked_write_unsigned (regcache, tdep->npc_regnum, pc + 4);

  /* Clear the "in syscall" bit to prevent the kernel from
     messing with the PCs we just installed, if we happen to be
     within an interrupted system call that the kernel wants to
     restart.

     Note that after we return from the dummy call, the PSR et al.
     registers will be automatically restored, and the kernel
     continues to restart the system call at this point.  */
  regcache_cooked_read_unsigned (regcache, SPARC32_PSR_REGNUM, &psr);
  psr &= ~PSR_SYSCALL;
  regcache_cooked_write_unsigned (regcache, SPARC32_PSR_REGNUM, psr);
}
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/sparc/kernel/rtrap_64.S b/arch/sparc/kernel/rtrap_64.S
index 090b9e9..77f1b95 100644
--- a/arch/sparc/kernel/rtrap_64.S
+++ b/arch/sparc/kernel/rtrap_64.S
@@ -34,37 +34,9 @@  __handle_preemption:
 __handle_user_windows:
 		call			fault_in_user_windows
 		 wrpr			%g0, RTRAP_PSTATE, %pstate
-		wrpr			%g0, RTRAP_PSTATE_IRQOFF, %pstate
-		/* Redo sched+sig checks */
-		ldx			[%g6 + TI_FLAGS], %l0
-		andcc			%l0, _TIF_NEED_RESCHED, %g0
-
-		be,pt			%xcc, 1f
-		 nop
-		call			schedule
-		 wrpr			%g0, RTRAP_PSTATE, %pstate
-		wrpr			%g0, RTRAP_PSTATE_IRQOFF, %pstate
-		ldx			[%g6 + TI_FLAGS], %l0
-
-1:		andcc			%l0, _TIF_DO_NOTIFY_RESUME_MASK, %g0
-		be,pt			%xcc, __handle_user_windows_continue
-		 nop
-		mov			%l5, %o1
-		add			%sp, PTREGS_OFF, %o0
-		mov			%l0, %o2
-
-		call			do_notify_resume
-		 wrpr			%g0, RTRAP_PSTATE, %pstate
-		wrpr			%g0, RTRAP_PSTATE_IRQOFF, %pstate
-		/* Signal delivery can modify pt_regs tstate, so we must
-		 * reload it.
-		 */
-		ldx			[%sp + PTREGS_OFF + PT_V9_TSTATE], %l1
-		sethi			%hi(0xf << 20), %l4
-		and			%l1, %l4, %l4
-		ba,pt			%xcc, __handle_user_windows_continue
+		ba,pt			%xcc, __handle_preemption_continue
+		 wrpr			%g0, RTRAP_PSTATE_IRQOFF, %pstate
 
-		 andn			%l1, %l4, %l1
 __handle_userfpu:
 		rd			%fprs, %l5
 		andcc			%l5, FPRS_FEF, %g0
@@ -87,7 +59,7 @@  __handle_signal:
 		ldx			[%sp + PTREGS_OFF + PT_V9_TSTATE], %l1
 		sethi			%hi(0xf << 20), %l4
 		and			%l1, %l4, %l4
-		ba,pt			%xcc, __handle_signal_continue
+		ba,pt			%xcc, __handle_preemption_continue
 		 andn			%l1, %l4, %l1
 
 		/* When returning from a NMI (%pil==15) interrupt we want to
@@ -177,11 +149,9 @@  __handle_preemption_continue:
 		bne,pn			%xcc, __handle_preemption
 		 andcc			%l0, _TIF_DO_NOTIFY_RESUME_MASK, %g0
 		bne,pn			%xcc, __handle_signal
-__handle_signal_continue:
 		 ldub			[%g6 + TI_WSAVED], %o2
 		brnz,pn			%o2, __handle_user_windows
 		 nop
-__handle_user_windows_continue:
 		sethi			%hi(TSTATE_PEF), %o0
 		andcc			%l1, %o0, %g0