mbox series

[0/4] next round of OPAL debugging improvements

Message ID 20180408064939.18879-1-npiggin@gmail.com
Headers show
Series next round of OPAL debugging improvements | expand

Message

Nicholas Piggin April 8, 2018, 6:49 a.m. UTC
The r13 clobber bug has been fixed in Linux upstream now, the next
bit is the skiboot stack corruption caused by re-entry from the OS.

Patches 1-3 have been posted here before in various RFCs. They give
nicer back traces, make quiescing a bit more robust, and stop the
stack from being destroyed, and make quiescing a bit more robust.

Patch 4 is one I've found useful when testing with sreset injection
into various parts of skiboot code.

We can in general interrupt and resume OPAL after this series.
Many specific cases are unlikely to work for other reasons, but
we have a cut down sub-set of OPAL API that we allow through, so
the task of making those reasonably robust should be manageable.

Thanks,
Nick

Nicholas Piggin (4):
  core/stack: backtrace unwind basic OPAL call details
  asm/head: implement quiescing without stack or clobbering regs
  core/opal: Emergency stack for re-entry
  core/opal: Allow poller re-entry if OPAL was re-entered

 asm/asm-offsets.c |   2 ++
 asm/head.S        | 100 ++++++++++++++++++++++++++++++++++++++++++++++++------
 core/cpu.c        |  13 ++++---
 core/opal.c       |  53 ++++++++++++-----------------
 core/stack.c      |  46 ++++++++++++++++++++-----
 include/cpu.h     |   3 +-
 include/mem-map.h |  10 +++---
 include/stack.h   |  34 ++++++++++++++++---
 8 files changed, 196 insertions(+), 65 deletions(-)

Comments

Stewart Smith April 19, 2018, 6:24 a.m. UTC | #1
Nicholas Piggin <npiggin@gmail.com> writes:
> The r13 clobber bug has been fixed in Linux upstream now, the next
> bit is the skiboot stack corruption caused by re-entry from the OS.
>
> Patches 1-3 have been posted here before in various RFCs. They give
> nicer back traces, make quiescing a bit more robust, and stop the
> stack from being destroyed, and make quiescing a bit more robust.
>
> Patch 4 is one I've found useful when testing with sreset injection
> into various parts of skiboot code.
>
> We can in general interrupt and resume OPAL after this series.
> Many specific cases are unlikely to work for other reasons, but
> we have a cut down sub-set of OPAL API that we allow through, so
> the task of making those reasonably robust should be manageable.

Neat.

I haven't hammered the series with sreset and fun like that, but we
should probably add something along those lines to op-test (hit it a lot
more when we had more bugs :)

When I build with STACK_CHECK=1, it seems that we never get below 10k of
free stack, but I'm not quite bold enough to suggest "8k is enough for
everyone", and just eating the extra memory for stacks is probably okay.

Series merged to master as of 87f55507195a166fb12ed1240ed3221964f11f40
Nicholas Piggin April 19, 2018, 6:57 a.m. UTC | #2
On Thu, 19 Apr 2018 16:24:06 +1000
Stewart Smith <stewart@linux.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> > The r13 clobber bug has been fixed in Linux upstream now, the next
> > bit is the skiboot stack corruption caused by re-entry from the OS.
> >
> > Patches 1-3 have been posted here before in various RFCs. They give
> > nicer back traces, make quiescing a bit more robust, and stop the
> > stack from being destroyed, and make quiescing a bit more robust.
> >
> > Patch 4 is one I've found useful when testing with sreset injection
> > into various parts of skiboot code.
> >
> > We can in general interrupt and resume OPAL after this series.
> > Many specific cases are unlikely to work for other reasons, but
> > we have a cut down sub-set of OPAL API that we allow through, so
> > the task of making those reasonably robust should be manageable.  
> 
> Neat.
> 
> I haven't hammered the series with sreset and fun like that, but we
> should probably add something along those lines to op-test (hit it a lot
> more when we had more bugs :)

Yeah, I've mostly been injecting in mambo, so far. We should be able
to add some tests. I'll think about it too.

> 
> When I build with STACK_CHECK=1, it seems that we never get below 10k of
> free stack, but I'm not quite bold enough to suggest "8k is enough for
> everyone", and just eating the extra memory for stacks is probably okay.

Yeah I don't see it's a really big problem. Worth the added robustness.

> Series merged to master as of 87f55507195a166fb12ed1240ed3221964f11f40

Cool. I think this brings us to a *reasonably* good place in terms of
coping with NMIs and re-entry from Linux. Maybe a few minor tweaks, but
we should be getting more solid.

Next chapter is some kind of OPAL_DEBUG API we've talked about, but
that'll be for another day. Concentrate on getting all this well tested
and stable first.

Thanks,
Nick
Stewart Smith April 20, 2018, 5:54 a.m. UTC | #3
Nicholas Piggin <npiggin@gmail.com> writes:
> Cool. I think this brings us to a *reasonably* good place in terms of
> coping with NMIs and re-entry from Linux. Maybe a few minor tweaks, but
> we should be getting more solid.
>
> Next chapter is some kind of OPAL_DEBUG API we've talked about, but
> that'll be for another day. Concentrate on getting all this well tested
> and stable first.

I compare to stability of turning on stop4/5 and think this is the most
solid thing in the entire universe.

(i think that was meant to be my inside voice....)