[v7,8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

Message ID 153365146712.14256.11869543914717297278.stgit@jupiter.in.ibm.com
State Changes Requested
Headers show
Series
  • powerpc/pseries: Machine check handler improvements.
Related show

Checks

Context Check Description
snowpatch_ozlabs/apply_patch fail Failed to apply to any branch

Commit Message

Mahesh J Salgaonkar Aug. 7, 2018, 2:17 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Introduce recovery action for recovered memory errors (MCEs). There are
soft memory errors like SLB Multihit, which can be a result of a bad
hardware OR software BUG. Kernel can easily recover from these soft errors
by flushing SLB contents. After the recovery kernel can still continue to
function without any issue. But in some scenario's we may keep getting
these soft errors until the root cause is fixed. To be able to analyze and
find the root cause, best way is to gather enough data and system state at
the time of MCE. Hence this patch introduces a sysctl knob where user can
decide either to continue after recovery or panic the kernel to capture the
dump. This will allow one to configure a kernel to capture a dump on MCE
and then toggle back to recovery while dump is being analyzed.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/mce.h         |    2 +
 arch/powerpc/kernel/mce.c              |   58 ++++++++++++++++++++++++++++++++
 arch/powerpc/kernel/traps.c            |    3 +-
 arch/powerpc/platforms/powernv/setup.c |    4 ++
 4 files changed, 66 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Aug. 8, 2018, 2:56 p.m. | #1
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> Introduce recovery action for recovered memory errors (MCEs). There are
> soft memory errors like SLB Multihit, which can be a result of a bad
> hardware OR software BUG. Kernel can easily recover from these soft errors
> by flushing SLB contents. After the recovery kernel can still continue to
> function without any issue. But in some scenario's we may keep getting
> these soft errors until the root cause is fixed. To be able to analyze and
> find the root cause, best way is to gather enough data and system state at
> the time of MCE. Hence this patch introduces a sysctl knob where user can
> decide either to continue after recovery or panic the kernel to capture the
> dump.

I'm not convinced we want this.

As we've discovered it's often not possible to reconstruct what happened
based on a dump anyway.

The key thing you need is the content of the SLB and that's not included
in a dump.

So I think we should dump the SLB content when we get the MCE (which
this series does) and any other useful info, and then if we can recover
we should.

cheers
Aneesh Kumar K.V Aug. 8, 2018, 3:37 p.m. | #2
On 08/08/2018 08:26 PM, Michael Ellerman wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> Introduce recovery action for recovered memory errors (MCEs). There are
>> soft memory errors like SLB Multihit, which can be a result of a bad
>> hardware OR software BUG. Kernel can easily recover from these soft errors
>> by flushing SLB contents. After the recovery kernel can still continue to
>> function without any issue. But in some scenario's we may keep getting
>> these soft errors until the root cause is fixed. To be able to analyze and
>> find the root cause, best way is to gather enough data and system state at
>> the time of MCE. Hence this patch introduces a sysctl knob where user can
>> decide either to continue after recovery or panic the kernel to capture the
>> dump.
> 
> I'm not convinced we want this.
> 
> As we've discovered it's often not possible to reconstruct what happened
> based on a dump anyway.
> 
> The key thing you need is the content of the SLB and that's not included
> in a dump.
> 
> So I think we should dump the SLB content when we get the MCE (which
> this series does) and any other useful info, and then if we can recover
> we should.
> 

The reasoning there is what if we got multi-hit due to some corruption 
in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
data structure due to wrong pointer. Now that is far fetched, but then 
possible right?. Hence the idea that, if we don't have much insight into 
why a slb multi-hit occur from the dmesg which include slb content, 
slb_cache contents etc, there should be an easy way to force a dump that 
might assist in further debug.

-aneesh
Michal Suchánek Aug. 8, 2018, 4:09 p.m. | #3
On Wed, 8 Aug 2018 21:07:11 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
> > Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:  
> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>
> >> Introduce recovery action for recovered memory errors (MCEs).
> >> There are soft memory errors like SLB Multihit, which can be a
> >> result of a bad hardware OR software BUG. Kernel can easily
> >> recover from these soft errors by flushing SLB contents. After the
> >> recovery kernel can still continue to function without any issue.
> >> But in some scenario's we may keep getting these soft errors until
> >> the root cause is fixed. To be able to analyze and find the root
> >> cause, best way is to gather enough data and system state at the
> >> time of MCE. Hence this patch introduces a sysctl knob where user
> >> can decide either to continue after recovery or panic the kernel
> >> to capture the dump.  
> > 
> > I'm not convinced we want this.
> > 
> > As we've discovered it's often not possible to reconstruct what
> > happened based on a dump anyway.
> > 
> > The key thing you need is the content of the SLB and that's not
> > included in a dump.
> > 
> > So I think we should dump the SLB content when we get the MCE (which
> > this series does) and any other useful info, and then if we can
> > recover we should.
> >   
> 
> The reasoning there is what if we got multi-hit due to some
> corruption in slb_cache_ptr. ie. some part of kernel is wrongly
> updating the paca data structure due to wrong pointer. Now that is
> far fetched, but then possible right?. Hence the idea that, if we
> don't have much insight into why a slb multi-hit occur from the dmesg
> which include slb content, slb_cache contents etc, there should be an
> easy way to force a dump that might assist in further debug.

Nonetheless this turns all MCEs into crashes. Are there any MCEs that
could happen during normal operation and should be handled by default?

Thanks

Michal
Nicholas Piggin Aug. 9, 2018, 1:43 a.m. | #4
On Thu, 09 Aug 2018 00:56:00 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >
> > Introduce recovery action for recovered memory errors (MCEs). There are
> > soft memory errors like SLB Multihit, which can be a result of a bad
> > hardware OR software BUG. Kernel can easily recover from these soft errors
> > by flushing SLB contents. After the recovery kernel can still continue to
> > function without any issue. But in some scenario's we may keep getting
> > these soft errors until the root cause is fixed. To be able to analyze and
> > find the root cause, best way is to gather enough data and system state at
> > the time of MCE. Hence this patch introduces a sysctl knob where user can
> > decide either to continue after recovery or panic the kernel to capture the
> > dump.  
> 
> I'm not convinced we want this.
> 
> As we've discovered it's often not possible to reconstruct what happened
> based on a dump anyway.
> 
> The key thing you need is the content of the SLB and that's not included
> in a dump.
> 
> So I think we should dump the SLB content when we get the MCE (which
> this series does) and any other useful info, and then if we can recover
> we should.

Yeah it's a lot of knobs that administrators can hardly be expected to
tune. Hypervisor or firmware should really eventually make the MCE
unrecoverable if we aren't making progress.

That said, x86 has a bunch of options, and for debugging a rare crash
or specialised installations it might be useful. But we should follow
the normal format, /proc/sys/kernel/panic_on_mce.

Thanks,
Nick
Michael Ellerman Aug. 9, 2018, 6:34 a.m. | #5
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
>> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> Introduce recovery action for recovered memory errors (MCEs). There are
>>> soft memory errors like SLB Multihit, which can be a result of a bad
>>> hardware OR software BUG. Kernel can easily recover from these soft errors
>>> by flushing SLB contents. After the recovery kernel can still continue to
>>> function without any issue. But in some scenario's we may keep getting
>>> these soft errors until the root cause is fixed. To be able to analyze and
>>> find the root cause, best way is to gather enough data and system state at
>>> the time of MCE. Hence this patch introduces a sysctl knob where user can
>>> decide either to continue after recovery or panic the kernel to capture the
>>> dump.
>> 
>> I'm not convinced we want this.
>> 
>> As we've discovered it's often not possible to reconstruct what happened
>> based on a dump anyway.
>> 
>> The key thing you need is the content of the SLB and that's not included
>> in a dump.
>> 
>> So I think we should dump the SLB content when we get the MCE (which
>> this series does) and any other useful info, and then if we can recover
>> we should.
>
> The reasoning there is what if we got multi-hit due to some corruption 
> in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> data structure due to wrong pointer. Now that is far fetched, but then 
> possible right?. Hence the idea that, if we don't have much insight into 
> why a slb multi-hit occur from the dmesg which include slb content, 
> slb_cache contents etc, there should be an easy way to force a dump that 
> might assist in further debug.

If you're debugging something complex that you can't determine from the
SLB dump then you should be running a debug kernel anyway. And if
anything you want to drop into xmon and sit there, preserving the most
state, rather than taking a dump.

The last SLB multi-hit I debugged was this:

  https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db7130d63fd8


Which took quite a while to track down, including a bunch of tracing and
so on. A dump would not have helped in the slightest.

cheers
Nicholas Piggin Aug. 9, 2018, 8:02 a.m. | #6
On Thu, 09 Aug 2018 16:34:07 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:  
> >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>>
> >>> Introduce recovery action for recovered memory errors (MCEs). There are
> >>> soft memory errors like SLB Multihit, which can be a result of a bad
> >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> >>> by flushing SLB contents. After the recovery kernel can still continue to
> >>> function without any issue. But in some scenario's we may keep getting
> >>> these soft errors until the root cause is fixed. To be able to analyze and
> >>> find the root cause, best way is to gather enough data and system state at
> >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> >>> decide either to continue after recovery or panic the kernel to capture the
> >>> dump.  
> >> 
> >> I'm not convinced we want this.
> >> 
> >> As we've discovered it's often not possible to reconstruct what happened
> >> based on a dump anyway.
> >> 
> >> The key thing you need is the content of the SLB and that's not included
> >> in a dump.
> >> 
> >> So I think we should dump the SLB content when we get the MCE (which
> >> this series does) and any other useful info, and then if we can recover
> >> we should.  
> >
> > The reasoning there is what if we got multi-hit due to some corruption 
> > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > data structure due to wrong pointer. Now that is far fetched, but then 
> > possible right?. Hence the idea that, if we don't have much insight into 
> > why a slb multi-hit occur from the dmesg which include slb content, 
> > slb_cache contents etc, there should be an easy way to force a dump that 
> > might assist in further debug.  
> 
> If you're debugging something complex that you can't determine from the
> SLB dump then you should be running a debug kernel anyway. And if
> anything you want to drop into xmon and sit there, preserving the most
> state, rather than taking a dump.

I'm not saying for a dump specifically, just some form of crash. And we
really should have an option to xmon on panic, but that's another story.

I think HA/failover kind of environments use options like this too. If
anything starts going bad they don't want to try limping along but stop
ASAP.

Thanks,
Nick
Ananth N Mavinakayanahalli Aug. 9, 2018, 8:09 a.m. | #7
On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> On Thu, 09 Aug 2018 16:34:07 +1000
> Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:  
> > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > >>>
> > >>> Introduce recovery action for recovered memory errors (MCEs). There are
> > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> > >>> by flushing SLB contents. After the recovery kernel can still continue to
> > >>> function without any issue. But in some scenario's we may keep getting
> > >>> these soft errors until the root cause is fixed. To be able to analyze and
> > >>> find the root cause, best way is to gather enough data and system state at
> > >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> > >>> decide either to continue after recovery or panic the kernel to capture the
> > >>> dump.  
> > >> 
> > >> I'm not convinced we want this.
> > >> 
> > >> As we've discovered it's often not possible to reconstruct what happened
> > >> based on a dump anyway.
> > >> 
> > >> The key thing you need is the content of the SLB and that's not included
> > >> in a dump.
> > >> 
> > >> So I think we should dump the SLB content when we get the MCE (which
> > >> this series does) and any other useful info, and then if we can recover
> > >> we should.  
> > >
> > > The reasoning there is what if we got multi-hit due to some corruption 
> > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > possible right?. Hence the idea that, if we don't have much insight into 
> > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > slb_cache contents etc, there should be an easy way to force a dump that 
> > > might assist in further debug.  
> > 
> > If you're debugging something complex that you can't determine from the
> > SLB dump then you should be running a debug kernel anyway. And if
> > anything you want to drop into xmon and sit there, preserving the most
> > state, rather than taking a dump.
> 
> I'm not saying for a dump specifically, just some form of crash. And we
> really should have an option to xmon on panic, but that's another story.

That's fine during development or in a lab, not something we could
enforce in a customer environment, could we?

> I think HA/failover kind of environments use options like this too. If
> anything starts going bad they don't want to try limping along but stop
> ASAP.

Right. And in this particular case, can we guarantee no corruption
(leading to or post the multihit recovery) when running a customer workload,
is the question...

Ananth
Nicholas Piggin Aug. 9, 2018, 8:33 a.m. | #8
On Thu, 9 Aug 2018 13:39:45 +0530
Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> wrote:

> On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> > On Thu, 09 Aug 2018 16:34:07 +1000
> > Michael Ellerman <mpe@ellerman.id.au> wrote:
> >   
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:  
> > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:    
> > > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:    
> > > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > >>>
> > > >>> Introduce recovery action for recovered memory errors (MCEs). There are
> > > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > > >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> > > >>> by flushing SLB contents. After the recovery kernel can still continue to
> > > >>> function without any issue. But in some scenario's we may keep getting
> > > >>> these soft errors until the root cause is fixed. To be able to analyze and
> > > >>> find the root cause, best way is to gather enough data and system state at
> > > >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> > > >>> decide either to continue after recovery or panic the kernel to capture the
> > > >>> dump.    
> > > >> 
> > > >> I'm not convinced we want this.
> > > >> 
> > > >> As we've discovered it's often not possible to reconstruct what happened
> > > >> based on a dump anyway.
> > > >> 
> > > >> The key thing you need is the content of the SLB and that's not included
> > > >> in a dump.
> > > >> 
> > > >> So I think we should dump the SLB content when we get the MCE (which
> > > >> this series does) and any other useful info, and then if we can recover
> > > >> we should.    
> > > >
> > > > The reasoning there is what if we got multi-hit due to some corruption 
> > > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > > possible right?. Hence the idea that, if we don't have much insight into 
> > > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > > slb_cache contents etc, there should be an easy way to force a dump that 
> > > > might assist in further debug.    
> > > 
> > > If you're debugging something complex that you can't determine from the
> > > SLB dump then you should be running a debug kernel anyway. And if
> > > anything you want to drop into xmon and sit there, preserving the most
> > > state, rather than taking a dump.  
> > 
> > I'm not saying for a dump specifically, just some form of crash. And we
> > really should have an option to xmon on panic, but that's another story.  
> 
> That's fine during development or in a lab, not something we could
> enforce in a customer environment, could we?

xmon on panic? Not something to enforce but IMO (without thinking about
it too much but having encountered it several times) it should probably
be tied xmon on BUG option.

> 
> > I think HA/failover kind of environments use options like this too. If
> > anything starts going bad they don't want to try limping along but stop
> > ASAP.  
> 
> Right. And in this particular case, can we guarantee no corruption
> (leading to or post the multihit recovery) when running a customer workload,
> is the question...

I think that's an element of it. If SLB corruption is caused by
software then we could already have memory corruption. If it's hardware
then presumably we're supposed to have some guarantee of error rates.
But still you would say a machine that has taken no MCEs is less likely
to have a problem than one that has taken some MCEs!

It's not just corruption either, I've run into bugs where we get huge
streams of HMIs for example which all get recovered properly but
performance would have been in the toilet.

Anyway, being policy maybe we could drop this patch out of the SLB MCE
series and introduce it afterwards if we think it's necessary. For
SLB multi hit caused by software bug in slb handling, I'd say Michael's
pretty right about just needing the MCE output with SLB contents.

Thanks,
Nick
Michal Suchánek Aug. 9, 2018, 10:26 a.m. | #9
On Thu, 9 Aug 2018 18:33:33 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Thu, 9 Aug 2018 13:39:45 +0530
> Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:  
> > > On Thu, 09 Aug 2018 16:34:07 +1000
> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > >     
> > > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:    
> > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:      
> > > > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:      
> > > > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > > >>>
> > > > >>> Introduce recovery action for recovered memory errors
> > > > >>> (MCEs). There are soft memory errors like SLB Multihit,
> > > > >>> which can be a result of a bad hardware OR software BUG.
> > > > >>> Kernel can easily recover from these soft errors by
> > > > >>> flushing SLB contents. After the recovery kernel can still
> > > > >>> continue to function without any issue. But in some
> > > > >>> scenario's we may keep getting these soft errors until the
> > > > >>> root cause is fixed. To be able to analyze and find the
> > > > >>> root cause, best way is to gather enough data and system
> > > > >>> state at the time of MCE. Hence this patch introduces a
> > > > >>> sysctl knob where user can decide either to continue after
> > > > >>> recovery or panic the kernel to capture the dump.      
> > > > >> 
> > > > >> I'm not convinced we want this.
> > > > >> 
> > > > >> As we've discovered it's often not possible to reconstruct
> > > > >> what happened based on a dump anyway.
> > > > >> 
> > > > >> The key thing you need is the content of the SLB and that's
> > > > >> not included in a dump.
> > > > >> 
> > > > >> So I think we should dump the SLB content when we get the
> > > > >> MCE (which this series does) and any other useful info, and
> > > > >> then if we can recover we should.      
> > > > >
> > > > > The reasoning there is what if we got multi-hit due to some
> > > > > corruption in slb_cache_ptr. ie. some part of kernel is
> > > > > wrongly updating the paca data structure due to wrong
> > > > > pointer. Now that is far fetched, but then possible right?.
> > > > > Hence the idea that, if we don't have much insight into why a
> > > > > slb multi-hit occur from the dmesg which include slb content,
> > > > > slb_cache contents etc, there should be an easy way to force
> > > > > a dump that might assist in further debug.      
> > > > 
> > > > If you're debugging something complex that you can't determine
> > > > from the SLB dump then you should be running a debug kernel
> > > > anyway. And if anything you want to drop into xmon and sit
> > > > there, preserving the most state, rather than taking a dump.    
> > > 
> > > I'm not saying for a dump specifically, just some form of crash.
> > > And we really should have an option to xmon on panic, but that's
> > > another story.    
> > 
> > That's fine during development or in a lab, not something we could
> > enforce in a customer environment, could we?  
> 
> xmon on panic? Not something to enforce but IMO (without thinking
> about it too much but having encountered it several times) it should
> probably be tied xmon on BUG option.

You should get that with this patch and xmon=on or am I missing
something?

Thanks

Michal
Nicholas Piggin Aug. 10, 2018, 7:31 a.m. | #10
On Thu, 9 Aug 2018 12:26:46 +0200
Michal Suchánek <msuchanek@suse.de> wrote:

> On Thu, 9 Aug 2018 18:33:33 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Thu, 9 Aug 2018 13:39:45 +0530
> > Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> wrote:
> >   
> > > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:    
> > > > On Thu, 09 Aug 2018 16:34:07 +1000
> > > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > > >       
> > > > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:      
> > > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:        
> > > > > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:        
> > > > > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > > > >>>
> > > > > >>> Introduce recovery action for recovered memory errors
> > > > > >>> (MCEs). There are soft memory errors like SLB Multihit,
> > > > > >>> which can be a result of a bad hardware OR software BUG.
> > > > > >>> Kernel can easily recover from these soft errors by
> > > > > >>> flushing SLB contents. After the recovery kernel can still
> > > > > >>> continue to function without any issue. But in some
> > > > > >>> scenario's we may keep getting these soft errors until the
> > > > > >>> root cause is fixed. To be able to analyze and find the
> > > > > >>> root cause, best way is to gather enough data and system
> > > > > >>> state at the time of MCE. Hence this patch introduces a
> > > > > >>> sysctl knob where user can decide either to continue after
> > > > > >>> recovery or panic the kernel to capture the dump.        
> > > > > >> 
> > > > > >> I'm not convinced we want this.
> > > > > >> 
> > > > > >> As we've discovered it's often not possible to reconstruct
> > > > > >> what happened based on a dump anyway.
> > > > > >> 
> > > > > >> The key thing you need is the content of the SLB and that's
> > > > > >> not included in a dump.
> > > > > >> 
> > > > > >> So I think we should dump the SLB content when we get the
> > > > > >> MCE (which this series does) and any other useful info, and
> > > > > >> then if we can recover we should.        
> > > > > >
> > > > > > The reasoning there is what if we got multi-hit due to some
> > > > > > corruption in slb_cache_ptr. ie. some part of kernel is
> > > > > > wrongly updating the paca data structure due to wrong
> > > > > > pointer. Now that is far fetched, but then possible right?.
> > > > > > Hence the idea that, if we don't have much insight into why a
> > > > > > slb multi-hit occur from the dmesg which include slb content,
> > > > > > slb_cache contents etc, there should be an easy way to force
> > > > > > a dump that might assist in further debug.        
> > > > > 
> > > > > If you're debugging something complex that you can't determine
> > > > > from the SLB dump then you should be running a debug kernel
> > > > > anyway. And if anything you want to drop into xmon and sit
> > > > > there, preserving the most state, rather than taking a dump.      
> > > > 
> > > > I'm not saying for a dump specifically, just some form of crash.
> > > > And we really should have an option to xmon on panic, but that's
> > > > another story.      
> > > 
> > > That's fine during development or in a lab, not something we could
> > > enforce in a customer environment, could we?    
> > 
> > xmon on panic? Not something to enforce but IMO (without thinking
> > about it too much but having encountered it several times) it should
> > probably be tied xmon on BUG option.  
> 
> You should get that with this patch and xmon=on or am I missing
> something?

Oh yeah, I just got a bit side tracked and added something not very
relevant -- a panic() call should drop to xmon if we have xmon=on. It
doesn't today (or last I looked), but that's nothing to do with this
patch.

Thanks,
Nick
Michael Ellerman Aug. 10, 2018, 11:04 a.m. | #11
Michal Suchánek <msuchanek@suse.de> writes:
> On Wed, 8 Aug 2018 21:07:11 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
>> > Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:  
>> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> >>
>> >> Introduce recovery action for recovered memory errors (MCEs).
>> >> There are soft memory errors like SLB Multihit, which can be a
>> >> result of a bad hardware OR software BUG. Kernel can easily
>> >> recover from these soft errors by flushing SLB contents. After the
>> >> recovery kernel can still continue to function without any issue.
>> >> But in some scenario's we may keep getting these soft errors until
>> >> the root cause is fixed. To be able to analyze and find the root
>> >> cause, best way is to gather enough data and system state at the
>> >> time of MCE. Hence this patch introduces a sysctl knob where user
>> >> can decide either to continue after recovery or panic the kernel
>> >> to capture the dump.  
>> > 
>> > I'm not convinced we want this.
>> > 
>> > As we've discovered it's often not possible to reconstruct what
>> > happened based on a dump anyway.
>> > 
>> > The key thing you need is the content of the SLB and that's not
>> > included in a dump.
>> > 
>> > So I think we should dump the SLB content when we get the MCE (which
>> > this series does) and any other useful info, and then if we can
>> > recover we should.
>> 
>> The reasoning there is what if we got multi-hit due to some
>> corruption in slb_cache_ptr. ie. some part of kernel is wrongly
>> updating the paca data structure due to wrong pointer. Now that is
>> far fetched, but then possible right?. Hence the idea that, if we
>> don't have much insight into why a slb multi-hit occur from the dmesg
>> which include slb content, slb_cache contents etc, there should be an
>> easy way to force a dump that might assist in further debug.
>
> Nonetheless this turns all MCEs into crashes. Are there any MCEs that
> could happen during normal operation and should be handled by default?

An MCE should always be an indication of an abnormal condition, but
the exact set of things that are reported as MCEs is CPU specific, and
potentially even configurable at the hardware level.

However we only "handle" certain types of MCEs, so if we get an MCE for
something we don't understand then we'll panic already.

SLB multi-hit / parity error is one that we do handle (on bare metal),
because there is a well defined recovery action.

cheers

Patch

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 3a1226e9b465..d46e1903878d 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -202,6 +202,8 @@  struct mce_error_info {
 #define MCE_EVENT_RELEASE	true
 #define MCE_EVENT_DONTRELEASE	false
 
+extern int recover_on_mce;
+
 extern void save_mce_event(struct pt_regs *regs, long handled,
 			   struct mce_error_info *mce_err, uint64_t nip,
 			   uint64_t addr, uint64_t phys_addr);
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ae17d8aa60c4..5e2ab5cade81 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,6 +28,7 @@ 
 #include <linux/percpu.h>
 #include <linux/export.h>
 #include <linux/irq_work.h>
+#include <linux/moduleparam.h>
 
 #include <asm/machdep.h>
 #include <asm/mce.h>
@@ -631,3 +632,60 @@  long hmi_exception_realmode(struct pt_regs *regs)
 
 	return 1;
 }
+
+/*
+ * Recovery action for recovered memory errors.
+ *
+ * There are soft memory errors like SLB Multihit, which can be a result of
+ * a bad hardware OR software BUG. Kernel can easily recover from these
+ * soft errors by flushing SLB contents. After the recovery kernel can
+ * still continue to function without any issue. But in some scenario's we
+ * may keep getting these soft errors until the root cause is fixed. To be
+ * able to analyze and find the root cause, best way is to gather enough
+ * data and system state at the time of MCE. Introduce a sysctl knob where
+ * user can decide either to continue after recovery or panic the kernel
+ * to capture the dump. This will allow one to configure a kernel to capture
+ * dump on MCE and then toggle back to recovery while dump is being analyzed.
+ *
+ * recover_on_mce == 0
+ *	panic/crash the kernel to trigger dump capture.
+ *
+ * recover_on_mce == 1
+ *	continue after MCE recovery. (no panic)
+ */
+int recover_on_mce;
+
+#ifdef CONFIG_SYSCTL
+/*
+ * Register the sysctl to define memory error recovery action.
+ */
+static struct ctl_table machine_check_ctl_table[] = {
+	{
+		.procname	= "recover_on_mce",
+		.data		= &recover_on_mce,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table machine_check_sysctl_root[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= machine_check_ctl_table,
+	},
+	{}
+};
+
+static int __init register_machine_check_sysctl(void)
+{
+	register_sysctl_table(machine_check_sysctl_root);
+
+	return 0;
+}
+__initcall(register_machine_check_sysctl);
+#endif /* CONFIG_SYSCTL */
+
+core_param(recover_on_mce, recover_on_mce, int, 0644);
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..246477c790e8 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -70,6 +70,7 @@ 
 #include <asm/hmi.h>
 #include <sysdev/fsl_pci.h>
 #include <asm/kprobes.h>
+#include <asm/mce.h>
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -727,7 +728,7 @@  void machine_check_exception(struct pt_regs *regs)
 	else if (cur_cpu_spec->machine_check)
 		recover = cur_cpu_spec->machine_check(regs);
 
-	if (recover > 0)
+	if ((recover > 0) && recover_on_mce)
 		goto bail;
 
 	if (debugger_fault_handler(regs))
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index b74c93bc2e55..d13278029a94 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -39,6 +39,7 @@ 
 #include <asm/tm.h>
 #include <asm/setup.h>
 #include <asm/security_features.h>
+#include <asm/mce.h>
 
 #include "powernv.h"
 
@@ -147,6 +148,9 @@  static void __init pnv_setup_arch(void)
 	/* Enable NAP mode */
 	powersave_nap = 1;
 
+	/* Recovery action on recovered MCE. By default enable it on PowerNV */
+	recover_on_mce = 1;
+
 	/* XXX PMCS */
 }