diff mbox

[RFC] hmi: clear xscom and unknown bits from HMER

Message ID 20170628170415.38d744e7@roar.ozlabs.ibm.com
State Rejected
Headers show

Commit Message

Nicholas Piggin June 28, 2017, 7:04 a.m. UTC
On Wed, 28 Jun 2017 12:02:55 +0530
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> On 06/28/2017 10:11 AM, Nicholas Piggin wrote:
> > On Wed, 28 Jun 2017 09:00:05 +0530
> > Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> >   
> >> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:  
> >>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:    
> >>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:    
> >>>>> It has been observed the xscom bit in HMER gets stuck (as-yet    
> >>>>
> >>>> We see that stuck because opal never clears it after scom read/write.
> >>>> The bit is cleared just before the next scom read/write. I am not sure
> >>>> what it was left uncleared until next scom read/write kicks in.    
> >>>
> >>> Because we don't care ?     
> >>
> >> looking at the code it looks like we didn't care. I sent out a patch
> >> that clears them once scom operation is complete.
> >>  
> >>> It should not be enabled in HMEER...    
> >>
> >> Yes, we don't enable them in HMEER.
> >>  
> >>>>    
> >>>>> unkonwn root cause -- HMEER should disable those exceptions).
> >>>>> This causes HMIs to be continually taken.
> >>>>>
> >>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
> >>>>>
> >>>>> Add some attempt to handle this by clearing the HMER and HMEER.
> >>>>>
> >>>>> Try to clear HMER for other unknown HMIs (alternative is to not
> >>>>> recover).    
> >>>>
> >>>> I think we should be just ok with clearing out and masking them again.    
> >>>
> >>> Right but we need to understand why we are taking the HMI in the first
> >>> place since it's not enabled in HMEER unless something's wrong there.
> >>> Is that reproduceable ?    
> >>
> >> We did debug it yesterday and found the reason. Akshay sent out a patch
> >> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/  
> > 
> > Given that this bug was caused by Linux, and not due to an actual
> > HMI (and therefore would not be fixed by clearing the HMER/HMEER
> > bits), I wonder if this patch is still warranted. HMEER could be
> > messed up somehow, so maybe a simplified version that just notes
> > the unexpected HMI and masks out HMEER.
> > 
> > Any opinions?  
> 
> Yeah I agree with having simplified version so that it will help us to
> detect if we at all mess up with HMEER in future.

How about something very simple?

[PATCH] hmi: report HMEER and fail when encountering unknown HMI

In the interest of minimising corruption and improving debugging, when
encountering an unknown HMI, report HMEER and HMER with PR_ERR, and do
not recover.

This is an improvement on existing behaviour, which in the best case
will just continue to take HMIs because the exception remains asserted
and this will cause the system to become unstable.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 core/hmi.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Mahesh J Salgaonkar June 28, 2017, 10:40 a.m. UTC | #1
On 06/28/2017 12:34 PM, Nicholas Piggin wrote:
> On Wed, 28 Jun 2017 12:02:55 +0530
> Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> 
>> On 06/28/2017 10:11 AM, Nicholas Piggin wrote:
>>> On Wed, 28 Jun 2017 09:00:05 +0530
>>> Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>>>   
>>>> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:  
>>>>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:    
>>>>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:    
>>>>>>> It has been observed the xscom bit in HMER gets stuck (as-yet    
>>>>>>
>>>>>> We see that stuck because opal never clears it after scom read/write.
>>>>>> The bit is cleared just before the next scom read/write. I am not sure
>>>>>> what it was left uncleared until next scom read/write kicks in.    
>>>>>
>>>>> Because we don't care ?     
>>>>
>>>> looking at the code it looks like we didn't care. I sent out a patch
>>>> that clears them once scom operation is complete.
>>>>  
>>>>> It should not be enabled in HMEER...    
>>>>
>>>> Yes, we don't enable them in HMEER.
>>>>  
>>>>>>    
>>>>>>> unkonwn root cause -- HMEER should disable those exceptions).
>>>>>>> This causes HMIs to be continually taken.
>>>>>>>
>>>>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>>>>>
>>>>>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>>>>>
>>>>>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>>>>>> recover).    
>>>>>>
>>>>>> I think we should be just ok with clearing out and masking them again.    
>>>>>
>>>>> Right but we need to understand why we are taking the HMI in the first
>>>>> place since it's not enabled in HMEER unless something's wrong there.
>>>>> Is that reproduceable ?    
>>>>
>>>> We did debug it yesterday and found the reason. Akshay sent out a patch
>>>> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/  
>>>
>>> Given that this bug was caused by Linux, and not due to an actual
>>> HMI (and therefore would not be fixed by clearing the HMER/HMEER
>>> bits), I wonder if this patch is still warranted. HMEER could be
>>> messed up somehow, so maybe a simplified version that just notes
>>> the unexpected HMI and masks out HMEER.
>>>
>>> Any opinions?  
>>
>> Yeah I agree with having simplified version so that it will help us to
>> detect if we at all mess up with HMEER in future.
> 
> How about something very simple?
> 
> [PATCH] hmi: report HMEER and fail when encountering unknown HMI
> 
> In the interest of minimising corruption and improving debugging, when
> encountering an unknown HMI, report HMEER and HMER with PR_ERR, and do
> not recover.
> 
> This is an improvement on existing behaviour, which in the best case
> will just continue to take HMIs because the exception remains asserted
> and this will cause the system to become unstable.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  core/hmi.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/core/hmi.c b/core/hmi.c
> index 84f2c2d6..3329d055 100644
> --- a/core/hmi.c
> +++ b/core/hmi.c
> @@ -823,6 +823,15 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>  		}
>  	}
> 
> +	if (hmer) {

Should we check for messed up HMEER before treating it as unexpected HMI
? By the time we reach here, we would have cleared all the bits in the
hmer variable that we care about. For remaining bits that are still on
in hmer, if corresponding bit in HMEER is also set then we for sure know
that this is an unexpected HMI and HMEER has been messed up.

How about this check ?
	
	if (hmer & mfspr(SPR_HMEER)) {

> +		uint64_t hmeer = mfspr(SPR_HMEER);
> +
> +		prlog(PR_ERR, "HMI: Unexpected exception HMEER = 0x%016llx "
> +			      "HMER = 0x%016llx (not recovering)\n",
> +			      hmeer, hmer);
> +		recover = 0;
> +	}
> +
>  	if (recover == 0)
>  		disable_fast_reboot("Unrecoverable HMI");
>  	/*
> 

Thanks,
-Mahesh.
diff mbox

Patch

diff --git a/core/hmi.c b/core/hmi.c
index 84f2c2d6..3329d055 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -823,6 +823,15 @@  int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
 		}
 	}
 
+	if (hmer) {
+		uint64_t hmeer = mfspr(SPR_HMEER);
+
+		prlog(PR_ERR, "HMI: Unexpected exception HMEER = 0x%016llx "
+			      "HMER = 0x%016llx (not recovering)\n",
+			      hmeer, hmer);
+		recover = 0;
+	}
+
 	if (recover == 0)
 		disable_fast_reboot("Unrecoverable HMI");
 	/*