diff mbox

[RFC] hmi: clear xscom and unknown bits from HMER

Message ID 20170623121101.30781-1-npiggin@gmail.com
State Rejected
Headers show

Commit Message

Nicholas Piggin June 23, 2017, 12:11 p.m. UTC
It has been observed the xscom bit in HMER gets stuck (as-yet
unkonwn root cause -- HMEER should disable those exceptions).
This causes HMIs to be continually taken.

HMI: Received HMI interrupt: HMER = 0x0040000000000000

Add some attempt to handle this by clearing the HMER and HMEER.

Try to clear HMER for other unknown HMIs (alternative is to not
recover).

There seems to be no point in continually taking an HMI that will
never be handled. By not handling it we already implicitly are
trying to "continue" without solving anything aren't we?

---
 core/hmi.c          | 26 ++++++++++++++++++++++++++
 hw/xscom.c          |  5 +----
 include/processor.h |  7 +++++++
 3 files changed, 34 insertions(+), 4 deletions(-)

Comments

Mahesh J Salgaonkar June 27, 2017, 5:16 a.m. UTC | #1
On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
> It has been observed the xscom bit in HMER gets stuck (as-yet

We see that stuck because opal never clears it after scom read/write.
The bit is cleared just before the next scom read/write. I am not sure
what it was left uncleared until next scom read/write kicks in.

> unkonwn root cause -- HMEER should disable those exceptions).
> This causes HMIs to be continually taken.
> 
> HMI: Received HMI interrupt: HMER = 0x0040000000000000
> 
> Add some attempt to handle this by clearing the HMER and HMEER.
> 
> Try to clear HMER for other unknown HMIs (alternative is to not
> recover).

I think we should be just ok with clearing out and masking them again.

> 
> There seems to be no point in continually taking an HMI that will
> never be handled. By not handling it we already implicitly are
> trying to "continue" without solving anything aren't we?

We do handle the ones that could cause harm to system functioning. Rest
we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
which does not look harmful. I think we should just mask them again in
HMEER if we get HMIs for the bits that we already masked.

> 
> ---
>  core/hmi.c          | 26 ++++++++++++++++++++++++++
>  hw/xscom.c          |  5 +----
>  include/processor.h |  7 +++++++
>  3 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/core/hmi.c b/core/hmi.c
> index 84f2c2d6..7ab5810d 100644
> --- a/core/hmi.c
> +++ b/core/hmi.c
> @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>  		}
>  	}
> 
> +	if (hmer & SPR_HMER_XSCOM_MASK) {
> +		hmer &= ~SPR_HMER_XSCOM_MASK;
> +		if (hmi_evt) {
> +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
> +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
> +			queue_hmi_event(hmi_evt, recover);
> +		}
> +		sync();
> +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
> +							SPR_HMER_XSCOM_DONE))
> +		isync();
> +
> +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
> +	}
> +
> +	if (hmer) {
> +		hmer = 0;
> +		if (hmi_evt) {
> +			hmi_evt->severity = OpalHMI_SEV_WARNING;
> +			hmi_evt->type = 0; /* Anything sane we can put here? */
> +			queue_hmi_event(hmi_evt, recover);
> +		}

This one is also unexpected, should we clear and mask this as well ?
Otherwise we would keep getting this HMI and warnings would flood host
kernel.

Thanks,
-Mahesh.
Nicholas Piggin June 27, 2017, 5:39 a.m. UTC | #2
On Tue, 27 Jun 2017 10:46:34 +0530
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
> > It has been observed the xscom bit in HMER gets stuck (as-yet  
> 
> We see that stuck because opal never clears it after scom read/write.
> The bit is cleared just before the next scom read/write. I am not sure
> what it was left uncleared until next scom read/write kicks in.

Right, but did we work out why it's taking a HMI or getting enabled
in the HMEER?

> > unkonwn root cause -- HMEER should disable those exceptions).
> > This causes HMIs to be continually taken.
> > 
> > HMI: Received HMI interrupt: HMER = 0x0040000000000000
> > 
> > Add some attempt to handle this by clearing the HMER and HMEER.
> > 
> > Try to clear HMER for other unknown HMIs (alternative is to not
> > recover).  
> 
> I think we should be just ok with clearing out and masking them again.
> 
> > 
> > There seems to be no point in continually taking an HMI that will
> > never be handled. By not handling it we already implicitly are
> > trying to "continue" without solving anything aren't we?  
> 
> We do handle the ones that could cause harm to system functioning. Rest
> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
> which does not look harmful. I think we should just mask them again in
> HMEER if we get HMIs for the bits that we already masked.

Okay. My thinking is that if there was a hw or sw error that causes
some unknown HME, would it be better to stop and crash ASAP?

Either way seems better than our current approach of trying to
continue without masking. So it's your call.

> 
> > 
> > ---
> >  core/hmi.c          | 26 ++++++++++++++++++++++++++
> >  hw/xscom.c          |  5 +----
> >  include/processor.h |  7 +++++++
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> > 
> > diff --git a/core/hmi.c b/core/hmi.c
> > index 84f2c2d6..7ab5810d 100644
> > --- a/core/hmi.c
> > +++ b/core/hmi.c
> > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
> >  		}
> >  	}
> > 
> > +	if (hmer & SPR_HMER_XSCOM_MASK) {
> > +		hmer &= ~SPR_HMER_XSCOM_MASK;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
> > +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> > +		sync();
> > +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
> > +							SPR_HMER_XSCOM_DONE))
> > +		isync();
> > +
> > +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
> > +	}
> > +
> > +	if (hmer) {
> > +		hmer = 0;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_WARNING;
> > +			hmi_evt->type = 0; /* Anything sane we can put here? */
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}  
> 
> This one is also unexpected, should we clear and mask this as well ?

Probably yes. Either that or set recover = 0. What do you think?

Thanks,
Nick
Benjamin Herrenschmidt June 27, 2017, 12:32 p.m. UTC | #3
On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:
> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
> > It has been observed the xscom bit in HMER gets stuck (as-yet
> 
> We see that stuck because opal never clears it after scom read/write.
> The bit is cleared just before the next scom read/write. I am not sure
> what it was left uncleared until next scom read/write kicks in.

Because we don't care ? It should not be enabled in HMEER...
> 
> > unkonwn root cause -- HMEER should disable those exceptions).
> > This causes HMIs to be continually taken.
> > 
> > HMI: Received HMI interrupt: HMER = 0x0040000000000000
> > 
> > Add some attempt to handle this by clearing the HMER and HMEER.
> > 
> > Try to clear HMER for other unknown HMIs (alternative is to not
> > recover).
> 
> I think we should be just ok with clearing out and masking them again.

Right but we need to understand why we are taking the HMI in the first
place since it's not enabled in HMEER unless something's wrong there.
Is that reproduceable ?

> > 
> > There seems to be no point in continually taking an HMI that will
> > never be handled. By not handling it we already implicitly are
> > trying to "continue" without solving anything aren't we?
> 
> We do handle the ones that could cause harm to system functioning. Rest
> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
> which does not look harmful. I think we should just mask them again in
> HMEER if we get HMIs for the bits that we already masked.

Ben.

> > 
> > ---
> >  core/hmi.c          | 26 ++++++++++++++++++++++++++
> >  hw/xscom.c          |  5 +----
> >  include/processor.h |  7 +++++++
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> > 
> > diff --git a/core/hmi.c b/core/hmi.c
> > index 84f2c2d6..7ab5810d 100644
> > --- a/core/hmi.c
> > +++ b/core/hmi.c
> > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
> >  		}
> >  	}
> > 
> > +	if (hmer & SPR_HMER_XSCOM_MASK) {
> > +		hmer &= ~SPR_HMER_XSCOM_MASK;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
> > +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> > +		sync();
> > +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
> > +							SPR_HMER_XSCOM_DONE))
> > +		isync();
> > +
> > +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
> > +	}
> > +
> > +	if (hmer) {
> > +		hmer = 0;
> > +		if (hmi_evt) {
> > +			hmi_evt->severity = OpalHMI_SEV_WARNING;
> > +			hmi_evt->type = 0; /* Anything sane we can put here? */
> > +			queue_hmi_event(hmi_evt, recover);
> > +		}
> 
> This one is also unexpected, should we clear and mask this as well ?
> Otherwise we would keep getting this HMI and warnings would flood host
> kernel.
> 
> Thanks,
> -Mahesh.
Mahesh J Salgaonkar June 28, 2017, 3:30 a.m. UTC | #4
On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:
>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:
>>> It has been observed the xscom bit in HMER gets stuck (as-yet
>>
>> We see that stuck because opal never clears it after scom read/write.
>> The bit is cleared just before the next scom read/write. I am not sure
>> what it was left uncleared until next scom read/write kicks in.
> 
> Because we don't care ? 

looking at the code it looks like we didn't care. I sent out a patch
that clears them once scom operation is complete.

> It should not be enabled in HMEER...

Yes, we don't enable them in HMEER.

>>
>>> unkonwn root cause -- HMEER should disable those exceptions).
>>> This causes HMIs to be continually taken.
>>>
>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>
>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>
>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>> recover).
>>
>> I think we should be just ok with clearing out and masking them again.
> 
> Right but we need to understand why we are taking the HMI in the first
> place since it's not enabled in HMEER unless something's wrong there.
> Is that reproduceable ?

We did debug it yesterday and found the reason. Akshay sent out a patch
that fixes the issue. http://patchwork.ozlabs.org/patch/781434/

Thanks,
-Mahesh.

> 
>>>
>>> There seems to be no point in continually taking an HMI that will
>>> never be handled. By not handling it we already implicitly are
>>> trying to "continue" without solving anything aren't we?
>>
>> We do handle the ones that could cause harm to system functioning. Rest
>> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17
>> which does not look harmful. I think we should just mask them again in
>> HMEER if we get HMIs for the bits that we already masked.
> 
> Ben.
> 
>>>
>>> ---
>>>  core/hmi.c          | 26 ++++++++++++++++++++++++++
>>>  hw/xscom.c          |  5 +----
>>>  include/processor.h |  7 +++++++
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/core/hmi.c b/core/hmi.c
>>> index 84f2c2d6..7ab5810d 100644
>>> --- a/core/hmi.c
>>> +++ b/core/hmi.c
>>> @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
>>>  		}
>>>  	}
>>>
>>> +	if (hmer & SPR_HMER_XSCOM_MASK) {
>>> +		hmer &= ~SPR_HMER_XSCOM_MASK;
>>> +		if (hmi_evt) {
>>> +			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
>>> +			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
>>> +			queue_hmi_event(hmi_evt, recover);
>>> +		}
>>> +		sync();
>>> +		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
>>> +							SPR_HMER_XSCOM_DONE))
>>> +		isync();
>>> +
>>> +		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
>>> +	}
>>> +
>>> +	if (hmer) {
>>> +		hmer = 0;
>>> +		if (hmi_evt) {
>>> +			hmi_evt->severity = OpalHMI_SEV_WARNING;
>>> +			hmi_evt->type = 0; /* Anything sane we can put here? */
>>> +			queue_hmi_event(hmi_evt, recover);
>>> +		}
>>
>> This one is also unexpected, should we clear and mask this as well ?
>> Otherwise we would keep getting this HMI and warnings would flood host
>> kernel.
>>
>> Thanks,
>> -Mahesh.
>
Nicholas Piggin June 28, 2017, 4:41 a.m. UTC | #5
On Wed, 28 Jun 2017 09:00:05 +0530
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:  
> >> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:  
> >>> It has been observed the xscom bit in HMER gets stuck (as-yet  
> >>
> >> We see that stuck because opal never clears it after scom read/write.
> >> The bit is cleared just before the next scom read/write. I am not sure
> >> what it was left uncleared until next scom read/write kicks in.  
> > 
> > Because we don't care ?   
> 
> looking at the code it looks like we didn't care. I sent out a patch
> that clears them once scom operation is complete.
> 
> > It should not be enabled in HMEER...  
> 
> Yes, we don't enable them in HMEER.
> 
> >>  
> >>> unkonwn root cause -- HMEER should disable those exceptions).
> >>> This causes HMIs to be continually taken.
> >>>
> >>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
> >>>
> >>> Add some attempt to handle this by clearing the HMER and HMEER.
> >>>
> >>> Try to clear HMER for other unknown HMIs (alternative is to not
> >>> recover).  
> >>
> >> I think we should be just ok with clearing out and masking them again.  
> > 
> > Right but we need to understand why we are taking the HMI in the first
> > place since it's not enabled in HMEER unless something's wrong there.
> > Is that reproduceable ?  
> 
> We did debug it yesterday and found the reason. Akshay sent out a patch
> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/

Given that this bug was caused by Linux, and not due to an actual
HMI (and therefore would not be fixed by clearing the HMER/HMEER
bits), I wonder if this patch is still warranted. HMEER could be
messed up somehow, so maybe a simplified version that just notes
the unexpected HMI and masks out HMEER.

Any opinions?

Thanks,
Nick
Mahesh J Salgaonkar June 28, 2017, 6:32 a.m. UTC | #6
On 06/28/2017 10:11 AM, Nicholas Piggin wrote:
> On Wed, 28 Jun 2017 09:00:05 +0530
> Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> 
>> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote:  
>>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote:  
>>>>> It has been observed the xscom bit in HMER gets stuck (as-yet  
>>>>
>>>> We see that stuck because opal never clears it after scom read/write.
>>>> The bit is cleared just before the next scom read/write. I am not sure
>>>> what it was left uncleared until next scom read/write kicks in.  
>>>
>>> Because we don't care ?   
>>
>> looking at the code it looks like we didn't care. I sent out a patch
>> that clears them once scom operation is complete.
>>
>>> It should not be enabled in HMEER...  
>>
>> Yes, we don't enable them in HMEER.
>>
>>>>  
>>>>> unkonwn root cause -- HMEER should disable those exceptions).
>>>>> This causes HMIs to be continually taken.
>>>>>
>>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000
>>>>>
>>>>> Add some attempt to handle this by clearing the HMER and HMEER.
>>>>>
>>>>> Try to clear HMER for other unknown HMIs (alternative is to not
>>>>> recover).  
>>>>
>>>> I think we should be just ok with clearing out and masking them again.  
>>>
>>> Right but we need to understand why we are taking the HMI in the first
>>> place since it's not enabled in HMEER unless something's wrong there.
>>> Is that reproduceable ?  
>>
>> We did debug it yesterday and found the reason. Akshay sent out a patch
>> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/
> 
> Given that this bug was caused by Linux, and not due to an actual
> HMI (and therefore would not be fixed by clearing the HMER/HMEER
> bits), I wonder if this patch is still warranted. HMEER could be
> messed up somehow, so maybe a simplified version that just notes
> the unexpected HMI and masks out HMEER.
> 
> Any opinions?

Yeah I agree with having simplified version so that it will help us to
detect if we at all mess up with HMEER in future.

Thanks,
-Mahesh.
diff mbox

Patch

diff --git a/core/hmi.c b/core/hmi.c
index 84f2c2d6..7ab5810d 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -823,6 +823,32 @@  int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt)
 		}
 	}
 
+	if (hmer & SPR_HMER_XSCOM_MASK) {
+		hmer &= ~SPR_HMER_XSCOM_MASK;
+		if (hmi_evt) {
+			hmi_evt->severity = OpalHMI_SEV_NO_ERROR;
+			hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE;
+			queue_hmi_event(hmi_evt, recover);
+		}
+		sync();
+		mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL |
+							SPR_HMER_XSCOM_DONE))
+		isync();
+
+		prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n");
+	}
+
+	if (hmer) {
+		hmer = 0;
+		if (hmi_evt) {
+			hmi_evt->severity = OpalHMI_SEV_WARNING;
+			hmi_evt->type = 0; /* Anything sane we can put here? */
+			queue_hmi_event(hmi_evt, recover);
+		}
+		prlog(PR_DEBUG, "HMI: Unhandled (attempting to continue).\n");
+	}
+
+
 	if (recover == 0)
 		disable_fast_reboot("Unrecoverable HMI");
 	/*
diff --git a/hw/xscom.c b/hw/xscom.c
index 63813f1e..47a78e87 100644
--- a/hw/xscom.c
+++ b/hw/xscom.c
@@ -25,10 +25,7 @@ 
 #include <opal-api.h>
 #include <timebase.h>
 
-/* Mask of bits to clear in HMER before an access */
-#define HMER_CLR_MASK	(~(SPR_HMER_XSCOM_FAIL | \
-			   SPR_HMER_XSCOM_DONE | \
-			   SPR_HMER_XSCOM_STATUS))
+#define HMER_CLR_MASK (~SPR_HMER_XSCOM_MASK)
 
 DEFINE_LOG_ENTRY(OPAL_RC_XSCOM_RW, OPAL_PLATFORM_ERR_EVT, OPAL_XSCOM,
 		OPAL_CEC_HARDWARE, OPAL_PREDICTIVE_ERR_GENERAL,
diff --git a/include/processor.h b/include/processor.h
index 5906b865..07111856 100644
--- a/include/processor.h
+++ b/include/processor.h
@@ -150,6 +150,13 @@ 
 #define SPR_HMER_XSCOM_STATUS		PPC_BITMASK(21,23)
 
 /*
+ * Mask of bits to clear in HMER before an xscom access
+ */
+#define SPR_HMER_XSCOM_MASK		(SPR_HMER_XSCOM_FAIL | \
+					 SPR_HMER_XSCOM_DONE | \
+					 SPR_HMER_XSCOM_STATUS)
+
+/*
  * HMEER: initial bits for HMI interrupt enable mask.
  * Per Dave Larson, never enable 8,9,21-23
  */