Message ID | 20170623121101.30781-1-npiggin@gmail.com |
---|---|
State | Rejected |
Headers | show |
On 06/23/2017 05:41 PM, Nicholas Piggin wrote: > It has been observed the xscom bit in HMER gets stuck (as-yet We see that stuck because opal never clears it after scom read/write. The bit is cleared just before the next scom read/write. I am not sure what it was left uncleared until next scom read/write kicks in. > unkonwn root cause -- HMEER should disable those exceptions). > This causes HMIs to be continually taken. > > HMI: Received HMI interrupt: HMER = 0x0040000000000000 > > Add some attempt to handle this by clearing the HMER and HMEER. > > Try to clear HMER for other unknown HMIs (alternative is to not > recover). I think we should be just ok with clearing out and masking them again. > > There seems to be no point in continually taking an HMI that will > never be handled. By not handling it we already implicitly are > trying to "continue" without solving anything aren't we? We do handle the ones that could cause harm to system functioning. Rest we mask it. Other than xscom related bits we also mask bit 6, 16 and 17 which does not look harmful. I think we should just mask them again in HMEER if we get HMIs for the bits that we already masked. > > --- > core/hmi.c | 26 ++++++++++++++++++++++++++ > hw/xscom.c | 5 +---- > include/processor.h | 7 +++++++ > 3 files changed, 34 insertions(+), 4 deletions(-) > > diff --git a/core/hmi.c b/core/hmi.c > index 84f2c2d6..7ab5810d 100644 > --- a/core/hmi.c > +++ b/core/hmi.c > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) > } > } > > + if (hmer & SPR_HMER_XSCOM_MASK) { > + hmer &= ~SPR_HMER_XSCOM_MASK; > + if (hmi_evt) { > + hmi_evt->severity = OpalHMI_SEV_NO_ERROR; > + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE; > + queue_hmi_event(hmi_evt, recover); > + } > + sync(); > + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL | > + SPR_HMER_XSCOM_DONE)) > + isync(); > + > + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n"); > + } > + > + if (hmer) { > + hmer = 0; > + if (hmi_evt) { > + hmi_evt->severity = OpalHMI_SEV_WARNING; > + hmi_evt->type = 0; /* Anything sane we can put here? */ > + queue_hmi_event(hmi_evt, recover); > + } This one is also unexpected, should we clear and mask this as well ? Otherwise we would keep getting this HMI and warnings would flood host kernel. Thanks, -Mahesh.
On Tue, 27 Jun 2017 10:46:34 +0530 Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > On 06/23/2017 05:41 PM, Nicholas Piggin wrote: > > It has been observed the xscom bit in HMER gets stuck (as-yet > > We see that stuck because opal never clears it after scom read/write. > The bit is cleared just before the next scom read/write. I am not sure > what it was left uncleared until next scom read/write kicks in. Right, but did we work out why it's taking a HMI or getting enabled in the HMEER? > > unkonwn root cause -- HMEER should disable those exceptions). > > This causes HMIs to be continually taken. > > > > HMI: Received HMI interrupt: HMER = 0x0040000000000000 > > > > Add some attempt to handle this by clearing the HMER and HMEER. > > > > Try to clear HMER for other unknown HMIs (alternative is to not > > recover). > > I think we should be just ok with clearing out and masking them again. > > > > > There seems to be no point in continually taking an HMI that will > > never be handled. By not handling it we already implicitly are > > trying to "continue" without solving anything aren't we? > > We do handle the ones that could cause harm to system functioning. Rest > we mask it. Other than xscom related bits we also mask bit 6, 16 and 17 > which does not look harmful. I think we should just mask them again in > HMEER if we get HMIs for the bits that we already masked. Okay. My thinking is that if there was a hw or sw error that causes some unknown HME, would it be better to stop and crash ASAP? Either way seems better than our current approach of trying to continue without masking. So it's your call. > > > > > --- > > core/hmi.c | 26 ++++++++++++++++++++++++++ > > hw/xscom.c | 5 +---- > > include/processor.h | 7 +++++++ > > 3 files changed, 34 insertions(+), 4 deletions(-) > > > > diff --git a/core/hmi.c b/core/hmi.c > > index 84f2c2d6..7ab5810d 100644 > > --- a/core/hmi.c > > +++ b/core/hmi.c > > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) > > } > > } > > > > + if (hmer & SPR_HMER_XSCOM_MASK) { > > + hmer &= ~SPR_HMER_XSCOM_MASK; > > + if (hmi_evt) { > > + hmi_evt->severity = OpalHMI_SEV_NO_ERROR; > > + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE; > > + queue_hmi_event(hmi_evt, recover); > > + } > > + sync(); > > + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL | > > + SPR_HMER_XSCOM_DONE)) > > + isync(); > > + > > + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n"); > > + } > > + > > + if (hmer) { > > + hmer = 0; > > + if (hmi_evt) { > > + hmi_evt->severity = OpalHMI_SEV_WARNING; > > + hmi_evt->type = 0; /* Anything sane we can put here? */ > > + queue_hmi_event(hmi_evt, recover); > > + } > > This one is also unexpected, should we clear and mask this as well ? Probably yes. Either that or set recover = 0. What do you think? Thanks, Nick
On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote: > On 06/23/2017 05:41 PM, Nicholas Piggin wrote: > > It has been observed the xscom bit in HMER gets stuck (as-yet > > We see that stuck because opal never clears it after scom read/write. > The bit is cleared just before the next scom read/write. I am not sure > what it was left uncleared until next scom read/write kicks in. Because we don't care ? It should not be enabled in HMEER... > > > unkonwn root cause -- HMEER should disable those exceptions). > > This causes HMIs to be continually taken. > > > > HMI: Received HMI interrupt: HMER = 0x0040000000000000 > > > > Add some attempt to handle this by clearing the HMER and HMEER. > > > > Try to clear HMER for other unknown HMIs (alternative is to not > > recover). > > I think we should be just ok with clearing out and masking them again. Right but we need to understand why we are taking the HMI in the first place since it's not enabled in HMEER unless something's wrong there. Is that reproduceable ? > > > > There seems to be no point in continually taking an HMI that will > > never be handled. By not handling it we already implicitly are > > trying to "continue" without solving anything aren't we? > > We do handle the ones that could cause harm to system functioning. Rest > we mask it. Other than xscom related bits we also mask bit 6, 16 and 17 > which does not look harmful. I think we should just mask them again in > HMEER if we get HMIs for the bits that we already masked. Ben. > > > > --- > > core/hmi.c | 26 ++++++++++++++++++++++++++ > > hw/xscom.c | 5 +---- > > include/processor.h | 7 +++++++ > > 3 files changed, 34 insertions(+), 4 deletions(-) > > > > diff --git a/core/hmi.c b/core/hmi.c > > index 84f2c2d6..7ab5810d 100644 > > --- a/core/hmi.c > > +++ b/core/hmi.c > > @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) > > } > > } > > > > + if (hmer & SPR_HMER_XSCOM_MASK) { > > + hmer &= ~SPR_HMER_XSCOM_MASK; > > + if (hmi_evt) { > > + hmi_evt->severity = OpalHMI_SEV_NO_ERROR; > > + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE; > > + queue_hmi_event(hmi_evt, recover); > > + } > > + sync(); > > + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL | > > + SPR_HMER_XSCOM_DONE)) > > + isync(); > > + > > + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n"); > > + } > > + > > + if (hmer) { > > + hmer = 0; > > + if (hmi_evt) { > > + hmi_evt->severity = OpalHMI_SEV_WARNING; > > + hmi_evt->type = 0; /* Anything sane we can put here? */ > > + queue_hmi_event(hmi_evt, recover); > > + } > > This one is also unexpected, should we clear and mask this as well ? > Otherwise we would keep getting this HMI and warnings would flood host > kernel. > > Thanks, > -Mahesh.
On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote: > On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote: >> On 06/23/2017 05:41 PM, Nicholas Piggin wrote: >>> It has been observed the xscom bit in HMER gets stuck (as-yet >> >> We see that stuck because opal never clears it after scom read/write. >> The bit is cleared just before the next scom read/write. I am not sure >> what it was left uncleared until next scom read/write kicks in. > > Because we don't care ? looking at the code it looks like we didn't care. I sent out a patch that clears them once scom operation is complete. > It should not be enabled in HMEER... Yes, we don't enable them in HMEER. >> >>> unkonwn root cause -- HMEER should disable those exceptions). >>> This causes HMIs to be continually taken. >>> >>> HMI: Received HMI interrupt: HMER = 0x0040000000000000 >>> >>> Add some attempt to handle this by clearing the HMER and HMEER. >>> >>> Try to clear HMER for other unknown HMIs (alternative is to not >>> recover). >> >> I think we should be just ok with clearing out and masking them again. > > Right but we need to understand why we are taking the HMI in the first > place since it's not enabled in HMEER unless something's wrong there. > Is that reproduceable ? We did debug it yesterday and found the reason. Akshay sent out a patch that fixes the issue. http://patchwork.ozlabs.org/patch/781434/ Thanks, -Mahesh. > >>> >>> There seems to be no point in continually taking an HMI that will >>> never be handled. By not handling it we already implicitly are >>> trying to "continue" without solving anything aren't we? >> >> We do handle the ones that could cause harm to system functioning. Rest >> we mask it. Other than xscom related bits we also mask bit 6, 16 and 17 >> which does not look harmful. I think we should just mask them again in >> HMEER if we get HMIs for the bits that we already masked. > > Ben. > >>> >>> --- >>> core/hmi.c | 26 ++++++++++++++++++++++++++ >>> hw/xscom.c | 5 +---- >>> include/processor.h | 7 +++++++ >>> 3 files changed, 34 insertions(+), 4 deletions(-) >>> >>> diff --git a/core/hmi.c b/core/hmi.c >>> index 84f2c2d6..7ab5810d 100644 >>> --- a/core/hmi.c >>> +++ b/core/hmi.c >>> @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) >>> } >>> } >>> >>> + if (hmer & SPR_HMER_XSCOM_MASK) { >>> + hmer &= ~SPR_HMER_XSCOM_MASK; >>> + if (hmi_evt) { >>> + hmi_evt->severity = OpalHMI_SEV_NO_ERROR; >>> + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE; >>> + queue_hmi_event(hmi_evt, recover); >>> + } >>> + sync(); >>> + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL | >>> + SPR_HMER_XSCOM_DONE)) >>> + isync(); >>> + >>> + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n"); >>> + } >>> + >>> + if (hmer) { >>> + hmer = 0; >>> + if (hmi_evt) { >>> + hmi_evt->severity = OpalHMI_SEV_WARNING; >>> + hmi_evt->type = 0; /* Anything sane we can put here? */ >>> + queue_hmi_event(hmi_evt, recover); >>> + } >> >> This one is also unexpected, should we clear and mask this as well ? >> Otherwise we would keep getting this HMI and warnings would flood host >> kernel. >> >> Thanks, >> -Mahesh. >
On Wed, 28 Jun 2017 09:00:05 +0530 Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote: > > On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote: > >> On 06/23/2017 05:41 PM, Nicholas Piggin wrote: > >>> It has been observed the xscom bit in HMER gets stuck (as-yet > >> > >> We see that stuck because opal never clears it after scom read/write. > >> The bit is cleared just before the next scom read/write. I am not sure > >> what it was left uncleared until next scom read/write kicks in. > > > > Because we don't care ? > > looking at the code it looks like we didn't care. I sent out a patch > that clears them once scom operation is complete. > > > It should not be enabled in HMEER... > > Yes, we don't enable them in HMEER. > > >> > >>> unkonwn root cause -- HMEER should disable those exceptions). > >>> This causes HMIs to be continually taken. > >>> > >>> HMI: Received HMI interrupt: HMER = 0x0040000000000000 > >>> > >>> Add some attempt to handle this by clearing the HMER and HMEER. > >>> > >>> Try to clear HMER for other unknown HMIs (alternative is to not > >>> recover). > >> > >> I think we should be just ok with clearing out and masking them again. > > > > Right but we need to understand why we are taking the HMI in the first > > place since it's not enabled in HMEER unless something's wrong there. > > Is that reproduceable ? > > We did debug it yesterday and found the reason. Akshay sent out a patch > that fixes the issue. http://patchwork.ozlabs.org/patch/781434/ Given that this bug was caused by Linux, and not due to an actual HMI (and therefore would not be fixed by clearing the HMER/HMEER bits), I wonder if this patch is still warranted. HMEER could be messed up somehow, so maybe a simplified version that just notes the unexpected HMI and masks out HMEER. Any opinions? Thanks, Nick
On 06/28/2017 10:11 AM, Nicholas Piggin wrote: > On Wed, 28 Jun 2017 09:00:05 +0530 > Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > >> On 06/27/2017 06:02 PM, Benjamin Herrenschmidt wrote: >>> On Tue, 2017-06-27 at 10:46 +0530, Mahesh Jagannath Salgaonkar wrote: >>>> On 06/23/2017 05:41 PM, Nicholas Piggin wrote: >>>>> It has been observed the xscom bit in HMER gets stuck (as-yet >>>> >>>> We see that stuck because opal never clears it after scom read/write. >>>> The bit is cleared just before the next scom read/write. I am not sure >>>> what it was left uncleared until next scom read/write kicks in. >>> >>> Because we don't care ? >> >> looking at the code it looks like we didn't care. I sent out a patch >> that clears them once scom operation is complete. >> >>> It should not be enabled in HMEER... >> >> Yes, we don't enable them in HMEER. >> >>>> >>>>> unkonwn root cause -- HMEER should disable those exceptions). >>>>> This causes HMIs to be continually taken. >>>>> >>>>> HMI: Received HMI interrupt: HMER = 0x0040000000000000 >>>>> >>>>> Add some attempt to handle this by clearing the HMER and HMEER. >>>>> >>>>> Try to clear HMER for other unknown HMIs (alternative is to not >>>>> recover). >>>> >>>> I think we should be just ok with clearing out and masking them again. >>> >>> Right but we need to understand why we are taking the HMI in the first >>> place since it's not enabled in HMEER unless something's wrong there. >>> Is that reproduceable ? >> >> We did debug it yesterday and found the reason. Akshay sent out a patch >> that fixes the issue. http://patchwork.ozlabs.org/patch/781434/ > > Given that this bug was caused by Linux, and not due to an actual > HMI (and therefore would not be fixed by clearing the HMER/HMEER > bits), I wonder if this patch is still warranted. HMEER could be > messed up somehow, so maybe a simplified version that just notes > the unexpected HMI and masks out HMEER. > > Any opinions? Yeah I agree with having simplified version so that it will help us to detect if we at all mess up with HMEER in future. Thanks, -Mahesh.
diff --git a/core/hmi.c b/core/hmi.c index 84f2c2d6..7ab5810d 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -823,6 +823,32 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) } } + if (hmer & SPR_HMER_XSCOM_MASK) { + hmer &= ~SPR_HMER_XSCOM_MASK; + if (hmi_evt) { + hmi_evt->severity = OpalHMI_SEV_NO_ERROR; + hmi_evt->type = OpalHMI_ERROR_XSCOM_DONE; + queue_hmi_event(hmi_evt, recover); + } + sync(); + mtspr(SPR_HMEER, mfspr(SPR_HMEER) & ~(SPR_HMER_XSCOM_FAIL | + SPR_HMER_XSCOM_DONE)) + isync(); + + prlog(PR_DEBUG, "HMI: Unexpected XSCOM (clearing).\n"); + } + + if (hmer) { + hmer = 0; + if (hmi_evt) { + hmi_evt->severity = OpalHMI_SEV_WARNING; + hmi_evt->type = 0; /* Anything sane we can put here? */ + queue_hmi_event(hmi_evt, recover); + } + prlog(PR_DEBUG, "HMI: Unhandled (attempting to continue).\n"); + } + + if (recover == 0) disable_fast_reboot("Unrecoverable HMI"); /* diff --git a/hw/xscom.c b/hw/xscom.c index 63813f1e..47a78e87 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -25,10 +25,7 @@ #include <opal-api.h> #include <timebase.h> -/* Mask of bits to clear in HMER before an access */ -#define HMER_CLR_MASK (~(SPR_HMER_XSCOM_FAIL | \ - SPR_HMER_XSCOM_DONE | \ - SPR_HMER_XSCOM_STATUS)) +#define HMER_CLR_MASK (~SPR_HMER_XSCOM_MASK) DEFINE_LOG_ENTRY(OPAL_RC_XSCOM_RW, OPAL_PLATFORM_ERR_EVT, OPAL_XSCOM, OPAL_CEC_HARDWARE, OPAL_PREDICTIVE_ERR_GENERAL, diff --git a/include/processor.h b/include/processor.h index 5906b865..07111856 100644 --- a/include/processor.h +++ b/include/processor.h @@ -150,6 +150,13 @@ #define SPR_HMER_XSCOM_STATUS PPC_BITMASK(21,23) /* + * Mask of bits to clear in HMER before an xscom access + */ +#define SPR_HMER_XSCOM_MASK (SPR_HMER_XSCOM_FAIL | \ + SPR_HMER_XSCOM_DONE | \ + SPR_HMER_XSCOM_STATUS) + +/* * HMEER: initial bits for HMI interrupt enable mask. * Per Dave Larson, never enable 8,9,21-23 */