diff mbox series

xscom: Don't log xscom errors caused bu OPAL calls

Message ID 20191205131434.25091-1-oohall@gmail.com
State Accepted
Headers show
Series xscom: Don't log xscom errors caused bu OPAL calls | expand

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot fail Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)

Commit Message

Oliver O'Halloran Dec. 5, 2019, 1:14 p.m. UTC
The XSCOM read/write OPAL calls are largely there to support running
PRD in the OS. PRD itself handles submitting error logs (if needed)
when a XSCOM operation fails so there's no need to send an error log
from inside of OPAL.

Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 core/opal.c |  3 +++
 hw/xscom.c  | 15 +++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

Comments

Gautham R. Shenoy Dec. 5, 2019, 3:32 p.m. UTC | #1
Hello Oliver,


On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote:
> The XSCOM read/write OPAL calls are largely there to support running
> PRD in the OS. PRD itself handles submitting error logs (if needed)
> when a XSCOM operation fails so there's no need to send an error log
> from inside of OPAL.
> 
> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

This is much better, simpler version than what I posted.

Only one question.
> ---
>  core/opal.c |  3 +++
>  hw/xscom.c  | 15 +++++++++++----
>  2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/core/opal.c b/core/opal.c
> index da746e805a3f..922dc5abddca 100644
> --- a/core/opal.c
> +++ b/core/opal.c
> @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe)
>  		      "Spent %llu msecs in OPAL call %llu!\n",
>  		      call_time, token);
>  	}
> +
> +	this_cpu->current_token = 0;
> +
>  	return retval;
>  }
> 
> diff --git a/hw/xscom.c b/hw/xscom.c
> index f3f4e1039ecf..381cf2c4aaa4 100644
> --- a/hw/xscom.c
> +++ b/hw/xscom.c
> @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add
>  		break;
>  	}
> 
> -	/* XXX: Create error log entry ? */
> -	log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> -		"XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> -		is_write ? "write" : "read", gcid, pcb_addr, stat);
> +	/*
> +	 * If we're in an XSCOM opal call then squash the error
> +	 * we assume that the caller (probably opal-prd) will
> +	 * handle logging it
> +	 */
> +	if (this_cpu()->current_token != OPAL_XSCOM_READ &&
> +	    this_cpu()->current_token != OPAL_XSCOM_WRITE) {


Shouldn't we restrict to not creating PELs only when the error is
OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the
XSCOM read/writes initiated by HBRT were failing ?

> +		log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> +			"XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> +			is_write ? "write" : "read", gcid, pcb_addr, stat);
> +	}
> 
>  	/* We need to reset the XSCOM or we'll hang on the next access */
>  	xscom_reset(gcid, false);
> -- 
> 2.21.0
> 

--
Thanks and Regards
gautham.
Oliver O'Halloran Dec. 6, 2019, 2:51 a.m. UTC | #2
On Fri, Dec 6, 2019 at 2:32 AM Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote:
>
> Hello Oliver,
>
>
> On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote:
> > The XSCOM read/write OPAL calls are largely there to support running
> > PRD in the OS. PRD itself handles submitting error logs (if needed)
> > when a XSCOM operation fails so there's no need to send an error log
> > from inside of OPAL.
> >
> > Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>
> This is much better, simpler version than what I posted.
>
> Only one question.
> > ---
> >  core/opal.c |  3 +++
> >  hw/xscom.c  | 15 +++++++++++----
> >  2 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/core/opal.c b/core/opal.c
> > index da746e805a3f..922dc5abddca 100644
> > --- a/core/opal.c
> > +++ b/core/opal.c
> > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe)
> >                     "Spent %llu msecs in OPAL call %llu!\n",
> >                     call_time, token);
> >       }
> > +
> > +     this_cpu->current_token = 0;
> > +
> >       return retval;
> >  }
> >
> > diff --git a/hw/xscom.c b/hw/xscom.c
> > index f3f4e1039ecf..381cf2c4aaa4 100644
> > --- a/hw/xscom.c
> > +++ b/hw/xscom.c
> > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add
> >               break;
> >       }
> >
> > -     /* XXX: Create error log entry ? */
> > -     log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> > -             "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> > -             is_write ? "write" : "read", gcid, pcb_addr, stat);
> > +     /*
> > +      * If we're in an XSCOM opal call then squash the error
> > +      * we assume that the caller (probably opal-prd) will
> > +      * handle logging it
> > +      */
> > +     if (this_cpu()->current_token != OPAL_XSCOM_READ &&
> > +         this_cpu()->current_token != OPAL_XSCOM_WRITE) {
>
>
> Shouldn't we restrict to not creating PELs only when the error is
> OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the
> XSCOM read/writes initiated by HBRT were failing ?

I'm pretty sure HBRT will creates PEL for any (unexpected) xscom
failure so we'd just be doubling up. The other error results we return
are:

#define OPAL_XSCOM_BUSY         OPAL_BUSY
#define OPAL_XSCOM_CHIPLET_OFF  OPAL_WRONG_STATE
#define OPAL_XSCOM_PARTIAL_GOOD -25
#define OPAL_XSCOM_ADDR_ERROR   -26
#define OPAL_XSCOM_CLOCK_ERROR  -27
#define OPAL_XSCOM_PARITY_ERROR -28
#define OPAL_XSCOM_TIMEOUT      -29
#define OPAL_XSCOM_CTR_OFFLINED -30

IMO these are all variations on the same theme. The caller did a scom
to a register that isn't accessible for $reason so we might as well
handle it consistently. If the error needs to be logged then the
caller can deal with it.


>
> > +             log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> > +                     "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> > +                     is_write ? "write" : "read", gcid, pcb_addr, stat);
> > +     }
> >
> >       /* We need to reset the XSCOM or we'll hang on the next access */
> >       xscom_reset(gcid, false);
> > --
> > 2.21.0
> >
>
> --
> Thanks and Regards
> gautham.
Vaidyanathan Srinivasan Dec. 6, 2019, 9:26 a.m. UTC | #3
* Oliver O'Halloran <oohall@gmail.com> [2019-12-06 13:51:46]:

> On Fri, Dec 6, 2019 at 2:32 AM Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote:
> >
> > Hello Oliver,
> >
> >
> > On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote:
> > > The XSCOM read/write OPAL calls are largely there to support running
> > > PRD in the OS. PRD itself handles submitting error logs (if needed)
> > > when a XSCOM operation fails so there's no need to send an error log
> > > from inside of OPAL.
> > >
> > > Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>

> >
> > This is much better, simpler version than what I posted.
> >
> > Only one question.
> > > ---
> > >  core/opal.c |  3 +++
> > >  hw/xscom.c  | 15 +++++++++++----
> > >  2 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/core/opal.c b/core/opal.c
> > > index da746e805a3f..922dc5abddca 100644
> > > --- a/core/opal.c
> > > +++ b/core/opal.c
> > > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe)
> > >                     "Spent %llu msecs in OPAL call %llu!\n",
> > >                     call_time, token);
> > >       }
> > > +
> > > +     this_cpu->current_token = 0;
> > > +
> > >       return retval;
> > >  }
> > >
> > > diff --git a/hw/xscom.c b/hw/xscom.c
> > > index f3f4e1039ecf..381cf2c4aaa4 100644
> > > --- a/hw/xscom.c
> > > +++ b/hw/xscom.c
> > > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add
> > >               break;
> > >       }
> > >
> > > -     /* XXX: Create error log entry ? */
> > > -     log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> > > -             "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> > > -             is_write ? "write" : "read", gcid, pcb_addr, stat);
> > > +     /*
> > > +      * If we're in an XSCOM opal call then squash the error
> > > +      * we assume that the caller (probably opal-prd) will
> > > +      * handle logging it
> > > +      */
> > > +     if (this_cpu()->current_token != OPAL_XSCOM_READ &&
> > > +         this_cpu()->current_token != OPAL_XSCOM_WRITE) {
> >
> >
> > Shouldn't we restrict to not creating PELs only when the error is
> > OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the
> > XSCOM read/writes initiated by HBRT were failing ?

Thank Oliver for the nice solution to check for OPAL call context.
This takes care of HBRT and userspace xscom commandline for which we
need not log an error.  Anything else calling xscom during boot will
still log an error which is very rare anyway.  


> I'm pretty sure HBRT will creates PEL for any (unexpected) xscom
> failure so we'd just be doubling up. The other error results we return
> are:
> 
> #define OPAL_XSCOM_BUSY         OPAL_BUSY
> #define OPAL_XSCOM_CHIPLET_OFF  OPAL_WRONG_STATE
> #define OPAL_XSCOM_PARTIAL_GOOD -25
> #define OPAL_XSCOM_ADDR_ERROR   -26
> #define OPAL_XSCOM_CLOCK_ERROR  -27
> #define OPAL_XSCOM_PARITY_ERROR -28
> #define OPAL_XSCOM_TIMEOUT      -29
> #define OPAL_XSCOM_CTR_OFFLINED -30
> 
> IMO these are all variations on the same theme. The caller did a scom
> to a register that isn't accessible for $reason so we might as well
> handle it consistently. If the error needs to be logged then the
> caller can deal with it.

This is fine now since the xscom code is stable and we have coded
error recovery procedures.  We originally needed error logging where
some xscom fails mysteriously affects subsequent xscoms and make them
fail as well. Now we will get prlog prints if there where retries.
We can revisit this logging area if we run into new problems.  

--Vaidy
diff mbox series

Patch

diff --git a/core/opal.c b/core/opal.c
index da746e805a3f..922dc5abddca 100644
--- a/core/opal.c
+++ b/core/opal.c
@@ -189,6 +189,9 @@  int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe)
 		      "Spent %llu msecs in OPAL call %llu!\n",
 		      call_time, token);
 	}
+
+	this_cpu->current_token = 0;
+
 	return retval;
 }
 
diff --git a/hw/xscom.c b/hw/xscom.c
index f3f4e1039ecf..381cf2c4aaa4 100644
--- a/hw/xscom.c
+++ b/hw/xscom.c
@@ -272,10 +272,17 @@  static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add
 		break;
 	}
 
-	/* XXX: Create error log entry ? */
-	log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
-		"XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
-		is_write ? "write" : "read", gcid, pcb_addr, stat);
+	/*
+	 * If we're in an XSCOM opal call then squash the error
+	 * we assume that the caller (probably opal-prd) will
+	 * handle logging it
+	 */
+	if (this_cpu()->current_token != OPAL_XSCOM_READ &&
+	    this_cpu()->current_token != OPAL_XSCOM_WRITE) {
+		log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
+			"XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
+			is_write ? "write" : "read", gcid, pcb_addr, stat);
+	}
 
 	/* We need to reset the XSCOM or we'll hang on the next access */
 	xscom_reset(gcid, false);