hw/xscom: Enable sw xstop by default on p9
diff mbox series

Message ID 20190416015701.24170-1-oohall@gmail.com
State Accepted
Delegated to: Vasant Hegde
Headers show
Series
  • hw/xscom: Enable sw xstop by default on p9
Related show

Commit Message

Oliver April 16, 2019, 1:57 a.m. UTC
This was disabled at some point during bringup to make life easier for
the lab folks trying to debug NVLink issues. This hack really should
have never made it out into the wild though, so we now have the
following situation occuring in the field:

 1) A bad happens
 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
    request a platform reboot.
 3) OPAL rejects the reboot attempt and returns to the kernel with
    OPAL_PARAMETER.
 4) Kernel panics and attempts to kexec into a kdump kernel.

A side effect of the HMI seems to be CPUs becoming stuck which results
in the initialisation of the kdump kernel taking a extremely long time
(6+ hours). It's also been observed that after performing a dump the
kdump kernel then crashes itself because OPAL has ended up in a bad
state as a side effect of the HMI.

All up, it's not very good so re-enable the software checkstop by
default. If people still want to turn it off they can using the nvram
override.

Cc: skiboot-stable@lists.ozlabs.org
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 hw/xscom.c | 26 ++------------------------
 1 file changed, 2 insertions(+), 24 deletions(-)

Comments

Mahesh J Salgaonkar April 16, 2019, 3:30 p.m. UTC | #1
On 4/16/19 7:27 AM, Oliver O'Halloran wrote:
> This was disabled at some point during bringup to make life easier for
> the lab folks trying to debug NVLink issues. This hack really should
> have never made it out into the wild though, so we now have the
> following situation occuring in the field:
> 
>  1) A bad happens
>  2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
>     request a platform reboot.
>  3) OPAL rejects the reboot attempt and returns to the kernel with
>     OPAL_PARAMETER.
>  4) Kernel panics and attempts to kexec into a kdump kernel.
> 
> A side effect of the HMI seems to be CPUs becoming stuck which results
> in the initialisation of the kdump kernel taking a extremely long time
> (6+ hours). It's also been observed that after performing a dump the
> kdump kernel then crashes itself because OPAL has ended up in a bad
> state as a side effect of the HMI.
> 
> All up, it's not very good so re-enable the software checkstop by
> default. If people still want to turn it off they can using the nvram
> override.
> 
> Cc: skiboot-stable@lists.ozlabs.org
> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Thanks,
-Mahesh.

> ---
>  hw/xscom.c | 26 ++------------------------
>  1 file changed, 2 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/xscom.c b/hw/xscom.c
> index 37f0705d1c2a..bf634d91a960 100644
> --- a/hw/xscom.c
> +++ b/hw/xscom.c
> @@ -833,30 +833,8 @@ int64_t xscom_trigger_xstop(void)
>  	int rc = OPAL_UNSUPPORTED;
>  	bool xstop_disabled = false;
> 
> -	/*
> -	 * Workaround until we iron out all checkstop issues at present.
> -	 *
> -	 * For p9:
> -	 * By default do not trigger sw checkstop unless explicitly enabled
> -	 * through nvram option 'opal-sw-xstop=enable'.
> -	 *
> -	 * For p8:
> -	 * Keep it enabled by default unless explicitly disabled.
> -	 *
> -	 * NOTE: Once all checkstop issues are resolved/stabilized reverse
> -	 * the logic to enable sw checkstop by default on p9.
> -	 */
> -	switch (proc_gen) {
> -	case proc_gen_p8:
> -		if (nvram_query_eq("opal-sw-xstop", "disable"))
> -			xstop_disabled = true;
> -		break;
> -	case proc_gen_p9:
> -	default:
> -		if (!nvram_query_eq("opal-sw-xstop", "enable"))
> -			xstop_disabled = true;
> -		break;
> -	}
> +	if (nvram_query_eq("opal-sw-xstop", "disable"))
> +		xstop_disabled = true;
> 
>  	if (xstop_disabled) {
>  		prlog(PR_NOTICE, "Software initiated checkstop disabled.\n");
>
Stewart Smith April 17, 2019, 7:32 a.m. UTC | #2
"Oliver O'Halloran" <oohall@gmail.com> writes:
> This was disabled at some point during bringup to make life easier for
> the lab folks trying to debug NVLink issues. This hack really should
> have never made it out into the wild though, so we now have the
> following situation occuring in the field:
>
>  1) A bad happens
>  2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
>     request a platform reboot.
>  3) OPAL rejects the reboot attempt and returns to the kernel with
>     OPAL_PARAMETER.
>  4) Kernel panics and attempts to kexec into a kdump kernel.
>
> A side effect of the HMI seems to be CPUs becoming stuck which results
> in the initialisation of the kdump kernel taking a extremely long time
> (6+ hours). It's also been observed that after performing a dump the
> kdump kernel then crashes itself because OPAL has ended up in a bad
> state as a side effect of the HMI.
>
> All up, it's not very good so re-enable the software checkstop by
> default. If people still want to turn it off they can using the nvram
> override.
>
> Cc: skiboot-stable@lists.ozlabs.org
> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

I'll be the one rocking in the corner weeping and screaming incoherently
about some time in P9 bringup. If you listen closely, some of the things
I may be incoherently yelling are the words 'merge' and the string
"af5a3ee925d11f4e4e5276ccd5c6ec20b2d2df9f".

Patch
diff mbox series

diff --git a/hw/xscom.c b/hw/xscom.c
index 37f0705d1c2a..bf634d91a960 100644
--- a/hw/xscom.c
+++ b/hw/xscom.c
@@ -833,30 +833,8 @@  int64_t xscom_trigger_xstop(void)
 	int rc = OPAL_UNSUPPORTED;
 	bool xstop_disabled = false;
 
-	/*
-	 * Workaround until we iron out all checkstop issues at present.
-	 *
-	 * For p9:
-	 * By default do not trigger sw checkstop unless explicitly enabled
-	 * through nvram option 'opal-sw-xstop=enable'.
-	 *
-	 * For p8:
-	 * Keep it enabled by default unless explicitly disabled.
-	 *
-	 * NOTE: Once all checkstop issues are resolved/stabilized reverse
-	 * the logic to enable sw checkstop by default on p9.
-	 */
-	switch (proc_gen) {
-	case proc_gen_p8:
-		if (nvram_query_eq("opal-sw-xstop", "disable"))
-			xstop_disabled = true;
-		break;
-	case proc_gen_p9:
-	default:
-		if (!nvram_query_eq("opal-sw-xstop", "enable"))
-			xstop_disabled = true;
-		break;
-	}
+	if (nvram_query_eq("opal-sw-xstop", "disable"))
+		xstop_disabled = true;
 
 	if (xstop_disabled) {
 		prlog(PR_NOTICE, "Software initiated checkstop disabled.\n");