npu2: Return sensible PCI error when not frozen

Message ID 20181205235013.68364-1-aik@ozlabs.ru
State Accepted
Headers show
Series
  • npu2: Return sensible PCI error when not frozen
Related show

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success master/apply_patch Successfully applied

Commit Message

Alexey Kardashevskiy Dec. 5, 2018, 11:50 p.m.
The current kernel calls OPAL_PCI_EEH_FREEZE_STATUS with an uninitialized
@pci_error_type parameter and then analyzes it even if the OPAL call
returned OPAL_SUCCESS. This is results in unexpected EEH events and NPU
freezes.

This initializes @pci_error_type and @severity to known safe values.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

The corresponding kernel patch is under review:
https://patchwork.ozlabs.org/patch/999630/
---
 hw/npu2.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Andrew Donnellan Dec. 6, 2018, 4:05 a.m. | #1
On 6/12/18 10:50 am, Alexey Kardashevskiy wrote:
> The current kernel calls OPAL_PCI_EEH_FREEZE_STATUS with an uninitialized
> @pci_error_type parameter and then analyzes it even if the OPAL call
> returned OPAL_SUCCESS. This is results in unexpected EEH events and NPU
> freezes.
> 
> This initializes @pci_error_type and @severity to known safe values.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
> 
> The corresponding kernel patch is under review:
> https://patchwork.ozlabs.org/patch/999630/
> ---
>   hw/npu2.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/npu2.c b/hw/npu2.c
> index 767306f..674ffee 100644
> --- a/hw/npu2.c
> +++ b/hw/npu2.c
> @@ -1313,8 +1313,8 @@ static struct pci_slot *npu2_slot_create(struct phb *phb)
>   int64_t npu2_freeze_status(struct phb *phb __unused,
>   			   uint64_t pe_number __unused,
>   			   uint8_t *freeze_state,
> -			   uint16_t *pci_error_type __unused,
> -			   uint16_t *severity __unused,
> +			   uint16_t *pci_error_type,
> +			   uint16_t *severity,
>   			   uint64_t *phb_status __unused)
>   {
>   	/*
> @@ -1324,6 +1324,10 @@ int64_t npu2_freeze_status(struct phb *phb __unused,
>   	 * it keeps the skiboot PCI enumeration going.
>   	 */
>   	*freeze_state = OPAL_EEH_STOPPED_NOT_FROZEN;
> +	*pci_error_type = OPAL_EEH_NO_ERROR;
> +	if (severity)
> +		*severity = OPAL_EEH_SEV_NO_ERROR;
> +
>   	return OPAL_SUCCESS;
>   }
>   
>
Stewart Smith Dec. 11, 2018, 6:29 a.m. | #2
Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> The current kernel calls OPAL_PCI_EEH_FREEZE_STATUS with an uninitialized
> @pci_error_type parameter and then analyzes it even if the OPAL call
> returned OPAL_SUCCESS. This is results in unexpected EEH events and NPU
> freezes.
>
> This initializes @pci_error_type and @severity to known safe values.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Merged to master as of  3e3defbf73e3603c5dc6f168c3de764c14b50e27

Patch

diff --git a/hw/npu2.c b/hw/npu2.c
index 767306f..674ffee 100644
--- a/hw/npu2.c
+++ b/hw/npu2.c
@@ -1313,8 +1313,8 @@  static struct pci_slot *npu2_slot_create(struct phb *phb)
 int64_t npu2_freeze_status(struct phb *phb __unused,
 			   uint64_t pe_number __unused,
 			   uint8_t *freeze_state,
-			   uint16_t *pci_error_type __unused,
-			   uint16_t *severity __unused,
+			   uint16_t *pci_error_type,
+			   uint16_t *severity,
 			   uint64_t *phb_status __unused)
 {
 	/*
@@ -1324,6 +1324,10 @@  int64_t npu2_freeze_status(struct phb *phb __unused,
 	 * it keeps the skiboot PCI enumeration going.
 	 */
 	*freeze_state = OPAL_EEH_STOPPED_NOT_FROZEN;
+	*pci_error_type = OPAL_EEH_NO_ERROR;
+	if (severity)
+		*severity = OPAL_EEH_SEV_NO_ERROR;
+
 	return OPAL_SUCCESS;
 }