[2/2] core/hmi: Do not call find_core_checkstop_reason() unconditionally

Message ID 20171130125051.1433-2-bsingharora@gmail.com
State New
Headers show
Series
  • [1/2] hw/npu2: Implement logging HMI actions
Related show

Commit Message

Balbir Singh Nov. 30, 2017, 12:50 p.m.
We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI
HMI handling paths found and accepted the reason for the local checkstop.
We also trash hmi_evt in the call.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 core/hmi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Mahesh Jagannath Salgaonkar Dec. 1, 2017, 8:05 a.m. | #1
On 11/30/2017 06:20 PM, Balbir Singh wrote:
> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI
> HMI handling paths found and accepted the reason for the local checkstop.

Yes, this is because there is a possibility of cascading checkstops
where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once.
Hence we need to make sure that we detect reasons from all FIRs
including core FIR.

> We also trash hmi_evt in the call.

Every find_*_checkstop*() caller fills and queues up the hmi_evt to be
sent to kernel. Once it is queued, the hmi_evt is free to use by next
caller.

Thanks,
-Mahesh.

> 
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  core/hmi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/core/hmi.c b/core/hmi.c
> index bb144512..62a916ba 100644
> --- a/core/hmi.c
> +++ b/core/hmi.c
> @@ -697,7 +697,8 @@ static void decode_malfunction(struct OpalHMIEvent *hmi_evt)
>  		}
>  	}
> 
> -	find_core_checkstop_reason(hmi_evt, &event_generated);
> +	if (!event_generated)
> +		find_core_checkstop_reason(hmi_evt, &event_generated);
> 
>  	/*
>  	 * If we fail to find checkstop reason, send an unknown HMI event.
>
Balbir Singh Dec. 4, 2017, 6:01 a.m. | #2
On Fri, Dec 1, 2017 at 7:05 PM, Mahesh Jagannath Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
> On 11/30/2017 06:20 PM, Balbir Singh wrote:
>> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI
>> HMI handling paths found and accepted the reason for the local checkstop.
>
> Yes, this is because there is a possibility of cascading checkstops
> where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once.
> Hence we need to make sure that we detect reasons from all FIRs
> including core FIR.
>

Cascading as in one event for all the triggered FIR's - each causing a
malfunction
error? Have you ever seen this case, several cascaded FIR bits causing just
one HMI malfunction error?

>> We also trash hmi_evt in the call.
>
> Every find_*_checkstop*() caller fills and queues up the hmi_evt to be
> sent to kernel. Once it is queued, the hmi_evt is free to use by next
> caller.
>

I agree on the queuing bits,

The code looks a bit odd to me

find_core_checkstop_reason() unconditionally initializes

        hmi_evt->severity = OpalHMI_SEV_FATAL;
        hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT;
        hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_CORE;

Then in decode_malfunction we do

        /*
         * If we fail to find checkstop reason, send an unknown HMI event.
         */
        if (!event_generated) {
                hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN;
                hmi_evt->u.xstop_error.xstop_reason = 0;
                queue_hmi_event(hmi_evt, false);
        }

Do we want to leave the severity as FATAL from there? Moreover the
code organization
needs revisiting


Balbir Singh
Mahesh Jagannath Salgaonkar Dec. 4, 2017, 10:54 a.m. | #3
On 12/04/2017 11:31 AM, Balbir Singh wrote:
> On Fri, Dec 1, 2017 at 7:05 PM, Mahesh Jagannath Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> On 11/30/2017 06:20 PM, Balbir Singh wrote:
>>> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI
>>> HMI handling paths found and accepted the reason for the local checkstop.
>>
>> Yes, this is because there is a possibility of cascading checkstops
>> where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once.
>> Hence we need to make sure that we detect reasons from all FIRs
>> including core FIR.
>>
> 
> Cascading as in one event for all the triggered FIR's - each causing a
> malfunction
> error? Have you ever seen this case, several cascaded FIR bits causing just
> one HMI malfunction error?

Nope. I have never come across such a case but there is a possibility. I
remember HW folks mentioning this when we initially implemented HMI
handling.

> 
>>> We also trash hmi_evt in the call.
>>
>> Every find_*_checkstop*() caller fills and queues up the hmi_evt to be
>> sent to kernel. Once it is queued, the hmi_evt is free to use by next
>> caller.
>>
> 
> I agree on the queuing bits,
> 
> The code looks a bit odd to me
> 
> find_core_checkstop_reason() unconditionally initializes
> 
>         hmi_evt->severity = OpalHMI_SEV_FATAL;
>         hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT;
>         hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_CORE;
> 
> Then in decode_malfunction we do
> 
>         /*
>          * If we fail to find checkstop reason, send an unknown HMI event.
>          */
>         if (!event_generated) {
>                 hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN;
>                 hmi_evt->u.xstop_error.xstop_reason = 0;
>                 queue_hmi_event(hmi_evt, false);
>         }
> 
> Do we want to leave the severity as FATAL from there? Moreover the
> code organization
> needs revisiting

That was the original intent for unknown malfunction alert since we
considered all malfunction alerts as fatal unrecovered HW errors. Hence
the code uses the hmi_evt that was initialized in
find_core_checkstop_reason() and just change the xstop_type to Unknown.
I agree code need to be more readable. We can re-initialize the hmi_evt
again in (!event_generated) case to remove the confusion.

Unknown malf alert means that opal handler has failed to detect the HW
error reason and we don't know whether it is good idea for kernel to
continue. Hence we leave the severity as FATAL and disposition to
unrecovered. This also means hmi handler need to be enhanced to add the
missing functionality.

Do you think we should mark it as a warning and send it as recovered event ?

Thanks,
-Mahesh.

Patch

diff --git a/core/hmi.c b/core/hmi.c
index bb144512..62a916ba 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -697,7 +697,8 @@  static void decode_malfunction(struct OpalHMIEvent *hmi_evt)
 		}
 	}
 
-	find_core_checkstop_reason(hmi_evt, &event_generated);
+	if (!event_generated)
+		find_core_checkstop_reason(hmi_evt, &event_generated);
 
 	/*
 	 * If we fail to find checkstop reason, send an unknown HMI event.