diff mbox series

opal/hmi: Display correct chip id while printing NPU FIRs.

Message ID 152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com
State Accepted
Headers show
Series opal/hmi: Display correct chip id while printing NPU FIRs. | expand

Commit Message

Mahesh J Salgaonkar May 31, 2018, 8:34 a.m. UTC
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

HMIs for NPU xstops are broadcasted to all chips. All cores on all the
chips receive HMI. HMI handler correctly identifies and extracts the
NPU FIR details from affected chip, but while printing FIR data it
prints chip id and location code details of this_cpu()->chip_id which
may not be correct. This patch fixes this issue.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 core/hmi.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Balbir Singh June 1, 2018, 3:24 p.m. UTC | #1
On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> chips receive HMI. HMI handler correctly identifies and extracts the
> NPU FIR details from affected chip, but while printing FIR data it
> prints chip id and location code details of this_cpu()->chip_id which
> may not be correct. This patch fixes this issue.
>

The core does not matter here, does it? It's the NPU, it's one per
socket/processor
If fir, mask and action match, this_cpu()->chip_id's location code
should be correct.
Am I missing something? Have you seen examples of this printing the wrong thing
or did you catch this via code-review?

Balbir
Mahesh J Salgaonkar June 4, 2018, 6:09 a.m. UTC | #2
On 06/01/2018 08:54 PM, Balbir Singh wrote:
> On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
>> chips receive HMI. HMI handler correctly identifies and extracts the
>> NPU FIR details from affected chip, but while printing FIR data it
>> prints chip id and location code details of this_cpu()->chip_id which
>> may not be correct. This patch fixes this issue.
>>
> 
> The core does not matter here, does it? It's the NPU, it's one per
> socket/processor
> If fir, mask and action match, this_cpu()->chip_id's location code
> should be correct.
> Am I missing something? Have you seen examples of this printing the wrong thing
> or did you catch this via code-review?

There is a BUG reported where submitter see same NPU FIR value shown for
both chip ids. It creates a confusion as which NPU got an error. The
code checks for pMisc Receive Malfunction Alert to find out correct chip
id (flat_chip_id) where NPU got an error. We query NPU FIRs using
flat_chip_id not this_cpu()->chip_id

# putscom -c 0x0  0x5013C00 0x4000000000000000
[84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0
FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
[84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8
FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff

The second o/p above creates the confusion. It prints the same info but
with different chip id and location code.

Thanks,
-Mahesh


> 
> Balbir
>
Balbir Singh June 5, 2018, 12:45 a.m. UTC | #3
On Mon, Jun 4, 2018 at 4:09 PM, Mahesh Jagannath Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
>
> On 06/01/2018 08:54 PM, Balbir Singh wrote:
> > On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
> > <mahesh@linux.vnet.ibm.com> wrote:
> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>
> >> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> >> chips receive HMI. HMI handler correctly identifies and extracts the
> >> NPU FIR details from affected chip, but while printing FIR data it
> >> prints chip id and location code details of this_cpu()->chip_id which
> >> may not be correct. This patch fixes this issue.
> >>
> >
> > The core does not matter here, does it? It's the NPU, it's one per
> > socket/processor
> > If fir, mask and action match, this_cpu()->chip_id's location code
> > should be correct.
> > Am I missing something? Have you seen examples of this printing the wrong thing
> > or did you catch this via code-review?
>
> There is a BUG reported where submitter see same NPU FIR value shown for
> both chip ids. It creates a confusion as which NPU got an error. The
> code checks for pMisc Receive Malfunction Alert to find out correct chip
> id (flat_chip_id) where NPU got an error. We query NPU FIRs using
> flat_chip_id not this_cpu()->chip_id
>
> # putscom -c 0x0  0x5013C00 0x4000000000000000
> [84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
> [84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
>
> The second o/p above creates the confusion. It prints the same info but
> with different chip id and location code.

OK, so both cores see the HMI (it's probably broadcast?) and pass the
following test

xscom_read(this_cpu()->chip_id, malf_alert_scom, &malf_alert);
if (!malf_alert)
    return;

They also have bit i set in the malf_alert, since they seem to pass,
so we know that for that this_cpu()->chip_id, malf_alert did point to
chiplet i as potentially having an error due to the

if (malf_alert & PPC_BIT(i)) check

Anyway, I am fine with the changes, if you feel it's the right thing.

Balbir Singh.
Stewart Smith June 5, 2018, 5:57 a.m. UTC | #4
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> chips receive HMI. HMI handler correctly identifies and extracts the
> NPU FIR details from affected chip, but while printing FIR data it
> prints chip id and location code details of this_cpu()->chip_id which
> may not be correct. This patch fixes this issue.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Thanks. Merged to master (with added Fixes and CC stable) as of
fa82d360a73a5e52131d8a19e9fdb9d6e9c2eeb9
diff mbox series

Patch

diff --git a/core/hmi.c b/core/hmi.c
index c0e2354ad..3bbdb2a34 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -634,7 +634,7 @@  static void dump_scoms(int flat_chip_id, const char *unit, uint32_t *scoms,
 		if (r != OPAL_SUCCESS)
 			continue;
 		prlog(PR_ERR, "%s: [Loc: %s] P:%d 0x%08x=0x%016llx\n",
-		      unit, loc, this_cpu()->chip_id, *scoms, value);
+		      unit, loc, flat_chip_id, *scoms, value);
 		scoms++;
 	}
 }
@@ -691,13 +691,13 @@  static void find_npu2_checkstop_reason(int flat_chip_id,
 		fatal_errors = npu2_fir & ~npu2_fir_mask & npu2_fir_action0 & npu2_fir_action1;
 
 		if (fatal_errors) {
-			loc = chip_loc_code(this_cpu()->chip_id);
+			loc = chip_loc_code(flat_chip_id);
 			if (!loc)
 				loc = "Not Available";
 			prlog(PR_ERR, "NPU2: [Loc: %s] P:%d FIR#%d FIR 0x%016llx mask 0x%016llx\n",
-					loc, this_cpu()->chip_id, i, npu2_fir, npu2_fir_mask);
+					loc, flat_chip_id, i, npu2_fir, npu2_fir_mask);
 			prlog(PR_ERR, "NPU2: [Loc: %s] P:%d ACTION0 0x%016llx, ACTION1 0x%016llx\n",
-					loc, this_cpu()->chip_id, npu2_fir_action0, npu2_fir_action1);
+					loc, flat_chip_id, npu2_fir_action0, npu2_fir_action1);
 			total_errors++;
 		}