Message ID | 152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com |
---|---|
State | Accepted |
Headers | show |
Series | opal/hmi: Display correct chip id while printing NPU FIRs. | expand |
On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > HMIs for NPU xstops are broadcasted to all chips. All cores on all the > chips receive HMI. HMI handler correctly identifies and extracts the > NPU FIR details from affected chip, but while printing FIR data it > prints chip id and location code details of this_cpu()->chip_id which > may not be correct. This patch fixes this issue. > The core does not matter here, does it? It's the NPU, it's one per socket/processor If fir, mask and action match, this_cpu()->chip_id's location code should be correct. Am I missing something? Have you seen examples of this printing the wrong thing or did you catch this via code-review? Balbir
On 06/01/2018 08:54 PM, Balbir Singh wrote: > On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar > <mahesh@linux.vnet.ibm.com> wrote: >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> >> HMIs for NPU xstops are broadcasted to all chips. All cores on all the >> chips receive HMI. HMI handler correctly identifies and extracts the >> NPU FIR details from affected chip, but while printing FIR data it >> prints chip id and location code details of this_cpu()->chip_id which >> may not be correct. This patch fixes this issue. >> > > The core does not matter here, does it? It's the NPU, it's one per > socket/processor > If fir, mask and action match, this_cpu()->chip_id's location code > should be correct. > Am I missing something? Have you seen examples of this printing the wrong thing > or did you catch this via code-review? There is a BUG reported where submitter see same NPU FIR value shown for both chip ids. It creates a confusion as which NPU got an error. The code checks for pMisc Receive Malfunction Alert to find out correct chip id (flat_chip_id) where NPU got an error. We query NPU FIRs using flat_chip_id not this_cpu()->chip_id # putscom -c 0x0 0x5013C00 0x4000000000000000 [84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0 FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff [84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8 FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff The second o/p above creates the confusion. It prints the same info but with different chip id and location code. Thanks, -Mahesh > > Balbir >
On Mon, Jun 4, 2018 at 4:09 PM, Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > > On 06/01/2018 08:54 PM, Balbir Singh wrote: > > On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar > > <mahesh@linux.vnet.ibm.com> wrote: > >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > >> > >> HMIs for NPU xstops are broadcasted to all chips. All cores on all the > >> chips receive HMI. HMI handler correctly identifies and extracts the > >> NPU FIR details from affected chip, but while printing FIR data it > >> prints chip id and location code details of this_cpu()->chip_id which > >> may not be correct. This patch fixes this issue. > >> > > > > The core does not matter here, does it? It's the NPU, it's one per > > socket/processor > > If fir, mask and action match, this_cpu()->chip_id's location code > > should be correct. > > Am I missing something? Have you seen examples of this printing the wrong thing > > or did you catch this via code-review? > > There is a BUG reported where submitter see same NPU FIR value shown for > both chip ids. It creates a confusion as which NPU got an error. The > code checks for pMisc Receive Malfunction Alert to find out correct chip > id (flat_chip_id) where NPU got an error. We query NPU FIRs using > flat_chip_id not this_cpu()->chip_id > > # putscom -c 0x0 0x5013C00 0x4000000000000000 > [84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0 > FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff > [84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8 > FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff > > The second o/p above creates the confusion. It prints the same info but > with different chip id and location code. OK, so both cores see the HMI (it's probably broadcast?) and pass the following test xscom_read(this_cpu()->chip_id, malf_alert_scom, &malf_alert); if (!malf_alert) return; They also have bit i set in the malf_alert, since they seem to pass, so we know that for that this_cpu()->chip_id, malf_alert did point to chiplet i as potentially having an error due to the if (malf_alert & PPC_BIT(i)) check Anyway, I am fine with the changes, if you feel it's the right thing. Balbir Singh.
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes: > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > > HMIs for NPU xstops are broadcasted to all chips. All cores on all the > chips receive HMI. HMI handler correctly identifies and extracts the > NPU FIR details from affected chip, but while printing FIR data it > prints chip id and location code details of this_cpu()->chip_id which > may not be correct. This patch fixes this issue. > > Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Thanks. Merged to master (with added Fixes and CC stable) as of fa82d360a73a5e52131d8a19e9fdb9d6e9c2eeb9
diff --git a/core/hmi.c b/core/hmi.c index c0e2354ad..3bbdb2a34 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -634,7 +634,7 @@ static void dump_scoms(int flat_chip_id, const char *unit, uint32_t *scoms, if (r != OPAL_SUCCESS) continue; prlog(PR_ERR, "%s: [Loc: %s] P:%d 0x%08x=0x%016llx\n", - unit, loc, this_cpu()->chip_id, *scoms, value); + unit, loc, flat_chip_id, *scoms, value); scoms++; } } @@ -691,13 +691,13 @@ static void find_npu2_checkstop_reason(int flat_chip_id, fatal_errors = npu2_fir & ~npu2_fir_mask & npu2_fir_action0 & npu2_fir_action1; if (fatal_errors) { - loc = chip_loc_code(this_cpu()->chip_id); + loc = chip_loc_code(flat_chip_id); if (!loc) loc = "Not Available"; prlog(PR_ERR, "NPU2: [Loc: %s] P:%d FIR#%d FIR 0x%016llx mask 0x%016llx\n", - loc, this_cpu()->chip_id, i, npu2_fir, npu2_fir_mask); + loc, flat_chip_id, i, npu2_fir, npu2_fir_mask); prlog(PR_ERR, "NPU2: [Loc: %s] P:%d ACTION0 0x%016llx, ACTION1 0x%016llx\n", - loc, this_cpu()->chip_id, npu2_fir_action0, npu2_fir_action1); + loc, flat_chip_id, npu2_fir_action0, npu2_fir_action1); total_errors++; }