opal/hmi: Display correct chip id while printing NPU FIRs.

Message ID	152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com
State	Accepted
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <skiboot@lists.ozlabs.org> from <mahesh@linux.vnet.ibm.com>; Thu, 31 May 2018 09:34:37 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 31 May 2018 09:34:35 +0100 From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> To: skiboot list <skiboot@lists.ozlabs.org> Date: Thu, 31 May 2018 14:04:33 +0530 User-Agent: StGit/unknown-version MIME-Version: 1.0 Message-Id: <152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com> Subject: [Skiboot] [PATCH] opal/hmi: Display correct chip id while printing NPU FIRs. Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	opal/hmi: Display correct chip id while printing NPU FIRs. \| expand opal/hmi: Display correct chip id while printing NPU FIRs.

Message ID

152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com

State

Accepted

Headers

From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
To: skiboot list <skiboot@lists.ozlabs.org>
Date: Thu, 31 May 2018 14:04:33 +0530
User-Agent: StGit/unknown-version
MIME-Version: 1.0
Message-Id: <152775567297.21492.4184562233834175407.stgit@jupiter.in.ibm.com>
Subject: [Skiboot] [PATCH] opal/hmi: Display correct chip id while printing
	NPU FIRs.
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Skiboot"
	<skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Series

opal/hmi: Display correct chip id while printing NPU FIRs. | expand

Commit Message

Mahesh J Salgaonkar May 31, 2018, 8:34 a.m. UTC

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

HMIs for NPU xstops are broadcasted to all chips. All cores on all the
chips receive HMI. HMI handler correctly identifies and extracts the
NPU FIR details from affected chip, but while printing FIR data it
prints chip id and location code details of this_cpu()->chip_id which
may not be correct. This patch fixes this issue.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 core/hmi.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Balbir Singh June 1, 2018, 3:24 p.m. UTC | #1

On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> chips receive HMI. HMI handler correctly identifies and extracts the
> NPU FIR details from affected chip, but while printing FIR data it
> prints chip id and location code details of this_cpu()->chip_id which
> may not be correct. This patch fixes this issue.
>

The core does not matter here, does it? It's the NPU, it's one per
socket/processor
If fir, mask and action match, this_cpu()->chip_id's location code
should be correct.
Am I missing something? Have you seen examples of this printing the wrong thing
or did you catch this via code-review?

Balbir

Mahesh J Salgaonkar June 4, 2018, 6:09 a.m. UTC | #2

On 06/01/2018 08:54 PM, Balbir Singh wrote:
> On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
>> chips receive HMI. HMI handler correctly identifies and extracts the
>> NPU FIR details from affected chip, but while printing FIR data it
>> prints chip id and location code details of this_cpu()->chip_id which
>> may not be correct. This patch fixes this issue.
>>
> 
> The core does not matter here, does it? It's the NPU, it's one per
> socket/processor
> If fir, mask and action match, this_cpu()->chip_id's location code
> should be correct.
> Am I missing something? Have you seen examples of this printing the wrong thing
> or did you catch this via code-review?

There is a BUG reported where submitter see same NPU FIR value shown for
both chip ids. It creates a confusion as which NPU got an error. The
code checks for pMisc Receive Malfunction Alert to find out correct chip
id (flat_chip_id) where NPU got an error. We query NPU FIRs using
flat_chip_id not this_cpu()->chip_id

# putscom -c 0x0  0x5013C00 0x4000000000000000
[84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0
FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
[84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8
FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff

The second o/p above creates the confusion. It prints the same info but
with different chip id and location code.

Thanks,
-Mahesh


> 
> Balbir
>

Balbir Singh June 5, 2018, 12:45 a.m. UTC | #3

On Mon, Jun 4, 2018 at 4:09 PM, Mahesh Jagannath Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
>
> On 06/01/2018 08:54 PM, Balbir Singh wrote:
> > On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
> > <mahesh@linux.vnet.ibm.com> wrote:
> >> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>
> >> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> >> chips receive HMI. HMI handler correctly identifies and extracts the
> >> NPU FIR details from affected chip, but while printing FIR data it
> >> prints chip id and location code details of this_cpu()->chip_id which
> >> may not be correct. This patch fixes this issue.
> >>
> >
> > The core does not matter here, does it? It's the NPU, it's one per
> > socket/processor
> > If fir, mask and action match, this_cpu()->chip_id's location code
> > should be correct.
> > Am I missing something? Have you seen examples of this printing the wrong thing
> > or did you catch this via code-review?
>
> There is a BUG reported where submitter see same NPU FIR value shown for
> both chip ids. It creates a confusion as which NPU got an error. The
> code checks for pMisc Receive Malfunction Alert to find out correct chip
> id (flat_chip_id) where NPU got an error. We query NPU FIRs using
> flat_chip_id not this_cpu()->chip_id
>
> # putscom -c 0x0  0x5013C00 0x4000000000000000
> [84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
> [84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
>
> The second o/p above creates the confusion. It prints the same info but
> with different chip id and location code.

OK, so both cores see the HMI (it's probably broadcast?) and pass the
following test

xscom_read(this_cpu()->chip_id, malf_alert_scom, &malf_alert);
if (!malf_alert)
    return;

They also have bit i set in the malf_alert, since they seem to pass,
so we know that for that this_cpu()->chip_id, malf_alert did point to
chiplet i as potentially having an error due to the

if (malf_alert & PPC_BIT(i)) check

Anyway, I am fine with the changes, if you feel it's the right thing.

Balbir Singh.

Stewart Smith June 5, 2018, 5:57 a.m. UTC | #4

Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> chips receive HMI. HMI handler correctly identifies and extracts the
> NPU FIR details from affected chip, but while printing FIR data it
> prints chip id and location code details of this_cpu()->chip_id which
> may not be correct. This patch fixes this issue.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Thanks. Merged to master (with added Fixes and CC stable) as of
fa82d360a73a5e52131d8a19e9fdb9d6e9c2eeb9

diff --git a/core/hmi.c b/core/hmi.c
index c0e2354ad..3bbdb2a34 100644
--- a/core/hmi.c
+++ b/core/hmi.c
@@ -634,7 +634,7 @@  static void dump_scoms(int flat_chip_id, const char *unit, uint32_t *scoms,
 		if (r != OPAL_SUCCESS)
 			continue;
 		prlog(PR_ERR, "%s: [Loc: %s] P:%d 0x%08x=0x%016llx\n",
-		      unit, loc, this_cpu()->chip_id, *scoms, value);
+		      unit, loc, flat_chip_id, *scoms, value);
 		scoms++;
 	}
 }
@@ -691,13 +691,13 @@  static void find_npu2_checkstop_reason(int flat_chip_id,
 		fatal_errors = npu2_fir & ~npu2_fir_mask & npu2_fir_action0 & npu2_fir_action1;
 
 		if (fatal_errors) {
-			loc = chip_loc_code(this_cpu()->chip_id);
+			loc = chip_loc_code(flat_chip_id);
 			if (!loc)
 				loc = "Not Available";
 			prlog(PR_ERR, "NPU2: [Loc: %s] P:%d FIR#%d FIR 0x%016llx mask 0x%016llx\n",
-					loc, this_cpu()->chip_id, i, npu2_fir, npu2_fir_mask);
+					loc, flat_chip_id, i, npu2_fir, npu2_fir_mask);
 			prlog(PR_ERR, "NPU2: [Loc: %s] P:%d ACTION0 0x%016llx, ACTION1 0x%016llx\n",
-					loc, this_cpu()->chip_id, npu2_fir_action0, npu2_fir_action1);
+					loc, flat_chip_id, npu2_fir_action0, npu2_fir_action1);
 			total_errors++;
 		}

opal/hmi: Display correct chip id while printing NPU FIRs.

Commit Message

Comments

Patch