diff mbox series

npu2: Clear fence on all bricks

Message ID 20191122000422.49503-1-aik@ozlabs.ru
State Accepted
Headers show
Series npu2: Clear fence on all bricks | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Alexey Kardashevskiy Nov. 22, 2019, 12:04 a.m. UTC
A bug in the NVidia driver can cause an UR HMI which fences bricks
(links). At the moment we clear fence status only for bricks of a specific
devices, however this does not appear to be enough and we need to clear
fences for all bricks. This is ok as we do not allow using GPUs
individually anyway.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

Reza/Ryan, could you please add more details about what exactly causes
these UR HMIs? Thanks!
---
 hw/npu2-hw-procedures.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Comments

Reza Arbab Nov. 29, 2019, 4:47 p.m. UTC | #1
On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>Reza/Ryan, could you please add more details about what exactly causes
>these UR HMIs? Thanks!

Hopefully I've pieced together the bug history correctly. As I 
understand it...

Each GPU has a 640kb protected region which will result in a 
"unsupported request" (UR) response. The root bug is that the driver 
maps and accidentally accesses that area.

This firmware patch helps for recovery. From our perspective it may seem 
redundant to clear the fence on all bricks instead of just the one we're 
resettting, but at a hardware level the above UR sends a fence signal to 
all the hardware units so they all need to be cleared.

Acked-by: Reza Arbab <arbab@linux.ibm.com>
Alexey Kardashevskiy Dec. 2, 2019, 2:05 a.m. UTC | #2
On 30/11/2019 03:47, Reza Arbab wrote:
> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>> Reza/Ryan, could you please add more details about what exactly causes
>> these UR HMIs? Thanks!
> 
> Hopefully I've pieced together the bug history correctly. As I
> understand it...
> 
> Each GPU has a 640kb protected region which will result in a
> "unsupported request" (UR) response. The root bug is that the driver
> maps and accidentally accesses that area.

Oh. Is this address range described anywhere? We could disable mapping
these as a precaution measure.


> This firmware patch helps for recovery. From our perspective it may seem
> redundant to clear the fence on all bricks instead of just the one we're
> resettting, but at a hardware level the above UR sends a fence signal to
> all the hardware units so they all need to be cleared.
> 
> Acked-by: Reza Arbab <arbab@linux.ibm.com>


Thanks!
Alexey Kardashevskiy Dec. 5, 2019, 10:58 p.m. UTC | #3
Ping?


On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> 
> 
> On 30/11/2019 03:47, Reza Arbab wrote:
>> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>>> Reza/Ryan, could you please add more details about what exactly causes
>>> these UR HMIs? Thanks!
>>
>> Hopefully I've pieced together the bug history correctly. As I
>> understand it...
>>
>> Each GPU has a 640kb protected region which will result in a
>> "unsupported request" (UR) response. The root bug is that the driver
>> maps and accidentally accesses that area.
> 
> Oh. Is this address range described anywhere? We could disable mapping
> these as a precaution measure.
> 
> 
>> This firmware patch helps for recovery. From our perspective it may seem
>> redundant to clear the fence on all bricks instead of just the one we're
>> resettting, but at a hardware level the above UR sends a fence signal to
>> all the hardware units so they all need to be cleared.
>>
>> Acked-by: Reza Arbab <arbab@linux.ibm.com>
> 
> 
> Thanks!
> 
>
Reza Arbab Dec. 9, 2019, 10:21 p.m. UTC | #4
On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> On 30/11/2019 03:47, Reza Arbab wrote:
>> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>> Each GPU has a 640kb protected region which will result in a
>> "unsupported request" (UR) response. The root bug is that the driver
>> maps and accidentally accesses that area.
>
> Oh. Is this address range described anywhere? We could disable mapping
> these as a precaution measure.

It's only been communicated to us ad hoc during bug investigation, as 
far as I know. I'm going to try requesting documentation of all 
cpu-access-limited regions so we have something to refer to.
Alistair Popple Dec. 10, 2019, 4:29 a.m. UTC | #5
On Tuesday, 10 December 2019 9:21:18 AM AEDT Reza Arbab wrote:
> On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> > On 30/11/2019 03:47, Reza Arbab wrote:
> >> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
> >> Each GPU has a 640kb protected region which will result in a
> >> "unsupported request" (UR) response. The root bug is that the driver
> >> maps and accidentally accesses that area.
> > 
> > Oh. Is this address range described anywhere? We could disable mapping
> > these as a precaution measure.
> 
> It's only been communicated to us ad hoc during bug investigation, as
> far as I know. I'm going to try requesting documentation of all
> cpu-access-limited regions so we have something to refer to.

I'm not a fan of this kind of whack-a-mole at all. If the driver is requesting 
a mapping of something it shouldn't be mapping then it's a bug and the driver 
needs to be fixed.

Fixing the recovery paths makes sense, but adding arbitrary validation checks 
around the place will simply create more hard to check co-dependencies and 
strange bugs when those locations inevitably change and it still won't prevent 
bugs causing UR responses or other bad state.

- Alistair
Oliver O'Halloran Dec. 16, 2019, 4:01 a.m. UTC | #6
On Fri, Nov 22, 2019 at 11:13 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> A bug in the NVidia driver can cause an UR HMI which fences bricks
> (links). At the moment we clear fence status only for bricks of a specific
> devices, however this does not appear to be enough and we need to clear
> fences for all bricks. This is ok as we do not allow using GPUs
> individually anyway.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Thanks, merged as 9be9a77a8352aee0bb74ac0d79f55e1238f76285
diff mbox series

Patch

diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
index 8379cbbeedcf..8d7082d27dc6 100644
--- a/hw/npu2-hw-procedures.c
+++ b/hw/npu2-hw-procedures.c
@@ -264,8 +264,8 @@  static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
 /* Procedure 1.2.1 - Reset NPU/NDL */
 uint32_t reset_ntl(struct npu2_dev *ndev)
 {
-	uint64_t val;
-	int lane;
+	uint64_t val, check;
+	int lane, i;
 
 	set_iovalid(ndev, true);
 
@@ -283,10 +283,17 @@  uint32_t reset_ntl(struct npu2_dev *ndev)
 
 	/* Clear fence state for the brick */
 	val = npu2_read(ndev->npu, NPU2_MISC_FENCE_STATE);
-	if (val & PPC_BIT(ndev->brick_index)) {
-		NPU2DEVINF(ndev, "Clearing brick fence\n");
-		val = PPC_BIT(ndev->brick_index);
+	if (val) {
+		NPU2DEVINF(ndev, "Clearing all bricks fence\n");
 		npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
+		for (i = 0, check = 0; i < 4096; i++) {
+			check = npu2_read(ndev->npu, NPU2_NTL_CQ_FENCE_STATUS(ndev));
+			if (!check)
+				break;
+		}
+		if (check)
+			NPU2DEVERR(ndev, "Clearing NPU2_MISC_FENCE_STATE=0x%llx timeout, current=0x%llx\n",
+					val, check);
 	}
 
 	/* Write PRI */