npu2: Clear fence on all bricks

Message ID	20191122000422.49503-1-aik@ozlabs.ru
State	Accepted
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> From: Alexey Kardashevskiy <aik@ozlabs.ru> To: skiboot@lists.ozlabs.org Date: Fri, 22 Nov 2019 11:04:22 +1100 Message-Id: <20191122000422.49503-1-aik@ozlabs.ru> Subject: [Skiboot] [PATCH skiboot] npu2: Clear fence on all bricks Precedence: list Cc: Alistair Popple <alistair@popple.id.au>, Reza Arbab <arbab@linux.ibm.com>, Brian J King <bjking1@us.ibm.com>, Ryan Black <rblack@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	npu2: Clear fence on all bricks \| expand npu2: Clear fence on all bricks

Context	Check	Description
snowpatch_ozlabs/apply_patch	success	Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot	success	Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco	success	Signed-off-by present

Alexey Kardashevskiy Nov. 22, 2019, 12:04 a.m. UTC

A bug in the NVidia driver can cause an UR HMI which fences bricks
(links). At the moment we clear fence status only for bricks of a specific
devices, however this does not appear to be enough and we need to clear
fences for all bricks. This is ok as we do not allow using GPUs
individually anyway.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

Reza/Ryan, could you please add more details about what exactly causes
these UR HMIs? Thanks!
---
 hw/npu2-hw-procedures.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Reza Arbab Nov. 29, 2019, 4:47 p.m. UTC | #1

On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>Reza/Ryan, could you please add more details about what exactly causes
>these UR HMIs? Thanks!

Hopefully I've pieced together the bug history correctly. As I 
understand it...

Each GPU has a 640kb protected region which will result in a 
"unsupported request" (UR) response. The root bug is that the driver 
maps and accidentally accesses that area.

This firmware patch helps for recovery. From our perspective it may seem 
redundant to clear the fence on all bricks instead of just the one we're 
resettting, but at a hardware level the above UR sends a fence signal to 
all the hardware units so they all need to be cleared.

Acked-by: Reza Arbab <arbab@linux.ibm.com>

Alexey Kardashevskiy Dec. 2, 2019, 2:05 a.m. UTC | #2

On 30/11/2019 03:47, Reza Arbab wrote:
> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>> Reza/Ryan, could you please add more details about what exactly causes
>> these UR HMIs? Thanks!
> 
> Hopefully I've pieced together the bug history correctly. As I
> understand it...
> 
> Each GPU has a 640kb protected region which will result in a
> "unsupported request" (UR) response. The root bug is that the driver
> maps and accidentally accesses that area.

Oh. Is this address range described anywhere? We could disable mapping
these as a precaution measure.


> This firmware patch helps for recovery. From our perspective it may seem
> redundant to clear the fence on all bricks instead of just the one we're
> resettting, but at a hardware level the above UR sends a fence signal to
> all the hardware units so they all need to be cleared.
> 
> Acked-by: Reza Arbab <arbab@linux.ibm.com>


Thanks!

Alexey Kardashevskiy Dec. 5, 2019, 10:58 p.m. UTC | #3

Ping?


On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> 
> 
> On 30/11/2019 03:47, Reza Arbab wrote:
>> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>>> Reza/Ryan, could you please add more details about what exactly causes
>>> these UR HMIs? Thanks!
>>
>> Hopefully I've pieced together the bug history correctly. As I
>> understand it...
>>
>> Each GPU has a 640kb protected region which will result in a
>> "unsupported request" (UR) response. The root bug is that the driver
>> maps and accidentally accesses that area.
> 
> Oh. Is this address range described anywhere? We could disable mapping
> these as a precaution measure.
> 
> 
>> This firmware patch helps for recovery. From our perspective it may seem
>> redundant to clear the fence on all bricks instead of just the one we're
>> resettting, but at a hardware level the above UR sends a fence signal to
>> all the hardware units so they all need to be cleared.
>>
>> Acked-by: Reza Arbab <arbab@linux.ibm.com>
> 
> 
> Thanks!
> 
>

Reza Arbab Dec. 9, 2019, 10:21 p.m. UTC | #4

On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> On 30/11/2019 03:47, Reza Arbab wrote:
>> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
>> Each GPU has a 640kb protected region which will result in a
>> "unsupported request" (UR) response. The root bug is that the driver
>> maps and accidentally accesses that area.
>
> Oh. Is this address range described anywhere? We could disable mapping
> these as a precaution measure.

It's only been communicated to us ad hoc during bug investigation, as 
far as I know. I'm going to try requesting documentation of all 
cpu-access-limited regions so we have something to refer to.

Alistair Popple Dec. 10, 2019, 4:29 a.m. UTC | #5

On Tuesday, 10 December 2019 9:21:18 AM AEDT Reza Arbab wrote:
> On 02/12/2019 13:05, Alexey Kardashevskiy wrote:
> > On 30/11/2019 03:47, Reza Arbab wrote:
> >> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:
> >> Each GPU has a 640kb protected region which will result in a
> >> "unsupported request" (UR) response. The root bug is that the driver
> >> maps and accidentally accesses that area.
> > 
> > Oh. Is this address range described anywhere? We could disable mapping
> > these as a precaution measure.
> 
> It's only been communicated to us ad hoc during bug investigation, as
> far as I know. I'm going to try requesting documentation of all
> cpu-access-limited regions so we have something to refer to.

I'm not a fan of this kind of whack-a-mole at all. If the driver is requesting 
a mapping of something it shouldn't be mapping then it's a bug and the driver 
needs to be fixed.

Fixing the recovery paths makes sense, but adding arbitrary validation checks 
around the place will simply create more hard to check co-dependencies and 
strange bugs when those locations inevitably change and it still won't prevent 
bugs causing UR responses or other bad state.

- Alistair

Oliver O'Halloran Dec. 16, 2019, 4:01 a.m. UTC | #6

On Fri, Nov 22, 2019 at 11:13 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> A bug in the NVidia driver can cause an UR HMI which fences bricks
> (links). At the moment we clear fence status only for bricks of a specific
> devices, however this does not appear to be enough and we need to clear
> fences for all bricks. This is ok as we do not allow using GPUs
> individually anyway.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Thanks, merged as 9be9a77a8352aee0bb74ac0d79f55e1238f76285

npu2: Clear fence on all bricks

Checks

Commit Message

Comments

Patch