Message ID | 20191122000422.49503-1-aik@ozlabs.ru |
---|---|
State | Accepted |
Headers | show |
Series | npu2: Clear fence on all bricks | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote: >Reza/Ryan, could you please add more details about what exactly causes >these UR HMIs? Thanks! Hopefully I've pieced together the bug history correctly. As I understand it... Each GPU has a 640kb protected region which will result in a "unsupported request" (UR) response. The root bug is that the driver maps and accidentally accesses that area. This firmware patch helps for recovery. From our perspective it may seem redundant to clear the fence on all bricks instead of just the one we're resettting, but at a hardware level the above UR sends a fence signal to all the hardware units so they all need to be cleared. Acked-by: Reza Arbab <arbab@linux.ibm.com>
On 30/11/2019 03:47, Reza Arbab wrote: > On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote: >> Reza/Ryan, could you please add more details about what exactly causes >> these UR HMIs? Thanks! > > Hopefully I've pieced together the bug history correctly. As I > understand it... > > Each GPU has a 640kb protected region which will result in a > "unsupported request" (UR) response. The root bug is that the driver > maps and accidentally accesses that area. Oh. Is this address range described anywhere? We could disable mapping these as a precaution measure. > This firmware patch helps for recovery. From our perspective it may seem > redundant to clear the fence on all bricks instead of just the one we're > resettting, but at a hardware level the above UR sends a fence signal to > all the hardware units so they all need to be cleared. > > Acked-by: Reza Arbab <arbab@linux.ibm.com> Thanks!
Ping? On 02/12/2019 13:05, Alexey Kardashevskiy wrote: > > > On 30/11/2019 03:47, Reza Arbab wrote: >> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote: >>> Reza/Ryan, could you please add more details about what exactly causes >>> these UR HMIs? Thanks! >> >> Hopefully I've pieced together the bug history correctly. As I >> understand it... >> >> Each GPU has a 640kb protected region which will result in a >> "unsupported request" (UR) response. The root bug is that the driver >> maps and accidentally accesses that area. > > Oh. Is this address range described anywhere? We could disable mapping > these as a precaution measure. > > >> This firmware patch helps for recovery. From our perspective it may seem >> redundant to clear the fence on all bricks instead of just the one we're >> resettting, but at a hardware level the above UR sends a fence signal to >> all the hardware units so they all need to be cleared. >> >> Acked-by: Reza Arbab <arbab@linux.ibm.com> > > > Thanks! > >
On 02/12/2019 13:05, Alexey Kardashevskiy wrote: > On 30/11/2019 03:47, Reza Arbab wrote: >> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote: >> Each GPU has a 640kb protected region which will result in a >> "unsupported request" (UR) response. The root bug is that the driver >> maps and accidentally accesses that area. > > Oh. Is this address range described anywhere? We could disable mapping > these as a precaution measure. It's only been communicated to us ad hoc during bug investigation, as far as I know. I'm going to try requesting documentation of all cpu-access-limited regions so we have something to refer to.
On Tuesday, 10 December 2019 9:21:18 AM AEDT Reza Arbab wrote: > On 02/12/2019 13:05, Alexey Kardashevskiy wrote: > > On 30/11/2019 03:47, Reza Arbab wrote: > >> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote: > >> Each GPU has a 640kb protected region which will result in a > >> "unsupported request" (UR) response. The root bug is that the driver > >> maps and accidentally accesses that area. > > > > Oh. Is this address range described anywhere? We could disable mapping > > these as a precaution measure. > > It's only been communicated to us ad hoc during bug investigation, as > far as I know. I'm going to try requesting documentation of all > cpu-access-limited regions so we have something to refer to. I'm not a fan of this kind of whack-a-mole at all. If the driver is requesting a mapping of something it shouldn't be mapping then it's a bug and the driver needs to be fixed. Fixing the recovery paths makes sense, but adding arbitrary validation checks around the place will simply create more hard to check co-dependencies and strange bugs when those locations inevitably change and it still won't prevent bugs causing UR responses or other bad state. - Alistair
On Fri, Nov 22, 2019 at 11:13 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > A bug in the NVidia driver can cause an UR HMI which fences bricks > (links). At the moment we clear fence status only for bricks of a specific > devices, however this does not appear to be enough and we need to clear > fences for all bricks. This is ok as we do not allow using GPUs > individually anyway. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Thanks, merged as 9be9a77a8352aee0bb74ac0d79f55e1238f76285
diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c index 8379cbbeedcf..8d7082d27dc6 100644 --- a/hw/npu2-hw-procedures.c +++ b/hw/npu2-hw-procedures.c @@ -264,8 +264,8 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val) /* Procedure 1.2.1 - Reset NPU/NDL */ uint32_t reset_ntl(struct npu2_dev *ndev) { - uint64_t val; - int lane; + uint64_t val, check; + int lane, i; set_iovalid(ndev, true); @@ -283,10 +283,17 @@ uint32_t reset_ntl(struct npu2_dev *ndev) /* Clear fence state for the brick */ val = npu2_read(ndev->npu, NPU2_MISC_FENCE_STATE); - if (val & PPC_BIT(ndev->brick_index)) { - NPU2DEVINF(ndev, "Clearing brick fence\n"); - val = PPC_BIT(ndev->brick_index); + if (val) { + NPU2DEVINF(ndev, "Clearing all bricks fence\n"); npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val); + for (i = 0, check = 0; i < 4096; i++) { + check = npu2_read(ndev->npu, NPU2_NTL_CQ_FENCE_STATUS(ndev)); + if (!check) + break; + } + if (check) + NPU2DEVERR(ndev, "Clearing NPU2_MISC_FENCE_STATE=0x%llx timeout, current=0x%llx\n", + val, check); } /* Write PRI */
A bug in the NVidia driver can cause an UR HMI which fences bricks (links). At the moment we clear fence status only for bricks of a specific devices, however this does not appear to be enough and we need to clear fences for all bricks. This is ok as we do not allow using GPUs individually anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- Reza/Ryan, could you please add more details about what exactly causes these UR HMIs? Thanks! --- hw/npu2-hw-procedures.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)