Message ID | 20190909123151.21944-17-fbarrat@linux.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | opencapi: enable card reset and link retraining | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (470ffb5f29d741c3bed600f7bb7bf0cbb270e05a) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
On 09/09/2019 14:31, Frederic Barrat wrote: > On P9, the NPU doesn't support recovery if the link goes down > unexpectedly. It was not fully verified. We mark the device as broken > when we receive an error interrupt from the NPU. However, there's > nothing to prevent the OS from trying to reset the device; It may or > may not work, it's unsupported territory, so let's log a message to > make it clear, as it could help when debugging. We haven't hit any > cases where the reset goes badly enough that we'd want to prevent it, > so let it go for now. We can revisit later if we have evidence that > it's causing more problems than it is worth. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > hw/npu2-opencapi.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c > index 46aeb6d3..f044fdbf 100644 > --- a/hw/npu2-opencapi.c > +++ b/hw/npu2-opencapi.c > @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct pci_slot *slot) > OCAPIINF(dev, "no card detected\n"); > return OPAL_SUCCESS; > } > + if (dev->flags & NPU2_DEV_BROKEN) { > + OCAPIERR(dev, "Resetting a device which hit a previous error. Device recovery is not supported, so future behavior is undefined\n"); > + dev->flags &= ~NPU2_DEV_BROKEN; Removing the "broken" state means that the device is available. You could update the state only when freset exits without issue. > + } > slot->link_retries = OCAPI_LINK_TRAINING_RETRIES; > /* fall-through */ > case OCAPI_SLOT_FRESET_INIT: >
Le 17/09/2019 à 15:55, christophe lombard a écrit : > On 09/09/2019 14:31, Frederic Barrat wrote: >> On P9, the NPU doesn't support recovery if the link goes down >> unexpectedly. It was not fully verified. We mark the device as broken >> when we receive an error interrupt from the NPU. However, there's >> nothing to prevent the OS from trying to reset the device; It may or >> may not work, it's unsupported territory, so let's log a message to >> make it clear, as it could help when debugging. We haven't hit any >> cases where the reset goes badly enough that we'd want to prevent it, >> so let it go for now. We can revisit later if we have evidence that >> it's causing more problems than it is worth. >> >> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> >> --- >> hw/npu2-opencapi.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c >> index 46aeb6d3..f044fdbf 100644 >> --- a/hw/npu2-opencapi.c >> +++ b/hw/npu2-opencapi.c >> @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct >> pci_slot *slot) >> OCAPIINF(dev, "no card detected\n"); >> return OPAL_SUCCESS; >> } >> + if (dev->flags & NPU2_DEV_BROKEN) { >> + OCAPIERR(dev, "Resetting a device which hit a previous >> error. Device recovery is not supported, so future behavior is >> undefined\n"); >> + dev->flags &= ~NPU2_DEV_BROKEN; > > Removing the "broken" state means that the device is available. You > could update the state only when freset exits without issue. Good point, I don't think the state needs to be reset that early. Fred >> + } >> slot->link_retries = OCAPI_LINK_TRAINING_RETRIES; >> /* fall-through */ >> case OCAPI_SLOT_FRESET_INIT: >> >
diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c index 46aeb6d3..f044fdbf 100644 --- a/hw/npu2-opencapi.c +++ b/hw/npu2-opencapi.c @@ -1246,6 +1246,10 @@ static int64_t npu2_opencapi_freset(struct pci_slot *slot) OCAPIINF(dev, "no card detected\n"); return OPAL_SUCCESS; } + if (dev->flags & NPU2_DEV_BROKEN) { + OCAPIERR(dev, "Resetting a device which hit a previous error. Device recovery is not supported, so future behavior is undefined\n"); + dev->flags &= ~NPU2_DEV_BROKEN; + } slot->link_retries = OCAPI_LINK_TRAINING_RETRIES; /* fall-through */ case OCAPI_SLOT_FRESET_INIT:
On P9, the NPU doesn't support recovery if the link goes down unexpectedly. It was not fully verified. We mark the device as broken when we receive an error interrupt from the NPU. However, there's nothing to prevent the OS from trying to reset the device; It may or may not work, it's unsupported territory, so let's log a message to make it clear, as it could help when debugging. We haven't hit any cases where the reset goes badly enough that we'd want to prevent it, so let it go for now. We can revisit later if we have evidence that it's causing more problems than it is worth. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> --- hw/npu2-opencapi.c | 4 ++++ 1 file changed, 4 insertions(+)