Message ID | 20180607070607.16037-1-aik@ozlabs.ru (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [kernel] powerpc/ioda/npu2: Call hot reset skiboot hook when disabling NPU | expand |
On Thu, 7 Jun 2018 17:06:07 +1000 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > This brings NPU2 in a safe mode when it does not throw HMI if GPU > coherent memory is gone. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Anyone, ping? > --- > > The main aim for this is nvlink2 pass through, helps a lot. > > > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index 66c2804..29f798c 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev) > pnv_ioda_release_pe(pe); > } > > +void pnv_npu_disable_device(struct pci_dev *pdev) > +{ > + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); > + struct eeh_pe *eehpe = edev ? edev->pe : NULL; > + > + if (eehpe && eeh_ops && eeh_ops->reset) { > + eeh_ops->reset(eehpe, EEH_RESET_HOT); > + } > +} > + > static void pnv_pci_ioda_shutdown(struct pci_controller *hose) > { > struct pnv_phb *phb = hose->private_data; > @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = { > .reset_secondary_bus = pnv_pci_reset_secondary_bus, > .dma_set_mask = pnv_npu_dma_set_mask, > .shutdown = pnv_pci_ioda_shutdown, > + .disable_device = pnv_npu_disable_device, > }; > > static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { > -- > 2.11.0 > -- Alexey
Hi Alexey, On Wednesday, 11 July 2018 7:45:10 PM AEST Alexey Kardashevskiy wrote: > On Thu, 7 Jun 2018 17:06:07 +1000 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > This brings NPU2 in a safe mode when it does not throw HMI if GPU > > coherent memory is gone. It might be helpful if you you could describe the problem and what you are trying to solve in a bit more depth. Assuming the memory was online how are you offlining it? If the memory has been online merely fencing/hot-resetting the NVLink is likely not sufficient as you also need to flush caches prior to taking the links down. - Alistair > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > Anyone, ping? > > > > --- > > > > The main aim for this is nvlink2 pass through, helps a lot. > > > > > > --- > > arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > > index 66c2804..29f798c 100644 > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev) > > pnv_ioda_release_pe(pe); > > } > > > > +void pnv_npu_disable_device(struct pci_dev *pdev) > > +{ > > + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); > > + struct eeh_pe *eehpe = edev ? edev->pe : NULL; > > + > > + if (eehpe && eeh_ops && eeh_ops->reset) { > > + eeh_ops->reset(eehpe, EEH_RESET_HOT); > > + } > > +} > > + > > static void pnv_pci_ioda_shutdown(struct pci_controller *hose) > > { > > struct pnv_phb *phb = hose->private_data; > > @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = { > > .reset_secondary_bus = pnv_pci_reset_secondary_bus, > > .dma_set_mask = pnv_npu_dma_set_mask, > > .shutdown = pnv_pci_ioda_shutdown, > > + .disable_device = pnv_npu_disable_device, > > }; > > > > static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { > > > > -- > Alexey >
On Thu, 12 Jul 2018 11:38:34 +1000 Alistair Popple <alistair@popple.id.au> wrote: > Hi Alexey, > > On Wednesday, 11 July 2018 7:45:10 PM AEST Alexey Kardashevskiy wrote: > > On Thu, 7 Jun 2018 17:06:07 +1000 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > > > This brings NPU2 in a safe mode when it does not throw HMI if GPU > > > coherent memory is gone. > > It might be helpful if you you could describe the problem and what you are > trying to solve in a bit more depth. Assuming the memory was online how are you > offlining it? Fair enough. I am offlining it by simply killing a guest which triggers GPU PCI reset. Before this, PCI reset would trigger HMI as PTEs were still in both QEMU and guest pagetables and that would cause prefetching and thus killing the host. > If the memory has been online merely fencing/hot-resetting the > NVLink is likely not sufficient as you also need to flush caches prior to taking > the links down. I'd expect the guest driver to take care of this. If this is not enough and I need to pass some other MMIO (in addition to the ATS/tlb invalidation thingy which I'll add anyway), then what is it? > > - Alistair > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > > > Anyone, ping? > > > > > > > --- > > > > > > The main aim for this is nvlink2 pass through, helps a lot. > > > > > > > > > --- > > > arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > > > index 66c2804..29f798c 100644 > > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > > @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev) > > > pnv_ioda_release_pe(pe); > > > } > > > > > > +void pnv_npu_disable_device(struct pci_dev *pdev) > > > +{ > > > + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); > > > + struct eeh_pe *eehpe = edev ? edev->pe : NULL; > > > + > > > + if (eehpe && eeh_ops && eeh_ops->reset) { > > > + eeh_ops->reset(eehpe, EEH_RESET_HOT); > > > + } > > > +} > > > + > > > static void pnv_pci_ioda_shutdown(struct pci_controller *hose) > > > { > > > struct pnv_phb *phb = hose->private_data; > > > @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = { > > > .reset_secondary_bus = pnv_pci_reset_secondary_bus, > > > .dma_set_mask = pnv_npu_dma_set_mask, > > > .shutdown = pnv_pci_ioda_shutdown, > > > + .disable_device = pnv_npu_disable_device, > > > }; > > > > > > static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { > > > > > > > > -- > > Alexey > > > > -- Alexey
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 66c2804..29f798c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev) pnv_ioda_release_pe(pe); } +void pnv_npu_disable_device(struct pci_dev *pdev) +{ + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); + struct eeh_pe *eehpe = edev ? edev->pe : NULL; + + if (eehpe && eeh_ops && eeh_ops->reset) { + eeh_ops->reset(eehpe, EEH_RESET_HOT); + } +} + static void pnv_pci_ioda_shutdown(struct pci_controller *hose) { struct pnv_phb *phb = hose->private_data; @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = { .reset_secondary_bus = pnv_pci_reset_secondary_bus, .dma_set_mask = pnv_npu_dma_set_mask, .shutdown = pnv_pci_ioda_shutdown, + .disable_device = pnv_npu_disable_device, }; static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
This brings NPU2 in a safe mode when it does not throw HMI if GPU coherent memory is gone. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- The main aim for this is nvlink2 pass through, helps a lot. --- arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++ 1 file changed, 11 insertions(+)