Message ID | 1460097404-35422-3-git-send-email-aik@ozlabs.ru (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: > When SRIOV is disabled, the existing code presumes there is no > virtual function (VF) in use and destroys all associated PEs. > However it is possible to get into the situation when the user > activated SRIOV disabling while a VF is still in use via VFIO. > For example, unbinding a physical function (PF) while there is a guest > running with a VF passed throuhgh via VFIO will trigger the bug. > > This defines an IODA2-specific IOMMU group release() callback. > This moves all the disposal code from pnv_ioda_release_vf_PE() to this > new callback so the cleanup happens when the last user of an IOMMU > group released the reference. > > As pnv_pci_ioda2_release_dma_pe() was reduced to just calling > iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() > into pnv_ioda_release_vf_PE(). > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ > 1 file changed, 14 insertions(+), 19 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index ce9f2bf..8108c54 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); > static void pnv_pci_ioda2_group_release(void *iommu_data) > { > struct iommu_table_group *table_group = iommu_data; > + struct pnv_ioda_pe *pe = container_of(table_group, > + struct pnv_ioda_pe, table_group); > + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); > + struct pnv_phb *phb = hose->private_data; > + struct iommu_table *tbl = pe->table_group.tables[0]; > + int64_t rc; > > - table_group->group = NULL; > -} > - > -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) > -{ > - struct iommu_table *tbl; > - int64_t rc; > - > - tbl = pe->table_group.tables[0]; > rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); Is it safe to go manipulating the PE windows, etc. after SR-IOV is disabled? When SR-IOV is disabled, you need to immediately disable the VF (I'm guessing that happens somewhere) and stop all access to the VF "hardware". Only the iommu group structure *has* to stick around until the reference count drops to zero. I think other structures and hardware reconfiguration can be deferred or done immediately, whichever is more convenient. > if (rc) > pe_warn(pe, "OPAL error %ld release DMA window\n", rc); > > pnv_pci_ioda2_set_bypass(pe, false); > - if (pe->table_group.group) { > - iommu_group_put(pe->table_group.group); > - BUG_ON(pe->table_group.group); > - } > + > + BUG_ON(!tbl); > pnv_pci_ioda2_table_free_pages(tbl); > - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); > + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); > + > + pnv_ioda_deconfigure_pe(phb, pe); > + pnv_ioda_free_pe(phb, pe->pe_number); > } > > static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) > @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) > if (pe->parent_dev != pdev) > continue; > > - pnv_pci_ioda2_release_dma_pe(pdev, pe); > - > /* Remove from list */ > mutex_lock(&phb->ioda.pe_list_mutex); > list_del(&pe->list); > mutex_unlock(&phb->ioda.pe_list_mutex); > > - pnv_ioda_deconfigure_pe(phb, pe); > - > - pnv_ioda_free_pe(phb, pe->pe_number); > + if (pe->table_group.group) > + iommu_group_put(pe->table_group.group); > } > } >
On 04/14/2016 11:40 AM, David Gibson wrote: > On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: >> When SRIOV is disabled, the existing code presumes there is no >> virtual function (VF) in use and destroys all associated PEs. >> However it is possible to get into the situation when the user >> activated SRIOV disabling while a VF is still in use via VFIO. >> For example, unbinding a physical function (PF) while there is a guest >> running with a VF passed throuhgh via VFIO will trigger the bug. >> >> This defines an IODA2-specific IOMMU group release() callback. >> This moves all the disposal code from pnv_ioda_release_vf_PE() to this >> new callback so the cleanup happens when the last user of an IOMMU >> group released the reference. >> >> As pnv_pci_ioda2_release_dma_pe() was reduced to just calling >> iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() >> into pnv_ioda_release_vf_PE(). >> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> --- >> arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ >> 1 file changed, 14 insertions(+), 19 deletions(-) >> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c >> index ce9f2bf..8108c54 100644 >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >> @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); >> static void pnv_pci_ioda2_group_release(void *iommu_data) >> { >> struct iommu_table_group *table_group = iommu_data; >> + struct pnv_ioda_pe *pe = container_of(table_group, >> + struct pnv_ioda_pe, table_group); >> + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); >> + struct pnv_phb *phb = hose->private_data; >> + struct iommu_table *tbl = pe->table_group.tables[0]; >> + int64_t rc; >> >> - table_group->group = NULL; >> -} >> - >> -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) >> -{ >> - struct iommu_table *tbl; >> - int64_t rc; >> - >> - tbl = pe->table_group.tables[0]; >> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); > > Is it safe to go manipulating the PE windows, etc. after SR-IOV is > disabled? Manipulating windows in this case is just updating 8 bytes in the TVT. At this point a VF is expected to be destroyed but PE is expected to remain not free so pnv_ioda2_pick_m64_pe() (or pnv_ioda2_reserve_m64_pe()?) won't use it. > > When SR-IOV is disabled, you need to immediately disable the VF (I'm > guessing that happens somewhere) and stop all access to the VF > "hardware". drivers/pci/iov.c === static void sriov_disable(struct pci_dev *dev) { ... for (i = 0; i < iov->num_VFs; i++) pci_iov_remove_virtfn(dev, i, 0); ... pcibios_sriov_disable(dev); === pcibios_sriov_disable() is where pnv_pci_ioda2_release_dma_pe() is called from. > Only the iommu group structure *has* to stick around > until the reference count drops to zero. I think other structures and > hardware reconfiguration can be deferred or done immediately, > whichever is more convenient. I deferred everything because of convenience as iommu_table_group is embedded into pnv_ioda struct, not a pointer. >> if (rc) >> pe_warn(pe, "OPAL error %ld release DMA window\n", rc); >> >> pnv_pci_ioda2_set_bypass(pe, false); >> - if (pe->table_group.group) { >> - iommu_group_put(pe->table_group.group); >> - BUG_ON(pe->table_group.group); >> - } >> + >> + BUG_ON(!tbl); >> pnv_pci_ioda2_table_free_pages(tbl); >> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); >> + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); >> + >> + pnv_ioda_deconfigure_pe(phb, pe); >> + pnv_ioda_free_pe(phb, pe->pe_number); >> } >> >> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >> @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >> if (pe->parent_dev != pdev) >> continue; >> >> - pnv_pci_ioda2_release_dma_pe(pdev, pe); >> - >> /* Remove from list */ >> mutex_lock(&phb->ioda.pe_list_mutex); >> list_del(&pe->list); >> mutex_unlock(&phb->ioda.pe_list_mutex); >> >> - pnv_ioda_deconfigure_pe(phb, pe); >> - >> - pnv_ioda_free_pe(phb, pe->pe_number); >> + if (pe->table_group.group) >> + iommu_group_put(pe->table_group.group); >> } >> } >> >
On Fri, Apr 15, 2016 at 11:29:32AM +1000, Alexey Kardashevskiy wrote: > On 04/14/2016 11:40 AM, David Gibson wrote: > >On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: > >>When SRIOV is disabled, the existing code presumes there is no > >>virtual function (VF) in use and destroys all associated PEs. > >>However it is possible to get into the situation when the user > >>activated SRIOV disabling while a VF is still in use via VFIO. > >>For example, unbinding a physical function (PF) while there is a guest > >>running with a VF passed throuhgh via VFIO will trigger the bug. > >> > >>This defines an IODA2-specific IOMMU group release() callback. > >>This moves all the disposal code from pnv_ioda_release_vf_PE() to this > >>new callback so the cleanup happens when the last user of an IOMMU > >>group released the reference. > >> > >>As pnv_pci_ioda2_release_dma_pe() was reduced to just calling > >>iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() > >>into pnv_ioda_release_vf_PE(). > >> > >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>--- > >> arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ > >> 1 file changed, 14 insertions(+), 19 deletions(-) > >> > >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > >>index ce9f2bf..8108c54 100644 > >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c > >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c > >>@@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); > >> static void pnv_pci_ioda2_group_release(void *iommu_data) > >> { > >> struct iommu_table_group *table_group = iommu_data; > >>+ struct pnv_ioda_pe *pe = container_of(table_group, > >>+ struct pnv_ioda_pe, table_group); > >>+ struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); > >>+ struct pnv_phb *phb = hose->private_data; > >>+ struct iommu_table *tbl = pe->table_group.tables[0]; > >>+ int64_t rc; > >> > >>- table_group->group = NULL; > >>-} > >>- > >>-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) > >>-{ > >>- struct iommu_table *tbl; > >>- int64_t rc; > >>- > >>- tbl = pe->table_group.tables[0]; > >> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); > > > >Is it safe to go manipulating the PE windows, etc. after SR-IOV is > >disabled? > > Manipulating windows in this case is just updating 8 bytes in the TVT. At > this point a VF is expected to be destroyed but PE is expected to remain not > free so pnv_ioda2_pick_m64_pe() (or pnv_ioda2_reserve_m64_pe()?) won't use > it. Ok. > >When SR-IOV is disabled, you need to immediately disable the VF (I'm > >guessing that happens somewhere) and stop all access to the VF > >"hardware". > > drivers/pci/iov.c > === > static void sriov_disable(struct pci_dev *dev) > { > ... > for (i = 0; i < iov->num_VFs; i++) > pci_iov_remove_virtfn(dev, i, 0); > ... > pcibios_sriov_disable(dev); > === > > pcibios_sriov_disable() is where pnv_pci_ioda2_release_dma_pe() is called from. > > >Only the iommu group structure *has* to stick around > >until the reference count drops to zero. I think other structures and > >hardware reconfiguration can be deferred or done immediately, > >whichever is more convenient. > > I deferred everything because of convenience as iommu_table_group is > embedded into pnv_ioda struct, not a pointer. Ok. With those queries answered, Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > >> if (rc) > >> pe_warn(pe, "OPAL error %ld release DMA window\n", rc); > >> > >> pnv_pci_ioda2_set_bypass(pe, false); > >>- if (pe->table_group.group) { > >>- iommu_group_put(pe->table_group.group); > >>- BUG_ON(pe->table_group.group); > >>- } > >>+ > >>+ BUG_ON(!tbl); > >> pnv_pci_ioda2_table_free_pages(tbl); > >>- iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); > >>+ iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); > >>+ > >>+ pnv_ioda_deconfigure_pe(phb, pe); > >>+ pnv_ioda_free_pe(phb, pe->pe_number); > >> } > >> > >> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) > >>@@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) > >> if (pe->parent_dev != pdev) > >> continue; > >> > >>- pnv_pci_ioda2_release_dma_pe(pdev, pe); > >>- > >> /* Remove from list */ > >> mutex_lock(&phb->ioda.pe_list_mutex); > >> list_del(&pe->list); > >> mutex_unlock(&phb->ioda.pe_list_mutex); > >> > >>- pnv_ioda_deconfigure_pe(phb, pe); > >>- > >>- pnv_ioda_free_pe(phb, pe->pe_number); > >>+ if (pe->table_group.group) > >>+ iommu_group_put(pe->table_group.group); > >> } > >> } > >> > > > >
On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: >When SRIOV is disabled, the existing code presumes there is no >virtual function (VF) in use and destroys all associated PEs. >However it is possible to get into the situation when the user >activated SRIOV disabling while a VF is still in use via VFIO. >For example, unbinding a physical function (PF) while there is a guest >running with a VF passed throuhgh via VFIO will trigger the bug. > >This defines an IODA2-specific IOMMU group release() callback. >This moves all the disposal code from pnv_ioda_release_vf_PE() to this >new callback so the cleanup happens when the last user of an IOMMU >group released the reference. > >As pnv_pci_ioda2_release_dma_pe() was reduced to just calling >iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() >into pnv_ioda_release_vf_PE(). > Sorry, I don't understand how it works. When PF's driver disables IOV capability, the VF cannnot work. The guest is unlikely to know that and still continue accessing the VF's resources (e.g. config space and MMIO registers). It would cause EEH errors. >Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >--- > arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ > 1 file changed, 14 insertions(+), 19 deletions(-) > >diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c >index ce9f2bf..8108c54 100644 >--- a/arch/powerpc/platforms/powernv/pci-ioda.c >+++ b/arch/powerpc/platforms/powernv/pci-ioda.c >@@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); > static void pnv_pci_ioda2_group_release(void *iommu_data) > { > struct iommu_table_group *table_group = iommu_data; >+ struct pnv_ioda_pe *pe = container_of(table_group, >+ struct pnv_ioda_pe, table_group); >+ struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); pe->parent_dev would be NULL for non-VF-PEs and it's protected by CONFIG_PCI_IOV in pci.h. >+ struct pnv_phb *phb = hose->private_data; >+ struct iommu_table *tbl = pe->table_group.tables[0]; >+ int64_t rc; > >- table_group->group = NULL; >-} >- >-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) >-{ >- struct iommu_table *tbl; >- int64_t rc; >- >- tbl = pe->table_group.tables[0]; > rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); > if (rc) > pe_warn(pe, "OPAL error %ld release DMA window\n", rc); > > pnv_pci_ioda2_set_bypass(pe, false); >- if (pe->table_group.group) { >- iommu_group_put(pe->table_group.group); >- BUG_ON(pe->table_group.group); >- } >+ >+ BUG_ON(!tbl); > pnv_pci_ioda2_table_free_pages(tbl); >- iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); >+ iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); >+ >+ pnv_ioda_deconfigure_pe(phb, pe); >+ pnv_ioda_free_pe(phb, pe->pe_number); > } It's not correct enough. One PE is comprised of DMA, MMIO, mapping info etc. This function disposes all of them when DMA finishes its job. I don't figure out a better way to represent all of them and their relationship. I guess it's worthy to have something in long term though it's not trival work. > > static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >@@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) > if (pe->parent_dev != pdev) > continue; > >- pnv_pci_ioda2_release_dma_pe(pdev, pe); >- > /* Remove from list */ > mutex_lock(&phb->ioda.pe_list_mutex); > list_del(&pe->list); > mutex_unlock(&phb->ioda.pe_list_mutex); > >- pnv_ioda_deconfigure_pe(phb, pe); >- >- pnv_ioda_free_pe(phb, pe->pe_number); >+ if (pe->table_group.group) >+ iommu_group_put(pe->table_group.group); > } > } > >-- >2.5.0.rc3 >
On 04/21/2016 10:21 AM, Gavin Shan wrote: > On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: >> When SRIOV is disabled, the existing code presumes there is no >> virtual function (VF) in use and destroys all associated PEs. >> However it is possible to get into the situation when the user >> activated SRIOV disabling while a VF is still in use via VFIO. >> For example, unbinding a physical function (PF) while there is a guest >> running with a VF passed throuhgh via VFIO will trigger the bug. >> >> This defines an IODA2-specific IOMMU group release() callback. >> This moves all the disposal code from pnv_ioda_release_vf_PE() to this >> new callback so the cleanup happens when the last user of an IOMMU >> group released the reference. >> >> As pnv_pci_ioda2_release_dma_pe() was reduced to just calling >> iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() >> into pnv_ioda_release_vf_PE(). >> > > Sorry, I don't understand how it works. When PF's driver disables > IOV capability, the VF cannnot work. The guest is unlikely to know > that and still continue accessing the VF's resources (e.g. config > space and MMIO registers). It would cause EEH errors. The host disables IOV which removes VF devices which unbinds vfio_pci driver and does all the cleanup, eventually we get to QEMU's vfio_req_notifier_handler() and PCI hot unplug is initiated and the device disappears from the guest. If the guest cannot do PCI hotunplug, then EEH will make host stop it anyway. Here we do not really care what happens to the guest (it can detect EEH or hotunplug or simply crash), we need to make sure that the _host_ does not crash in any case because the root user did something weird. > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> --- >> arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ >> 1 file changed, 14 insertions(+), 19 deletions(-) >> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c >> index ce9f2bf..8108c54 100644 >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >> @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); >> static void pnv_pci_ioda2_group_release(void *iommu_data) >> { >> struct iommu_table_group *table_group = iommu_data; >> + struct pnv_ioda_pe *pe = container_of(table_group, >> + struct pnv_ioda_pe, table_group); >> + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); > > pe->parent_dev would be NULL for non-VF-PEs and it's protected by CONFIG_PCI_IOV > in pci.h. Yeah, I'll fix it. > >> + struct pnv_phb *phb = hose->private_data; >> + struct iommu_table *tbl = pe->table_group.tables[0]; >> + int64_t rc; >> >> - table_group->group = NULL; >> -} >> - >> -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) >> -{ >> - struct iommu_table *tbl; >> - int64_t rc; >> - >> - tbl = pe->table_group.tables[0]; >> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); >> if (rc) >> pe_warn(pe, "OPAL error %ld release DMA window\n", rc); >> >> pnv_pci_ioda2_set_bypass(pe, false); >> - if (pe->table_group.group) { >> - iommu_group_put(pe->table_group.group); >> - BUG_ON(pe->table_group.group); >> - } >> + >> + BUG_ON(!tbl); >> pnv_pci_ioda2_table_free_pages(tbl); >> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); >> + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); >> + >> + pnv_ioda_deconfigure_pe(phb, pe); >> + pnv_ioda_free_pe(phb, pe->pe_number); >> } > > It's not correct enough. One PE is comprised of DMA, MMIO, mapping info etc. > This function disposes all of them when DMA finishes its job. I don't figure > out a better way to represent all of them and their relationship. I guess it's > worthy to have something in long term though it's not trival work. Sorry, I am missing your point here. I am not changing the resource deallocation here, I am just doing it slightly later and all I wonder at the moment is if there are races - like having 2 scripts - one doing unbind PF and another doing bind PF - will this crash the host in theory? > >> >> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >> @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >> if (pe->parent_dev != pdev) >> continue; >> >> - pnv_pci_ioda2_release_dma_pe(pdev, pe); >> - >> /* Remove from list */ >> mutex_lock(&phb->ioda.pe_list_mutex); >> list_del(&pe->list); >> mutex_unlock(&phb->ioda.pe_list_mutex); >> >> - pnv_ioda_deconfigure_pe(phb, pe); >> - >> - pnv_ioda_free_pe(phb, pe->pe_number); >> + if (pe->table_group.group) >> + iommu_group_put(pe->table_group.group); >> } >> } >> >> -- >> 2.5.0.rc3 >> >
On 04/21/2016 01:20 PM, Alexey Kardashevskiy wrote: > On 04/21/2016 10:21 AM, Gavin Shan wrote: >> On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: >>> When SRIOV is disabled, the existing code presumes there is no >>> virtual function (VF) in use and destroys all associated PEs. >>> However it is possible to get into the situation when the user >>> activated SRIOV disabling while a VF is still in use via VFIO. >>> For example, unbinding a physical function (PF) while there is a guest >>> running with a VF passed throuhgh via VFIO will trigger the bug. >>> >>> This defines an IODA2-specific IOMMU group release() callback. >>> This moves all the disposal code from pnv_ioda_release_vf_PE() to this >>> new callback so the cleanup happens when the last user of an IOMMU >>> group released the reference. >>> >>> As pnv_pci_ioda2_release_dma_pe() was reduced to just calling >>> iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() >>> into pnv_ioda_release_vf_PE(). >>> >> >> Sorry, I don't understand how it works. When PF's driver disables >> IOV capability, the VF cannnot work. The guest is unlikely to know >> that and still continue accessing the VF's resources (e.g. config >> space and MMIO registers). It would cause EEH errors. > > The host disables IOV which removes VF devices which unbinds vfio_pci > driver and does all the cleanup, eventually we get to QEMU's > vfio_req_notifier_handler() and PCI hot unplug is initiated and the device > disappears from the guest. > > If the guest cannot do PCI hotunplug, then EEH will make host stop it anyway. > > Here we do not really care what happens to the guest (it can detect EEH or > hotunplug or simply crash), we need to make sure that the _host_ does not > crash in any case because the root user did something weird. Ping? > > >> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> --- >>> arch/powerpc/platforms/powernv/pci-ioda.c | 33 >>> +++++++++++++------------------ >>> 1 file changed, 14 insertions(+), 19 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c >>> b/arch/powerpc/platforms/powernv/pci-ioda.c >>> index ce9f2bf..8108c54 100644 >>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c >>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >>> @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct >>> pnv_ioda_pe *pe, bool enable); >>> static void pnv_pci_ioda2_group_release(void *iommu_data) >>> { >>> struct iommu_table_group *table_group = iommu_data; >>> + struct pnv_ioda_pe *pe = container_of(table_group, >>> + struct pnv_ioda_pe, table_group); >>> + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); >> >> pe->parent_dev would be NULL for non-VF-PEs and it's protected by >> CONFIG_PCI_IOV >> in pci.h. > > > Yeah, I'll fix it. > >> >>> + struct pnv_phb *phb = hose->private_data; >>> + struct iommu_table *tbl = pe->table_group.tables[0]; >>> + int64_t rc; >>> >>> - table_group->group = NULL; >>> -} >>> - >>> -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct >>> pnv_ioda_pe *pe) >>> -{ >>> - struct iommu_table *tbl; >>> - int64_t rc; >>> - >>> - tbl = pe->table_group.tables[0]; >>> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); >>> if (rc) >>> pe_warn(pe, "OPAL error %ld release DMA window\n", rc); >>> >>> pnv_pci_ioda2_set_bypass(pe, false); >>> - if (pe->table_group.group) { >>> - iommu_group_put(pe->table_group.group); >>> - BUG_ON(pe->table_group.group); >>> - } >>> + >>> + BUG_ON(!tbl); >>> pnv_pci_ioda2_table_free_pages(tbl); >>> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); >>> + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); >>> + >>> + pnv_ioda_deconfigure_pe(phb, pe); >>> + pnv_ioda_free_pe(phb, pe->pe_number); >>> } >> >> It's not correct enough. One PE is comprised of DMA, MMIO, mapping info etc. >> This function disposes all of them when DMA finishes its job. I don't figure >> out a better way to represent all of them and their relationship. I guess >> it's >> worthy to have something in long term though it's not trival work. > > > Sorry, I am missing your point here. I am not changing the resource > deallocation here, I am just doing it slightly later and all I wonder at > the moment is if there are races - like having 2 scripts - one doing unbind > PF and another doing bind PF - will this crash the host in theory? > > >> >>> >>> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >>> @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct >>> pci_dev *pdev) >>> if (pe->parent_dev != pdev) >>> continue; >>> >>> - pnv_pci_ioda2_release_dma_pe(pdev, pe); >>> - >>> /* Remove from list */ >>> mutex_lock(&phb->ioda.pe_list_mutex); >>> list_del(&pe->list); >>> mutex_unlock(&phb->ioda.pe_list_mutex); >>> >>> - pnv_ioda_deconfigure_pe(phb, pe); >>> - >>> - pnv_ioda_free_pe(phb, pe->pe_number); >>> + if (pe->table_group.group) >>> + iommu_group_put(pe->table_group.group); >>> } >>> } >>> >>> -- >>> 2.5.0.rc3 >>> >> > >
On Thu, Apr 21, 2016 at 01:20:09PM +1000, Alexey Kardashevskiy wrote: >On 04/21/2016 10:21 AM, Gavin Shan wrote: >>On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote: >>>When SRIOV is disabled, the existing code presumes there is no >>>virtual function (VF) in use and destroys all associated PEs. >>>However it is possible to get into the situation when the user >>>activated SRIOV disabling while a VF is still in use via VFIO. >>>For example, unbinding a physical function (PF) while there is a guest >>>running with a VF passed throuhgh via VFIO will trigger the bug. >>> >>>This defines an IODA2-specific IOMMU group release() callback. >>>This moves all the disposal code from pnv_ioda_release_vf_PE() to this >>>new callback so the cleanup happens when the last user of an IOMMU >>>group released the reference. >>> >>>As pnv_pci_ioda2_release_dma_pe() was reduced to just calling >>>iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() >>>into pnv_ioda_release_vf_PE(). >>> >> >>Sorry, I don't understand how it works. When PF's driver disables >>IOV capability, the VF cannnot work. The guest is unlikely to know >>that and still continue accessing the VF's resources (e.g. config >>space and MMIO registers). It would cause EEH errors. > >The host disables IOV which removes VF devices which unbinds vfio_pci driver >and does all the cleanup, eventually we get to QEMU's >vfio_req_notifier_handler() and PCI hot unplug is initiated and the device >disappears from the guest. > >If the guest cannot do PCI hotunplug, then EEH will make host stop it anyway. > >Here we do not really care what happens to the guest (it can detect EEH or >hotunplug or simply crash), we need to make sure that the _host_ does not >crash in any case because the root user did something weird. > Thanks for the detailed explanation. My concern is there should have some race: the guest doesn't receive the signal and unplug it in time. The guest could possibly produces PCI traffic that cannnot be handled by VF. My first impression would be raised EEH as the result. If it's a fenced PHB and the host is going to be affected. Would it be a problem? I think it would be nice if the guest can work without issue under this situation. As you claimed, it's a _weird_ behaviour for user to unbind PF's driver that in turn disables IOV capability when VFs are owned by guest. Why not stop user from doing it? > >> >>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>--- >>>arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ >>>1 file changed, 14 insertions(+), 19 deletions(-) >>> >>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c >>>index ce9f2bf..8108c54 100644 >>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c >>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c >>>@@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); >>>static void pnv_pci_ioda2_group_release(void *iommu_data) >>>{ >>> struct iommu_table_group *table_group = iommu_data; >>>+ struct pnv_ioda_pe *pe = container_of(table_group, >>>+ struct pnv_ioda_pe, table_group); >>>+ struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); >> >>pe->parent_dev would be NULL for non-VF-PEs and it's protected by CONFIG_PCI_IOV >>in pci.h. > > >Yeah, I'll fix it. > >> >>>+ struct pnv_phb *phb = hose->private_data; >>>+ struct iommu_table *tbl = pe->table_group.tables[0]; >>>+ int64_t rc; >>> >>>- table_group->group = NULL; >>>-} >>>- >>>-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) >>>-{ >>>- struct iommu_table *tbl; >>>- int64_t rc; >>>- >>>- tbl = pe->table_group.tables[0]; >>> rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); >>> if (rc) >>> pe_warn(pe, "OPAL error %ld release DMA window\n", rc); >>> >>> pnv_pci_ioda2_set_bypass(pe, false); >>>- if (pe->table_group.group) { >>>- iommu_group_put(pe->table_group.group); >>>- BUG_ON(pe->table_group.group); >>>- } >>>+ >>>+ BUG_ON(!tbl); >>> pnv_pci_ioda2_table_free_pages(tbl); >>>- iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); >>>+ iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); >>>+ >>>+ pnv_ioda_deconfigure_pe(phb, pe); >>>+ pnv_ioda_free_pe(phb, pe->pe_number); >>>} >> >>It's not correct enough. One PE is comprised of DMA, MMIO, mapping info etc. >>This function disposes all of them when DMA finishes its job. I don't figure >>out a better way to represent all of them and their relationship. I guess it's >>worthy to have something in long term though it's not trival work. > > >Sorry, I am missing your point here. I am not changing the resource >deallocation here, I am just doing it slightly later and all I wonder at the >moment is if there are races - like having 2 scripts - one doing unbind PF >and another doing bind PF - will this crash the host in theory? > One PE is comprised of multiple facets like DMA/MMIO/PELTM/EEH as I explained in last reply. pnv_pci_ioda2_release_dma_pe() called to destroy DMA window is going to tear down all (or most) of them. That means DMA and PE are equivalent, but they're not. Furturemore, not all PE have associated enabled DMA windows. >> >>> >>>static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >>>@@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) >>> if (pe->parent_dev != pdev) >>> continue; >>> >>>- pnv_pci_ioda2_release_dma_pe(pdev, pe); >>>- >>> /* Remove from list */ >>> mutex_lock(&phb->ioda.pe_list_mutex); >>> list_del(&pe->list); >>> mutex_unlock(&phb->ioda.pe_list_mutex); >>> >>>- pnv_ioda_deconfigure_pe(phb, pe); >>>- >>>- pnv_ioda_free_pe(phb, pe->pe_number); >>>+ if (pe->table_group.group) >>>+ iommu_group_put(pe->table_group.group); >>> } >>>} >>> >>>-- >>>2.5.0.rc3 >>> >> > > >-- >Alexey >
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ce9f2bf..8108c54 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); static void pnv_pci_ioda2_group_release(void *iommu_data) { struct iommu_table_group *table_group = iommu_data; + struct pnv_ioda_pe *pe = container_of(table_group, + struct pnv_ioda_pe, table_group); + struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus); + struct pnv_phb *phb = hose->private_data; + struct iommu_table *tbl = pe->table_group.tables[0]; + int64_t rc; - table_group->group = NULL; -} - -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) -{ - struct iommu_table *tbl; - int64_t rc; - - tbl = pe->table_group.tables[0]; rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); pnv_pci_ioda2_set_bypass(pe, false); - if (pe->table_group.group) { - iommu_group_put(pe->table_group.group); - BUG_ON(pe->table_group.group); - } + + BUG_ON(!tbl); pnv_pci_ioda2_table_free_pages(tbl); - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); + iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node)); + + pnv_ioda_deconfigure_pe(phb, pe); + pnv_ioda_free_pe(phb, pe->pe_number); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) @@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) if (pe->parent_dev != pdev) continue; - pnv_pci_ioda2_release_dma_pe(pdev, pe); - /* Remove from list */ mutex_lock(&phb->ioda.pe_list_mutex); list_del(&pe->list); mutex_unlock(&phb->ioda.pe_list_mutex); - pnv_ioda_deconfigure_pe(phb, pe); - - pnv_ioda_free_pe(phb, pe->pe_number); + if (pe->table_group.group) + iommu_group_put(pe->table_group.group); } }
When SRIOV is disabled, the existing code presumes there is no virtual function (VF) in use and destroys all associated PEs. However it is possible to get into the situation when the user activated SRIOV disabling while a VF is still in use via VFIO. For example, unbinding a physical function (PF) while there is a guest running with a VF passed throuhgh via VFIO will trigger the bug. This defines an IODA2-specific IOMMU group release() callback. This moves all the disposal code from pnv_ioda_release_vf_PE() to this new callback so the cleanup happens when the last user of an IOMMU group released the reference. As pnv_pci_ioda2_release_dma_pe() was reduced to just calling iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe() into pnv_ioda_release_vf_PE(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++------------------ 1 file changed, 14 insertions(+), 19 deletions(-)