diff mbox series

[v2,1/3] powernv/iov: Ensure the pdn for VFs always contains a valid PE number

Message ID 20191028085424.12006-1-oohall@gmail.com (mailing list archive)
State Accepted
Commit 3b5b9997b331e77ce967eba2c4bc80dc3134a7fe
Headers show
Series [v2,1/3] powernv/iov: Ensure the pdn for VFs always contains a valid PE number | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (6b450d0404ca83dc131dadffd40c5aa6f7a603af)
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 68 lines checked

Commit Message

Oliver O'Halloran Oct. 28, 2019, 8:54 a.m. UTC
On pseries there is a bug with adding hotplugged devices to an IOMMU group.
For a number of dumb reasons fixing that bug first requires re-working how
VFs are configured on PowerNV. For background, on PowerNV we use the
pcibios_sriov_enable() hook to do two things:

1. Create a pci_dn structure for each of the VFs, and
2. Configure the PHB's internal BARs so the MMIO range for each VF
   maps to a unique PE.

Roughly speaking a PE is the hardware counterpart to a Linux IOMMU group
since all the devices in a PE share the same IOMMU table. A PE also defines
the set of devices that should be isolated in response to a PCI error (i.e.
bad DMA, UR/CA, AER events, etc). When isolated all MMIO and DMA traffic to
and from devicein the PE is blocked by the root complex until the PE is
recovered by the OS.

The requirement to block MMIO causes a giant headache because the P8 PHB
generally uses a fixed mapping between MMIO addresses and PEs.  As a result
we need to delay configuring the IOMMU groups for device until after MMIO
resources are assigned. For physical devices (i.e. non-VFs) the PE
assignment is done in pcibios_setup_bridge() which is called immediately
after the MMIO resources for downstream devices (and the bridge's windows)
are assigned. For VFs the setup is more complicated because:

a) pcibios_setup_bridge() is not called again when VFs are activated, and
b) The pci_dev for VFs are created by generic code which runs after
   pcibios_sriov_enable() is called.

The work around for this is a two step process:

1. A fixup in pcibios_add_device() is used to initialised the cached
   pe_number in pci_dn, then
2. A bus notifier then adds the device to the IOMMU group for the PE
   specified in pci_dn->pe_number.

A side effect fixing the pseries bug mentioned in the first paragraph is
moving the fixup out of pcibios_add_device() and into
pcibios_bus_add_device(), which is called much later. This results in step
2. failing because pci_dn->pe_number won't be initialised when the bus
notifier is run.

We can fix this by removing the need for the fixup. The PE for a VF is
known before the VF is even scanned so we can initialise pci_dn->pe_number
pcibios_sriov_enable() instead. Unfortunately, moving the initialisation
causes two problems:

1. We trip the WARN_ON() in the current fixup code, and
2. The EEH core clears pdn->pe_number when recovering a VF and relies
   on the fixup to correctly re-set it.

The only justification for either of these is a comment in eeh_rmv_device()
suggesting that pdn->pe_number *must* be set to IODA_INVALID_PE in order
for the VF to be scanned. However, this comment appears to have no basis
in reality. Both bugs can be fixed by just deleting the code.

Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
v2: Re-wrote commit message, got very depressed about the state of things.

    The real fix here is to move the IOMMU group setup for both VFs and
    PFs into pcibios_bus_add_device() and kill the pcibios_setup_bridge()
    hack and the bus notifier hack, but doing that requires some pretty
    gnarly changes. The fix here is much less invasive.
---
 arch/powerpc/kernel/eeh_driver.c          |  6 ------
 arch/powerpc/platforms/powernv/pci-ioda.c | 19 +++++++++++++++----
 arch/powerpc/platforms/powernv/pci.c      |  4 ----
 3 files changed, 15 insertions(+), 14 deletions(-)

Comments

Shawn Anastasio Dec. 3, 2019, 11:45 p.m. UTC | #1
On 10/28/19 3:54 AM, Oliver O'Halloran wrote:
> On pseries there is a bug with adding hotplugged devices to an IOMMU group.
> For a number of dumb reasons fixing that bug first requires re-working how
> VFs are configured on PowerNV.

Are these patches expected to land soon, or is there another fix in the
pipeline? As far as I can tell this is still an issue.
Michael Ellerman Jan. 6, 2020, 11:33 p.m. UTC | #2
On Mon, 2019-10-28 at 08:54:22 UTC, Oliver O'Halloran wrote:
> On pseries there is a bug with adding hotplugged devices to an IOMMU group.
> For a number of dumb reasons fixing that bug first requires re-working how
> VFs are configured on PowerNV. For background, on PowerNV we use the
> pcibios_sriov_enable() hook to do two things:
> 
> 1. Create a pci_dn structure for each of the VFs, and
> 2. Configure the PHB's internal BARs so the MMIO range for each VF
>    maps to a unique PE.
> 
> Roughly speaking a PE is the hardware counterpart to a Linux IOMMU group
> since all the devices in a PE share the same IOMMU table. A PE also defines
> the set of devices that should be isolated in response to a PCI error (i.e.
> bad DMA, UR/CA, AER events, etc). When isolated all MMIO and DMA traffic to
> and from devicein the PE is blocked by the root complex until the PE is
> recovered by the OS.
> 
> The requirement to block MMIO causes a giant headache because the P8 PHB
> generally uses a fixed mapping between MMIO addresses and PEs.  As a result
> we need to delay configuring the IOMMU groups for device until after MMIO
> resources are assigned. For physical devices (i.e. non-VFs) the PE
> assignment is done in pcibios_setup_bridge() which is called immediately
> after the MMIO resources for downstream devices (and the bridge's windows)
> are assigned. For VFs the setup is more complicated because:
> 
> a) pcibios_setup_bridge() is not called again when VFs are activated, and
> b) The pci_dev for VFs are created by generic code which runs after
>    pcibios_sriov_enable() is called.
> 
> The work around for this is a two step process:
> 
> 1. A fixup in pcibios_add_device() is used to initialised the cached
>    pe_number in pci_dn, then
> 2. A bus notifier then adds the device to the IOMMU group for the PE
>    specified in pci_dn->pe_number.
> 
> A side effect fixing the pseries bug mentioned in the first paragraph is
> moving the fixup out of pcibios_add_device() and into
> pcibios_bus_add_device(), which is called much later. This results in step
> 2. failing because pci_dn->pe_number won't be initialised when the bus
> notifier is run.
> 
> We can fix this by removing the need for the fixup. The PE for a VF is
> known before the VF is even scanned so we can initialise pci_dn->pe_number
> pcibios_sriov_enable() instead. Unfortunately, moving the initialisation
> causes two problems:
> 
> 1. We trip the WARN_ON() in the current fixup code, and
> 2. The EEH core clears pdn->pe_number when recovering a VF and relies
>    on the fixup to correctly re-set it.
> 
> The only justification for either of these is a comment in eeh_rmv_device()
> suggesting that pdn->pe_number *must* be set to IODA_INVALID_PE in order
> for the VF to be scanned. However, this comment appears to have no basis
> in reality. Both bugs can be fixed by just deleting the code.
> 
> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3b5b9997b331e77ce967eba2c4bc80dc3134a7fe

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index d9279d0..7955fba 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -541,12 +541,6 @@  static void eeh_rmv_device(struct eeh_dev *edev, void *userdata)
 
 		pci_iov_remove_virtfn(edev->physfn, pdn->vf_index);
 		edev->pdev = NULL;
-
-		/*
-		 * We have to set the VF PE number to invalid one, which is
-		 * required to plug the VF successfully.
-		 */
-		pdn->pe_number = IODA_INVALID_PE;
 #endif
 		if (rmv_data)
 			list_add(&edev->rmv_entry, &rmv_data->removed_vf_list);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5e3172d..70508b3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1558,6 +1558,10 @@  static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 
 	/* Reserve PE for each VF */
 	for (vf_index = 0; vf_index < num_vfs; vf_index++) {
+		int vf_devfn = pci_iov_virtfn_devfn(pdev, vf_index);
+		int vf_bus = pci_iov_virtfn_bus(pdev, vf_index);
+		struct pci_dn *vf_pdn;
+
 		if (pdn->m64_single_mode)
 			pe_num = pdn->pe_num_map[vf_index];
 		else
@@ -1570,13 +1574,11 @@  static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		pe->pbus = NULL;
 		pe->parent_dev = pdev;
 		pe->mve_number = -1;
-		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
-			   pci_iov_virtfn_devfn(pdev, vf_index);
+		pe->rid = (vf_bus << 8) | vf_devfn;
 
 		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
 			hose->global_number, pdev->bus->number,
-			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
-			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+			PCI_SLOT(vf_devfn), PCI_FUNC(vf_devfn), pe_num);
 
 		if (pnv_ioda_configure_pe(phb, pe)) {
 			/* XXX What do we do here ? */
@@ -1590,6 +1592,15 @@  static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		list_add_tail(&pe->list, &phb->ioda.pe_list);
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
+		/* associate this pe to it's pdn */
+		list_for_each_entry(vf_pdn, &pdn->parent->child_list, list) {
+			if (vf_pdn->busno == vf_bus &&
+			    vf_pdn->devfn == vf_devfn) {
+				vf_pdn->pe_number = pe_num;
+				break;
+			}
+		}
+
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
 #ifdef CONFIG_IOMMU_API
 		iommu_register_group(&pe->table_group,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 2825d00..b7761e2 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -816,16 +816,12 @@  void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 	struct pnv_phb *phb = hose->private_data;
 #ifdef CONFIG_PCI_IOV
 	struct pnv_ioda_pe *pe;
-	struct pci_dn *pdn;
 
 	/* Fix the VF pdn PE number */
 	if (pdev->is_virtfn) {
-		pdn = pci_get_pdn(pdev);
-		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
 		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
 			if (pe->rid == ((pdev->bus->number << 8) |
 			    (pdev->devfn & 0xff))) {
-				pdn->pe_number = pe->pe_number;
 				pe->pdev = pdev;
 				break;
 			}