[3/3] PCI: vmd: Use managed irq affinities
diff mbox series

Message ID 1573040408-3831-4-git-send-email-jonathan.derrick@intel.com
State Changes Requested
Delegated to: Lorenzo Pieralisi
Headers show
Series
  • PCI: vmd: Reducing tail latency by affining to the storage stack
Related show

Commit Message

Derrick, Jonathan Nov. 6, 2019, 11:40 a.m. UTC
Using managed IRQ affinities sets up the VMD affinities identically to
the child devices when those devices vector counts are limited by VMD.
This promotes better affinity handling as interrupts won't necessarily
need to pass context between non-local CPUs. One pre-vector is reserved
for the slow interrupt and not considered in the affinity algorithm.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

Comments

Keith Busch Nov. 6, 2019, 6:10 p.m. UTC | #1
On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> Using managed IRQ affinities sets up the VMD affinities identically to
> the child devices when those devices vector counts are limited by VMD.
> This promotes better affinity handling as interrupts won't necessarily
> need to pass context between non-local CPUs. One pre-vector is reserved
> for the slow interrupt and not considered in the affinity algorithm.

This only works if all devices have exactly the same number of interrupts
as the parent VMD host bridge. If a child device has less, the device
will stop working if you offline a cpu: the child device may have a
resource affined to other online cpus, but the VMD device affinity is to
that single offline cpu.
Derrick, Jonathan Nov. 6, 2019, 8:14 p.m. UTC | #2
On Thu, 2019-11-07 at 03:10 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> > Using managed IRQ affinities sets up the VMD affinities identically to
> > the child devices when those devices vector counts are limited by VMD.
> > This promotes better affinity handling as interrupts won't necessarily
> > need to pass context between non-local CPUs. One pre-vector is reserved
> > for the slow interrupt and not considered in the affinity algorithm.
> 
> This only works if all devices have exactly the same number of interrupts
> as the parent VMD host bridge. If a child device has less, the device
> will stop working if you offline a cpu: the child device may have a
> resource affined to other online cpus, but the VMD device affinity is to
> that single offline cpu.

Yes that problem exists today and this set limits the exposure as it's
a rare case where you have a child NVMe device with fewer than 32
vectors.
Keith Busch Nov. 6, 2019, 8:27 p.m. UTC | #3
On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> Yes that problem exists today 

Not really, because we're currently using unamanged interrupts which
migrate to online CPUs. It's only the managed ones that don't migrate
because they have a unchangeable affinity.

> and this set limits the exposure as it's
> a rare case where you have a child NVMe device with fewer than 32
> vectors.

I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
queues, yielding 31 vectors for IO services, so that already won't work
with this scheme.

But assuming you wanted to only use devices that have at least 32 IRQ
vectors, the nvme driver also allows users to carve those vectors up
into fully affinitized sets for different services (read vs. write is
the one the block stack supports), which would also break if alignment
to the parent device's IRQ setup is required.
Derrick, Jonathan Nov. 6, 2019, 8:33 p.m. UTC | #4
On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > Yes that problem exists today 
> 
> Not really, because we're currently using unamanged interrupts which
> migrate to online CPUs. It's only the managed ones that don't migrate
> because they have a unchangeable affinity.
> 
> > and this set limits the exposure as it's
> > a rare case where you have a child NVMe device with fewer than 32
> > vectors.
> 
> I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> queues, yielding 31 vectors for IO services, so that already won't work
> with this scheme.
> 
Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
leaving the 31 remaining VMD vectors for NVMe IO queues. This would
have lined up perfectly. Had changed it last minute and that does break
the scheme for P3700....

> But assuming you wanted to only use devices that have at least 32 IRQ
> vectors, the nvme driver also allows users to carve those vectors up
> into fully affinitized sets for different services (read vs. write is
> the one the block stack supports), which would also break if alignment
> to the parent device's IRQ setup is required.

Wasn't aware of this. Fair enough.
Lorenzo Pieralisi Nov. 18, 2019, 10:49 a.m. UTC | #5
On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > Yes that problem exists today 
> > 
> > Not really, because we're currently using unamanged interrupts which
> > migrate to online CPUs. It's only the managed ones that don't migrate
> > because they have a unchangeable affinity.
> > 
> > > and this set limits the exposure as it's
> > > a rare case where you have a child NVMe device with fewer than 32
> > > vectors.
> > 
> > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > queues, yielding 31 vectors for IO services, so that already won't work
> > with this scheme.
> > 
> Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> have lined up perfectly. Had changed it last minute and that does break
> the scheme for P3700....
> 
> > But assuming you wanted to only use devices that have at least 32 IRQ
> > vectors, the nvme driver also allows users to carve those vectors up
> > into fully affinitized sets for different services (read vs. write is
> > the one the block stack supports), which would also break if alignment
> > to the parent device's IRQ setup is required.
> 
> Wasn't aware of this. Fair enough.

Marked the series with "Changes Requested", waiting for a new
version.

Lorenzo
Derrick, Jonathan Nov. 18, 2019, 4:43 p.m. UTC | #6
On Mon, 2019-11-18 at 10:49 +0000, Lorenzo Pieralisi wrote:
> On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> > On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > > Yes that problem exists today 
> > > 
> > > Not really, because we're currently using unamanged interrupts which
> > > migrate to online CPUs. It's only the managed ones that don't migrate
> > > because they have a unchangeable affinity.
> > > 
> > > > and this set limits the exposure as it's
> > > > a rare case where you have a child NVMe device with fewer than 32
> > > > vectors.
> > > 
> > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > > queues, yielding 31 vectors for IO services, so that already won't work
> > > with this scheme.
> > > 
> > Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> > leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> > have lined up perfectly. Had changed it last minute and that does break
> > the scheme for P3700....
> > 
> > > But assuming you wanted to only use devices that have at least 32 IRQ
> > > vectors, the nvme driver also allows users to carve those vectors up
> > > into fully affinitized sets for different services (read vs. write is
> > > the one the block stack supports), which would also break if alignment
> > > to the parent device's IRQ setup is required.
> > 
> > Wasn't aware of this. Fair enough.
> 
> Marked the series with "Changes Requested", waiting for a new
> version.
> 
> Lorenzo

This will need an entirely different strategy to be useful, so can be
dropped for now.

Thank you,
Jon

Patch
diff mbox series

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 7aca925..be92076 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -157,22 +157,11 @@  static void vmd_irq_disable(struct irq_data *data)
 	raw_spin_unlock_irqrestore(&list_lock, flags);
 }
 
-/*
- * XXX: Stubbed until we develop acceptable way to not create conflicts with
- * other devices sharing the same vector.
- */
-static int vmd_irq_set_affinity(struct irq_data *data,
-				const struct cpumask *dest, bool force)
-{
-	return -EINVAL;
-}
-
 static struct irq_chip vmd_msi_controller = {
 	.name			= "VMD-MSI",
 	.irq_enable		= vmd_irq_enable,
 	.irq_disable		= vmd_irq_disable,
 	.irq_compose_msi_msg	= vmd_compose_msi_msg,
-	.irq_set_affinity	= vmd_irq_set_affinity,
 };
 
 static irq_hw_number_t vmd_get_hwirq(struct msi_domain_info *info,
@@ -722,6 +711,9 @@  static irqreturn_t vmd_irq(int irq, void *data)
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	struct vmd_dev *vmd;
+	struct irq_affinity affd = {
+		.pre_vectors = 1,
+	};
 	int i, err;
 
 	if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
@@ -749,8 +741,8 @@  static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	if (vmd->msix_count < 0)
 		return -ENODEV;
 
-	vmd->msix_count = pci_alloc_irq_vectors(dev, 1, vmd->msix_count,
-					PCI_IRQ_MSIX);
+	vmd->msix_count = pci_alloc_irq_vectors_affinity(dev, 1, vmd->msix_count,
+					PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd);
 	if (vmd->msix_count < 0)
 		return vmd->msix_count;