diff mbox series

[PATCHv2,3/4] PCI/MSI: Handle vector reduce and retry

Message ID 20190103225033.11249-4-keith.busch@intel.com
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series NVMe IRQ sets fixups | expand

Commit Message

Keith Busch Jan. 3, 2019, 10:50 p.m. UTC
The struct irq_affinity nr_sets forced the driver to handle reducing the
vector count on allocation failures because the set distribution counts
are driver specific. The change to this API requires very different usage
than before, and introduced new error corner cases that weren't being
handled. It is also less efficient since the driver doesn't actually
know what a proper vector count it should use since it only sees the
error code and can only reduce by one instead of going straight to a
possible vector count like PCI is able to do.

Provide a driver specific callback for managed irq set creation so that
PCI can take a min and max vectors as before to handle the reduce and
retry logic.

The usage is not particularly obvious for this new feature, so append
documentation for driver usage.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
 drivers/pci/msi.c               | 20 ++++++--------------
 include/linux/interrupt.h       |  5 +++++
 3 files changed, 46 insertions(+), 15 deletions(-)

Comments

Ming Lei Jan. 4, 2019, 2:45 a.m. UTC | #1
On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> The struct irq_affinity nr_sets forced the driver to handle reducing the
> vector count on allocation failures because the set distribution counts
> are driver specific. The change to this API requires very different usage
> than before, and introduced new error corner cases that weren't being
> handled. It is also less efficient since the driver doesn't actually
> know what a proper vector count it should use since it only sees the
> error code and can only reduce by one instead of going straight to a
> possible vector count like PCI is able to do.
> 
> Provide a driver specific callback for managed irq set creation so that
> PCI can take a min and max vectors as before to handle the reduce and
> retry logic.
> 
> The usage is not particularly obvious for this new feature, so append
> documentation for driver usage.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi.c               | 20 ++++++--------------
>  include/linux/interrupt.h       |  5 +++++
>  3 files changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 618e13d5e276..391b1f369138 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -98,7 +98,41 @@ The flags argument is used to specify which type of interrupt can be used
>  by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
>  A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
>  any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
> -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
> +pci_alloc_irq_vectors() will spread the interrupts around the available
> +CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
> +managed by the kernel, and are not tunable from user space like other
> +vectors.
> +
> +When your driver requires a more complex vector affinity configuration
> +than a default spread of all vectors, the driver may use the following
> +function:
> +
> +  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> +				   unsigned int max_vecs, unsigned int flags,
> +				   const struct irq_affinity *affd);
> +
> +The 'struct irq_affinity *affd' allows a driver to specify additional
> +characteristics for how a driver wants the vector management to occur. The
> +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> +wants to not participate in kernel managed affinities, and whether those
> +special vectors are at the beginning or the end of the vector space.
> +
> +It may also be the case that a driver wants multiple sets of fully
> +affinitized vectors. For example, a single PCI function may provide
> +different high performance services that want full CPU affinity for each
> +service independent of other services. In this case, the driver may use
> +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> +vectors need to be spread across all the CPUs, and fill in the 'sets'
> +array to say how many vectors the driver wants in each set.
> +
> +When using multiple affinity 'sets', the error handling for vector
> +reduction and retry becomes more complicated since the PCI core
> +doesn't know how to redistribute the vector count across the sets. In
> +order to provide this error handling, the driver must also provide the
> +'recalc_sets()' callback and set the 'priv' data needed for the driver
> +specific vector distribution. The driver's callback is responsible to
> +ensure the sum of the vector counts across its sets matches the new
> +vector count that PCI can allocate.
>  
>  To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
>  vectors, use the following function:
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 7a1c8a09efa5..b93ac49be18d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msi_enabled))
>  		return -EINVAL;
>  
> @@ -1061,6 +1054,9 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +

->recalc_sets() should have been done after msi_capability_init() makes at least
'minvec' is available.

>  		rc = msi_capability_init(dev, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> @@ -1093,13 +1089,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * supported vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msix_enabled))
>  		return -EINVAL;
>  
> @@ -1110,6 +1099,9 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +

->recalc_sets() should have been done after __pci_enable_msix() makes at least
'minvec' is available.

Thanks,
Ming
Bjorn Helgaas Jan. 4, 2019, 10:35 p.m. UTC | #2
On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> The struct irq_affinity nr_sets forced the driver to handle reducing the
> vector count on allocation failures because the set distribution counts
> are driver specific. The change to this API requires very different usage
> than before, and introduced new error corner cases that weren't being
> handled. It is also less efficient since the driver doesn't actually
> know what a proper vector count it should use since it only sees the
> error code and can only reduce by one instead of going straight to a
> possible vector count like PCI is able to do.
> 
> Provide a driver specific callback for managed irq set creation so that
> PCI can take a min and max vectors as before to handle the reduce and
> retry logic.
> 
> The usage is not particularly obvious for this new feature, so append
> documentation for driver usage.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt | 36 +++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi.c               | 20 ++++++--------------
>  include/linux/interrupt.h       |  5 +++++
>  3 files changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 618e13d5e276..391b1f369138 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -98,7 +98,41 @@ The flags argument is used to specify which type of interrupt can be used
>  by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
>  A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
>  any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
> -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
> +pci_alloc_irq_vectors() will spread the interrupts around the available
> +CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
> +managed by the kernel, and are not tunable from user space like other
> +vectors.
> +
> +When your driver requires a more complex vector affinity configuration
> +than a default spread of all vectors, the driver may use the following
> +function:
> +
> +  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> +				   unsigned int max_vecs, unsigned int flags,
> +				   const struct irq_affinity *affd);
> +
> +The 'struct irq_affinity *affd' allows a driver to specify additional
> +characteristics for how a driver wants the vector management to occur. The
> +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> +wants to not participate in kernel managed affinities, and whether those
> +special vectors are at the beginning or the end of the vector space.

How are the pre_vectors and post_vectors handled?  Do they get
assigned to random CPUs?  Current CPU?  Are their assignments tunable
from user space?

> +It may also be the case that a driver wants multiple sets of fully
> +affinitized vectors. For example, a single PCI function may provide
> +different high performance services that want full CPU affinity for each
> +service independent of other services. In this case, the driver may use
> +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> +vectors need to be spread across all the CPUs, and fill in the 'sets'
> +array to say how many vectors the driver wants in each set.

I think the issue here is IRQ vectors, and "services" and whether
they're high performance are unnecessary concepts.

What does irq_affinity.sets point to?  I guess it's a table of
integers where the table size is the number of sets and each entry is
the number of vectors in the set?

So we'd have something like this:

  pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
  set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
  set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
  ...
  post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)

where the vectors in set0 are spread across all CPUs, those in set1
are independently spread across all CPUs, etc?

I would guess there may be device-specific restrictions on the mapping
of of these vectors to sets, so the PCI core probably can't assume the
sets can be of arbitrary size, contiguous, etc.

> +When using multiple affinity 'sets', the error handling for vector
> +reduction and retry becomes more complicated since the PCI core
> +doesn't know how to redistribute the vector count across the sets. In
> +order to provide this error handling, the driver must also provide the
> +'recalc_sets()' callback and set the 'priv' data needed for the driver
> +specific vector distribution. The driver's callback is responsible to
> +ensure the sum of the vector counts across its sets matches the new
> +vector count that PCI can allocate.
>  
>  To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
>  vectors, use the following function:
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 7a1c8a09efa5..b93ac49be18d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msi_enabled))
>  		return -EINVAL;
>  
> @@ -1061,6 +1054,9 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +
>  		rc = msi_capability_init(dev, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> @@ -1093,13 +1089,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	/*
> -	 * If the caller is passing in sets, we can't support a range of
> -	 * supported vectors. The caller needs to handle that.
> -	 */
> -	if (affd && affd->nr_sets && minvec != maxvec)
> -		return -EINVAL;
> -
>  	if (WARN_ON_ONCE(dev->msix_enabled))
>  		return -EINVAL;
>  
> @@ -1110,6 +1099,9 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> +		if (nvec != maxvec && affd && affd->recalc_sets)
> +			affd->recalc_sets((struct irq_affinity *)affd, nvec);
> +
>  		rc = __pci_enable_msix(dev, entries, nvec, affd);
>  		if (rc == 0)
>  			return nvec;
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c672f34235e7..01c06829ff43 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -249,12 +249,17 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @nr_sets:		Length of passed in *sets array
>   * @sets:		Number of affinitized sets
> + * @recalc_sets:	Recalculate sets if the previously requested allocation
> + *			failed
> + * @priv:		Driver private data
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
>  	int	nr_sets;
>  	int	*sets;
> +	void	(*recalc_sets)(struct irq_affinity *, unsigned int);
> +	void	*priv;
>  };
>  
>  /**
> -- 
> 2.14.4
>
Keith Busch Jan. 4, 2019, 10:56 p.m. UTC | #3
On Fri, Jan 04, 2019 at 04:35:31PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> > +The 'struct irq_affinity *affd' allows a driver to specify additional
> > +characteristics for how a driver wants the vector management to occur. The
> > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> > +wants to not participate in kernel managed affinities, and whether those
> > +special vectors are at the beginning or the end of the vector space.
> 
> How are the pre_vectors and post_vectors handled?  Do they get
> assigned to random CPUs?  Current CPU?  Are their assignments tunable
> from user space?

Point taken. Those do get assigned a default mask, but they are also
user tunable and kernel migratable when CPUs offline/online.
 
> > +It may also be the case that a driver wants multiple sets of fully
> > +affinitized vectors. For example, a single PCI function may provide
> > +different high performance services that want full CPU affinity for each
> > +service independent of other services. In this case, the driver may use
> > +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> > +vectors need to be spread across all the CPUs, and fill in the 'sets'
> > +array to say how many vectors the driver wants in each set.
> 
> I think the issue here is IRQ vectors, and "services" and whether
> they're high performance are unnecessary concepts.

It's really intended for when your device has resources optimally accessed
in a per-cpu manner. I can better rephrase this description.

> What does irq_affinity.sets point to?  I guess it's a table of
> integers where the table size is the number of sets and each entry is
> the number of vectors in the set?
>
> So we'd have something like this:
> 
>   pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
>   set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
>   set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
>   ...
>   post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)
> 
> where the vectors in set0 are spread across all CPUs, those in set1
> are independently spread across all CPUs, etc?
>
> I would guess there may be device-specific restrictions on the mapping
> of of these vectors to sets, so the PCI core probably can't assume the
> sets can be of arbitrary size, contiguous, etc.

I think it's fair to say the caller wants vectors allocated and each set
affinitized contiguously such that each set starts after the previous
one ends. That works great with how NVMe wants to use it, at least. If
there is really any other way a device driver wants it, I can't see how
that can be easily accomodated.
diff mbox series

Patch

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 618e13d5e276..391b1f369138 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -98,7 +98,41 @@  The flags argument is used to specify which type of interrupt can be used
 by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
 A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
 any possible kind of interrupt.  If the PCI_IRQ_AFFINITY flag is set,
-pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
+pci_alloc_irq_vectors() will spread the interrupts around the available
+CPUs. Vector affinities allocated under the PCI_IRQ_AFFINITY flag are
+managed by the kernel, and are not tunable from user space like other
+vectors.
+
+When your driver requires a more complex vector affinity configuration
+than a default spread of all vectors, the driver may use the following
+function:
+
+  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
+				   unsigned int max_vecs, unsigned int flags,
+				   const struct irq_affinity *affd);
+
+The 'struct irq_affinity *affd' allows a driver to specify additional
+characteristics for how a driver wants the vector management to occur. The
+'pre_vectors' and 'post_vectors' fields define how many vectors the driver
+wants to not participate in kernel managed affinities, and whether those
+special vectors are at the beginning or the end of the vector space.
+
+It may also be the case that a driver wants multiple sets of fully
+affinitized vectors. For example, a single PCI function may provide
+different high performance services that want full CPU affinity for each
+service independent of other services. In this case, the driver may use
+the struct irq_affinity's 'nr_sets' field to specify how many groups of
+vectors need to be spread across all the CPUs, and fill in the 'sets'
+array to say how many vectors the driver wants in each set.
+
+When using multiple affinity 'sets', the error handling for vector
+reduction and retry becomes more complicated since the PCI core
+doesn't know how to redistribute the vector count across the sets. In
+order to provide this error handling, the driver must also provide the
+'recalc_sets()' callback and set the 'priv' data needed for the driver
+specific vector distribution. The driver's callback is responsible to
+ensure the sum of the vector counts across its sets matches the new
+vector count that PCI can allocate.
 
 To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
 vectors, use the following function:
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 7a1c8a09efa5..b93ac49be18d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1035,13 +1035,6 @@  static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 	if (maxvec < minvec)
 		return -ERANGE;
 
-	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * vectors. The caller needs to handle that.
-	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
-		return -EINVAL;
-
 	if (WARN_ON_ONCE(dev->msi_enabled))
 		return -EINVAL;
 
@@ -1061,6 +1054,9 @@  static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 				return -ENOSPC;
 		}
 
+		if (nvec != maxvec && affd && affd->recalc_sets)
+			affd->recalc_sets((struct irq_affinity *)affd, nvec);
+
 		rc = msi_capability_init(dev, nvec, affd);
 		if (rc == 0)
 			return nvec;
@@ -1093,13 +1089,6 @@  static int __pci_enable_msix_range(struct pci_dev *dev,
 	if (maxvec < minvec)
 		return -ERANGE;
 
-	/*
-	 * If the caller is passing in sets, we can't support a range of
-	 * supported vectors. The caller needs to handle that.
-	 */
-	if (affd && affd->nr_sets && minvec != maxvec)
-		return -EINVAL;
-
 	if (WARN_ON_ONCE(dev->msix_enabled))
 		return -EINVAL;
 
@@ -1110,6 +1099,9 @@  static int __pci_enable_msix_range(struct pci_dev *dev,
 				return -ENOSPC;
 		}
 
+		if (nvec != maxvec && affd && affd->recalc_sets)
+			affd->recalc_sets((struct irq_affinity *)affd, nvec);
+
 		rc = __pci_enable_msix(dev, entries, nvec, affd);
 		if (rc == 0)
 			return nvec;
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c672f34235e7..01c06829ff43 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -249,12 +249,17 @@  struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @nr_sets:		Length of passed in *sets array
  * @sets:		Number of affinitized sets
+ * @recalc_sets:	Recalculate sets if the previously requested allocation
+ *			failed
+ * @priv:		Driver private data
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
 	int	nr_sets;
 	int	*sets;
+	void	(*recalc_sets)(struct irq_affinity *, unsigned int);
+	void	*priv;
 };
 
 /**