diff mbox

[1/3] PCI/MSI: Add pci_enable_msi_partial()

Message ID 4fef62a2e647a7c38e9f2a1ea4244b3506a85e2b.1402405331.git.agordeev@redhat.com
State Rejected
Headers show

Commit Message

Alexander Gordeev June 10, 2014, 1:10 p.m. UTC
There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

	roundup_pow_of_two(nvec) < 'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

	'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_partial() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to MME register while the value
of 'nvec' is still used to setup as many interrupts as requested.

As result of this change, architecture-specific callbacks
arch_msi_check_device() and arch_setup_msi_irqs() get an extra
'nvec_mme' parameter as well, but it is ignored for now.
Therefore, this update is a placeholder for architectures that
wish to support pci_enable_msi_partial() function in the future.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 8 files changed, 115 insertions(+), 22 deletions(-)

Comments

Alexander Gordeev June 23, 2014, 8:11 p.m. UTC | #1
Hi Bjorn,

Any feedback?

Thanks!
Bjorn Helgaas July 2, 2014, 8:22 p.m. UTC | #2
On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> There are PCI devices that require a particular value written
> to the Multiple Message Enable (MME) register while aligned on
> power of 2 boundary value of actually used MSI vectors 'nvec'
> is a lesser of that MME value:
> 
> 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> 
> However the existing pci_enable_msi_block() interface is not
> able to configure such devices, since the value written to the
> MME register is calculated from the number of requested MSIs
> 'nvec':
> 
> 	'Multiple Message Enable' = roundup_pow_of_two(nvec)

For MSI, software learns how many vectors a device requests by reading
the Multiple Message Capable (MMC) field.  This field is encoded, so a
device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
for a device to request 3 vectors; it would have to round up that up
to a power of two and request 4 vectors.

Software writes similarly encoded values to MME to tell the device how
many vectors have been allocated for its use.  For example, it's
impossible to tell the device that it can use 3 vectors; the OS has to
round that up and tell the device it can use 4 vectors.

So if I understand correctly, the point of this series is to take
advantage of device-specific knowledge, e.g., the device requests 4
vectors via MMC, but we "know" the device is only capable of using 3.
Moreover, we tell the device via MME that 4 vectors are available, but
we've only actually set up 3 of them.

This makes me uneasy because we're lying to the device, and the device
is perfectly within spec to use all 4 of those vectors.  If anything
changes the number of vectors the device uses (new device revision,
firmware upgrade, etc.), this is liable to break.

Can you quantify the benefit of this?  Can't a device already use
MSI-X to request exactly the number of vectors it can use?  (I know
not all devices support MSI-X, but maybe we should just accept MSI for
what it is and encourage the HW guys to use MSI-X if MSI isn't good
enough.)

> In this case the result written to the MME register may not
> satisfy the aforementioned PCI devices requirement and therefore
> the PCI functions will not operate in a desired mode.

I'm not sure what you mean by "will not operate in a desired mode."
I thought this was an optimization to save vectors and that these
changes would be completely invisible to the hardware.

Bjorn

> This update introduces pci_enable_msi_partial() extension to
> pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> argument which is then written to MME register while the value
> of 'nvec' is still used to setup as many interrupts as requested.
> 
> As result of this change, architecture-specific callbacks
> arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> 'nvec_mme' parameter as well, but it is ignored for now.
> Therefore, this update is a placeholder for architectures that
> wish to support pci_enable_msi_partial() function in the future.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-mips@linux-mips.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
>  arch/mips/pci/msi-octeon.c      |    2 +-
>  arch/powerpc/kernel/msi.c       |    4 +-
>  arch/s390/pci/pci.c             |    2 +-
>  arch/x86/kernel/x86_init.c      |    2 +-
>  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
>  include/linux/msi.h             |    5 +-
>  include/linux/pci.h             |    3 +
>  8 files changed, 115 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 10a9369..c8a8503 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
>  returns zero in case of success, which indicates MSI interrupts have been
>  successfully allocated.
>  
> -4.2.4 pci_disable_msi
> +4.2.4 pci_enable_msi_partial
> +
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +
> +This variation on pci_enable_msi_exact() call allows a device driver to
> +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> +in operating system. The MSI specification only allows 'nvec_mme' to be
> +allocated in powers of two, up to a maximum of 2^5 (32).
> +
> +This function could be used when a PCI function is known to send 'nvec'
> +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> +do not waste system resources.
> +
> +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> +interrupts and setting up 'nvec' interrupts. In this case, the function
> +enables MSI on this device and updates dev->irq to be the lowest of the
> +new interrupts assigned to it.  The other interrupts assigned to the
> +device are in the range dev->irq to dev->irq + nvec - 1.
> +
> +If this function returns a negative number, it indicates an error and
> +the driver should not attempt to request any more MSI interrupts for
> +this device.
> +
> +4.2.5 pci_disable_msi
>  
>  void pci_disable_msi(struct pci_dev *dev)
>  
> -This function should be used to undo the effect of pci_enable_msi_range().
> -Calling it restores dev->irq to the pin-based interrupt number and frees
> -the previously allocated MSIs.  The interrupts may subsequently be assigned
> -to another device, so drivers should not cache the value of dev->irq.
> +This function should be used to undo the effect of pci_enable_msi_range()
> +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> +interrupt number and frees the previously allocated MSIs.  The interrupts
> +may subsequently be assigned to another device, so drivers should not cache
> +the value of dev->irq.
>  
>  Before calling this function, a device driver must always call free_irq()
>  on any interrupt for which it previously called request_irq().
> diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> index 2b91b0e..2be7979 100644
> --- a/arch/mips/pci/msi-octeon.c
> +++ b/arch/mips/pci/msi-octeon.c
> @@ -178,7 +178,7 @@ msi_irq_allocated:
>  	return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> index 8bbc12d..c60aee3 100644
> --- a/arch/powerpc/kernel/msi.c
> +++ b/arch/powerpc/kernel/msi.c
> @@ -13,7 +13,7 @@
>  
>  #include <asm/machdep.h>
>  
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
>  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
>          return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return ppc_md.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 9ddc51e..3cf38a8 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
>  	}
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
>  {
>  	struct zpci_dev *zdev = get_zdev(pdev);
>  	unsigned int hwirq, msi_vecs;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index e48b674..b65bf95 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
>  };
>  
>  /* MSI arch specific hooks */
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return x86_msi.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 27a7e67..0410d9b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
>  	chip->teardown_irq(chip, irq);
>  }
>  
> -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_msi_check_device(struct pci_dev *dev,
> +				 int nvec, int nvec_mme, int type)
>  {
>  	struct msi_chip *chip = dev->bus->msi;
>  
> @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  	return chip->check_device(chip, dev, nvec, type);
>  }
>  
> -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> +			       int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -598,6 +600,7 @@ error_attrs:
>   * msi_capability_init - configure device's MSI capability structure
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: number of interrupts to allocate
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
>   *
>   * Setup the MSI capability structure of the device with the requested
>   * number of interrupts.  A return value of zero indicates the successful
> @@ -605,7 +608,7 @@ error_attrs:
>   * an error, and a positive return value indicates the number of interrupts
>   * which could have been allocated.
>   */
> -static int msi_capability_init(struct pci_dev *dev, int nvec)
> +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
>  	list_add_tail(&entry->list, &dev->msi_list);
>  
>  	/* Configure MSI capability structure */
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
>  	if (ret) {
>  		msi_mask_irq(entry, mask, ~mask);
>  		free_msi_irqs(dev);
> @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
>  	if (ret)
>  		return ret;
>  
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (ret)
>  		goto out_avail;
>  
> @@ -812,13 +816,15 @@ out_free:
>   * pci_msi_check_device - check whether MSI may be enabled on a device
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: how many MSIs have been requested ?
> + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
>   * @type: are we checking for MSI or MSI-X ?
>   *
>   * Look at global flags, the device itself, and its parent buses
>   * to determine if MSI/-X are supported for the device. If MSI/-X is
>   * supported return 0, else return an error code.
>   **/
> -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +static int pci_msi_check_device(struct pci_dev *dev,
> +				int nvec, int nvec_mme, int type)
>  {
>  	struct pci_bus *bus;
>  	int ret;
> @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
>  			return -EINVAL;
>  
> -	ret = arch_msi_check_device(dev, nvec, type);
> +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
>  	if (ret)
>  		return ret;
>  
> @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msi_vec_count);
>  
> +/**
> + * pci_enable_msi_partial - configure device's MSI capability structure
> + * @dev: device to configure
> + * @nvec: number of interrupts to configure
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> + *
> + * This function tries to allocate @nvec number of interrupts while setup
> + * device's Multiple Message Enable register with @nvec_mme interrupts.
> + * It returns a negative errno if an error occurs. If it succeeds, it returns
> + * zero and updates the @dev's irq member to the lowest new interrupt number;
> + * the other interrupt numbers allocated to this device are consecutive.
> + */
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{
> +	int maxvec;
> +	int rc;
> +
> +	if (dev->current_state != PCI_D0)
> +		return -EINVAL;
> +
> +	WARN_ON(!!dev->msi_enabled);
> +
> +	/* Check whether driver already requested MSI-X irqs */
> +	if (dev->msix_enabled) {
> +		dev_info(&dev->dev, "can't enable MSI "
> +			 "(MSI-X already enabled)\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(nvec_mme))
> +		return -EINVAL;
> +	if (nvec > nvec_mme)
> +		return -EINVAL;
> +
> +	maxvec = pci_msi_vec_count(dev);
> +	if (maxvec < 0)
> +		return maxvec;
> +	else if (nvec_mme > maxvec)
> +		return -EINVAL;
> +
> +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	rc = msi_capability_init(dev, nvec, nvec_mme);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(pci_enable_msi_partial);
> +
>  void pci_msi_shutdown(struct pci_dev *dev)
>  {
>  	struct msi_desc *desc;
> @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
>  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
>  		return -EINVAL;
>  
> -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (status)
>  		return status;
>  
> @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  		nvec = maxvec;
>  
>  	do {
> -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> +					  PCI_CAP_ID_MSI);
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  	} while (rc);
>  
>  	do {
> -		rc = msi_capability_init(dev, nvec);
> +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 92a2f99..b9f89ee 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -57,9 +57,10 @@ struct msi_desc {
>   */
>  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
>  void arch_teardown_msi_irq(unsigned int irq);
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
>  void arch_teardown_msi_irqs(struct pci_dev *dev);
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> +int arch_msi_check_device(struct pci_dev *dev,
> +			  int nvec, int nvec_mme, int type);
>  void arch_restore_msi_irqs(struct pci_dev *dev);
>  
>  void default_teardown_msi_irqs(struct pci_dev *dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 71d9673..7360bd2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
>  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
>  int pci_msi_enabled(void);
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
>  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
>  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
>  {
> @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
>  static inline int pci_msi_enabled(void) { return 0; }
> +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{ return -ENOSYS; }
>  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
>  				       int maxvec)
>  { return -ENOSYS; }
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight July 3, 2014, 9:20 a.m. UTC | #3
From: Bjorn Helgaas

> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:

> > There are PCI devices that require a particular value written

> > to the Multiple Message Enable (MME) register while aligned on

> > power of 2 boundary value of actually used MSI vectors 'nvec'

> > is a lesser of that MME value:

> >

> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'

> >

> > However the existing pci_enable_msi_block() interface is not

> > able to configure such devices, since the value written to the

> > MME register is calculated from the number of requested MSIs

> > 'nvec':

> >

> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)

> 

> For MSI, software learns how many vectors a device requests by reading

> the Multiple Message Capable (MMC) field.  This field is encoded, so a

> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible

> for a device to request 3 vectors; it would have to round up that up

> to a power of two and request 4 vectors.

> 

> Software writes similarly encoded values to MME to tell the device how

> many vectors have been allocated for its use.  For example, it's

> impossible to tell the device that it can use 3 vectors; the OS has to

> round that up and tell the device it can use 4 vectors.

> 

> So if I understand correctly, the point of this series is to take

> advantage of device-specific knowledge, e.g., the device requests 4

> vectors via MMC, but we "know" the device is only capable of using 3.

> Moreover, we tell the device via MME that 4 vectors are available, but

> we've only actually set up 3 of them.

...

Even if you do that, you ought to write valid interrupt information
into the 4th slot (maybe replicating one of the earlier interrupts).
Then, if the device does raise the 'unexpected' interrupt you don't
get a write to a random kernel location.

Plausibly something similar should be done when a smaller number of
interrupts is assigned.

	David
Alexander Gordeev July 4, 2014, 8:57 a.m. UTC | #4
On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-mips@linux-mips.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: x86@kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linux-ide@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >  	return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >  	}
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct zpci_dev *zdev = get_zdev(pdev);
> >  	unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >  	chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +				 int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  	return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +			       int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
> >  	list_add_tail(&entry->list, &dev->msi_list);
> >  
> >  	/* Configure MSI capability structure */
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >  	if (ret) {
> >  		msi_mask_irq(entry, mask, ~mask);
> >  		free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (ret)
> >  		goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +				int nvec, int nvec_mme, int type)
> >  {
> >  	struct pci_bus *bus;
> >  	int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >  			return -EINVAL;
> >  
> > -	ret = arch_msi_check_device(dev, nvec, type);
> > +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +	int maxvec;
> > +	int rc;
> > +
> > +	if (dev->current_state != PCI_D0)
> > +		return -EINVAL;
> > +
> > +	WARN_ON(!!dev->msi_enabled);
> > +
> > +	/* Check whether driver already requested MSI-X irqs */
> > +	if (dev->msix_enabled) {
> > +		dev_info(&dev->dev, "can't enable MSI "
> > +			 "(MSI-X already enabled)\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(nvec_mme))
> > +		return -EINVAL;
> > +	if (nvec > nvec_mme)
> > +		return -EINVAL;
> > +
> > +	maxvec = pci_msi_vec_count(dev);
> > +	if (maxvec < 0)
> > +		return maxvec;
> > +	else if (nvec_mme > maxvec)
> > +		return -EINVAL;
> > +
> > +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	rc = msi_capability_init(dev, nvec, nvec_mme);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >  	struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> >  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >  		return -EINVAL;
> >  
> > -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (status)
> >  		return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  		nvec = maxvec;
> >  
> >  	do {
> > -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +					  PCI_CAP_ID_MSI);
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  	} while (rc);
> >  
> >  	do {
> > -		rc = msi_capability_init(dev, nvec);
> > +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +			  int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >  				       int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev July 4, 2014, 8:58 a.m. UTC | #5
On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
> From: Bjorn Helgaas
> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > > There are PCI devices that require a particular value written
> > > to the Multiple Message Enable (MME) register while aligned on
> > > power of 2 boundary value of actually used MSI vectors 'nvec'
> > > is a lesser of that MME value:
> > >
> > > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > >
> > > However the existing pci_enable_msi_block() interface is not
> > > able to configure such devices, since the value written to the
> > > MME register is calculated from the number of requested MSIs
> > > 'nvec':
> > >
> > > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> > 
> > For MSI, software learns how many vectors a device requests by reading
> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> > for a device to request 3 vectors; it would have to round up that up
> > to a power of two and request 4 vectors.
> > 
> > Software writes similarly encoded values to MME to tell the device how
> > many vectors have been allocated for its use.  For example, it's
> > impossible to tell the device that it can use 3 vectors; the OS has to
> > round that up and tell the device it can use 4 vectors.
> > 
> > So if I understand correctly, the point of this series is to take
> > advantage of device-specific knowledge, e.g., the device requests 4
> > vectors via MMC, but we "know" the device is only capable of using 3.
> > Moreover, we tell the device via MME that 4 vectors are available, but
> > we've only actually set up 3 of them.
> ...
> 
> Even if you do that, you ought to write valid interrupt information
> into the 4th slot (maybe replicating one of the earlier interrupts).
> Then, if the device does raise the 'unexpected' interrupt you don't
> get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

> Plausibly something similar should be done when a smaller number of
> interrupts is assigned.
> 
> 	David
>
David Laight July 4, 2014, 9:11 a.m. UTC | #6
From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
> 
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
> 
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

	David



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev July 4, 2014, 9:54 a.m. UTC | #7
On Fri, Jul 04, 2014 at 09:11:50AM +0000, David Laight wrote:
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> > 
> > Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
> 
> I'm thinking of the following - which might be MSI-X ?
> 1) Hardware requests some interrupts and tells the host the BAR (and offset)
>    where the 'vectors' should be written.
> 2) To raise an interrupt the hardware uses the 'vector' as the address
>    of a normal PCIe write cycle.
> 
> So if the hardware requests 4 interrupts, but the driver (believing it
> will only use 3) only write 3 vectors, and then the hardware uses the
> 4th vector it can write to a random location.
> 
> Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

> 	David
> 
> 
>
Bjorn Helgaas July 7, 2014, 7:26 p.m. UTC | #8
On Fri, Jul 4, 2014 at 2:58 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
>> From: Bjorn Helgaas
>> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > > There are PCI devices that require a particular value written
>> > > to the Multiple Message Enable (MME) register while aligned on
>> > > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > > is a lesser of that MME value:
>> > >
>> > >   roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> > >
>> > > However the existing pci_enable_msi_block() interface is not
>> > > able to configure such devices, since the value written to the
>> > > MME register is calculated from the number of requested MSIs
>> > > 'nvec':
>> > >
>> > >   'Multiple Message Enable' = roundup_pow_of_two(nvec)
>> >
>> > For MSI, software learns how many vectors a device requests by reading
>> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> > for a device to request 3 vectors; it would have to round up that up
>> > to a power of two and request 4 vectors.
>> >
>> > Software writes similarly encoded values to MME to tell the device how
>> > many vectors have been allocated for its use.  For example, it's
>> > impossible to tell the device that it can use 3 vectors; the OS has to
>> > round that up and tell the device it can use 4 vectors.
>> >
>> > So if I understand correctly, the point of this series is to take
>> > advantage of device-specific knowledge, e.g., the device requests 4
>> > vectors via MMC, but we "know" the device is only capable of using 3.
>> > Moreover, we tell the device via MME that 4 vectors are available, but
>> > we've only actually set up 3 of them.
>> ...
>>
>> Even if you do that, you ought to write valid interrupt information
>> into the 4th slot (maybe replicating one of the earlier interrupts).
>> Then, if the device does raise the 'unexpected' interrupt you don't
>> get a write to a random kernel location.
>
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.

Yes, that's how I understand it.  With MSI, the OS specifies the a
single Message Address, e.g., a LAPIC address, and a single Message
Data value, e.g., a vector number that will be written to the LAPIC.
The device is permitted to modify some low-order bits of the Message
Data to send one of several vector numbers (the MME value tells the
device how many bits it can modify).

Bottom line, I think a spurious interrupt is the failure we'd expect
if a device used more vectors than the OS expects it to.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas July 7, 2014, 7:40 p.m. UTC | #9
On Fri, Jul 4, 2014 at 2:57 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > There are PCI devices that require a particular value written
>> > to the Multiple Message Enable (MME) register while aligned on
>> > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > is a lesser of that MME value:
>> >
>> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> >
>> > However the existing pci_enable_msi_block() interface is not
>> > able to configure such devices, since the value written to the
>> > MME register is calculated from the number of requested MSIs
>> > 'nvec':
>> >
>> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
>>
>> For MSI, software learns how many vectors a device requests by reading
>> the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> for a device to request 3 vectors; it would have to round up that up
>> to a power of two and request 4 vectors.
>>
>> Software writes similarly encoded values to MME to tell the device how
>> many vectors have been allocated for its use.  For example, it's
>> impossible to tell the device that it can use 3 vectors; the OS has to
>> round that up and tell the device it can use 4 vectors.
>
> Nod.
>
>> So if I understand correctly, the point of this series is to take
>> advantage of device-specific knowledge, e.g., the device requests 4
>> vectors via MMC, but we "know" the device is only capable of using 3.
>> Moreover, we tell the device via MME that 4 vectors are available, but
>> we've only actually set up 3 of them.
>
> Exactly.
>
>> This makes me uneasy because we're lying to the device, and the device
>> is perfectly within spec to use all 4 of those vectors.  If anything
>> changes the number of vectors the device uses (new device revision,
>> firmware upgrade, etc.), this is liable to break.
>
> If a device committed via non-MSI specific means to send only 3 vectors
> out of 4 available why should we expect it to send 4? The probability of
> a firmware sending 4/4 vectors in this case is equal to the probability
> of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
> Moreover, even vector 4/4 would be unexpected by the device driver, though
> it is perfectly within the spec.
>
> As of new device revision or firmware update etc. - it is just yet another
> case of device driver vs the firmware match/mismatch. Not including this
> change does not help here at all IMHO.
>
>> Can you quantify the benefit of this?  Can't a device already use
>> MSI-X to request exactly the number of vectors it can use?  (I know
>
> A Intel AHCI chipset requires 16 vectors written to MME while advertises
> (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> in device's fallback to 1 (!).

Is the fact that it uses only 6 vectors documented in the public spec?

Is this a chipset erratum?  Are there newer versions of the chipset
that fix this, e.g., by requesting 8 vectors and using 6, or by also
supporting MSI-X?

I know this conserves vector numbers.  What does that mean in real
user-visible terms?  Are there systems that won't boot because of this
issue, and this patch fixes them?  Does it enable bigger
configurations, e.g., more I/O devices, than before?

Do you know how Windows handles this?  Does it have a similar interface?

As you can tell, I'm a little skeptical about this.  It's a fairly big
change, it affects the arch interface, it seems to be targeted for
only a single chipset (though it's widely used), and we already
support a standard solution (MSI-X, reducing the number of vectors
requested, or even operating with 1 vector).

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev July 7, 2014, 8:42 p.m. UTC | #10
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this "summary"
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

> Bjorn
Michael Ellerman July 8, 2014, 4:01 a.m. UTC | #11
On Wed, 2014-07-02 at 14:22 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
> 
> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

It also adds more complexity into the already complex MSI API, across all
architectures, all so a single Intel chipset can save a couple of MSIs. That
seems like the wrong trade off to me.

cheers


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight July 8, 2014, 8:33 a.m. UTC | #12
From: Bjorn Helgaas

...
> >> Even if you do that, you ought to write valid interrupt information

> >> into the 4th slot (maybe replicating one of the earlier interrupts).

> >> Then, if the device does raise the 'unexpected' interrupt you don't

> >> get a write to a random kernel location.

> >

> > I might be missing something, but we are talking of MSI address space

> > here, aren't we? I am not getting how we could end up with a 'write'

> > to a random kernel location when a unclaimed MSI vector sent. We could

> > only expect a spurious interrupt at worst, which is handled and reported.

> 

> Yes, that's how I understand it.  With MSI, the OS specifies the a

> single Message Address, e.g., a LAPIC address, and a single Message

> Data value, e.g., a vector number that will be written to the LAPIC.

> The device is permitted to modify some low-order bits of the Message

> Data to send one of several vector numbers (the MME value tells the

> device how many bits it can modify).

> 

> Bottom line, I think a spurious interrupt is the failure we'd expect

> if a device used more vectors than the OS expects it to.


So you need to tell the device where to write in order to raise the
'spurious interrupt'.

	David
Alexander Gordeev July 8, 2014, 12:26 p.m. UTC | #13
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> >> Can you quantify the benefit of this?  Can't a device already use
> >> MSI-X to request exactly the number of vectors it can use?  (I know
> >
> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> > in device's fallback to 1 (!).
> 
> Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

> Is this a chipset erratum?  Are there newer versions of the chipset
> that fix this, e.g., by requesting 8 vectors and using 6, or by also
> supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

> I know this conserves vector numbers.  What does that mean in real
> user-visible terms?  Are there systems that won't boot because of this
> issue, and this patch fixes them?  Does it enable bigger
> configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

> Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

> Bjorn
Bjorn Helgaas July 9, 2014, 4:06 p.m. UTC | #14
On Tue, Jul 8, 2014 at 6:26 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
>> >> Can you quantify the benefit of this?  Can't a device already use
>> >> MSI-X to request exactly the number of vectors it can use?  (I know
>> >
>> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
>> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
>> > in device's fallback to 1 (!).
>>
>> Is the fact that it uses only 6 vectors documented in the public spec?
>
> Yes, it is documented in ICH specs.

Out of curiosity, do you have a pointer to this?  It looks like it
uses one vector per port, and I'm wondering if the reason it requests
16 is because there's some possibility of a part with more than 8
ports.

>> Is this a chipset erratum?  Are there newer versions of the chipset
>> that fix this, e.g., by requesting 8 vectors and using 6, or by also
>> supporting MSI-X?
>
> No, this is not an erratum. The value of 8 vectors is reserved and could
> cause undefined results if used.

As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
(requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
undefined results, I'd say that's a chipset defect.

>> I know this conserves vector numbers.  What does that mean in real
>> user-visible terms?  Are there systems that won't boot because of this
>> issue, and this patch fixes them?  Does it enable bigger
>> configurations, e.g., more I/O devices, than before?
>
> Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
> MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.
>
> No, it does not enable/fix any existing hardware issue I am aware of.
> It just saves a couple of interrupt vectors, as Michael put it (10/16
> to be precise). However, interrupt vectors space is pretty much scarce
> resource on x86 and a risk of exhausting the vectors (and introducing
> quota i.e) has already been raised AFAIR.

I'm not too concerned about the logging issue.  If necessary, we could
tweak that message somehow.

Interrupt vector space is the issue I would worry about, but I think
I'm going to put this on the back burner until it actually becomes a
problem.

>> Do you know how Windows handles this?  Does it have a similar interface?
>
> Have no clue, TBH. Can try to investigate if you see it helpful.

No, don't worry about investigating.  I was just curious because if
Windows *did* support something like this, that would be an indication
that there's a significant problem here and we might need to solve it,
too.  But it sounds like we can safely ignore it for now.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev July 10, 2014, 10:11 a.m. UTC | #15
On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
> Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

> uses one vector per port, and I'm wondering if the reason it requests
> 16 is because there's some possibility of a part with more than 8
> ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

> > No, this is not an erratum. The value of 8 vectors is reserved and could
> > cause undefined results if used.
> 
> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
> undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

> Interrupt vector space is the issue I would worry about, but I think
> I'm going to put this on the back burner until it actually becomes a
> problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!
Bjorn Helgaas July 10, 2014, 5:02 p.m. UTC | #16
On Thu, Jul 10, 2014 at 4:11 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
>> Out of curiosity, do you have a pointer to this?  It looks like it
>
> I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27
>
>> uses one vector per port, and I'm wondering if the reason it requests
>> 16 is because there's some possibility of a part with more than 8
>> ports.
>
> I doubt that is the reason. The only allowed MME values (powers of two)
> are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
> I would speculate it suits nicely to some hardware logic.
>
> BTW, apart from AHCI, it seems the reason MSI is not going to disappear
> (in a decade at least) is it is way cheaper to implement than MSI-X.
>
>> > No, this is not an erratum. The value of 8 vectors is reserved and could
>> > cause undefined results if used.
>>
>> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
>> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
>> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
>> undefined results, I'd say that's a chipset defect.
>
> Well, the PCI spec does not prevent devices to have their own specs on top
> of it. Undefined results are meant on the device side here. On the MSI side
> these results are likely perfectly within the PCI spec. I feel speaking as
> a lawer here ;)

I disagree about this part.  The reason MSI is in the PCI spec is so
the OS can have generic support for it without having to put
device-specific support in every driver.  The PCI spec is clear that
the OS can allocate any number of vectors less than or equal to the
number requested via MMC.  The SATA device requests 16, and it should
be perfectly legal for the OS to give it 8.

It's interesting that the ICH10 spec (sec 14.1.27, thanks for the
reference) says MMC 100b means "8 MSI Capable".  That smells like a
hardware bug.  The PCI spec says:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  011 => 8 vectors
  100 => 16 vectors

The ICH10 spec seems to think 100 means 8 vectors (not 16 as the PCI
spec says), and that would fit with the rest of the ICH10 MME info.
If ICH10 was built assuming this table:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  100 => 8 vectors

then everything makes sense: the device requests 8 vectors, and the
behavior is defined in all possible MME cases (1, 2, 4, or 8 vectors
assigned).  The "Values '011b' to '111b' are reserved" part is still
slightly wrong, because the 100b value is in that range but is not
reserved, but that's a tangent.

So my guess (speculation, I admit) is that the intent was for ICH SATA
to request only 8 vectors, but because of this error, it requests 16.
Maybe some early MSI proposal used a different encoding for MMC and
MME, and ICH was originally designed using that.

>> Interrupt vector space is the issue I would worry about, but I think
>> I'm going to put this on the back burner until it actually becomes a
>> problem.
>
> I plan to try get rid of arch_msi_check_device() hook. Should I repost
> this series afterwards?

Honestly, I'm still not inclined to pursue this because of the API
complication and lack of concrete benefit.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 10a9369..c8a8503 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -195,14 +195,40 @@  By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
 returns zero in case of success, which indicates MSI interrupts have been
 successfully allocated.
 
-4.2.4 pci_disable_msi
+4.2.4 pci_enable_msi_partial
+
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+
+This variation on pci_enable_msi_exact() call allows a device driver to
+setup 'nvec_mme' number of multiple MSIs with the PCI function, while
+setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
+in operating system. The MSI specification only allows 'nvec_mme' to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+This function could be used when a PCI function is known to send 'nvec'
+MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
+initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
+do not waste system resources.
+
+If this function returns 0, it has succeeded in allocating 'nvec_mme'
+interrupts and setting up 'nvec' interrupts. In this case, the function
+enables MSI on this device and updates dev->irq to be the lowest of the
+new interrupts assigned to it.  The other interrupts assigned to the
+device are in the range dev->irq to dev->irq + nvec - 1.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+4.2.5 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev->irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi_range()
+or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
+interrupt number and frees the previously allocated MSIs.  The interrupts
+may subsequently be assigned to another device, so drivers should not cache
+the value of dev->irq.
 
 Before calling this function, a device driver must always call free_irq()
 on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index 2b91b0e..2be7979 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -178,7 +178,7 @@  msi_irq_allocated:
 	return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..c60aee3 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@ 
 
 #include <asm/machdep.h>
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
 		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
@@ -32,7 +32,7 @@  int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
         return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9ddc51e..3cf38a8 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -398,7 +398,7 @@  static void zpci_irq_handler(struct airq_struct *airq)
 	}
 }
 
-int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
 {
 	struct zpci_dev *zdev = get_zdev(pdev);
 	unsigned int hwirq, msi_vecs;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index e48b674..b65bf95 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -121,7 +121,7 @@  struct x86_msi_ops x86_msi = {
 };
 
 /* MSI arch specific hooks */
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return x86_msi.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 27a7e67..0410d9b 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,7 +56,8 @@  void __weak arch_teardown_msi_irq(unsigned int irq)
 	chip->teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
+int __weak arch_msi_check_device(struct pci_dev *dev,
+				 int nvec, int nvec_mme, int type)
 {
 	struct msi_chip *chip = dev->bus->msi;
 
@@ -66,7 +67,8 @@  int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
 	return chip->check_device(chip, dev, nvec, type);
 }
 
-int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int __weak arch_setup_msi_irqs(struct pci_dev *dev,
+			       int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -598,6 +600,7 @@  error_attrs:
  * msi_capability_init - configure device's MSI capability structure
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: number of interrupts to allocate
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
  *
  * Setup the MSI capability structure of the device with the requested
  * number of interrupts.  A return value of zero indicates the successful
@@ -605,7 +608,7 @@  error_attrs:
  * an error, and a positive return value indicates the number of interrupts
  * which could have been allocated.
  */
-static int msi_capability_init(struct pci_dev *dev, int nvec)
+static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -640,7 +643,7 @@  static int msi_capability_init(struct pci_dev *dev, int nvec)
 	list_add_tail(&entry->list, &dev->msi_list);
 
 	/* Configure MSI capability structure */
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
 	if (ret) {
 		msi_mask_irq(entry, mask, ~mask);
 		free_msi_irqs(dev);
@@ -758,7 +761,8 @@  static int msix_capability_init(struct pci_dev *dev,
 	if (ret)
 		return ret;
 
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
+	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_avail;
 
@@ -812,13 +816,15 @@  out_free:
  * pci_msi_check_device - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
+ * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
  * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_check_device(struct pci_dev *dev,
+				int nvec, int nvec_mme, int type)
 {
 	struct pci_bus *bus;
 	int ret;
@@ -846,7 +852,7 @@  static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
 		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
 			return -EINVAL;
 
-	ret = arch_msi_check_device(dev, nvec, type);
+	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
 	if (ret)
 		return ret;
 
@@ -878,6 +884,62 @@  int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
+/**
+ * pci_enable_msi_partial - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
+ *
+ * This function tries to allocate @nvec number of interrupts while setup
+ * device's Multiple Message Enable register with @nvec_mme interrupts.
+ * It returns a negative errno if an error occurs. If it succeeds, it returns
+ * zero and updates the @dev's irq member to the lowest new interrupt number;
+ * the other interrupt numbers allocated to this device are consecutive.
+ */
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{
+	int maxvec;
+	int rc;
+
+	if (dev->current_state != PCI_D0)
+		return -EINVAL;
+
+	WARN_ON(!!dev->msi_enabled);
+
+	/* Check whether driver already requested MSI-X irqs */
+	if (dev->msix_enabled) {
+		dev_info(&dev->dev, "can't enable MSI "
+			 "(MSI-X already enabled)\n");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(nvec_mme))
+		return -EINVAL;
+	if (nvec > nvec_mme)
+		return -EINVAL;
+
+	maxvec = pci_msi_vec_count(dev);
+	if (maxvec < 0)
+		return maxvec;
+	else if (nvec_mme > maxvec)
+		return -EINVAL;
+
+	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	rc = msi_capability_init(dev, nvec, nvec_mme);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	return 0;
+}
+EXPORT_SYMBOL(pci_enable_msi_partial);
+
 void pci_msi_shutdown(struct pci_dev *dev)
 {
 	struct msi_desc *desc;
@@ -957,7 +1019,7 @@  int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
 	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
 		return -EINVAL;
 
-	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (status)
 		return status;
 
@@ -1110,7 +1172,8 @@  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 		nvec = maxvec;
 
 	do {
-		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
+					  PCI_CAP_ID_MSI);
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
@@ -1121,7 +1184,7 @@  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 	} while (rc);
 
 	do {
-		rc = msi_capability_init(dev, nvec);
+		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 92a2f99..b9f89ee 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -57,9 +57,10 @@  struct msi_desc {
  */
 int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_check_device(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 71d9673..7360bd2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1184,6 +1184,7 @@  void pci_disable_msix(struct pci_dev *dev);
 void msi_remove_pci_irq_vectors(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
 static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
 {
@@ -1215,6 +1216,8 @@  static inline void pci_disable_msix(struct pci_dev *dev) { }
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
 static inline void pci_restore_msi_state(struct pci_dev *dev) { }
 static inline int pci_msi_enabled(void) { return 0; }
+static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{ return -ENOSYS; }
 static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
 				       int maxvec)
 { return -ENOSYS; }