Patchwork [v3,-tip,x86/apic,1/2] PCI/MSI: Allocate as many multiple-MSIs as requested

login
register
mail settings
Submitter Alexander Gordeev
Date May 13, 2013, 9:05 a.m.
Message ID <8575dc590b819892f366852fe50835efaf579f4f.1368431413.git.agordeev@redhat.com>
Download mbox | patch
Permalink /patch/243337/
State Accepted
Headers show

Comments

Alexander Gordeev - May 13, 2013, 9:05 a.m.
When multiple MSIs are enabled with pci_enable_msi_block(), the
requested number of interrupts 'nvec' is rounded up to the nearest
power-of-two value. The result is then used for setting up the
number of MSI messages in the PCI device and allocation of
interrupt resources in the operating system (i.e. vector numbers).
Thus, in cases when a device driver requests some number of MSIs
and this number is not a power-of-two value, the extra operating
system resources (allocated as the result of rounding) are wasted.

This fix introduces 'msi_desc::nvec' field to address the above
issue. When non-zero, it will report the actual number of MSIs the
device will send, as requested by the device driver. This value
should be used by architectures to properly set up and tear down
associated interrupt resources.

Note, although the existing 'msi_desc::multiple' field might seem
redundant, in fact in does not. In general case the number of MSIs a
PCI device is initialized with is not necessarily the closest power-
of-two value of the number of MSIs the device will send. Thus, in
theory it would not be always possible to derive the former from the
latter and we need to keep them both, to stress this corner case.
Besides, since 'msi_desc::multiple' is a bitfield, throwing it out
would not save us any space.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 drivers/pci/msi.c   |   10 ++++++++--
 include/linux/msi.h |    1 +
 2 files changed, 9 insertions(+), 2 deletions(-)
Ingo Molnar - May 28, 2013, 9:50 a.m.
* Alexander Gordeev <agordeev@redhat.com> wrote:

> When multiple MSIs are enabled with pci_enable_msi_block(), the
> requested number of interrupts 'nvec' is rounded up to the nearest
> power-of-two value. The result is then used for setting up the
> number of MSI messages in the PCI device and allocation of
> interrupt resources in the operating system (i.e. vector numbers).
> Thus, in cases when a device driver requests some number of MSIs
> and this number is not a power-of-two value, the extra operating
> system resources (allocated as the result of rounding) are wasted.
> 
> This fix introduces 'msi_desc::nvec' field to address the above
> issue. When non-zero, it will report the actual number of MSIs the
> device will send, as requested by the device driver. This value
> should be used by architectures to properly set up and tear down
> associated interrupt resources.
> 
> Note, although the existing 'msi_desc::multiple' field might seem
> redundant, in fact in does not. In general case the number of MSIs a
> PCI device is initialized with is not necessarily the closest power-
> of-two value of the number of MSIs the device will send. Thus, in
> theory it would not be always possible to derive the former from the
> latter and we need to keep them both, to stress this corner case.
> Besides, since 'msi_desc::multiple' is a bitfield, throwing it out
> would not save us any space.
> 
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>

Would be nice to have an Acked-by from Bjorn for this patch.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Siewior - June 5, 2013, 8:56 p.m.
On Mon, May 13, 2013 at 11:05:48AM +0200, Alexander Gordeev wrote:
> Note, although the existing 'msi_desc::multiple' field might seem
> redundant, in fact in does not. In general case the number of MSIs a
> PCI device is initialized with is not necessarily the closest power-
> of-two value of the number of MSIs the device will send. Thus, in
> theory it would not be always possible to derive the former from the
> latter and we need to keep them both, to stress this corner case.
> Besides, since 'msi_desc::multiple' is a bitfield, throwing it out
> would not save us any space.

The last paragraph makes me curious. The only place where 'multiple' is set is
in do_setup_msi_irqs() and this uses the next power of two for it. And since a
device is not enabled twice, it is not overridden.
So it should be possible to compute 'multiple' out of 'nvec' but it saves
cycles not do to so. I agree to keep 'multiple' but your argument does not
seem to make sense.
While nitpicking, 'nvec' might deserve a better comment than 'number of
messages' since it holds the number of allocated interrupts. :)

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - June 5, 2013, 9:09 p.m.
On Wed, Jun 5, 2013 at 2:56 PM, Sebastian Andrzej Siewior
<sebastian@breakpoint.cc> wrote:
> On Mon, May 13, 2013 at 11:05:48AM +0200, Alexander Gordeev wrote:
>> Note, although the existing 'msi_desc::multiple' field might seem
>> redundant, in fact in does not. In general case the number of MSIs a
>> PCI device is initialized with is not necessarily the closest power-
>> of-two value of the number of MSIs the device will send. Thus, in
>> theory it would not be always possible to derive the former from the
>> latter and we need to keep them both, to stress this corner case.
>> Besides, since 'msi_desc::multiple' is a bitfield, throwing it out
>> would not save us any space.
>
> The last paragraph makes me curious. The only place where 'multiple' is set is
> in do_setup_msi_irqs() and this uses the next power of two for it. And since a
> device is not enabled twice, it is not overridden.
> So it should be possible to compute 'multiple' out of 'nvec' but it saves
> cycles not do to so. I agree to keep 'multiple' but your argument does not
> seem to make sense.

Alexander had an example device that advertised 16 vectors, but the
driver knew that it could only generate 6.  That's a case where we
can't compute 'multiple' from 'nvec' (assuming the driver supplies
'nvec == 6').  If we just rounded up to compute 'multiple', I think
we'd compute 8 instead of 16.

> While nitpicking, 'nvec' might deserve a better comment than 'number of
> messages' since it holds the number of allocated interrupts. :)

I did change the name 'nvec' to 'nvec_used', which should help a bit.
But I agree that it's still somewhat confusing.

BTW, the patches actually in my tree are at
http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/alexander-msi
(I tweaked this name and some comments slightly).

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Siewior - June 5, 2013, 9:28 p.m.
-Suresh

On Wed, Jun 05, 2013 at 03:09:34PM -0600, Bjorn Helgaas wrote:
> 
> Alexander had an example device that advertised 16 vectors, but the
> driver knew that it could only generate 6.  That's a case where we
> can't compute 'multiple' from 'nvec' (assuming the driver supplies
> 'nvec == 6').  If we just rounded up to compute 'multiple', I think
> we'd compute 8 instead of 16.

Sure, but as I said: the only place where 'multiple' is computed / written
it is doing the round-up thingy.

> > While nitpicking, 'nvec' might deserve a better comment than 'number of
> > messages' since it holds the number of allocated interrupts. :)
> 
> I did change the name 'nvec' to 'nvec_used', which should help a bit.
> But I agree that it's still somewhat confusing.
> 
> BTW, the patches actually in my tree are at
> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/alexander-msi
> (I tweaked this name and some comments slightly).

'nvec_used' is better the comment next to it is still wrong I think.

> Bjorn

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - June 6, 2013, 8:30 a.m.
On Wed, Jun 05, 2013 at 10:56:38PM +0200, Sebastian Andrzej Siewior wrote:
> On Mon, May 13, 2013 at 11:05:48AM +0200, Alexander Gordeev wrote:
> > Note, although the existing 'msi_desc::multiple' field might seem
> > redundant, in fact in does not. In general case the number of MSIs a
> > PCI device is initialized with is not necessarily the closest power-
> > of-two value of the number of MSIs the device will send. Thus, in
> > theory it would not be always possible to derive the former from the
> > latter and we need to keep them both, to stress this corner case.
> > Besides, since 'msi_desc::multiple' is a bitfield, throwing it out
> > would not save us any space.
> 
> The last paragraph makes me curious. The only place where 'multiple' is set is
> in do_setup_msi_irqs() and this uses the next power of two for it. And since a
> device is not enabled twice, it is not overridden.
> So it should be possible to compute 'multiple' out of 'nvec' but it saves
> cycles not do to so. I agree to keep 'multiple' but your argument does not
> seem to make sense.
> While nitpicking, 'nvec' might deserve a better comment than 'number of
> messages' since it holds the number of allocated interrupts. :)

Sebastian,

I re-read my comment few times and I admit it might be confusing. You are
right - 'multiple' is set by rounding up only. The part '...not necessarily
the closest power-of-two value...' implied an abstract PCI device rather than
the described code, but the wording is less than perfect, indeed. 

In fact, at the moment of writing I kept in mind a follow-up patch that could
help with aforementioned devices. That would be a new interface:

	int pci_enable_msi_block_partial(struct pci_dev *dev,
					 unsigned int nvec_use,
					 unsigned int nvec_init);

In this case 'nvec_use' would go to 'msi_desc::nvec_used' and 'nvec_init'
would translate to 'msi_desc::multiple' in case 'nvec_init' is not zero.
In case 'nvec_init' is zero, 'msi_desc::multiple' would be initialized
with the maximum possible value for the device (the way it is done now for
pci_enable_msi_block_auto() interface). So, for the AHCI device (Bjorn
mentioned) such a call would conserve on 10 of 16 vectors:

	pci_enable_msi_block_partial(pdev, 6, 0);

What I am not sure is whether we need to read out the maximum possible
number of vectors like pci_enable_msi_block_auto() does:

	int pci_enable_msi_block_partial(struct pci_dev *dev,
					 unsigned int nvec_use,
					 unsigned int nvec_init,
					 unsigned int *maxvec);

I can not think of any use of 'maxvec' with this interface, but the second
variant completes the whole picture about a device...

> Sebastian
Sebastian Siewior - June 6, 2013, 7:51 p.m.
On Thu, Jun 06, 2013 at 10:30:20AM +0200, Alexander Gordeev wrote:
> Sebastian,
Hi Alexander,

> I re-read my comment few times and I admit it might be confusing. You are
> right - 'multiple' is set by rounding up only. The part '...not necessarily
> the closest power-of-two value...' implied an abstract PCI device rather than
> the described code, but the wording is less than perfect, indeed. 

Good, so it is not just me :)

> In fact, at the moment of writing I kept in mind a follow-up patch that could
> help with aforementioned devices. That would be a new interface:
> 
> 	int pci_enable_msi_block_partial(struct pci_dev *dev,
> 					 unsigned int nvec_use,
> 					 unsigned int nvec_init);
> 
> In this case 'nvec_use' would go to 'msi_desc::nvec_used' and 'nvec_init'
> would translate to 'msi_desc::multiple' in case 'nvec_init' is not zero.
> In case 'nvec_init' is zero, 'msi_desc::multiple' would be initialized
> with the maximum possible value for the device (the way it is done now for
> pci_enable_msi_block_auto() interface). So, for the AHCI device (Bjorn
> mentioned) such a call would conserve on 10 of 16 vectors:
> 
> 	pci_enable_msi_block_partial(pdev, 6, 0);

Ah okay. that makes sense.

> 
> What I am not sure is whether we need to read out the maximum possible
> number of vectors like pci_enable_msi_block_auto() does:
> 
> 	int pci_enable_msi_block_partial(struct pci_dev *dev,
> 					 unsigned int nvec_use,
> 					 unsigned int nvec_init,
> 					 unsigned int *maxvec);
> 
> I can not think of any use of 'maxvec' with this interface, but the second
> variant completes the whole picture about a device...
The user of pci_enable_msi_block_auto() does not know how many it will get
so argument seems essential. Your new function on the other hand says exactly
how many it requires. Anything less should be an error.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 00cc78c7..014b9d5 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -79,7 +79,10 @@  void default_teardown_msi_irqs(struct pci_dev *dev)
 		int i, nvec;
 		if (entry->irq == 0)
 			continue;
-		nvec = 1 << entry->msi_attrib.multiple;
+		if (entry->nvec)
+			nvec = entry->nvec;
+		else
+			nvec = 1 << entry->msi_attrib.multiple;
 		for (i = 0; i < nvec; i++)
 			arch_teardown_msi_irq(entry->irq + i);
 	}
@@ -340,7 +343,10 @@  static void free_msi_irqs(struct pci_dev *dev)
 		int i, nvec;
 		if (!entry->irq)
 			continue;
-		nvec = 1 << entry->msi_attrib.multiple;
+		if (entry->nvec)
+			nvec = entry->nvec;
+		else
+			nvec = 1 << entry->msi_attrib.multiple;
 #ifdef CONFIG_GENERIC_HARDIRQS
 		for (i = 0; i < nvec; i++)
 			BUG_ON(irq_has_action(entry->irq + i));
diff --git a/include/linux/msi.h b/include/linux/msi.h
index ce93a34..0e20dfc 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -35,6 +35,7 @@  struct msi_desc {
 
 	u32 masked;			/* mask bits */
 	unsigned int irq;
+	unsigned int nvec;		/* number of messages */
 	struct list_head list;
 
 	union {