Patchwork [v2,2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

login
register
mail settings
Submitter Alexander Gordeev
Date Sept. 5, 2013, 12:52 p.m.
Message ID <c5ff8fbf905a3df2b75369da725ab5ebf31544ff.1378383792.git.agordeev@redhat.com>
Download mbox | patch
Permalink /patch/272878/
State Superseded
Headers show

Comments

Alexander Gordeev - Sept. 5, 2013, 12:52 p.m.
Factor out the operation of retrieving the number of MSI
vectors the device requested, that is a value encoded in
the Multiple Message Capable register.

This change is a prerequisite for the forthcoming update
of the AHCI device driver.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 Documentation/PCI/MSI-HOWTO.txt |   12 ++++++++++++
 drivers/pci/msi.c               |   30 ++++++++++++++++++++----------
 include/linux/pci.h             |    6 ++++++
 3 files changed, 38 insertions(+), 10 deletions(-)
Tejun Heo - Sept. 5, 2013, 1:09 p.m.
Hello, Alexander.

On Thu, Sep 05, 2013 at 02:52:47PM +0200, Alexander Gordeev wrote:
> -int pci_enable_msi_block_part(struct pci_dev *dev,
> -			      unsigned int nvec, int nvec_mme)
> +int pci_get_msi_cap(struct pci_dev *dev)
>  {
> -	int status, maxvec;
> +	int ret;
>  	u16 msgctl;
>  
>  	if (!dev->msi_cap)
>  		return -EINVAL;
>  
>  	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
> -	maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
> +	ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(pci_get_msi_cap);

One curiosity - with the above factored out, is
pci_enable_msi_block_part() returning positive number still necessary?
I followed most of code paths in x86 and nothing seems to need it and
positive return seems to be just causing confusion - ie. returning 1
on multiple msi config failure from some functions, which is silly.

Thanks.
Alexander Gordeev - Sept. 5, 2013, 3:03 p.m.
On Thu, Sep 05, 2013 at 09:09:02AM -0400, Tejun Heo wrote:
> Hello, Alexander.
> 
> On Thu, Sep 05, 2013 at 02:52:47PM +0200, Alexander Gordeev wrote:
> One curiosity - with the above factored out, is
> pci_enable_msi_block_part() returning positive number still necessary?
> I followed most of code paths in x86 and nothing seems to need it and
> positive return seems to be just causing confusion - ie. returning 1
> on multiple msi config failure from some functions, which is silly.

You mean we could treat positive numbers returned by architecture as
failures and translate it into negative error codes?
If so, I would prefer not to do this for two reasons:
1. It will not be possible to call pci_enable_msi_block_part() in a loop.
   Although there are no consumers right now I think the very possibility
   is better to keep.
2. The semantics of pci_enable_msi_block_part() is very close to
   pci_enable_msi_block(). I believe having a consisting interface for
   these two helps readability.

> Thanks.
> 
> -- 
> tejun
Tejun Heo - Sept. 5, 2013, 3:04 p.m.
Hello,

On Thu, Sep 05, 2013 at 05:03:00PM +0200, Alexander Gordeev wrote:
> You mean we could treat positive numbers returned by architecture as
> failures and translate it into negative error codes?
> If so, I would prefer not to do this for two reasons:
> 1. It will not be possible to call pci_enable_msi_block_part() in a loop.
>    Although there are no consumers right now I think the very possibility
>    is better to keep.
> 2. The semantics of pci_enable_msi_block_part() is very close to
>    pci_enable_msi_block(). I believe having a consisting interface for
>    these two helps readability.

The thing is, do we even have cases where arch code returns positive
return to indicate possible partial allocation?  If not, the whole
interface is convoluted for no good reason and we can just make
everything return 0 or -errno, which is a lot simpler.  No?

Thanks.
Alexander Gordeev - Sept. 5, 2013, 3:40 p.m.
On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> The thing is, do we even have cases where arch code returns positive
> return to indicate possible partial allocation?  If not, the whole
> interface is convoluted for no good reason and we can just make
> everything return 0 or -errno, which is a lot simpler.  No?

As I mentioned in my other note, at least PPC has a concept of MSI quota,
so exceeding it would be the very case, I believe.

> Thanks.
> 
> -- 
> tejun
Tejun Heo - Sept. 5, 2013, 3:44 p.m.
Hello,

On Thu, Sep 05, 2013 at 05:40:42PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> > The thing is, do we even have cases where arch code returns positive
> > return to indicate possible partial allocation?  If not, the whole
> > interface is convoluted for no good reason and we can just make
> > everything return 0 or -errno, which is a lot simpler.  No?
> 
> As I mentioned in my other note, at least PPC has a concept of MSI quota,
> so exceeding it would be the very case, I believe.

Given that multiple MSI is something which isn't too popular / already
superseded and that the condition is highly unlikely, do we really
care about possible partial success?  This sort of interface is
unnecessarily complex and actively harmful.  It forces all users to
wonder what could possibly happen and implement all sorts of nutty
fallback logic which is highly likely to be error-prone on both the
software and hardware side.  Seriously, how much testing would such
code path would get both on the driver and firmware sides?

It's an operation which isn't too likely to fail with a firm
known-to-work fallback.  It's pointless and error-prone to try to
extract the last point zero zero one percent.

Thanks.
Alexander Gordeev - Sept. 5, 2013, 6:54 p.m.
On Thu, Sep 05, 2013 at 11:44:36AM -0400, Tejun Heo wrote:
> Given that multiple MSI is something which isn't too popular / already
> superseded and that the condition is highly unlikely, do we really
> care about possible partial success?  This sort of interface is
> unnecessarily complex and actively harmful.  It forces all users to
> wonder what could possibly happen and implement all sorts of nutty
> fallback logic which is highly likely to be error-prone on both the
> software and hardware side.  Seriously, how much testing would such
> code path would get both on the driver and firmware sides?
> 
> It's an operation which isn't too likely to fail with a firm
> known-to-work fallback.  It's pointless and error-prone to try to
> extract the last point zero zero one percent.

I assume reasons for having this type of interface at the moment of
taking design decision about pci_enable_msi_block() still hold true.
I do not know what those reasons were, but I think the fact multiple
MSIs are rarely used and MSI-X exists does not invalidate them now.

I did consider the other argument - since pci_enable_msi_block_part()
is explicitly provided with a value of MME the caller will not be
satisfied with any other value and hence a repeated call with a lesser
MME does not make sense for the caller. Therefore we could just fail
in case the architecture returned a positive value. Same result, but
different reasoning.

At the moment I still prefer pci_enable_msi_block_part() to be similar
to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
but I would not dare to scrap it all right away.

Bjorn?

> 
> Thanks.
> 
> -- 
> tejun
Tejun Heo - Sept. 5, 2013, 8:06 p.m.
Hello, Alexander.

On Thu, Sep 05, 2013 at 08:54:40PM +0200, Alexander Gordeev wrote:
> I assume reasons for having this type of interface at the moment of
> taking design decision about pci_enable_msi_block() still hold true.
> I do not know what those reasons were, but I think the fact multiple
> MSIs are rarely used and MSI-X exists does not invalidate them now.

Well, it does change the underlying assumptions to make trade-offs
against.  If something is widely used, expected to continue to expand,
additional complexity to achieve better outcome is likely to be more
justifiable.  Nothing exists in vacuum.  That said, I'm not even sure
whether we want this sort of interface even if multiple MSI were still
the hot thing.

> I did consider the other argument - since pci_enable_msi_block_part()
> is explicitly provided with a value of MME the caller will not be
> satisfied with any other value and hence a repeated call with a lesser
> MME does not make sense for the caller. Therefore we could just fail
> in case the architecture returned a positive value. Same result, but
> different reasoning.

Just making the whole thing including arch methods to return 0/-errno
would probably be a lot cleaner.

> At the moment I still prefer pci_enable_msi_block_part() to be similar
> to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
> but I would not dare to scrap it all right away.

Yeah, of course, pci_enable_msi_block() would need to be updated to
match too.  I understand this is going a bit off the original scope of
the patchset but I can't help but cringing at the interface and the
resulting "fallback" logic it ends up creating in its users.  Bjorn,
what do you think?

Thanks.
Alexander Gordeev - Sept. 6, 2013, 1:17 p.m.
On Thu, Sep 05, 2013 at 05:40:41PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> > The thing is, do we even have cases where arch code returns positive
> > return to indicate possible partial allocation?  If not, the whole
> > interface is convoluted for no good reason and we can just make
> > everything return 0 or -errno, which is a lot simpler.  No?
> 
> As I mentioned in my other note, at least PPC has a concept of MSI quota,
> so exceeding it would be the very case, I believe.

An update here. No other arch supports multiple MSIs besides Intel (+IOMMU).
Yet, ironically, the concept of possible partial allocation was adopted from
MSI-X. I briefly checked all archs and it seems none of them throws out a
possible other number of MSIs > 1 (including PPC I mistakenly read).
Bjorn Helgaas - Sept. 6, 2013, 4:01 p.m.
On Thu, Sep 5, 2013 at 2:06 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Alexander.
>
> On Thu, Sep 05, 2013 at 08:54:40PM +0200, Alexander Gordeev wrote:
>> I assume reasons for having this type of interface at the moment of
>> taking design decision about pci_enable_msi_block() still hold true.
>> I do not know what those reasons were, but I think the fact multiple
>> MSIs are rarely used and MSI-X exists does not invalidate them now.
>
> Well, it does change the underlying assumptions to make trade-offs
> against.  If something is widely used, expected to continue to expand,
> additional complexity to achieve better outcome is likely to be more
> justifiable.  Nothing exists in vacuum.  That said, I'm not even sure
> whether we want this sort of interface even if multiple MSI were still
> the hot thing.
>
>> I did consider the other argument - since pci_enable_msi_block_part()
>> is explicitly provided with a value of MME the caller will not be
>> satisfied with any other value and hence a repeated call with a lesser
>> MME does not make sense for the caller. Therefore we could just fail
>> in case the architecture returned a positive value. Same result, but
>> different reasoning.
>
> Just making the whole thing including arch methods to return 0/-errno
> would probably be a lot cleaner.
>
>> At the moment I still prefer pci_enable_msi_block_part() to be similar
>> to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
>> but I would not dare to scrap it all right away.
>
> Yeah, of course, pci_enable_msi_block() would need to be updated to
> match too.  I understand this is going a bit off the original scope of
> the patchset but I can't help but cringing at the interface and the
> resulting "fallback" logic it ends up creating in its users.  Bjorn,
> what do you think?

Sorry, I haven't jumped in here yet because I saw your discussion and
was hoping you guys would figure something out without my help.  It
will take me a few hours to look into this and come up with anything
constructive to say.

I do remember disliking the complicated interface of
pci_enable_msi_block() (return negative errno, return positive "we
might be able to do this" values, or zero), but I'll have to do some
more research before I can say much more than that.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Sept. 6, 2013, 4:06 p.m.
Hello, Bjorn.

On Fri, Sep 06, 2013 at 10:01:38AM -0600, Bjorn Helgaas wrote:
> Sorry, I haven't jumped in here yet because I saw your discussion and
> was hoping you guys would figure something out without my help.  It
> will take me a few hours to look into this and come up with anything
> constructive to say.
> 
> I do remember disliking the complicated interface of
> pci_enable_msi_block() (return negative errno, return positive "we
> might be able to do this" values, or zero), but I'll have to do some
> more research before I can say much more than that.

According to Alexander, it doesn't even seem like we have any actual
use case for the positive return numbers.  I say just rip it out and
do the regular 0/-errno all the way through.

Thanks.
Bjorn Helgaas - Sept. 6, 2013, 11:32 p.m.
On Fri, Sep 06, 2013 at 12:06:21PM -0400, Tejun Heo wrote:
> Hello, Bjorn.
> 
> On Fri, Sep 06, 2013 at 10:01:38AM -0600, Bjorn Helgaas wrote:
> > Sorry, I haven't jumped in here yet because I saw your discussion and
> > was hoping you guys would figure something out without my help.  It
> > will take me a few hours to look into this and come up with anything
> > constructive to say.
> > 
> > I do remember disliking the complicated interface of
> > pci_enable_msi_block() (return negative errno, return positive "we
> > might be able to do this" values, or zero), but I'll have to do some
> > more research before I can say much more than that.
> 
> According to Alexander, it doesn't even seem like we have any actual
> use case for the positive return numbers.  I say just rip it out and
> do the regular 0/-errno all the way through.

I agree, that would be much simpler.

I propose that you rework it that way, and at least find out what
(if anything) would break if we do that.  Or maybe we just give up
some optimization; it would be nice to quantify that, too.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 9, 2013, 3:20 p.m.
On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> I propose that you rework it that way, and at least find out what
> (if anything) would break if we do that.  Or maybe we just give up
> some optimization; it would be nice to quantify that, too.

Hi Bjorn,

The series is what it seems a direction to take.

Looks like we need PPC folks to agree on the quota check update
for pSeries (yes, they do bail out with a positive return value
from arch_msi_check_device()):

	quota = msi_quota_for_device(pdev, nvec);

	if (quota && quota < nvec)
		return quota;

	return 0;

> Bjorn
Tejun Heo - Sept. 9, 2013, 3:37 p.m.
Hello,

On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> The series is what it seems a direction to take.

It can definitely use better descriptions explaining why this is
happening but the general direction seems good to me.

> Looks like we need PPC folks to agree on the quota check update
> for pSeries (yes, they do bail out with a positive return value
> from arch_msi_check_device()):
> 
> 	quota = msi_quota_for_device(pdev, nvec);
> 
> 	if (quota && quota < nvec)
> 		return quota;

I think it should return -ENOSPC.

Thanks.
Alexander Gordeev - Sept. 9, 2013, 3:45 p.m.
On Mon, Sep 09, 2013 at 11:37:54AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> > The series is what it seems a direction to take.
> 
> It can definitely use better descriptions explaining why this is
> happening but the general direction seems good to me.

Oh, those are not even tested. Did not want turn to drivers before
the interface/direction generally agreed.

> > Looks like we need PPC folks to agree on the quota check update
> > for pSeries (yes, they do bail out with a positive return value
> > from arch_msi_check_device()):
> > 
> > 	quota = msi_quota_for_device(pdev, nvec);
> > 
> > 	if (quota && quota < nvec)
> > 		return quota;
> 
> I think it should return -ENOSPC.

Absolutely, in patch #2 :)

> Thanks.
> 
> -- 
> tejun
Alexander Gordeev - Sept. 16, 2013, 10:22 a.m.
On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> > I propose that you rework it that way, and at least find out what
> > (if anything) would break if we do that.  Or maybe we just give up
> > some optimization; it would be nice to quantify that, too.
> 
> Hi Bjorn,
> 
> The series is what it seems a direction to take.
> 
> Looks like we need PPC folks to agree on the quota check update
> for pSeries (yes, they do bail out with a positive return value
> from arch_msi_check_device()):

Hi Ben,

An initiative to simplify MSI/MSI-X allocation interface is brewing.
It seems pSeries quota thing is an obstacle. If it could be given up
(patch 2/9).

Thanks!
Michael Ellerman - Sept. 17, 2013, 2:30 p.m.
On Mon, Sep 16, 2013 at 12:22:11PM +0200, Alexander Gordeev wrote:
> On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> > On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> > > I propose that you rework it that way, and at least find out what
> > > (if anything) would break if we do that.  Or maybe we just give up
> > > some optimization; it would be nice to quantify that, too.
> > 
> > Hi Bjorn,
> > 
> > The series is what it seems a direction to take.
> > 
> > Looks like we need PPC folks to agree on the quota check update
> > for pSeries (yes, they do bail out with a positive return value
> > from arch_msi_check_device()):
> 
> Hi Ben,
> 
> An initiative to simplify MSI/MSI-X allocation interface is brewing.
> It seems pSeries quota thing is an obstacle. If it could be given up
> (patch 2/9).

How about no?

We have a small number of MSIs available, limited by hardware &
firmware, if we don't impose a quota then the first device that probes
will get most/all of the MSIs and other devices miss out.

Anyway I don't see what problem you're trying to solve? I agree the
-ve/0/+ve return value pattern is ugly, but it's hardly the end of the
world.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 18, 2013, 9:48 a.m.
On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> How about no?
> 
> We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out.

Out of curiosity - how pSeries has had done it without quotas before
448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?

> Anyway I don't see what problem you're trying to solve? I agree the
> -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> world.

Well, the interface recently has been re-classified from "ugly" to
"unnecessarily complex and actively harmful" in Tejun's words ;)

Indeed, I checked most of the drivers and it is incredible how people
are creative in misusing the interface: from innocent pci_disable_msix()
calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
if pci_enable_msix() returned a positive value (apparently untested).

Roughly third of the drivers just do not care and bail out once
pci_enable_msix() has not succeeded. Not sure how many of these are
mandated by the hardware.

Another quite common pattern is a call to pci_enable_msix() to figure out
the number of MSI-Xs available and a repeated call of pci_enable_msix()
to enable those MSI-Xs, this time.

The last pattern makes most of sense to me and could be updated with a more
clear sequence - a call to (bit modified) pci_msix_table_size() followed
by a call to pci_enable_msix(). I think this pattern can effectively
supersede the currently recommended "loop" practice.

But as pSeries quota is still required I propose to introduce a new interface
pci_get_msix_limit() that combines pci_msix_table_size() and (also new)
arch_get_msix_limit(). The latter would check the quota thing in case of
pSeries and none in case of all other architectures.

The recommended practice would be:

	/*
	 * Retrieving 'nvec' by means other than pci_msix_table_size()
	 */

	rc = pci_get_msix_limit(pdev);
	if (rc < 0)
		return rc;

	/*
	 * nvec = min(rc, nvec);
	 */

	for (i = 0; i < nvec; i++)
		msix_entry[i].entry = i;

	rc = pci_enable_msix(pdev, msix_entry, nvec);
	if (rc)
		return rc;

Thoughts?

> cheers
Tejun Heo - Sept. 18, 2013, 2:22 p.m.
Hello,

On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> > 
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
> 
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?

Hmmm... do we need to treat this any differently?  If the platform
can't allocate full range of requested MSIs, just failing should be
enough regardless of why such allocation can't be met, no?

> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
> 
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)

LOL. :)

> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).
> 
> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Yeah, I mean, this type of interface is a trap.  People have to
actively resist to avoid doing silly stuff which is a lot to ask.

> 	/*
> 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> 	 */
> 
> 	rc = pci_get_msix_limit(pdev);
> 	if (rc < 0)
> 		return rc;
> 
> 	/*
> 	 * nvec = min(rc, nvec);
> 	 */
> 
> 	for (i = 0; i < nvec; i++)
> 		msix_entry[i].entry = i;
> 
> 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> 	if (rc)
> 		return rc;

I really think what we should do is

* Determine the number of MSIs the controller wants.  Don't worry
  about quotas or limits or anything.  Just determine the number
  necessary to enable enhanced interrupt handling.
	
* Try allocating that number of MSIs.  If it fails, then just revert
  to single interrupt mode.  It's not the end of the world and mostly
  guaranteed to work.  Let's please not even try to do partial
  multiple interrupts.  I really don't think it's worth the risk or
  complexity.

Thanks.
Alexander Gordeev - Sept. 18, 2013, 4:50 p.m.
On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> > 
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> Hmmm... do we need to treat this any differently?  If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?

That depends from what "full range of requested MSIs" is. If that is
a maximum number of MSIs the controller advertised, then no. As MSI
design essentially allows devices to operate in lower-than-maximum
modes it is responsibility of a caller to decide how many vectors to
request. So in case of pSeries I think it is completely legitimate to
request lessers to overcome the platform limitation and let all devices
work.

> I really think what we should do is
> 
> * Determine the number of MSIs the controller wants.  Don't worry
>   about quotas or limits or anything.  Just determine the number
>   necessary to enable enhanced interrupt handling.

Actually, I do not see much contradiction with what I proposed. The
key words here "determine the number of MSIs the controller wants".

In general case it is not what pci_msix_table_size() returns (or at
least we should not limit ourselves to it) - there could be non-
standard means to report number of MSIs: hardcoded, version-dependant,
device-specific registers etc.

Next, if we opt to determine the number of MSIs by non-MSI standard
means then there is no reason not to call pci_get_msix_limit() (or
whatever) at this step.

The question how I see it - do we want pci_get_msix_limit() interface
as part of the MSI framework or do we want it pSeries-specific?

> * Try allocating that number of MSIs.  If it fails, then just revert
>   to single interrupt mode.  It's not the end of the world and mostly
>   guaranteed to work.  Let's please not even try to do partial
>   multiple interrupts.  I really don't think it's worth the risk or
>   complexity.

Being Captain Obvious here, but it is up to the device driver to handle
a failure. There could be no such option as single MSI mode after all :)

> Thanks.
> 
> -- 
> tejun
Alexander Gordeev - Sept. 20, 2013, 8:24 a.m.
Michael et al.

The identifiable options sounded so far were:

* Do not change anything

* Make pci_enable_msix() return 0/-errno

* Make pci_enable_msix() return 0/-errno and introduce arch_get_msix_limit()/
  arch_get_msi_limit()

* Make pci_enable_msix() return 0/-errno and introduce pci_get_msix_limit()/
  pci_get_msi_limit() and arch_get_msix_limit()/arch_get_msi_limit() so that:
	pci_get_msix_limit() = min(arch_get_msix_limit(), pci_msix_table_size())
	pci_get_msi_limit()  = min(arch_get_msi_limit(),  pci_get_msi_cap())

Can we have a conclusion here?

Thanks!
Tejun Heo - Sept. 20, 2013, 12:26 p.m.
Hello,

On Wed, Sep 18, 2013 at 06:50:45PM +0200, Alexander Gordeev wrote:
> Actually, I do not see much contradiction with what I proposed. The
> key words here "determine the number of MSIs the controller wants".
> 
> In general case it is not what pci_msix_table_size() returns (or at
> least we should not limit ourselves to it) - there could be non-
> standard means to report number of MSIs: hardcoded, version-dependant,
> device-specific registers etc.
> 
> Next, if we opt to determine the number of MSIs by non-MSI standard
> means then there is no reason not to call pci_get_msix_limit() (or
> whatever) at this step.

Yeah, that's all fine.  My point is that we shouldn't try to use
"degraded" multiple MSI mode where the number of MSIs allocated is
smaller than performing full multiple MSI operation.  How that number
is determined doesn't really matter but that number is a property
which is solely decided by the device driver, right?  If a device
needs full multiple MSI mode, given specific configuration, it needs
>= X number of MSIs and that's the number it should request.

> Being Captain Obvious here, but it is up to the device driver to handle
> a failure. There could be no such option as single MSI mode after all :)

I don't think there actually is a mainstream device which can't
fallback to single interrupt.  Anyways, the point is the same, let's
please not try to create an interface which encourages complex retry
logic in its users which are likely to involve less traveled and
tested paths in both the driver and firmware.

Thanks.
Tejun Heo - Sept. 20, 2013, 12:27 p.m.
On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> * Make pci_enable_msix() return 0/-errno

My choice would be this one.

Thanks.
Bjorn Helgaas - Sept. 25, 2013, 6:02 p.m.
On Fri, Sep 20, 2013 at 07:27:36AM -0500, Tejun Heo wrote:
> On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> > * Make pci_enable_msix() return 0/-errno
> 
> My choice would be this one.

I agree; it sounds like you've identified several bugs related to the
current confusing interface, so fixing that seems like the first step.

I hope we can avoid adding a plethora of interfaces to address unusual
corner cases.  But if we do the above and it turns out not to be enough,
we can always extend it later.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 25, 2013, 8:58 p.m.
On Wed, Sep 25, 2013 at 12:02:20PM -0600, Bjorn Helgaas wrote:
> On Fri, Sep 20, 2013 at 07:27:36AM -0500, Tejun Heo wrote:
> > On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> > > * Make pci_enable_msix() return 0/-errno
> > 
> > My choice would be this one.
> 
> I agree; it sounds like you've identified several bugs related to the
> current confusing interface, so fixing that seems like the first step.

Yeah, I am trying to. Turns out to be a nice exercise ;)

> I hope we can avoid adding a plethora of interfaces to address unusual
> corner cases.  But if we do the above and it turns out not to be enough,
> we can always extend it later.

Unfortunately, pSeries is a shows-topper here :( It seems we have to
introduce pci_get_msi{,x}_limit() interfaces to honour the quota
thing.  I just hope the hardware set for pSeries is limited and we
won't need to use it for all drivers.

> Bjorn
Tejun Heo - Sept. 25, 2013, 9 p.m.
Hello,

On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> Unfortunately, pSeries is a shows-topper here :( It seems we have to
> introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> thing.  I just hope the hardware set for pSeries is limited and we
> won't need to use it for all drivers.

Can you please go into a bit of detail on that?  Why does it matter?
Is it because you're worried you might cause performance regression by
forcing prevoius partial multiple allocations to single interrupt
operation?

Thanks.
Alexander Gordeev - Sept. 26, 2013, 7:46 a.m.
On Wed, Sep 25, 2013 at 05:00:16PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> > Unfortunately, pSeries is a shows-topper here :( It seems we have to
> > introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> > thing.  I just hope the hardware set for pSeries is limited and we
> > won't need to use it for all drivers.
> 
> Can you please go into a bit of detail on that?  Why does it matter?

Because otherwise we will re-introduce a problem described by Michael:
"We have a small number of MSIs available, limited by hardware &
firmware, if we don't impose a quota then the first device that probes
will get most/all of the MSIs and other devices miss out."

> Is it because you're worried you might cause performance regression by
> forcing prevoius partial multiple allocations to single interrupt
> operation?

Well, not really. I think it won't be possible to force people not to use
partial allocations anyway. Some controllers just do not care how many MSIs
they are configured with. Some drivers derive the number of MSIs desired
from the number of CPUs online - in such cases allocating more MSIs (i.e.
a number the controller advertised) could cause a performance degradation
even.

So when driver authors will know/measure/believe their hardware performs
better with partial allocations they will stand for it. What we can do is
to prevent those try-and-decrease fallbacks.

> Thanks.
> 
> -- 
> tejun
David Laight - Sept. 26, 2013, 8:58 a.m.
> Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
> 
> On Wed, Sep 25, 2013 at 05:00:16PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> > > Unfortunately, pSeries is a shows-topper here :( It seems we have to
> > > introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> > > thing.  I just hope the hardware set for pSeries is limited and we
> > > won't need to use it for all drivers.
> >
> > Can you please go into a bit of detail on that?  Why does it matter?
> 
> Because otherwise we will re-introduce a problem described by Michael:
> "We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out."

Would it be possible to do some kind of 2-stage allocation.
In the first pass the driver would pass a minimum and desired
number of MSI-X interrupts, but not actually be given any.
Interrupts could then be allocated after it is known how many
are required and how many are available.

	David



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 26, 2013, 10:45 a.m.
On Thu, Sep 26, 2013 at 09:58:53AM +0100, David Laight wrote:
> Would it be possible to do some kind of 2-stage allocation.
> In the first pass the driver would pass a minimum and desired
> number of MSI-X interrupts, but not actually be given any.

Repeated calls to msi_enable_msi/msix() is what we are trying to avoid.

> Interrupts could then be allocated after it is known how many
> are required and how many are available.

Yep, that what we are heading to. So basic pattern I see would be like this:

	int foo_driver_enable_msix(struct pci_dev *pdev, int nvec)
	{
		...

		rc = pci_msix_table_size(pdev);
		if (rc < 0)
			return rc;

		nvec = min(nvec, rc);
		if (nvec < FOO_DRIVER_MINIMUM_NVEC)
			goto single_msi;

		for (i = 0; i < nvec; i++)
			entries[i].entry = i;

		rc = pci_enable_msix(pdev, entries, nvec);
		if (rc)
			goto single_msi;

		return 0;

	single_msi:
		...

	}

But this will break pSeries and we might end up with something like this:

	int foo_driver_enable_msix(struct pci_dev *pdev, int nvec)
	{
		...

		rc = pci_msix_table_size(pdev);
		if (rc < 0)
			return rc;

		nvec = min(nvec, rc);
		if (nvec < FOO_DRIVER_MINIMUM_NVEC)
			goto single_msi;

		rc = pci_get_msix_limit(pdev, nvec);
		if (rc < 0)
			return rc;

		nvec = min(nvec, rc);
		if (nvec < FOO_DRIVER_MINIMUM_NVEC)
			goto single_msi;

		for (i = 0; i < nvec; i++)
			entries[i].entry = i;

		rc = pci_enable_msix(pdev, entries, nvec);
		if (rc)
			goto single_msi;

		return 0;

	single_msi:
		...

	}

> 	David
David Laight - Sept. 26, 2013, 11:34 a.m.
> On Thu, Sep 26, 2013 at 09:58:53AM +0100, David Laight wrote:
> > Would it be possible to do some kind of 2-stage allocation.
> > In the first pass the driver would pass a minimum and desired
> > number of MSI-X interrupts, but not actually be given any.
> 
> Repeated calls to msi_enable_msi/msix() is what we are trying to avoid.

I was thinking that the first call would be done during driver probe
(assuming such a time exists) so that the subsystem could determine
how many interrupts all the drivers would like, so it can then
hand out a smaller number to some of the early drivers in order
to have some left to satisfy the minimum requirement of later
ones.

So all it would do is sum the requirements of all the drivers.

	David



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 26, 2013, 12:13 p.m.
On Thu, Sep 26, 2013 at 12:34:36PM +0100, David Laight wrote:
> I was thinking that the first call would be done during driver probe
> (assuming such a time exists) so that the subsystem could determine
> how many interrupts all the drivers would like, so it can then
> hand out a smaller number to some of the early drivers in order
> to have some left to satisfy the minimum requirement of later
> ones.
> 
> So all it would do is sum the requirements of all the drivers.

It is already implemented - please see commit 448e2ca ("powerpc/pseries:
Implement a quota system for MSIs")

All other archs do not have MSI vector space limitations.

> 	David
Mark Lord - Sept. 26, 2013, 12:32 p.m.
On 13-09-18 05:48 AM, Alexander Gordeev wrote:
>
> The last pattern makes most of sense to me and could be updated with a more
> clear sequence - a call to (bit modified) pci_msix_table_size() followed
> by a call to pci_enable_msix(). I think this pattern can effectively
> supersede the currently recommended "loop" practice.

The loop is still necessary, because there's a race between those two calls,
so that pci_enable_msix() can still fail due to lack of MSIX slots.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Sept. 26, 2013, 1:03 p.m.
On Thu, Sep 26, 2013 at 08:32:53AM -0400, Mark Lord wrote:
> On 13-09-18 05:48 AM, Alexander Gordeev wrote:
> > The last pattern makes most of sense to me and could be updated with a more
> > clear sequence - a call to (bit modified) pci_msix_table_size() followed
> > by a call to pci_enable_msix(). I think this pattern can effectively
> > supersede the currently recommended "loop" practice.
> 
> The loop is still necessary, because there's a race between those two calls,
> so that pci_enable_msix() can still fail due to lack of MSIX slots.

Moreover, the existing loop pattern is racy and could fail just as easily ;)
But (1) that is something drivers should expect and (2) there is basically
nothing to race against - that is probably the reason it has not been a
problem for pSeries. So I think we should not care about this.
Tejun Heo - Sept. 26, 2013, 1:11 p.m.
Hello,

On Thu, Sep 26, 2013 at 09:46:46AM +0200, Alexander Gordeev wrote:
> > Can you please go into a bit of detail on that?  Why does it matter?
> 
> Because otherwise we will re-introduce a problem described by Michael:
> "We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out."

Still not following.  Why wouldn't just letting the drivers request
the optimal number they want and falling back to single interrupt mode
work?  ie. why can't we just have an all or nothing interface?

> > Is it because you're worried you might cause performance regression by
> > forcing prevoius partial multiple allocations to single interrupt
> > operation?
> 
> Well, not really. I think it won't be possible to force people not to use
> partial allocations anyway. Some controllers just do not care how many MSIs
> they are configured with. Some drivers derive the number of MSIs desired
> from the number of CPUs online - in such cases allocating more MSIs (i.e.
> a number the controller advertised) could cause a performance degradation
> even.

Yeah, sure thing but just let the *driver* decide that number without
worrying about how many they can actually get.  Ultimately, what we
want is removing this extra variable which can arbitrarily affect the
number of allocated interrupts so that we only have to worry about
either proper multiple MSI mode or single interrupt mode, not
something random inbetween.  It is possible that there exists a driver
which absolutely requires partial allocation on certain archs, but
that should be a very special case and the interface should look
accordingly ugly / special.  But do we actually have those?

Thanks.
Alexander Gordeev - Sept. 26, 2013, 2:39 p.m.
On Thu, Sep 26, 2013 at 09:11:47AM -0400, Tejun Heo wrote:
> > Because otherwise we will re-introduce a problem described by Michael:
> > "We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out."
> 
> Still not following.  Why wouldn't just letting the drivers request
> the optimal number they want and falling back to single interrupt mode
> work?  ie. why can't we just have an all or nothing interface?

I can imagine a scenario where the first device probes in, requests its
optimal number, acquires that number and exhausts MSIs in pSeries firmware.
The next few devices possibly end up with single MSI, since no MSIs left
to satisfy their optimal numbers. If one of those single-MSI'ed devices
happened to be a high-performance HBA hitting a degraded performance that
alone would force (IBM) to introduce the quotas. Now, if the same/another
device happened does not support the legacy interrupt mode and no MSI
resources have left in the platform firmware at all...

> tejun
Tejun Heo - Sept. 26, 2013, 2:42 p.m.
Hello,

On Thu, Sep 26, 2013 at 10:39 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> I can imagine a scenario where the first device probes in, requests its

Well, we can imagine a lot of thing but usually have to draw the line somewhere.

> optimal number, acquires that number and exhausts MSIs in pSeries firmware.
> The next few devices possibly end up with single MSI, since no MSIs left
> to satisfy their optimal numbers. If one of those single-MSI'ed devices
> happened to be a high-performance HBA hitting a degraded performance that
> alone would force (IBM) to introduce the quotas. Now, if the same/another
> device happened does not support the legacy interrupt mode and no MSI
> resources have left in the platform firmware at all...

If that happens, that's just the platform code being dumb. Quota is
there to prevent that from happening, right? Let's please do something
simple and worry about problems if they actually exist.

Thanks.
Michael Ellerman - Oct. 1, 2013, 7:19 a.m.
On Thu, Sep 26, 2013 at 04:39:02PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 26, 2013 at 09:11:47AM -0400, Tejun Heo wrote:
> > > Because otherwise we will re-introduce a problem described by Michael:
> > > "We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out."
> > 
> > Still not following.  Why wouldn't just letting the drivers request
> > the optimal number they want and falling back to single interrupt mode
> > work?  ie. why can't we just have an all or nothing interface?
> 
> I can imagine a scenario where the first device probes in, requests its
> optimal number, acquires that number and exhausts MSIs in pSeries firmware.
> The next few devices possibly end up with single MSI, since no MSIs left
> to satisfy their optimal numbers. If one of those single-MSI'ed devices
> happened to be a high-performance HBA hitting a degraded performance that
> alone would force (IBM) to introduce the quotas.

Yes that's exactly the scenario, and I didn't imagine it, our test
people actually hit it and yelled at me.

I don't remember exactly which adapters it was, I might be able to find
the details if I looked hard, a quick search through my mail archive
didn't find it - it might have come in via irc / bugzilla etc.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Ellerman - Oct. 1, 2013, 7:26 a.m.
On Fri, Sep 20, 2013 at 07:26:03AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 18, 2013 at 06:50:45PM +0200, Alexander Gordeev wrote:
> > Actually, I do not see much contradiction with what I proposed. The
> > key words here "determine the number of MSIs the controller wants".
> > 
> > In general case it is not what pci_msix_table_size() returns (or at
> > least we should not limit ourselves to it) - there could be non-
> > standard means to report number of MSIs: hardcoded, version-dependant,
> > device-specific registers etc.
> > 
> > Next, if we opt to determine the number of MSIs by non-MSI standard
> > means then there is no reason not to call pci_get_msix_limit() (or
> > whatever) at this step.
> 
> Yeah, that's all fine.  My point is that we shouldn't try to use
> "degraded" multiple MSI mode where the number of MSIs allocated is
> smaller than performing full multiple MSI operation.  How that number
> is determined doesn't really matter but that number is a property
> which is solely decided by the device driver, right?  If a device
> needs full multiple MSI mode, given specific configuration, it needs
> >= X number of MSIs and that's the number it should request.

Sure, the driver is in full control. If it can ONLY work with N MSIs
then it should try for N, else fallback to 1.

But some drivers are able to work with a range of values for N, and
performance is improved vs using a single MSI.

> > Being Captain Obvious here, but it is up to the device driver to handle
> > a failure. There could be no such option as single MSI mode after all :)
> 
> I don't think there actually is a mainstream device which can't
> fallback to single interrupt.  Anyways, the point is the same, let's
> please not try to create an interface which encourages complex retry
> logic in its users which are likely to involve less traveled and
> tested paths in both the driver and firmware.

Why support > 1 MSI at all? It just adds complex logic and less travelled
paths in the driver and firmware.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Ellerman - Oct. 1, 2013, 7:35 a.m.
On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> > On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > > How about no?
> > > 
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> > 
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> Hmmm... do we need to treat this any differently?  If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?
> 
> > > Anyway I don't see what problem you're trying to solve? I agree the
> > > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > > world.
> > 
> > Well, the interface recently has been re-classified from "ugly" to
> > "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> LOL. :)
> 
> > Indeed, I checked most of the drivers and it is incredible how people
> > are creative in misusing the interface: from innocent pci_disable_msix()
> > calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> > if pci_enable_msix() returned a positive value (apparently untested).
> > 
> > Roughly third of the drivers just do not care and bail out once
> > pci_enable_msix() has not succeeded. Not sure how many of these are
> > mandated by the hardware.
> 
> Yeah, I mean, this type of interface is a trap.  People have to
> actively resist to avoid doing silly stuff which is a lot to ask.

I really think you're overstating the complexity here.

Functions typically return a boolean   -> nothing to see here
This function returns a tristate value -> brain explosion!


> > 	/*
> > 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> > 	 */
> > 
> > 	rc = pci_get_msix_limit(pdev);
> > 	if (rc < 0)
> > 		return rc;
> > 
> > 	/*
> > 	 * nvec = min(rc, nvec);
> > 	 */
> > 
> > 	for (i = 0; i < nvec; i++)
> > 		msix_entry[i].entry = i;
> > 
> > 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> > 	if (rc)
> > 		return rc;
> 
> I really think what we should do is
> 
> * Determine the number of MSIs the controller wants.  Don't worry
>   about quotas or limits or anything.  Just determine the number
>   necessary to enable enhanced interrupt handling.
> 	
> * Try allocating that number of MSIs.  If it fails, then just revert
>   to single interrupt mode.  It's not the end of the world and mostly
>   guaranteed to work.  Let's please not even try to do partial
>   multiple interrupts.  I really don't think it's worth the risk or
>   complexity.

It will potentially break existing setups on our hardware.

Can I make that any clearer?

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Ellerman - Oct. 1, 2013, 7:51 a.m.
On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> > 
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
> 
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
> 
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).

OK, but we have the source to the drivers, we could just fix them.

We could even add:

  pci_enable_msix_i_am_stupid()

Which swallows the positive return and just gives back -ve/0.

> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Sure, that's fine if those drivers do that, it's up to the drivers after
all.

> Another quite common pattern is a call to pci_enable_msix() to figure out
> the number of MSI-Xs available and a repeated call of pci_enable_msix()
> to enable those MSI-Xs, this time.

Also fine, though as the documentation suggests a loop is the best
construct rather than two explicit calls.

> The recommended practice would be:
> 
> 	/*
> 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> 	 */
> 
> 	rc = pci_get_msix_limit(pdev);
> 	if (rc < 0)
> 		return rc;
> 
> 	/*
> 	 * nvec = min(rc, nvec);
> 	 */
> 
> 	for (i = 0; i < nvec; i++)
> 		msix_entry[i].entry = i;
> 
> 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> 	if (rc)
> 		return rc;
> 
> Thoughts?

We could probably make that work.

The disadvantage is that any restriction imposed on us above the quota
can only be reported as an error from pci_enable_msix().

The quota code, called from pci_get_msix_limit(), can only do so much to
interogate firmware about the limitations. The ultimate way to check if
firmware will give us enough MSIs is to try and allocate them. But we
can't do that from pci_get_msix_limit() because the driver is not asking
us to enable MSIs, just query them.

You'll also need to add another arch hook, for the quota check, and
we'll have to add it to our per-platform indirection as well.

All a lot of bother for no real gain IMHO.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Oct. 1, 2013, 10:35 a.m.
On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> The disadvantage is that any restriction imposed on us above the quota
> can only be reported as an error from pci_enable_msix().
> 
> The quota code, called from pci_get_msix_limit(), can only do so much to
> interogate firmware about the limitations. The ultimate way to check if
> firmware will give us enough MSIs is to try and allocate them. But we
> can't do that from pci_get_msix_limit() because the driver is not asking
> us to enable MSIs, just query them.

If things are this way then pci_enable_msix() already exposed to this
problem internally on pSeries.

I see that even successful quota checks in rtas_msi_check_device() and
rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
give enough MSIs. Hence, pci_enable_msix() might fail even though the
its quota checks succeeded.

Therefore, nothing will really change if we make pci_get_msix_limit() check
quota and hope the follow-up call to pci_enable_msix() succeeded.

(Of course, we could allocate-deallocate MSIs at check time, but I think it
is an overkill).

> You'll also need to add another arch hook, for the quota check, and
> we'll have to add it to our per-platform indirection as well.

Already, in a branch, hidden from Bjorn & Tejun eyes ;)

> All a lot of bother for no real gain IMHO.

Well, I do not have a strong opinion here. I leave it to the ones who have :)
But few drivers have became clearer as result of this change (and messy ones
are still messy).

> cheers
Tejun Heo - Oct. 1, 2013, 11:55 a.m.
Hello,

On Tue, Oct 01, 2013 at 05:35:48PM +1000, Michael Ellerman wrote:
> > > Roughly third of the drivers just do not care and bail out once
> > > pci_enable_msix() has not succeeded. Not sure how many of these are
> > > mandated by the hardware.
> > 
> > Yeah, I mean, this type of interface is a trap.  People have to
> > actively resist to avoid doing silly stuff which is a lot to ask.
> 
> I really think you're overstating the complexity here.
> 
> Functions typically return a boolean   -> nothing to see here
> This function returns a tristate value -> brain explosion!

It is an interface which forces the driver writers to write
complicated fallback code which won't usually be excercised.  The same
goes for the hardware.  In isolation, it doesn't look like much but
things like this are bound to lead to subtle bugs which are diffiuclt
to trigger.

> > * Determine the number of MSIs the controller wants.  Don't worry
> >   about quotas or limits or anything.  Just determine the number
> >   necessary to enable enhanced interrupt handling.
> > 	
> > * Try allocating that number of MSIs.  If it fails, then just revert
> >   to single interrupt mode.  It's not the end of the world and mostly
> >   guaranteed to work.  Let's please not even try to do partial
> >   multiple interrupts.  I really don't think it's worth the risk or
> >   complexity.
> 
> It will potentially break existing setups on our hardware.

I think it'd be much better to have a separate interface for the
drivers which actually require it *in practice* rather than forcing
everyone to go "oh this interface supports that, I don't know if I
need it but let's implement fallback logic which I won't and have no
means of testing".  Are we talking about some limited number of device
drivers here?  Also, is the quota still necessary for machines in
production today?

Thanks.
Michael Ellerman - Oct. 2, 2013, 2:43 a.m.
On Tue, Oct 01, 2013 at 12:35:27PM +0200, Alexander Gordeev wrote:
> On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> > The disadvantage is that any restriction imposed on us above the quota
> > can only be reported as an error from pci_enable_msix().
> > 
> > The quota code, called from pci_get_msix_limit(), can only do so much to
> > interogate firmware about the limitations. The ultimate way to check if
> > firmware will give us enough MSIs is to try and allocate them. But we
> > can't do that from pci_get_msix_limit() because the driver is not asking
> > us to enable MSIs, just query them.
> 
> If things are this way then pci_enable_msix() already exposed to this
> problem internally on pSeries.
> 
> I see that even successful quota checks in rtas_msi_check_device() and
> rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
> give enough MSIs. Hence, pci_enable_msix() might fail even though the
> its quota checks succeeded.

Yes, but it can report that failure to the caller, which can then retry.

> Therefore, nothing will really change if we make pci_get_msix_limit() check
> quota and hope the follow-up call to pci_enable_msix() succeeded.

No that's not equivalent. Under your scheme if pci_enable_msix() fails
then the caller just bails, it will never try again with a lower number.

> (Of course, we could allocate-deallocate MSIs at check time, but I think it
> is an overkill).

It's not only overkill, it's messing with the device behind the drivers
back, which is definitely a no-no in my opinion.
 
> > You'll also need to add another arch hook, for the quota check, and
> > we'll have to add it to our per-platform indirection as well.
> 
> Already, in a branch, hidden from Bjorn & Tejun eyes ;)
> 
> > All a lot of bother for no real gain IMHO.
> 
> Well, I do not have a strong opinion here. I leave it to the ones who have :)
> But few drivers have became clearer as result of this change (and messy ones
> are still messy).

Amen.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord - Oct. 2, 2013, 2:46 a.m.
On 13-09-26 09:03 AM, Alexander Gordeev wrote:
> On Thu, Sep 26, 2013 at 08:32:53AM -0400, Mark Lord wrote:
>> On 13-09-18 05:48 AM, Alexander Gordeev wrote:
>>> The last pattern makes most of sense to me and could be updated with a more
>>> clear sequence - a call to (bit modified) pci_msix_table_size() followed
>>> by a call to pci_enable_msix(). I think this pattern can effectively
>>> supersede the currently recommended "loop" practice.
>>
>> The loop is still necessary, because there's a race between those two calls,
>> so that pci_enable_msix() can still fail due to lack of MSIX slots.
> 
> Moreover, the existing loop pattern is racy and could fail just as easily ;)

Yes, but it then loops again to correct things.

> But (1) that is something drivers should expect and (2) there is basically
> nothing to race against - that is probably the reason it has not been a
> problem for pSeries. So I think we should not care about this.

I always care about race conditions.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev - Oct. 2, 2013, 7:10 a.m.
On Wed, Oct 02, 2013 at 12:43:24PM +1000, Michael Ellerman wrote:
> On Tue, Oct 01, 2013 at 12:35:27PM +0200, Alexander Gordeev wrote:
> > On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> > > The disadvantage is that any restriction imposed on us above the quota
> > > can only be reported as an error from pci_enable_msix().
> > > 
> > > The quota code, called from pci_get_msix_limit(), can only do so much to
> > > interogate firmware about the limitations. The ultimate way to check if
> > > firmware will give us enough MSIs is to try and allocate them. But we
> > > can't do that from pci_get_msix_limit() because the driver is not asking
> > > us to enable MSIs, just query them.
> > 
> > If things are this way then pci_enable_msix() already exposed to this
> > problem internally on pSeries.
> > 
> > I see that even successful quota checks in rtas_msi_check_device() and
> > rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
> > give enough MSIs. Hence, pci_enable_msix() might fail even though the
> > its quota checks succeeded.
> 
> Yes, but it can report that failure to the caller, which can then retry.

If a driver wants to retry after a failure it is up to the driver (but why?).
The current guidlines state:

"If this function returns a negative number, it indicates an error and
the driver should not attempt to allocate any more MSI-X interrupts for
this device."

Anyway, what number could the driver retry with after it got a negative errno?

> > Therefore, nothing will really change if we make pci_get_msix_limit() check
> > quota and hope the follow-up call to pci_enable_msix() succeeded.
> 
> No that's not equivalent. Under your scheme if pci_enable_msix() fails
> then the caller just bails, it will never try again with a lower number.

Currently under the very same circumstances (the quota check within
rtas_setup_msi_irqs() returned Q vectors while the firmware has only F
vectors to allocate and Q > F) rtas_setup_msi_irqs() fails, pci_enable_msix()
fails, the caller bails and never try again with a lower number.

Am I missing something here?

> cheers
Alexander Gordeev - Oct. 2, 2013, 7:26 a.m.
On Tue, Oct 01, 2013 at 10:46:32PM -0400, Mark Lord wrote:
> >>> The last pattern makes most of sense to me and could be updated with a more
> >>> clear sequence - a call to (bit modified) pci_msix_table_size() followed
> >>> by a call to pci_enable_msix(). I think this pattern can effectively
> >>> supersede the currently recommended "loop" practice.
> >>
> >> The loop is still necessary, because there's a race between those two calls,
> >> so that pci_enable_msix() can still fail due to lack of MSIX slots.
> > 
> > Moreover, the existing loop pattern is racy and could fail just as easily ;)
> 
> Yes, but it then loops again to correct things.

No. If it failed it should exit the loop.
Bjorn Helgaas - Dec. 18, 2013, 6:26 p.m.
On Thu, Sep 26, 2013 at 08:32:53AM -0400, Mark Lord wrote:
> On 13-09-18 05:48 AM, Alexander Gordeev wrote:
> >
> > The last pattern makes most of sense to me and could be updated with a more
> > clear sequence - a call to (bit modified) pci_msix_table_size() followed
> > by a call to pci_enable_msix(). I think this pattern can effectively
> > supersede the currently recommended "loop" practice.
> 
> The loop is still necessary, because there's a race between those two calls,
> so that pci_enable_msix() can still fail due to lack of MSIX slots.

Hi Mark,

Can you elaborate on this race?  My understanding is that
pci_msix_table_size() depends only on the "Table Size" field in the MSI-X
Message Control register.

So if there's a concurrency problem here, it would have to be something
like "pci_enable_msix() may not be able to configure the requested number
of vectors because it has to allocate from a shared pool."

If that's the case, pci_msix_table_size() wouldn't be involved at all, and
the only question is how to coordinate between several drivers that each
call pci_enable_msix().  I think that would have to be resolved in some
arch hook used by the PCI core.

Maybe this is already taken care of; I just want to make sure we don't
overlook an issue here.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 32d7d15..205eb5d 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -209,6 +209,18 @@  on any interrupt for which it previously called request_irq().
 Failure to do so results in a BUG_ON(), leaving the device with
 MSI enabled and thus leaking its vector.
 
+4.2.6 pci_get_msi_cap
+
+int pci_get_msi_cap(struct pci_dev *dev)
+
+This function could be used to retrieve the number of MSI vectors the
+device requested (via the Multiple Message Capable register). The MSI
+specification only allows the returned value to be a power of two,
+up to a maximum of 2^5 (32).
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSIs.
+
 4.3 Using MSI-X
 
 The MSI-X capability is much more flexible than the MSI capability.
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 647e9b1..c50d518 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -802,17 +802,30 @@  static int pci_msi_check_device(struct pci_dev *dev,
 	return 0;
 }
 
-int pci_enable_msi_block_part(struct pci_dev *dev,
-			      unsigned int nvec, int nvec_mme)
+int pci_get_msi_cap(struct pci_dev *dev)
 {
-	int status, maxvec;
+	int ret;
 	u16 msgctl;
 
 	if (!dev->msi_cap)
 		return -EINVAL;
 
 	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
-	maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+	ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+	return ret;
+}
+EXPORT_SYMBOL(pci_get_msi_cap);
+
+int pci_enable_msi_block_part(struct pci_dev *dev,
+			      unsigned int nvec, int nvec_mme)
+{
+	int status, maxvec;
+	u16 msgctl;
+
+	maxvec = pci_get_msi_cap(dev);
+	if (maxvec < 0)
+		return maxvec;
 
 	if (nvec_mme < 0)
 		nvec_mme = maxvec;
@@ -869,13 +882,10 @@  EXPORT_SYMBOL(pci_enable_msi_block);
 int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
 {
 	int ret, nvec;
-	u16 msgctl;
-
-	if (!dev->msi_cap)
-		return -EINVAL;
 
-	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
-	ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+	ret = pci_get_msi_cap(dev);
+	if (ret < 0)
+		return ret;
 
 	if (maxvec)
 		*maxvec = ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6552cee..b866048 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1122,6 +1122,11 @@  struct msix_entry {
 
 
 #ifndef CONFIG_PCI_MSI
+static inline int pci_get_msi_cap(struct pci_dev *dev)
+{
+	return -1;
+}
+
 static inline int
 pci_enable_msi_block_part(struct pci_dev *dev, unsigned int nvec, int nvec_mme)
 {
@@ -1169,6 +1174,7 @@  static inline int pci_msi_enabled(void)
 	return 0;
 }
 #else
+int pci_get_msi_cap(struct pci_dev *dev);
 int pci_enable_msi_block_part(struct pci_dev *dev,
 			      unsigned int nvec, int nvec_mme);
 int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);