[v3] PCI: Check for PCIe downtraining conditions

Message ID 20180604155523.14906-1-mr.nuke.me@gmail.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series
  • [v3] PCI: Check for PCIe downtraining conditions
Related show

Commit Message

Alex G. June 4, 2018, 3:55 p.m.
PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor straigh
forward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---

Changes since v2:
 - Check dev->is_virtfn flag

Changes since v1:
 - Use pcie_print_link_status() instead of reimplementing logic
 
 drivers/pci/probe.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Andy Shevchenko June 5, 2018, 12:27 p.m. | #1
On Mon, Jun 4, 2018 at 6:55 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor straigh
> forward.
>
> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.

Have you seen any of my comments?
For your convenience repeating below.

>
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>
> Changes since v2:
>  - Check dev->is_virtfn flag
>
> Changes since v1:
>  - Use pcie_print_link_status() instead of reimplementing logic
>
>  drivers/pci/probe.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6fd0bcd..a88ec8c25dd5 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>         return dev;
>  }
>
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{

> +

This is redundant blank line.

> +       if (!pci_is_pcie(dev))
> +               return;
> +
> +       /* Look from the device up to avoid downstream ports with no devices. */
> +       if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> +           (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> +           (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> +               return;

I looked briefly at the use of these calls and perhaps it might make
sense to introduce
pci_is_pcie_type(dev, type) which unifies pci_is_pcie() + pci_pcie_type().

> +
> +       /* Multi-function PCIe share the same link/status. */

> +       if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)

The one pair of parens is not needed.

> +               return;
> +
> +       pcie_print_link_status(dev);
> +}
> +
>  static void pci_init_capabilities(struct pci_dev *dev)
>  {
>         /* Enhanced Allocation */
> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>         /* Advanced Error Reporting */
>         pci_aer_init(dev);
>
> +       /* Check link and detect downtrain errors */
> +       pcie_check_upstream_link(dev);
> +
>         if (pci_probe_reset_function(dev) == 0)
>                 dev->reset_fn = 1;
>  }
> --
> 2.14.4
>
Andy Shevchenko June 5, 2018, 1:04 p.m. | #2
On Tue, Jun 5, 2018 at 3:27 PM, Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Mon, Jun 4, 2018 at 6:55 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>
> Have you seen any of my comments?
> For your convenience repeating below.

Ah, found the answer in a pile of emails. OK, I see your point about
helper, though the rest is still applicable here.
Bjorn Helgaas July 16, 2018, 9:17 p.m. | #3
[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

On Mon, Jun 04, 2018 at 10:55:21AM -0500, Alexandru Gagniuc wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor straigh
> forward.

s/straigh/straight/
In this context, I think "straightforward" should be closed up
(without the space).

> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.

This is an interesting idea.  I have two concerns:

Some drivers already do this on their own, and we probably don't want
duplicate output for those devices.  In most cases (ixgbe and mlx* are
exceptions), the drivers do this unconditionally so we *could* remove
it from the driver if we add it to the core.  The dmesg order would
change, and the message wouldn't be associated with the driver as it
now is.

Also, I think some of the GPU devices might come up at a lower speed,
then download firmware, then reset the device so it comes up at a
higher speed.  I think this patch will make us complain about about
the low initial speed, which might confuse users.

So I'm not sure whether it's better to do this in the core for all
devices, or if we should just add it to the high-performance drivers
that really care.

> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
> 
> Changes since v2:
>  - Check dev->is_virtfn flag
> 
> Changes since v1:
>  - Use pcie_print_link_status() instead of reimplementing logic
>  
>  drivers/pci/probe.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6fd0bcd..a88ec8c25dd5 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>  	return dev;
>  }
>  
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{
> +
> +	if (!pci_is_pcie(dev))
> +		return;
> +
> +	/* Look from the device up to avoid downstream ports with no devices. */
> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> +		return;

Do we care about Upstream Ports here?  I suspect that ultimately we
only care about the bandwidth to Endpoints, and if an Endpoint is
constrained by a slow link farther up the tree,
pcie_print_link_status() is supposed to identify that slow link.

I would find this test easier to read as

  if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
    return;

But maybe I'm the only one that finds the conjunction of inequalities
hard to read.  No big deal either way.

> +	/* Multi-function PCIe share the same link/status. */
> +	if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
> +		return;
> +
> +	pcie_print_link_status(dev);
> +}
> +
>  static void pci_init_capabilities(struct pci_dev *dev)
>  {
>  	/* Enhanced Allocation */
> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>  	/* Advanced Error Reporting */
>  	pci_aer_init(dev);
>  
> +	/* Check link and detect downtrain errors */
> +	pcie_check_upstream_link(dev);
> +
>  	if (pci_probe_reset_function(dev) == 0)
>  		dev->reset_fn = 1;
>  }
> -- 
> 2.14.4
>
Alex_Gagniuc@Dellteam.com July 16, 2018, 10:28 p.m. | #4
On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]

Thanks for finding them!

[snip]
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
> 
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).

That's a straightforward edit. Thanks for the feedback!

>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
> This is an interesting idea.  I have two concerns:
> 
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices.  In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core.  The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.

Oh, there are only 8 users of that. Even I could patch up the drivers to 
remove the call, assuming we reach agreement about this change.

> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed.  I think this patch will make us complain about about
> the low initial speed, which might confuse users.

I spoke to one of the PCIe spec writers. It's allowable for a device to 
downtrain speed or width. It would also be extremely dumb to downtrain 
with the intent to re-train at a higher speed later, but it's possible 
devices do dumb stuff like that. That's why it's an informational 
message, instead of a warning.

Another case: Some devices (lower-end GPUs) use silicon (and marketing) 
that advertises x16, but they're only routed for x8. I'm okay with 
seeing an informational message in this case. In fact, I didn't know 
that my Quadro card for three years is only wired for x8 until I was 
testing this patch.

> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.

You're thinking "do I really need that bandwidth" because I'm using a 
function called "_bandwidth_". The point of the change is very far from 
that: it is to help in system troubleshooting by detecting downtraining 
conditions.

>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
[snip]
>> +	/* Look from the device up to avoid downstream ports with no devices. */
>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +		return;
> 
> Do we care about Upstream Ports here?  

YES! Switches. e.g. an x16 switch with 4x downstream ports could 
downtrain at 8x and 4x, and we'd never catch it.

> I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.

See above.

> I would find this test easier to read as
> 
>    if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
>      return;
> 
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read.  No big deal either way.
> 
>> +	/* Multi-function PCIe share the same link/status. */
>> +	if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> +		return;
>> +
>> +	pcie_print_link_status(dev);
>> +}
>> +
>>   static void pci_init_capabilities(struct pci_dev *dev)
>>   {
>>   	/* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>>   	/* Advanced Error Reporting */
>>   	pci_aer_init(dev);
>>   
>> +	/* Check link and detect downtrain errors */
>> +	pcie_check_upstream_link(dev);
>> +
>>   	if (pci_probe_reset_function(dev) == 0)
>>   		dev->reset_fn = 1;
>>   }
>> -- 
>> 2.14.4
>>
>
Tal Gilboa July 18, 2018, 1:38 p.m. | #5
On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]
> 
> On Mon, Jun 04, 2018 at 10:55:21AM -0500, Alexandru Gagniuc wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
> 
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).
> 
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
> 
> This is an interesting idea.  I have two concerns:
> 
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices.  In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core.  The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.
> 
> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed.  I think this patch will make us complain about about
> the low initial speed, which might confuse users.
> 
> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.
> 
>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> ---
>>
>> Changes since v2:
>>   - Check dev->is_virtfn flag
>>
>> Changes since v1:
>>   - Use pcie_print_link_status() instead of reimplementing logic
>>   
>>   drivers/pci/probe.c | 22 ++++++++++++++++++++++
>>   1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index ac91b6fd0bcd..a88ec8c25dd5 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>>   	return dev;
>>   }
>>   
>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>> +{
>> +
>> +	if (!pci_is_pcie(dev))
>> +		return;
>> +
>> +	/* Look from the device up to avoid downstream ports with no devices. */
>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> +		return;
> 
> Do we care about Upstream Ports here?  I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.
> 
> I would find this test easier to read as
> 
>    if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
>      return;
> 
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read.  No big deal either way.
> 
>> +	/* Multi-function PCIe share the same link/status. */
>> +	if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> +		return;
>> +
>> +	pcie_print_link_status(dev);
>> +}

Is this function called by default for every PCIe device? What about 
VFs? We make an exception for them on our driver since a VF doesn't have 
access to the needed information in order to provide a meaningful message.

>> +
>>   static void pci_init_capabilities(struct pci_dev *dev)
>>   {
>>   	/* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>>   	/* Advanced Error Reporting */
>>   	pci_aer_init(dev);
>>   
>> +	/* Check link and detect downtrain errors */
>> +	pcie_check_upstream_link(dev);
>> +
>>   	if (pci_probe_reset_function(dev) == 0)
>>   		dev->reset_fn = 1;
>>   }
>> -- 
>> 2.14.4
>>
Bjorn Helgaas July 18, 2018, 9:53 p.m. | #6
[+cc Mike (hfi1)]

On Mon, Jul 16, 2018 at 10:28:35PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> >> ...
> >> The easiest way to detect this is with pcie_print_link_status(),
> >> since the bottleneck is usually the link that is downtrained. It's not
> >> a perfect solution, but it works extremely well in most cases.
> > 
> > This is an interesting idea.  I have two concerns:
> > 
> > Some drivers already do this on their own, and we probably don't want
> > duplicate output for those devices.  In most cases (ixgbe and mlx* are
> > exceptions), the drivers do this unconditionally so we *could* remove
> > it from the driver if we add it to the core.  The dmesg order would
> > change, and the message wouldn't be associated with the driver as it
> > now is.
> 
> Oh, there are only 8 users of that. Even I could patch up the drivers to 
> remove the call, assuming we reach agreement about this change.
> 
> > Also, I think some of the GPU devices might come up at a lower speed,
> > then download firmware, then reset the device so it comes up at a
> > higher speed.  I think this patch will make us complain about about
> > the low initial speed, which might confuse users.
> 
> I spoke to one of the PCIe spec writers. It's allowable for a device to 
> downtrain speed or width. It would also be extremely dumb to downtrain 
> with the intent to re-train at a higher speed later, but it's possible 
> devices do dumb stuff like that. That's why it's an informational 
> message, instead of a warning.

FWIW, here's some of the discussion related to hfi1 from [1]:

  > Btw, why is the driver configuring the PCIe link speed?  Isn't
  > this something we should be handling in the PCI core?

  The device comes out of reset at the 5GT/s speed. The driver
  downloads device firmware, programs PCIe registers, and co-ordinates
  the transition to 8GT/s.

  This recipe is device specific and is therefore implemented in the
  hfi1 driver built on top of PCI core functions and macros.

Also several DRM drivers seem to do this (see cik_pcie_gen3_enable(),
si_pcie_gen3_enable()); from [2]:

  My understanding was that some platfoms only bring up the link in gen 1
  mode for compatibility reasons. 

[1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
[2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com

> Another case: Some devices (lower-end GPUs) use silicon (and marketing) 
> that advertises x16, but they're only routed for x8. I'm okay with 
> seeing an informational message in this case. In fact, I didn't know 
> that my Quadro card for three years is only wired for x8 until I was 
> testing this patch.

Yeah, it's probably OK.  I don't want bug reports from people who
think something's broken when it's really just a hardware limitation
of their system.  But hopefully the message is not alarming.

> > So I'm not sure whether it's better to do this in the core for all
> > devices, or if we should just add it to the high-performance drivers
> > that really care.
> 
> You're thinking "do I really need that bandwidth" because I'm using a 
> function called "_bandwidth_". The point of the change is very far from 
> that: it is to help in system troubleshooting by detecting downtraining 
> conditions.

I'm not sure what you think I'm thinking :)  My question is whether
it's worthwhile to print this extra information for *every* PCIe
device, given that your use case is the tiny percentage of broken
systems.

If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
when the device is capable of more than it's getting, that would make
a lot of sense to me.  The normal case line is more questionable.  I
think the reason that's there is because the network drivers are very
performance sensitive and like to see that info all the time.

Maybe we need something like this:

  pcie_print_link_status(struct pci_dev *dev, int verbose)
  {
    ...
    if (bw_avail >= bw_cap) {
      if (verbose)
        pci_info(dev, "... available PCIe bandwidth ...");
    } else
      pci_info(dev, "... available PCIe bandwidth, limited by ...");
  }

So the core could print only the potential problems with:

  pcie_print_link_status(dev, 0);

and drivers that really care even if there's no problem could do:

  pcie_print_link_status(dev, 1);

> >> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> [snip]
> >> +	/* Look from the device up to avoid downstream ports with no devices. */
> >> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> >> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> >> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> >> +		return;
> > 
> > Do we care about Upstream Ports here?  
> 
> YES! Switches. e.g. an x16 switch with 4x downstream ports could 
> downtrain at 8x and 4x, and we'd never catch it.

OK, I think I see your point: if the upstream port *could* do 16x but
only trains to 4x, and two endpoints below it are both capable of 4x,
the endpoints *think* they're happy but in fact they have to share 4x
when they could use more.

Bjorn
Alex G. July 19, 2018, 3:46 p.m. | #7
On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:
> [+cc Mike (hfi1)]
> 
> On Mon, Jul 16, 2018 at 10:28:35PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
>>>> ...
>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>> a perfect solution, but it works extremely well in most cases.
>>>
>>> This is an interesting idea.  I have two concerns:
>>>
>>> Some drivers already do this on their own, and we probably don't want
>>> duplicate output for those devices.  In most cases (ixgbe and mlx* are
>>> exceptions), the drivers do this unconditionally so we *could* remove
>>> it from the driver if we add it to the core.  The dmesg order would
>>> change, and the message wouldn't be associated with the driver as it
>>> now is.
>>
>> Oh, there are only 8 users of that. Even I could patch up the drivers to
>> remove the call, assuming we reach agreement about this change.
>>
>>> Also, I think some of the GPU devices might come up at a lower speed,
>>> then download firmware, then reset the device so it comes up at a
>>> higher speed.  I think this patch will make us complain about about
>>> the low initial speed, which might confuse users.
>>
>> I spoke to one of the PCIe spec writers. It's allowable for a device to
>> downtrain speed or width. It would also be extremely dumb to downtrain
>> with the intent to re-train at a higher speed later, but it's possible
>> devices do dumb stuff like that. That's why it's an informational
>> message, instead of a warning.
> 
> FWIW, here's some of the discussion related to hfi1 from [1]:
> 
>    > Btw, why is the driver configuring the PCIe link speed?  Isn't
>    > this something we should be handling in the PCI core?
> 
>    The device comes out of reset at the 5GT/s speed. The driver
>    downloads device firmware, programs PCIe registers, and co-ordinates
>    the transition to 8GT/s.
> 
>    This recipe is device specific and is therefore implemented in the
>    hfi1 driver built on top of PCI core functions and macros.
> 
> Also several DRM drivers seem to do this (see ),
> si_pcie_gen3_enable()); from [2]:
> 
>    My understanding was that some platfoms only bring up the link in gen 1
>    mode for compatibility reasons.
> 
> [1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
> [2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com

Downtraining a link "for compatibility reasons" is one of those dumb 
things that devices do. I'm SURPRISED AMD HW does it, although it is 
perfectly permissible by PCIe spec.

>> Another case: Some devices (lower-end GPUs) use silicon (and marketing)
>> that advertises x16, but they're only routed for x8. I'm okay with
>> seeing an informational message in this case. In fact, I didn't know
>> that my Quadro card for three years is only wired for x8 until I was
>> testing this patch.
> 
> Yeah, it's probably OK.  I don't want bug reports from people who
> think something's broken when it's really just a hardware limitation
> of their system.  But hopefully the message is not alarming.

It looks fairly innocent:

[    0.749415] pci 0000:18:00.0: 4.000 Gb/s available PCIe bandwidth, 
limited by 5 GT/s x1 link at 0000:17:03.0 (capable of 15.752 Gb/s with 8 
GT/s x2 link)

>>> So I'm not sure whether it's better to do this in the core for all
>>> devices, or if we should just add it to the high-performance drivers
>>> that really care.
>>
>> You're thinking "do I really need that bandwidth" because I'm using a
>> function called "_bandwidth_". The point of the change is very far from
>> that: it is to help in system troubleshooting by detecting downtraining
>> conditions.
> 
> I'm not sure what you think I'm thinking :)  My question is whether
> it's worthwhile to print this extra information for *every* PCIe
> device, given that your use case is the tiny percentage of broken
> systems.

I think this information is a lot more useful than a bunch of other info 
that's printed. Is "type 00 class 0x088000" more valuable? What about 
"reg 0x20: [mem 0x9d950000-0x9d95ffff 64bit pref]", which is also 
available under /proc/iomem for those curious?

> If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
> when the device is capable of more than it's getting, that would make
> a lot of sense to me.  The normal case line is more questionable.  I
> think the reason that's there is because the network drivers are very
> performance sensitive and like to see that info all the time.

I agree that can be an acceptable compromise.

> Maybe we need something like this:
> 
>    pcie_print_link_status(struct pci_dev *dev, int verbose)
>    {
>      ...
>      if (bw_avail >= bw_cap) {
>        if (verbose)
>          pci_info(dev, "... available PCIe bandwidth ...");
>      } else
>        pci_info(dev, "... available PCIe bandwidth, limited by ...");
>    }
> 
> So the core could print only the potential problems with:
> 
>    pcie_print_link_status(dev, 0);
> 
> and drivers that really care even if there's no problem could do:
> 
>    pcie_print_link_status(dev, 1);

Sounds good. I'll try to push out updated PATCH early next week.

>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> [snip]
>>>> +	/* Look from the device up to avoid downstream ports with no devices. */
>>>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>>>> +		return;
>>>
>>> Do we care about Upstream Ports here?
>>
>> YES! Switches. e.g. an x16 switch with 4x downstream ports could
>> downtrain at 8x and 4x, and we'd never catch it.
> 
> OK, I think I see your point: if the upstream port *could* do 16x but
> only trains to 4x, and two endpoints below it are both capable of 4x,
> the endpoints *think* they're happy but in fact they have to share 4x
> when they could use more.
> 
> Bjorn
>
Alex G. July 19, 2018, 3:49 p.m. | #8
On 07/18/2018 08:38 AM, Tal Gilboa wrote:
> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>> [+cc maintainers of drivers that already use pcie_print_link_status()
>> and GPU folks]
[snip]
>>
>>> +    /* Multi-function PCIe share the same link/status. */
>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>> +        return;
>>> +
>>> +    pcie_print_link_status(dev);
>>> +}
> 
> Is this function called by default for every PCIe device? What about 
> VFs? We make an exception for them on our driver since a VF doesn't have 
> access to the needed information in order to provide a meaningful message.

I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
care if it's passed a virtual function. It will try to do its job. 
That's why I bail out three lines above, with 'dev->is_virtfn' check.

Alex
Tal Gilboa July 23, 2018, 5:21 a.m. | #9
On 7/19/2018 6:49 PM, Alex G. wrote:
> 
> 
> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>> and GPU folks]
> [snip]
>>>
>>>> +    /* Multi-function PCIe share the same link/status. */
>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>> +        return;
>>>> +
>>>> +    pcie_print_link_status(dev);
>>>> +}
>>
>> Is this function called by default for every PCIe device? What about 
>> VFs? We make an exception for them on our driver since a VF doesn't 
>> have access to the needed information in order to provide a meaningful 
>> message.
> 
> I'm assuming VF means virtual function. pcie_print_link_status() doesn't 
> care if it's passed a virtual function. It will try to do its job. 
> That's why I bail out three lines above, with 'dev->is_virtfn' check.
> 
> Alex

That's the point - we don't want to call pcie_print_link_status() for 
virtual functions. We make the distinction in our driver. If you want to 
change the code to call this function by default it shouldn't affect the 
current usage.
Alex G. July 23, 2018, 5:01 p.m. | #10
On 07/23/2018 12:21 AM, Tal Gilboa wrote:
> On 7/19/2018 6:49 PM, Alex G. wrote:
>>
>>
>> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>>> and GPU folks]
>> [snip]
>>>>
>>>>> +    /* Multi-function PCIe share the same link/status. */
>>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>>> +        return;
>>>>> +
>>>>> +    pcie_print_link_status(dev);
>>>>> +}
>>>
>>> Is this function called by default for every PCIe device? What about 
>>> VFs? We make an exception for them on our driver since a VF doesn't 
>>> have access to the needed information in order to provide a 
>>> meaningful message.
>>
>> I'm assuming VF means virtual function. pcie_print_link_status() 
>> doesn't care if it's passed a virtual function. It will try to do its 
>> job. That's why I bail out three lines above, with 'dev->is_virtfn' 
>> check.
>>
>> Alex
> 
> That's the point - we don't want to call pcie_print_link_status() for 
> virtual functions. We make the distinction in our driver. If you want to 
> change the code to call this function by default it shouldn't affect the 
> current usage.

I'm not understanding very well what you're asking. I understand you 
want to avoid printing this message on virtual functions, and that's 
already taken care of. I'm also not changing current behavior.  Let's 
get v2 out and start the discussion again based on that.

Alex
Tal Gilboa July 23, 2018, 9:35 p.m. | #11
On 7/23/2018 8:01 PM, Alex G. wrote:
> On 07/23/2018 12:21 AM, Tal Gilboa wrote:
>> On 7/19/2018 6:49 PM, Alex G. wrote:
>>>
>>>
>>> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>>>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>>>> and GPU folks]
>>> [snip]
>>>>>
>>>>>> +    /* Multi-function PCIe share the same link/status. */
>>>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>>>> +        return;
>>>>>> +
>>>>>> +    pcie_print_link_status(dev);
>>>>>> +}
>>>>
>>>> Is this function called by default for every PCIe device? What about 
>>>> VFs? We make an exception for them on our driver since a VF doesn't 
>>>> have access to the needed information in order to provide a 
>>>> meaningful message.
>>>
>>> I'm assuming VF means virtual function. pcie_print_link_status() 
>>> doesn't care if it's passed a virtual function. It will try to do its 
>>> job. That's why I bail out three lines above, with 'dev->is_virtfn' 
>>> check.
>>>
>>> Alex
>>
>> That's the point - we don't want to call pcie_print_link_status() for 
>> virtual functions. We make the distinction in our driver. If you want 
>> to change the code to call this function by default it shouldn't 
>> affect the current usage.
> 
> I'm not understanding very well what you're asking. I understand you 
> want to avoid printing this message on virtual functions, and that's 
> already taken care of. I'm also not changing current behavior.  Let's 
> get v2 out and start the discussion again based on that.
> 
> Alex

Oh ok I see. In this case, please remove the explicit call in mlx4/5 
drivers so it won't be duplicated.

Patch

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac91b6fd0bcd..a88ec8c25dd5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2146,6 +2146,25 @@  static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
 	return dev;
 }
 
+static void pcie_check_upstream_link(struct pci_dev *dev)
+{
+
+	if (!pci_is_pcie(dev))
+		return;
+
+	/* Look from the device up to avoid downstream ports with no devices. */
+	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
+		return;
+
+	/* Multi-function PCIe share the same link/status. */
+	if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+		return;
+
+	pcie_print_link_status(dev);
+}
+
 static void pci_init_capabilities(struct pci_dev *dev)
 {
 	/* Enhanced Allocation */
@@ -2181,6 +2200,9 @@  static void pci_init_capabilities(struct pci_dev *dev)
 	/* Advanced Error Reporting */
 	pci_aer_init(dev);
 
+	/* Check link and detect downtrain errors */
+	pcie_check_upstream_link(dev);
+
 	if (pci_probe_reset_function(dev) == 0)
 		dev->reset_fn = 1;
 }