diff mbox series

[1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable

Message ID 20200223122057.6504-2-stanspas@amazon.com
State New
Headers show
Series Improve PCI device post-reset readiness polling | expand

Commit Message

Stanislav Spassov Feb. 23, 2020, 12:20 p.m. UTC
From: Wei Wang <wawei@amazon.de>

The resonable value for the maximum time to wait for a PCI device to be
ready after reset varies depending on the platform and the reliability
of its set of devices.

Signed-off-by: Wei Wang <wawei@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 .../admin-guide/kernel-parameters.txt         |  5 +++++
 drivers/pci/pci.c                             | 22 ++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

Comments

Bjorn Helgaas Feb. 24, 2020, 2:15 p.m. UTC | #1
[+cc Ashok, Alex, Sinan, Rajat]

On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> From: Wei Wang <wawei@amazon.de>
> 
> The resonable value for the maximum time to wait for a PCI device to be
> ready after reset varies depending on the platform and the reliability
> of its set of devices.

There are several mechanisms in the spec for reducing these times,
e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
Firmware Spec r3.2, sec 4.6.8, 4.6.9).

I would much rather use standard mechanisms like those instead of
adding kernel parameters.  A user would have to use trial and error
to figure out the value to use with a parameter like this, and that
doesn't feel like a robust approach.

> Signed-off-by: Wei Wang <wawei@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 +++++
>  drivers/pci/pci.c                             | 22 ++++++++++++++-----
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index dbc22d684627..5e4dade9acc8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3653,6 +3653,11 @@
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcie_reset_ready_poll_ms= [PCI,PCIE]
> +			Specifies timeout for PCI(e) device readiness polling
> +			after device reset (in milliseconds).
> +			Default: 60000 = 60 seconds
> +
>  	pcmv=		[HW,PCMCIA] BadgePAD 4
>  
>  	pd_ignore_unused
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d828ca835a98..db9b58ab6c68 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -149,7 +149,19 @@ static int __init pcie_port_pm_setup(char *str)
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
>  
>  /* Time to wait after a reset for device to become responsive */
> -#define PCIE_RESET_READY_POLL_MS 60000
> +#define PCIE_RESET_READY_POLL_MS_DEFAULT 60000
> +
> +int __read_mostly pcie_reset_ready_poll_ms = PCIE_RESET_READY_POLL_MS_DEFAULT;
> +
> +static int __init pcie_reset_ready_poll_ms_setup(char *str)
> +{
> +	int timeout;
> +
> +	if (!kstrtoint(str, 0, &timeout))
> +		pcie_reset_ready_poll_ms = timeout;
> +	return 1;
> +}
> +__setup("pcie_reset_ready_poll_ms=", pcie_reset_ready_poll_ms_setup);
>  
>  /**
>   * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
> @@ -4506,7 +4518,7 @@ int pcie_flr(struct pci_dev *dev)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
>  }
>  EXPORT_SYMBOL_GPL(pcie_flr);
>  
> @@ -4551,7 +4563,7 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
>  }
>  
>  /**
> @@ -4596,7 +4608,7 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>  	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
>  	pci_dev_d3_sleep(dev);
>  
> -	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
>  }
>  
>  /**
> @@ -4826,7 +4838,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>  {
>  	pcibios_reset_secondary_bus(dev);
>  
> -	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
>  }
>  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
Spassov, Stanislav Feb. 24, 2020, 5:52 p.m. UTC | #2
On Mon, 2020-02-24 at 08:15 -0600, Bjorn Helgaas wrote:
> [+cc Ashok, Alex, Sinan, Rajat]
> 
> On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> > From: Wei Wang <wawei@amazon.de>
> > 
> > The resonable value for the maximum time to wait for a PCI device to be
> > ready after reset varies depending on the platform and the reliability
> > of its set of devices.
> 
> There are several mechanisms in the spec for reducing these times,
> e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
> Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
> Firmware Spec r3.2, sec 4.6.8, 4.6.9).
> 
> I would much rather use standard mechanisms like those instead of
> adding kernel parameters.  A user would have to use trial and error
> to figure out the value to use with a parameter like this, and that
> doesn't feel like a robust approach.
> 

I agree that supporting the standard mechanisms is highly desirable,
but some sort of fallback poll timeout value is necessary on platforms
where none of those mechanisms are supported. Arguably, some kernel
configurability (whether via a kernel parameter, as proposed here,
or some other means) is still desirable.

I also agree there is no robust method to determine a "good value", but
then again - how was the current value for the constant determined? As
suggested in PATCH 2, the idea is to lower the global timeout to avoid
hung tasks when devices break and never come back after reset.

Linux already (partially) supports the _DSM methods you mention:
- acpi_pci_add_bus() queries Function 8 (described in 4.6.8) to set
  ignore_reset_delay on the host bridge
- pci_acpi_optimize_delay() uses Function 9 to set d3cold_delay and d3_delay
  in struct pci_dev, but does not currently store the DL_Up Time,
  FLR Reset Time and VF Enable Time
I guess we can extend pci_struct akin to what PATCH 2 does to store all
relevant delay values on a per-device basis, with some default value and
overriding them as appropriate from Readiness Time Reporting, _DSM, or a quirk.

Unfortunately, the hardware and firmware at my disposal does not support
any of the discussed mechanisms, so I do not feel comfortable making the changes
without being able to test them.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Bjorn Helgaas Feb. 27, 2020, 9:45 p.m. UTC | #3
On Mon, Feb 24, 2020 at 05:52:14PM +0000, Spassov, Stanislav wrote:
> On Mon, 2020-02-24 at 08:15 -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> > > From: Wei Wang <wawei@amazon.de>
> > > 
> > > The reasonable value for the maximum time to wait for a PCI
> > > device to be ready after reset varies depending on the platform
> > > and the reliability of its set of devices.
> > 
> > There are several mechanisms in the spec for reducing these times,
> > e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
> > Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
> > Firmware Spec r3.2, sec 4.6.8, 4.6.9).
> > 
> > I would much rather use standard mechanisms like those instead of
> > adding kernel parameters.  A user would have to use trial and
> > error to figure out the value to use with a parameter like this,
> > and that doesn't feel like a robust approach.
> 
> I agree that supporting the standard mechanisms is highly desirable,
> but some sort of fallback poll timeout value is necessary on
> platforms where none of those mechanisms are supported. Arguably,
> some kernel configurability (whether via a kernel parameter, as
> proposed here, or some other means) is still desirable.

IIUC we basically have two issues: 1) the default 60 second timeout is
too long, and 2) you'd like to reduce the delays further à la the
Device Readiness _DSM even for firmware that doesn't support that.

The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
seconds for device to become ready after FLR") and is probably too
long.  We probably should pick a smaller value based on numbers from
the spec and make quirks for devices that needed more time.

As far as reducing delays for specific devices, if we can identify
them via Vendor/Device ID, can we make per-device values that could be
set either by the _DSM or by a quirk?

I'm trying to wriggle out of adding yet more PCI kernel parameters
because people tend to stumble across them and pass them around on
bulletin boards as ways to "fix" or "speed up" things that really
should be addressed in better ways.

> I also agree there is no robust method to determine a "good value", but
> then again - how was the current value for the constant determined? As
> suggested in PATCH 2, the idea is to lower the global timeout to avoid
> hung tasks when devices break and never come back after reset.

I don't remember exactly how we came up with 60 seconds; I suspect it
was just a convenient large number.

Bjorn
Sinan Kaya Feb. 27, 2020, 11:44 p.m. UTC | #4
On 2/27/2020 4:45 PM, Bjorn Helgaas wrote:
> The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
> seconds for device to become ready after FLR") and is probably too
> long.  We probably should pick a smaller value based on numbers from
> the spec and make quirks for devices that needed more time.

If I remember right, there was no time mention about how long to
wait. Spec says device should send CRS as long as it is not ready.
Ashok Raj Feb. 28, 2020, 2:18 a.m. UTC | #5
On Thu, Feb 27, 2020 at 06:44:56PM -0500, Sinan Kaya wrote:
> On 2/27/2020 4:45 PM, Bjorn Helgaas wrote:
> > The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
> > seconds for device to become ready after FLR") and is probably too
> > long.  We probably should pick a smaller value based on numbers from
> > the spec and make quirks for devices that needed more time.
> 
> If I remember right, there was no time mention about how long to
> wait. Spec says device should send CRS as long as it is not ready.

Not exactly.. there are some requirements to follow for rules after
a conventional reset. 

Look for "The second set of rules addresses requirements placed on the system"

i'm looking a the 5.0 spec (around page 553) :-). 

In general 1s seems good enough for most cases. For ports that support
> 5gt/s transfer speed, it says 100ms after link training completes.
I'm not sure if this means 100ms after getting a DLLSC event?
Sinan Kaya March 2, 2020, 4:39 p.m. UTC | #6
On 2/27/2020 9:18 PM, Raj, Ashok wrote:
>> If I remember right, there was no time mention about how long to
>> wait. Spec says device should send CRS as long as it is not ready.
> Not exactly.. there are some requirements to follow for rules after
> a conventional reset. 

Yes, but CRS happens after functional reset, D3hot etc. too not just
conventional reset.

1 second is too aggressive. We already have proof that several PCIe
cards take their time during FLR especially FPGA cards in the orders
of 10 seconds.

Current code is waiting up to 60 seconds. If card shows up before that
we happily return.
Ashok Raj March 2, 2020, 5:37 p.m. UTC | #7
On Mon, Mar 02, 2020 at 11:39:39AM -0500, Sinan Kaya wrote:
> On 2/27/2020 9:18 PM, Raj, Ashok wrote:
> >> If I remember right, there was no time mention about how long to
> >> wait. Spec says device should send CRS as long as it is not ready.
> > Not exactly.. there are some requirements to follow for rules after
> > a conventional reset. 
> 
> Yes, but CRS happens after functional reset, D3hot etc. too not just
> conventional reset.
> 
> 1 second is too aggressive. We already have proof that several PCIe
> cards take their time during FLR especially FPGA cards in the orders
> of 10 seconds.

Aren't the rules specified in 7.9.17 Rediness Time Reporting Extended Capability
sufficient to cover this?

If a device doesn't have them we could let the driver supply this value
as a device specific value to override the default.

> 
> Current code is waiting up to 60 seconds. If card shows up before that
> we happily return.
> 

But in 7.9.17.2 Readiness Time Reporting 1 Register, says devices
can defer reporting by not setting the valid bit, but if it remains
clear after 60s system software can assume no valid values will be reported.

Maybe keep the system default to something more reasonable (after some
testing in the community), and do this insane 60s for devices that 
support the optional reporting capability?

Cheers,
Ashok
Sinan Kaya March 2, 2020, 6:30 p.m. UTC | #8
On 3/2/2020 12:37 PM, Raj, Ashok wrote:
>> 1 second is too aggressive. We already have proof that several PCIe
>> cards take their time during FLR especially FPGA cards in the orders
>> of 10 seconds.
> Aren't the rules specified in 7.9.17 Rediness Time Reporting Extended Capability
> sufficient to cover this?
> 

Yes, readiness would work too but it is not a mandatory feature.
Readiness is yet another alternative to CRS advertised as the new
flashy way of doing things. Problem is backwards compatibility.

Do you want to break all existing/legacy drivers for an optional/new
spec defined feature?

> If a device doesn't have them we could let the driver supply this value
> as a device specific value to override the default.
> 
>> Current code is waiting up to 60 seconds. If card shows up before that
>> we happily return.
>>
> But in 7.9.17.2 Readiness Time Reporting 1 Register, says devices
> can defer reporting by not setting the valid bit, but if it remains
> clear after 60s system software can assume no valid values will be reported.
> 
> Maybe keep the system default to something more reasonable (after some
> testing in the community), and do this insane 60s for devices that 
> support the optional reporting capability?

I think we should add readiness capability on top of CRS but not
exclusively.

If readiness is supported, do the spec defined behavior.
If not supported, fallback to CRS for legacy devices.

I remember Bjorn mentioning about readiness capability. Maybe, this is
the time to implement it.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dbc22d684627..5e4dade9acc8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3653,6 +3653,11 @@ 
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcie_reset_ready_poll_ms= [PCI,PCIE]
+			Specifies timeout for PCI(e) device readiness polling
+			after device reset (in milliseconds).
+			Default: 60000 = 60 seconds
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d828ca835a98..db9b58ab6c68 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -149,7 +149,19 @@  static int __init pcie_port_pm_setup(char *str)
 __setup("pcie_port_pm=", pcie_port_pm_setup);
 
 /* Time to wait after a reset for device to become responsive */
-#define PCIE_RESET_READY_POLL_MS 60000
+#define PCIE_RESET_READY_POLL_MS_DEFAULT 60000
+
+int __read_mostly pcie_reset_ready_poll_ms = PCIE_RESET_READY_POLL_MS_DEFAULT;
+
+static int __init pcie_reset_ready_poll_ms_setup(char *str)
+{
+	int timeout;
+
+	if (!kstrtoint(str, 0, &timeout))
+		pcie_reset_ready_poll_ms = timeout;
+	return 1;
+}
+__setup("pcie_reset_ready_poll_ms=", pcie_reset_ready_poll_ms_setup);
 
 /**
  * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
@@ -4506,7 +4518,7 @@  int pcie_flr(struct pci_dev *dev)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
 
@@ -4551,7 +4563,7 @@  static int pci_af_flr(struct pci_dev *dev, int probe)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
 }
 
 /**
@@ -4596,7 +4608,7 @@  static int pci_pm_reset(struct pci_dev *dev, int probe)
 	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
 	pci_dev_d3_sleep(dev);
 
-	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
 }
 
 /**
@@ -4826,7 +4838,7 @@  int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
 {
 	pcibios_reset_secondary_bus(dev);
 
-	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
 }
 EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);