diff mbox series

[v2,2/2] PCI: Add missing link delays required by the PCIe spec

Message ID 20191004123947.11087-3-mika.westerberg@linux.intel.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Add missing link delays | expand

Commit Message

Mika Westerberg Oct. 4, 2019, 12:39 p.m. UTC
Currently Linux does not follow PCIe spec regarding the required delays
after reset. A concrete example is a Thunderbolt add-in-card that
consists of a PCIe switch and two PCIe endpoints:

  +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
                                  +-01.0-[04-36]-- DS hotplug port
                                  +-02.0-[37]----00.0 xHCI controller
                                  \-04.0-[38-6b]-- DS hotplug port

The root port (1b.0) and the PCIe switch downstream ports are all PCIe
gen3 so they support 8GT/s link speeds.

We wait for the PCIe hierarchy to enter D3cold (runtime):

  pcieport 0000:00:1b.0: power state changed by ACPI to D3cold

When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
PCIe switch is put to reset and its power is re-applied. This means that
we must follow the rules in PCIe 4.0 section 6.6.1.

For the PCIe gen3 ports we are dealing with here, the following applies:

  With a Downstream Port that supports Link speeds greater than 5.0
  GT/s, software must wait a minimum of 100 ms after Link training
  completes before sending a Configuration Request to the device
  immediately below that Port. Software can determine when Link training
  completes by polling the Data Link Layer Link Active bit or by setting
  up an associated interrupt (see Section 6.7.3.3).

Translating this into the above topology we would need to do this (DLLLA
stands for Data Link Layer Link Active):

  0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
  0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
  0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0

I've instrumented the kernel with some additional logging so we can see
the actual delays performed:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
  pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms

For the switch upstream port (01:00.0 reachable through 00:1b.0 root
port) we wait for 100 ms but not taking into account the DLLLA
requirement. We then wait 10 ms for D3hot -> D0 transition of the root
port and the two downstream hotplug ports. This means that we deviate
from what the spec requires.

Performing the same check for system sleep (s2idle) transitions it turns
out to be even worse. None of the mandatory delays are performed. If
this would be S3 instead of s2idle then according to PCI FW spec 3.2
section 4.6.8. there is a specific _DSM that allows the OS to skip the
delays but this platform does not provide the _DSM and does not go to S3
anyway so no firmware is involved that could already handle these
delays.

On this particular platform these delays are not actually needed because
there is an additional delay as part of the ACPI power resource that is
used to turn on power to the hierarchy but since that additional delay
is not required by any of standards (PCIe, ACPI) it is not present in
the Intel Ice Lake, for example where missing the mandatory delays
causes pciehp to start tearing down the stack too early (links are not
yet trained). Below is an example how it looks like when this happens:

  pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
  pcieport 0000:87:04.0: PME# disabled
  pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
  pcieport 0000:86:00.0: Refused to change power state, currently in D3
  pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
  pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
  ...

There is also one reported case (see the bugzilla link below) where the
missing delay causes xHCI on a Titan Ridge controller fail to runtime
resume when USB-C dock is plugged. This does not involve pciehp but
instead it PCI core fails to runtime resume the xHCI device:

  pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
  xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
  xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
  xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
  ...

For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
that is called on PCI core resume and runtime resume paths accordingly
if the bridge entered D3cold (and thus went through reset).

This is second attempt to add the missing delays. The previous solution
in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
PCIe spec") was reverted because of two issues it caused:

  1. One system become unresponsive after S3 resume due to PME service
     spinning in pcie_pme_work_fn(). The root port in question reports
     that the xHCI sent PME but the xHCI device itself does not have PME
     status set. The PME status bit is never cleared in the root port
     resulting the indefinite loop in pcie_pme_work_fn().

  2. Slows down resume if the root/downstream port does not support
     Data Link Layer Active Reporting because pcie_wait_for_link_delay()
     waits 1100 ms in that case.

This version should avoid the above issues because we restrict the delay
to happen only if the port went into D3cold.

Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci-driver.c | 18 ++++++++
 drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
 drivers/pci/pci.h        |  1 +
 3 files changed, 104 insertions(+), 7 deletions(-)

Comments

Bjorn Helgaas Oct. 26, 2019, 2:19 p.m. UTC | #1
On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> Currently Linux does not follow PCIe spec regarding the required delays
> after reset. A concrete example is a Thunderbolt add-in-card that
> consists of a PCIe switch and two PCIe endpoints:
> 
>   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
>                                   +-01.0-[04-36]-- DS hotplug port
>                                   +-02.0-[37]----00.0 xHCI controller
>                                   \-04.0-[38-6b]-- DS hotplug port
> 
> The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> gen3 so they support 8GT/s link speeds.
> 
> We wait for the PCIe hierarchy to enter D3cold (runtime):
> 
>   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> 
> When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> PCIe switch is put to reset and its power is re-applied. This means that
> we must follow the rules in PCIe 4.0 section 6.6.1.

If you have the PCIe 5.0 spec, can you update these references to
that?  If not, I'm happy to do it for you.

> For the PCIe gen3 ports we are dealing with here, the following applies:
> 
>   With a Downstream Port that supports Link speeds greater than 5.0
>   GT/s, software must wait a minimum of 100 ms after Link training
>   completes before sending a Configuration Request to the device
>   immediately below that Port. Software can determine when Link training
>   completes by polling the Data Link Layer Link Active bit or by setting
>   up an associated interrupt (see Section 6.7.3.3).
> 
> Translating this into the above topology we would need to do this (DLLLA
> stands for Data Link Layer Link Active):
> 
>   0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
>   0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
>   0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0
> 
> I've instrumented the kernel with some additional logging so we can see
> the actual delays performed:
> 
>   pcieport 0000:00:1b.0: power state changed by ACPI to D0
>   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
>   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> 
> For the switch upstream port (01:00.0 reachable through 00:1b.0 root
> port) we wait for 100 ms but not taking into account the DLLLA
> requirement. We then wait 10 ms for D3hot -> D0 transition of the root
> port and the two downstream hotplug ports. This means that we deviate
> from what the spec requires.
> 
> Performing the same check for system sleep (s2idle) transitions it turns
> out to be even worse. None of the mandatory delays are performed. If
> this would be S3 instead of s2idle then according to PCI FW spec 3.2
> section 4.6.8. there is a specific _DSM that allows the OS to skip the
> delays but this platform does not provide the _DSM and does not go to S3
> anyway so no firmware is involved that could already handle these
> delays.
> 
> On this particular platform these delays are not actually needed because
> there is an additional delay as part of the ACPI power resource that is
> used to turn on power to the hierarchy but since that additional delay
> is not required by any of standards (PCIe, ACPI) it is not present in
> the Intel Ice Lake, for example where missing the mandatory delays
> causes pciehp to start tearing down the stack too early (links are not
> yet trained). Below is an example how it looks like when this happens:
> 
>   pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
>   pcieport 0000:87:04.0: PME# disabled
>   pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
>   pcieport 0000:86:00.0: Refused to change power state, currently in D3
>   pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
>   pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
>   ...
> 
> There is also one reported case (see the bugzilla link below) where the
> missing delay causes xHCI on a Titan Ridge controller fail to runtime
> resume when USB-C dock is plugged. This does not involve pciehp but
> instead it PCI core fails to runtime resume the xHCI device:
> 
>   pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
>   xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
>   xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
>   xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
>   ...
> 
> For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
> that is called on PCI core resume and runtime resume paths accordingly
> if the bridge entered D3cold (and thus went through reset).
> 
> This is second attempt to add the missing delays. The previous solution
> in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
> PCIe spec") was reverted because of two issues it caused:
> 
>   1. One system become unresponsive after S3 resume due to PME service
>      spinning in pcie_pme_work_fn(). The root port in question reports
>      that the xHCI sent PME but the xHCI device itself does not have PME
>      status set. The PME status bit is never cleared in the root port
>      resulting the indefinite loop in pcie_pme_work_fn().
> 
>   2. Slows down resume if the root/downstream port does not support
>      Data Link Layer Active Reporting because pcie_wait_for_link_delay()
>      waits 1100 ms in that case.
> 
> This version should avoid the above issues because we restrict the delay
> to happen only if the port went into D3cold.
> 
> Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
> Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>  drivers/pci/pci-driver.c | 18 ++++++++
>  drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
>  drivers/pci/pci.h        |  1 +
>  3 files changed, 104 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a8124e47bf6e..74a144c9cf4e 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -917,6 +917,7 @@ static int pci_pm_suspend_noirq(struct device *dev)
>  static int pci_pm_resume_noirq(struct device *dev)
>  {
>  	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	bool d3cold = pci_dev->current_state == PCI_D3cold;
>  	struct device_driver *drv = dev->driver;
>  	int error = 0;
>  
> @@ -947,6 +948,14 @@ static int pci_pm_resume_noirq(struct device *dev)
>  
>  	pcie_pme_root_status_cleanup(pci_dev);
>  
> +	/*
> +	 * If the hierarchy went into D3cold wait for the secondary bus to
> +	 * become accessible. This is important for PCIe to prevent pciehp
> +	 * from tearing down the downstream devices too soon.

The pciehp connection isn't obvious to me.

> +	 */
> +	if (d3cold)
> +		pci_bridge_wait_for_secondary_bus(pci_dev);

This will need to be rebased on top of my pci/pm branch, but I think
that's minor.

Can we move this closer to where we initiate the reset?  It's pretty
hard to tell from looking at pci_pm_resume_noirq() that there's a
reset happening here.

For D3cold->D0, I guess that would be somewhere down in
platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
What about the mid_pci_set_power_state() path?  Does that need this
too?

In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
I assume is not for PCI devices themselves) *does* say firmware is
responsible for sequencing delays, so I would tend to assume it's
really firmware's job and we shouldn't need to do this in the kernel
at all.

What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
Port) is in D3hot, I'm not really clear on the state of its link.  If
the link is down, I assume putting the bridge in D0 will bring it up
and we'd have to wait for that?  If so, we'd need to do something in
the reset path, e.g., pci_pm_reset()?

Sketch of control flow for this patch:

  pci_pm_resume_noirq
    if (...)
      pci_pm_default_resume_early
	pci_power_up
	  if (...)
	    platform_pci_set_power_state
	      acpi_pci_set_power_state
	  pci_raw_set_power_state(D0)
+   if (d3cold)
+     pci_bridge_wait_for_secondary_bus
    if (legacy)
      pci_legacy_resume_early
    else
      drv->pm->resume_noirq

>  	if (drv && drv->pm && drv->pm->resume_noirq)
>  		error = drv->pm->resume_noirq(dev);
Mika Westerberg Oct. 28, 2019, 11:28 a.m. UTC | #2
On Sat, Oct 26, 2019 at 09:19:38AM -0500, Bjorn Helgaas wrote:
> On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> > Currently Linux does not follow PCIe spec regarding the required delays
> > after reset. A concrete example is a Thunderbolt add-in-card that
> > consists of a PCIe switch and two PCIe endpoints:
> > 
> >   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> >                                   +-01.0-[04-36]-- DS hotplug port
> >                                   +-02.0-[37]----00.0 xHCI controller
> >                                   \-04.0-[38-6b]-- DS hotplug port
> > 
> > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > gen3 so they support 8GT/s link speeds.
> > 
> > We wait for the PCIe hierarchy to enter D3cold (runtime):
> > 
> >   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> > 
> > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > PCIe switch is put to reset and its power is re-applied. This means that
> > we must follow the rules in PCIe 4.0 section 6.6.1.
> 
> If you have the PCIe 5.0 spec, can you update these references to
> that?  If not, I'm happy to do it for you.

I do have it and sure, I'll update them.

> > For the PCIe gen3 ports we are dealing with here, the following applies:
> > 
> >   With a Downstream Port that supports Link speeds greater than 5.0
> >   GT/s, software must wait a minimum of 100 ms after Link training
> >   completes before sending a Configuration Request to the device
> >   immediately below that Port. Software can determine when Link training
> >   completes by polling the Data Link Layer Link Active bit or by setting
> >   up an associated interrupt (see Section 6.7.3.3).
> > 
> > Translating this into the above topology we would need to do this (DLLLA
> > stands for Data Link Layer Link Active):
> > 
> >   0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
> >   0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
> >   0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0
> > 
> > I've instrumented the kernel with some additional logging so we can see
> > the actual delays performed:
> > 
> >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> >   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> >   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> >   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> >   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> > 
> > For the switch upstream port (01:00.0 reachable through 00:1b.0 root
> > port) we wait for 100 ms but not taking into account the DLLLA
> > requirement. We then wait 10 ms for D3hot -> D0 transition of the root
> > port and the two downstream hotplug ports. This means that we deviate
> > from what the spec requires.
> > 
> > Performing the same check for system sleep (s2idle) transitions it turns
> > out to be even worse. None of the mandatory delays are performed. If
> > this would be S3 instead of s2idle then according to PCI FW spec 3.2
> > section 4.6.8. there is a specific _DSM that allows the OS to skip the
> > delays but this platform does not provide the _DSM and does not go to S3
> > anyway so no firmware is involved that could already handle these
> > delays.
> > 
> > On this particular platform these delays are not actually needed because
> > there is an additional delay as part of the ACPI power resource that is
> > used to turn on power to the hierarchy but since that additional delay
> > is not required by any of standards (PCIe, ACPI) it is not present in
> > the Intel Ice Lake, for example where missing the mandatory delays
> > causes pciehp to start tearing down the stack too early (links are not
> > yet trained). Below is an example how it looks like when this happens:
> > 
> >   pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
> >   pcieport 0000:87:04.0: PME# disabled
> >   pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
> >   pcieport 0000:86:00.0: Refused to change power state, currently in D3
> >   pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
> >   pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> >   ...
> > 
> > There is also one reported case (see the bugzilla link below) where the
> > missing delay causes xHCI on a Titan Ridge controller fail to runtime
> > resume when USB-C dock is plugged. This does not involve pciehp but
> > instead it PCI core fails to runtime resume the xHCI device:
> > 
> >   pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> >   pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
> >   xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
> >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
> >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> >   ...
> > 
> > For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
> > that is called on PCI core resume and runtime resume paths accordingly
> > if the bridge entered D3cold (and thus went through reset).
> > 
> > This is second attempt to add the missing delays. The previous solution
> > in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
> > PCIe spec") was reverted because of two issues it caused:
> > 
> >   1. One system become unresponsive after S3 resume due to PME service
> >      spinning in pcie_pme_work_fn(). The root port in question reports
> >      that the xHCI sent PME but the xHCI device itself does not have PME
> >      status set. The PME status bit is never cleared in the root port
> >      resulting the indefinite loop in pcie_pme_work_fn().
> > 
> >   2. Slows down resume if the root/downstream port does not support
> >      Data Link Layer Active Reporting because pcie_wait_for_link_delay()
> >      waits 1100 ms in that case.
> > 
> > This version should avoid the above issues because we restrict the delay
> > to happen only if the port went into D3cold.
> > 
> > Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
> > Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> >  drivers/pci/pci-driver.c | 18 ++++++++
> >  drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
> >  drivers/pci/pci.h        |  1 +
> >  3 files changed, 104 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index a8124e47bf6e..74a144c9cf4e 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -917,6 +917,7 @@ static int pci_pm_suspend_noirq(struct device *dev)
> >  static int pci_pm_resume_noirq(struct device *dev)
> >  {
> >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > +	bool d3cold = pci_dev->current_state == PCI_D3cold;
> >  	struct device_driver *drv = dev->driver;
> >  	int error = 0;
> >  
> > @@ -947,6 +948,14 @@ static int pci_pm_resume_noirq(struct device *dev)
> >  
> >  	pcie_pme_root_status_cleanup(pci_dev);
> >  
> > +	/*
> > +	 * If the hierarchy went into D3cold wait for the secondary bus to
> > +	 * become accessible. This is important for PCIe to prevent pciehp
> > +	 * from tearing down the downstream devices too soon.
> 
> The pciehp connection isn't obvious to me.

I tried to explain it in the changelog but maybe I'll just don't mention
it here at all to avoid confusion.

> > +	 */
> > +	if (d3cold)
> > +		pci_bridge_wait_for_secondary_bus(pci_dev);
> 
> This will need to be rebased on top of my pci/pm branch, but I think
> that's minor.

OK.

> Can we move this closer to where we initiate the reset?  It's pretty
> hard to tell from looking at pci_pm_resume_noirq() that there's a
> reset happening here.

Well we actually don't do explicit reset but instead we power the thing
on from D3cold.

> For D3cold->D0, I guess that would be somewhere down in
> platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
> What about the mid_pci_set_power_state() path?  Does that need this
> too?

I can take a look if it can be placed there. Yes,
mid_pci_set_power_state() may at least in theory need it too although I
don't remember any MID platforms with real PCIe devices.

> In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
> I assume is not for PCI devices themselves) *does* say firmware is
> responsible for sequencing delays, so I would tend to assume it's
> really firmware's job and we shouldn't need to do this in the kernel
> at all.

_ON is also for PCI device itself but all those methods are not just for
PCI so they don't really talk about any PCI specific delays. You need to
look at other specs. For example PCI FW spec v3.2 section 4.6.9 says
this about the _DSM that can be used to decrease the delays:

  This function is optional. If the platform does not provide it, the
  operating system must adhere to all timing requirements as described
  in the PCI Express Base specification and/or applicable form factor
  specification, including values contained in Readiness Time Reporting
  capability structure.
  
Relevant PCIe spec section is 6.6.1 (also referenced in the changelog).

[If you have access to ECN titled "Async Hot-Plug Updates" (you can find
it in PCI-SIG site) that document has a nice table about the delays in
page 32. It compares surprise hotplug with downstream port containment
for async hotplug]

> What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
> Port) is in D3hot, I'm not really clear on the state of its link.  If
> the link is down, I assume putting the bridge in D0 will bring it up
> and we'd have to wait for that?  If so, we'd need to do something in
> the reset path, e.g., pci_pm_reset()?

AFAIK the link goes into L1 when the function is programmed to any other
D state than D0. If we don't put the device into D3cold then I think it
stays in L1 where it can be brought back by writing D0 to PM register
which does not need any other delay than the D3hot -> D0 (10ms).

[There is a table in PCIe spec 5.0 section 5.3.2 that shows the link
state and power management D-state relationship.]
Bjorn Helgaas Oct. 28, 2019, 1:42 p.m. UTC | #3
On Mon, Oct 28, 2019 at 01:28:52PM +0200, Mika Westerberg wrote:
> On Sat, Oct 26, 2019 at 09:19:38AM -0500, Bjorn Helgaas wrote:
> > On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> > > Currently Linux does not follow PCIe spec regarding the required delays
> > > after reset. A concrete example is a Thunderbolt add-in-card that
> > > consists of a PCIe switch and two PCIe endpoints:
> > > 
> > >   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> > >                                   +-01.0-[04-36]-- DS hotplug port
> > >                                   +-02.0-[37]----00.0 xHCI controller
> > >                                   \-04.0-[38-6b]-- DS hotplug port
> > > 
> > > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > > gen3 so they support 8GT/s link speeds.
> > > 
> > > We wait for the PCIe hierarchy to enter D3cold (runtime):
> > > 
> > >   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> > > 
> > > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > > PCIe switch is put to reset and its power is re-applied. This means that
> > > we must follow the rules in PCIe 4.0 section 6.6.1.
> > 
> > If you have the PCIe 5.0 spec, can you update these references to
> > that?  If not, I'm happy to do it for you.
> 
> I do have it and sure, I'll update them.
> 
> > > For the PCIe gen3 ports we are dealing with here, the following applies:
> > > 
> > >   With a Downstream Port that supports Link speeds greater than 5.0
> > >   GT/s, software must wait a minimum of 100 ms after Link training
> > >   completes before sending a Configuration Request to the device
> > >   immediately below that Port. Software can determine when Link training
> > >   completes by polling the Data Link Layer Link Active bit or by setting
> > >   up an associated interrupt (see Section 6.7.3.3).
> > > 
> > > Translating this into the above topology we would need to do this (DLLLA
> > > stands for Data Link Layer Link Active):
> > > 
> > >   0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
> > >   0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
> > >   0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0
> > > 
> > > I've instrumented the kernel with some additional logging so we can see
> > > the actual delays performed:
> > > 
> > >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> > >   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> > >   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> > >   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> > >   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> > > 
> > > For the switch upstream port (01:00.0 reachable through 00:1b.0 root
> > > port) we wait for 100 ms but not taking into account the DLLLA
> > > requirement. We then wait 10 ms for D3hot -> D0 transition of the root
> > > port and the two downstream hotplug ports. This means that we deviate
> > > from what the spec requires.
> > > 
> > > Performing the same check for system sleep (s2idle) transitions it turns
> > > out to be even worse. None of the mandatory delays are performed. If
> > > this would be S3 instead of s2idle then according to PCI FW spec 3.2
> > > section 4.6.8. there is a specific _DSM that allows the OS to skip the
> > > delays but this platform does not provide the _DSM and does not go to S3
> > > anyway so no firmware is involved that could already handle these
> > > delays.
> > > 
> > > On this particular platform these delays are not actually needed because
> > > there is an additional delay as part of the ACPI power resource that is
> > > used to turn on power to the hierarchy but since that additional delay
> > > is not required by any of standards (PCIe, ACPI) it is not present in
> > > the Intel Ice Lake, for example where missing the mandatory delays
> > > causes pciehp to start tearing down the stack too early (links are not
> > > yet trained). Below is an example how it looks like when this happens:
> > > 
> > >   pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
> > >   pcieport 0000:87:04.0: PME# disabled
> > >   pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
> > >   pcieport 0000:86:00.0: Refused to change power state, currently in D3
> > >   pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
> > >   pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > >   ...
> > > 
> > > There is also one reported case (see the bugzilla link below) where the
> > > missing delay causes xHCI on a Titan Ridge controller fail to runtime
> > > resume when USB-C dock is plugged. This does not involve pciehp but
> > > instead it PCI core fails to runtime resume the xHCI device:
> > > 
> > >   pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > >   pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
> > >   xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
> > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
> > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > >   ...
> > > 
> > > For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
> > > that is called on PCI core resume and runtime resume paths accordingly
> > > if the bridge entered D3cold (and thus went through reset).
> > > 
> > > This is second attempt to add the missing delays. The previous solution
> > > in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
> > > PCIe spec") was reverted because of two issues it caused:
> > > 
> > >   1. One system become unresponsive after S3 resume due to PME service
> > >      spinning in pcie_pme_work_fn(). The root port in question reports
> > >      that the xHCI sent PME but the xHCI device itself does not have PME
> > >      status set. The PME status bit is never cleared in the root port
> > >      resulting the indefinite loop in pcie_pme_work_fn().
> > > 
> > >   2. Slows down resume if the root/downstream port does not support
> > >      Data Link Layer Active Reporting because pcie_wait_for_link_delay()
> > >      waits 1100 ms in that case.
> > > 
> > > This version should avoid the above issues because we restrict the delay
> > > to happen only if the port went into D3cold.
> > > 
> > > Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
> > > Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > > ---
> > >  drivers/pci/pci-driver.c | 18 ++++++++
> > >  drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
> > >  drivers/pci/pci.h        |  1 +
> > >  3 files changed, 104 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > > index a8124e47bf6e..74a144c9cf4e 100644
> > > --- a/drivers/pci/pci-driver.c
> > > +++ b/drivers/pci/pci-driver.c
> > > @@ -917,6 +917,7 @@ static int pci_pm_suspend_noirq(struct device *dev)
> > >  static int pci_pm_resume_noirq(struct device *dev)
> > >  {
> > >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > +	bool d3cold = pci_dev->current_state == PCI_D3cold;
> > >  	struct device_driver *drv = dev->driver;
> > >  	int error = 0;
> > >  
> > > @@ -947,6 +948,14 @@ static int pci_pm_resume_noirq(struct device *dev)
> > >  
> > >  	pcie_pme_root_status_cleanup(pci_dev);
> > >  
> > > +	/*
> > > +	 * If the hierarchy went into D3cold wait for the secondary bus to
> > > +	 * become accessible. This is important for PCIe to prevent pciehp
> > > +	 * from tearing down the downstream devices too soon.

> > Can we move this closer to where we initiate the reset?  It's pretty
> > hard to tell from looking at pci_pm_resume_noirq() that there's a
> > reset happening here.
> 
> Well we actually don't do explicit reset but instead we power the thing
> on from D3cold.

The point is that it's too hard to maintain unless we can connect the
delay with the related hardware event.

> > For D3cold->D0, I guess that would be somewhere down in
> > platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
> > What about the mid_pci_set_power_state() path?  Does that need this
> > too?
> 
> I can take a look if it can be placed there. Yes,
> mid_pci_set_power_state() may at least in theory need it too although I
> don't remember any MID platforms with real PCIe devices.

I don't know how the OS is supposed to know if these are real PCIe
devices or not.  If we don't know, we have to assume they work per
spec and may require the delays per spec.

> > In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
> > I assume is not for PCI devices themselves) *does* say firmware is
> > responsible for sequencing delays, so I would tend to assume it's
> > really firmware's job and we shouldn't need to do this in the kernel
> > at all.
> 
> _ON is also for PCI device itself but all those methods are not just for
> PCI so they don't really talk about any PCI specific delays. You need to
> look at other specs. For example PCI FW spec v3.2 section 4.6.9 says
> this about the _DSM that can be used to decrease the delays:
> 
>   This function is optional. If the platform does not provide it, the
>   operating system must adhere to all timing requirements as described
>   in the PCI Express Base specification and/or applicable form factor
>   specification, including values contained in Readiness Time Reporting
>   capability structure.

I don't think this _DSM tells us anything about delays after _ON,
_PS0, etc.  All the delays it mentions are for transitions the OS can
do natively without the _ON, _PS0, etc methods.  It makes no mention
of those methods, or of the D3cold->D0 transition (which would require
them).

> Relevant PCIe spec section is 6.6.1 (also referenced in the changelog).
> 
> [If you have access to ECN titled "Async Hot-Plug Updates" (you can find
> it in PCI-SIG site) that document has a nice table about the delays in
> page 32. It compares surprise hotplug with downstream port containment
> for async hotplug]

Thanks for the pointer, that ECN looks very useful.  It does talk
about delays in general, but I don't see anything that clarifies
whether ACPI methods or the OS is responsible for them.

> > What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
> > Port) is in D3hot, I'm not really clear on the state of its link.  If
> > the link is down, I assume putting the bridge in D0 will bring it up
> > and we'd have to wait for that?  If so, we'd need to do something in
> > the reset path, e.g., pci_pm_reset()?
> 
> AFAIK the link goes into L1 when the function is programmed to any other
> D state than D0. 

Yes, and the "function" here is the one on the *downstream* end, e.g.,
the Endpoint or Switch Upstream Port.  When the upstream bridge (Root
Port or Switch Downstream Port) is in a non-D0 state, the downstream
component is unreachable (memory, I/O, and type 1 config requests are
terminated by the bridge as unsupported requests).

> If we don't put the device into D3cold then I think it
> stays in L1 where it can be brought back by writing D0 to PM register
> which does not need any other delay than the D3hot -> D0 (10ms).

In pci_pm_reset(), we're doing the D0->D3hot->D0 transitions
specifically to do a reset, so No_Soft_Reset is false.  Doesn't 6.6.1
say we need at least 100ms here?

> [There is a table in PCIe spec 5.0 section 5.3.2 that shows the link
> state and power management D-state relationship.]
Mika Westerberg Oct. 28, 2019, 6:06 p.m. UTC | #4
On Mon, Oct 28, 2019 at 08:42:47AM -0500, Bjorn Helgaas wrote:
> On Mon, Oct 28, 2019 at 01:28:52PM +0200, Mika Westerberg wrote:
> > On Sat, Oct 26, 2019 at 09:19:38AM -0500, Bjorn Helgaas wrote:
> > > On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> > > > Currently Linux does not follow PCIe spec regarding the required delays
> > > > after reset. A concrete example is a Thunderbolt add-in-card that
> > > > consists of a PCIe switch and two PCIe endpoints:
> > > > 
> > > >   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> > > >                                   +-01.0-[04-36]-- DS hotplug port
> > > >                                   +-02.0-[37]----00.0 xHCI controller
> > > >                                   \-04.0-[38-6b]-- DS hotplug port
> > > > 
> > > > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > > > gen3 so they support 8GT/s link speeds.
> > > > 
> > > > We wait for the PCIe hierarchy to enter D3cold (runtime):
> > > > 
> > > >   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> > > > 
> > > > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > > > PCIe switch is put to reset and its power is re-applied. This means that
> > > > we must follow the rules in PCIe 4.0 section 6.6.1.
> > > 
> > > If you have the PCIe 5.0 spec, can you update these references to
> > > that?  If not, I'm happy to do it for you.
> > 
> > I do have it and sure, I'll update them.
> > 
> > > > For the PCIe gen3 ports we are dealing with here, the following applies:
> > > > 
> > > >   With a Downstream Port that supports Link speeds greater than 5.0
> > > >   GT/s, software must wait a minimum of 100 ms after Link training
> > > >   completes before sending a Configuration Request to the device
> > > >   immediately below that Port. Software can determine when Link training
> > > >   completes by polling the Data Link Layer Link Active bit or by setting
> > > >   up an associated interrupt (see Section 6.7.3.3).
> > > > 
> > > > Translating this into the above topology we would need to do this (DLLLA
> > > > stands for Data Link Layer Link Active):
> > > > 
> > > >   0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
> > > >   0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
> > > >   0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0
> > > > 
> > > > I've instrumented the kernel with some additional logging so we can see
> > > > the actual delays performed:
> > > > 
> > > >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> > > >   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> > > >   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> > > >   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> > > >   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> > > > 
> > > > For the switch upstream port (01:00.0 reachable through 00:1b.0 root
> > > > port) we wait for 100 ms but not taking into account the DLLLA
> > > > requirement. We then wait 10 ms for D3hot -> D0 transition of the root
> > > > port and the two downstream hotplug ports. This means that we deviate
> > > > from what the spec requires.
> > > > 
> > > > Performing the same check for system sleep (s2idle) transitions it turns
> > > > out to be even worse. None of the mandatory delays are performed. If
> > > > this would be S3 instead of s2idle then according to PCI FW spec 3.2
> > > > section 4.6.8. there is a specific _DSM that allows the OS to skip the
> > > > delays but this platform does not provide the _DSM and does not go to S3
> > > > anyway so no firmware is involved that could already handle these
> > > > delays.
> > > > 
> > > > On this particular platform these delays are not actually needed because
> > > > there is an additional delay as part of the ACPI power resource that is
> > > > used to turn on power to the hierarchy but since that additional delay
> > > > is not required by any of standards (PCIe, ACPI) it is not present in
> > > > the Intel Ice Lake, for example where missing the mandatory delays
> > > > causes pciehp to start tearing down the stack too early (links are not
> > > > yet trained). Below is an example how it looks like when this happens:
> > > > 
> > > >   pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
> > > >   pcieport 0000:87:04.0: PME# disabled
> > > >   pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
> > > >   pcieport 0000:86:00.0: Refused to change power state, currently in D3
> > > >   pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
> > > >   pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > > >   ...
> > > > 
> > > > There is also one reported case (see the bugzilla link below) where the
> > > > missing delay causes xHCI on a Titan Ridge controller fail to runtime
> > > > resume when USB-C dock is plugged. This does not involve pciehp but
> > > > instead it PCI core fails to runtime resume the xHCI device:
> > > > 
> > > >   pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > > >   pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
> > > >   xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
> > > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
> > > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > > >   ...
> > > > 
> > > > For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
> > > > that is called on PCI core resume and runtime resume paths accordingly
> > > > if the bridge entered D3cold (and thus went through reset).
> > > > 
> > > > This is second attempt to add the missing delays. The previous solution
> > > > in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
> > > > PCIe spec") was reverted because of two issues it caused:
> > > > 
> > > >   1. One system become unresponsive after S3 resume due to PME service
> > > >      spinning in pcie_pme_work_fn(). The root port in question reports
> > > >      that the xHCI sent PME but the xHCI device itself does not have PME
> > > >      status set. The PME status bit is never cleared in the root port
> > > >      resulting the indefinite loop in pcie_pme_work_fn().
> > > > 
> > > >   2. Slows down resume if the root/downstream port does not support
> > > >      Data Link Layer Active Reporting because pcie_wait_for_link_delay()
> > > >      waits 1100 ms in that case.
> > > > 
> > > > This version should avoid the above issues because we restrict the delay
> > > > to happen only if the port went into D3cold.
> > > > 
> > > > Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
> > > > Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > > > ---
> > > >  drivers/pci/pci-driver.c | 18 ++++++++
> > > >  drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
> > > >  drivers/pci/pci.h        |  1 +
> > > >  3 files changed, 104 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > > > index a8124e47bf6e..74a144c9cf4e 100644
> > > > --- a/drivers/pci/pci-driver.c
> > > > +++ b/drivers/pci/pci-driver.c
> > > > @@ -917,6 +917,7 @@ static int pci_pm_suspend_noirq(struct device *dev)
> > > >  static int pci_pm_resume_noirq(struct device *dev)
> > > >  {
> > > >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > > +	bool d3cold = pci_dev->current_state == PCI_D3cold;
> > > >  	struct device_driver *drv = dev->driver;
> > > >  	int error = 0;
> > > >  
> > > > @@ -947,6 +948,14 @@ static int pci_pm_resume_noirq(struct device *dev)
> > > >  
> > > >  	pcie_pme_root_status_cleanup(pci_dev);
> > > >  
> > > > +	/*
> > > > +	 * If the hierarchy went into D3cold wait for the secondary bus to
> > > > +	 * become accessible. This is important for PCIe to prevent pciehp
> > > > +	 * from tearing down the downstream devices too soon.
> 
> > > Can we move this closer to where we initiate the reset?  It's pretty
> > > hard to tell from looking at pci_pm_resume_noirq() that there's a
> > > reset happening here.
> > 
> > Well we actually don't do explicit reset but instead we power the thing
> > on from D3cold.
> 
> The point is that it's too hard to maintain unless we can connect the
> delay with the related hardware event.

The related hardware event is resume in this case. Can you point me to
the actual point where you want me to put this?

> > > For D3cold->D0, I guess that would be somewhere down in
> > > platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
> > > What about the mid_pci_set_power_state() path?  Does that need this
> > > too?
> > 
> > I can take a look if it can be placed there. Yes,
> > mid_pci_set_power_state() may at least in theory need it too although I
> > don't remember any MID platforms with real PCIe devices.
> 
> I don't know how the OS is supposed to know if these are real PCIe
> devices or not.  If we don't know, we have to assume they work per
> spec and may require the delays per spec.

Well MID devices are pretty much "hard-coded" the OS knows everything
there is connected.

> > > In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
> > > I assume is not for PCI devices themselves) *does* say firmware is
> > > responsible for sequencing delays, so I would tend to assume it's
> > > really firmware's job and we shouldn't need to do this in the kernel
> > > at all.
> > 
> > _ON is also for PCI device itself but all those methods are not just for
> > PCI so they don't really talk about any PCI specific delays. You need to
> > look at other specs. For example PCI FW spec v3.2 section 4.6.9 says
> > this about the _DSM that can be used to decrease the delays:
> > 
> >   This function is optional. If the platform does not provide it, the
> >   operating system must adhere to all timing requirements as described
> >   in the PCI Express Base specification and/or applicable form factor
> >   specification, including values contained in Readiness Time Reporting
> >   capability structure.
> 
> I don't think this _DSM tells us anything about delays after _ON,
> _PS0, etc.  All the delays it mentions are for transitions the OS can
> do natively without the _ON, _PS0, etc methods.  It makes no mention
> of those methods, or of the D3cold->D0 transition (which would require
> them).

D3cold->D0 transition is explained in PCI spec 5.0 page 492 (there is
picture). You can see that D3cold -> D0 involves fundamental reset.
Section 6.6.1 (page 551) then says that rundamental reset is one
category of conventional reset. Now, that _DSM allows lowering the init
time after conventional reset. So to me it talks exactly about those
delays (also PCIe cannot go into D3cold without help from the platform,
ACPI in this case).

> > Relevant PCIe spec section is 6.6.1 (also referenced in the changelog).
> > 
> > [If you have access to ECN titled "Async Hot-Plug Updates" (you can find
> > it in PCI-SIG site) that document has a nice table about the delays in
> > page 32. It compares surprise hotplug with downstream port containment
> > for async hotplug]
> 
> Thanks for the pointer, that ECN looks very useful.  It does talk
> about delays in general, but I don't see anything that clarifies
> whether ACPI methods or the OS is responsible for them.

No but the _DSM description above is pretty clear about that. At least
for me it is clear.

> > > What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
> > > Port) is in D3hot, I'm not really clear on the state of its link.  If
> > > the link is down, I assume putting the bridge in D0 will bring it up
> > > and we'd have to wait for that?  If so, we'd need to do something in
> > > the reset path, e.g., pci_pm_reset()?
> > 
> > AFAIK the link goes into L1 when the function is programmed to any other
> > D state than D0. 
> 
> Yes, and the "function" here is the one on the *downstream* end, e.g.,
> the Endpoint or Switch Upstream Port.  When the upstream bridge (Root
> Port or Switch Downstream Port) is in a non-D0 state, the downstream
> component is unreachable (memory, I/O, and type 1 config requests are
> terminated by the bridge as unsupported requests).

Yes, the link is in L1 (its PM state is determined by the D-state of the
downstream component. From there you can get it back to functional state
by programming the downstream port to D0 (the link is still in L1)
followed by programming the function itself to D0 which brings the link
back to L0. It does not involve conventional reset (see picture in page
492 of PCIe 5.0 spec). The recovery delays needed are listed in the same
page.

> > If we don't put the device into D3cold then I think it
> > stays in L1 where it can be brought back by writing D0 to PM register
> > which does not need any other delay than the D3hot -> D0 (10ms).
> 
> In pci_pm_reset(), we're doing the D0->D3hot->D0 transitions
> specifically to do a reset, so No_Soft_Reset is false.  Doesn't 6.6.1
> say we need at least 100ms here?

No since it does not go into D3cold. It just "reset" the thing if it
happens to do internal reset after D3hot -> D0.

Actually looking at the spec 5.3.1.4 it seems that pci_pm_reset() may
depend on something not guaranteed:

  If the No_Soft_Reset bit is Clear, functional context is not required
  to be maintained by the Function in the D3hot state, however it is not
  guaranteed that functional context will be cleared and software must
  not depend on such behavior.
Bjorn Helgaas Oct. 28, 2019, 8:16 p.m. UTC | #5
On Mon, Oct 28, 2019 at 08:06:01PM +0200, Mika Westerberg wrote:
> On Mon, Oct 28, 2019 at 08:42:47AM -0500, Bjorn Helgaas wrote:
> > On Mon, Oct 28, 2019 at 01:28:52PM +0200, Mika Westerberg wrote:
> > > On Sat, Oct 26, 2019 at 09:19:38AM -0500, Bjorn Helgaas wrote:
> > > > On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> > > > > Currently Linux does not follow PCIe spec regarding the required delays
> > > > > after reset. A concrete example is a Thunderbolt add-in-card that
> > > > > consists of a PCIe switch and two PCIe endpoints:
> > > > > 
> > > > >   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> > > > >                                   +-01.0-[04-36]-- DS hotplug port
> > > > >                                   +-02.0-[37]----00.0 xHCI controller
> > > > >                                   \-04.0-[38-6b]-- DS hotplug port
> > > > > 
> > > > > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > > > > gen3 so they support 8GT/s link speeds.
> > > > > 
> > > > > We wait for the PCIe hierarchy to enter D3cold (runtime):
> > > > > 
> > > > >   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> > > > > 
> > > > > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > > > > PCIe switch is put to reset and its power is re-applied. This means that
> > > > > we must follow the rules in PCIe 4.0 section 6.6.1.
> > > > 
> > > > If you have the PCIe 5.0 spec, can you update these references to
> > > > that?  If not, I'm happy to do it for you.
> > > 
> > > I do have it and sure, I'll update them.
> > > 
> > > > > For the PCIe gen3 ports we are dealing with here, the following applies:
> > > > > 
> > > > >   With a Downstream Port that supports Link speeds greater than 5.0
> > > > >   GT/s, software must wait a minimum of 100 ms after Link training
> > > > >   completes before sending a Configuration Request to the device
> > > > >   immediately below that Port. Software can determine when Link training
> > > > >   completes by polling the Data Link Layer Link Active bit or by setting
> > > > >   up an associated interrupt (see Section 6.7.3.3).
> > > > > 
> > > > > Translating this into the above topology we would need to do this (DLLLA
> > > > > stands for Data Link Layer Link Active):
> > > > > 
> > > > >   0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0
> > > > >   0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0
> > > > >   0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0
> > > > > 
> > > > > I've instrumented the kernel with some additional logging so we can see
> > > > > the actual delays performed:
> > > > > 
> > > > >   pcieport 0000:00:1b.0: power state changed by ACPI to D0
> > > > >   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> > > > >   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> > > > >   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> > > > >   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> > > > > 
> > > > > For the switch upstream port (01:00.0 reachable through 00:1b.0 root
> > > > > port) we wait for 100 ms but not taking into account the DLLLA
> > > > > requirement. We then wait 10 ms for D3hot -> D0 transition of the root
> > > > > port and the two downstream hotplug ports. This means that we deviate
> > > > > from what the spec requires.
> > > > > 
> > > > > Performing the same check for system sleep (s2idle) transitions it turns
> > > > > out to be even worse. None of the mandatory delays are performed. If
> > > > > this would be S3 instead of s2idle then according to PCI FW spec 3.2
> > > > > section 4.6.8. there is a specific _DSM that allows the OS to skip the
> > > > > delays but this platform does not provide the _DSM and does not go to S3
> > > > > anyway so no firmware is involved that could already handle these
> > > > > delays.
> > > > > 
> > > > > On this particular platform these delays are not actually needed because
> > > > > there is an additional delay as part of the ACPI power resource that is
> > > > > used to turn on power to the hierarchy but since that additional delay
> > > > > is not required by any of standards (PCIe, ACPI) it is not present in
> > > > > the Intel Ice Lake, for example where missing the mandatory delays
> > > > > causes pciehp to start tearing down the stack too early (links are not
> > > > > yet trained). Below is an example how it looks like when this happens:
> > > > > 
> > > > >   pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
> > > > >   pcieport 0000:87:04.0: PME# disabled
> > > > >   pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
> > > > >   pcieport 0000:86:00.0: Refused to change power state, currently in D3
> > > > >   pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
> > > > >   pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > > > >   ...
> > > > > 
> > > > > There is also one reported case (see the bugzilla link below) where the
> > > > > missing delay causes xHCI on a Titan Ridge controller fail to runtime
> > > > > resume when USB-C dock is plugged. This does not involve pciehp but
> > > > > instead it PCI core fails to runtime resume the xHCI device:
> > > > > 
> > > > >   pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > > > >   pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
> > > > >   xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
> > > > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
> > > > >   xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
> > > > >   ...
> > > > > 
> > > > > For this reason, introduce a new function pci_bridge_wait_for_secondary_bus()
> > > > > that is called on PCI core resume and runtime resume paths accordingly
> > > > > if the bridge entered D3cold (and thus went through reset).
> > > > > 
> > > > > This is second attempt to add the missing delays. The previous solution
> > > > > in commit c2bf1fc212f7 ("PCI: Add missing link delays required by the
> > > > > PCIe spec") was reverted because of two issues it caused:
> > > > > 
> > > > >   1. One system become unresponsive after S3 resume due to PME service
> > > > >      spinning in pcie_pme_work_fn(). The root port in question reports
> > > > >      that the xHCI sent PME but the xHCI device itself does not have PME
> > > > >      status set. The PME status bit is never cleared in the root port
> > > > >      resulting the indefinite loop in pcie_pme_work_fn().
> > > > > 
> > > > >   2. Slows down resume if the root/downstream port does not support
> > > > >      Data Link Layer Active Reporting because pcie_wait_for_link_delay()
> > > > >      waits 1100 ms in that case.
> > > > > 
> > > > > This version should avoid the above issues because we restrict the delay
> > > > > to happen only if the port went into D3cold.
> > > > > 
> > > > > Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/
> > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885
> > > > > Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > > > > ---
> > > > >  drivers/pci/pci-driver.c | 18 ++++++++
> > > > >  drivers/pci/pci.c        | 92 +++++++++++++++++++++++++++++++++++++---
> > > > >  drivers/pci/pci.h        |  1 +
> > > > >  3 files changed, 104 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > > > > index a8124e47bf6e..74a144c9cf4e 100644
> > > > > --- a/drivers/pci/pci-driver.c
> > > > > +++ b/drivers/pci/pci-driver.c
> > > > > @@ -917,6 +917,7 @@ static int pci_pm_suspend_noirq(struct device *dev)
> > > > >  static int pci_pm_resume_noirq(struct device *dev)
> > > > >  {
> > > > >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > > > +	bool d3cold = pci_dev->current_state == PCI_D3cold;
> > > > >  	struct device_driver *drv = dev->driver;
> > > > >  	int error = 0;
> > > > >  
> > > > > @@ -947,6 +948,14 @@ static int pci_pm_resume_noirq(struct device *dev)
> > > > >  
> > > > >  	pcie_pme_root_status_cleanup(pci_dev);
> > > > >  
> > > > > +	/*
> > > > > +	 * If the hierarchy went into D3cold wait for the secondary bus to
> > > > > +	 * become accessible. This is important for PCIe to prevent pciehp
> > > > > +	 * from tearing down the downstream devices too soon.
> > 
> > > > Can we move this closer to where we initiate the reset?  It's pretty
> > > > hard to tell from looking at pci_pm_resume_noirq() that there's a
> > > > reset happening here.
> > > 
> > > Well we actually don't do explicit reset but instead we power the thing
> > > on from D3cold.
> > 
> > The point is that it's too hard to maintain unless we can connect the
> > delay with the related hardware event.
> 
> The related hardware event is resume in this case. Can you point me to
> the actual point where you want me to put this?

"Resume" is a Linux software concept, so of course the PCIe spec
doesn't say anything about it.  The spec talks about delays related to
resets and device power and link state transitions, so somehow we have
to connect the Linux delay with those hardware events.

Since we're talking about a transition from D3cold, this has to be
done via something external to the device such as power regulators.
For ACPI systems that's probably hidden inside _PS0 or something
similar.  That's opaque, but at least it's a hook that says "here's
where we put the device into D0".  I suggested
acpi_pci_set_power_state() as a possibility since I think that's the
lowest-level point where we have the pci_dev so we know the current
state and the new state.

> > > > For D3cold->D0, I guess that would be somewhere down in
> > > > platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
> > > > What about the mid_pci_set_power_state() path?  Does that need this
> > > > too?
> > > 
> > > I can take a look if it can be placed there. Yes,
> > > mid_pci_set_power_state() may at least in theory need it too although I
> > > don't remember any MID platforms with real PCIe devices.
> > 
> > I don't know how the OS is supposed to know if these are real PCIe
> > devices or not.  If we don't know, we have to assume they work per
> > spec and may require the delays per spec.
> 
> Well MID devices are pretty much "hard-coded" the OS knows everything
> there is connected.

MID seems to be magic in that it wants to use the normal PCI core
without having to abide by all the assumptions in the spec.  That's
OK, but MID needs to be explicit about when it is OK to violate those
assumptions.  In this case, I think it means that if we add the delay
to acpi_pci_set_power_state(), we should at least add a comment to
mid_pci_set_power_state() about why the delay is or is not required
for MID.

> > > > In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
> > > > I assume is not for PCI devices themselves) *does* say firmware is
> > > > responsible for sequencing delays, so I would tend to assume it's
> > > > really firmware's job and we shouldn't need to do this in the kernel
> > > > at all.
> > > 
> > > _ON is also for PCI device itself but all those methods are not just for
> > > PCI so they don't really talk about any PCI specific delays. You need to
> > > look at other specs. For example PCI FW spec v3.2 section 4.6.9 says
> > > this about the _DSM that can be used to decrease the delays:
> > > 
> > >   This function is optional. If the platform does not provide it, the
> > >   operating system must adhere to all timing requirements as described
> > >   in the PCI Express Base specification and/or applicable form factor
> > >   specification, including values contained in Readiness Time Reporting
> > >   capability structure.
> > 
> > I don't think this _DSM tells us anything about delays after _ON,
> > _PS0, etc.  All the delays it mentions are for transitions the OS can
> > do natively without the _ON, _PS0, etc methods.  It makes no mention
> > of those methods, or of the D3cold->D0 transition (which would require
> > them).
> 
> D3cold->D0 transition is explained in PCI spec 5.0 page 492 (there is
> picture). You can see that D3cold -> D0 involves fundamental reset.
> Section 6.6.1 (page 551) then says that fundamental reset is one
> category of conventional reset. Now, that _DSM allows lowering the init
> time after conventional reset. So to me it talks exactly about those
> delays (also PCIe cannot go into D3cold without help from the platform,
> ACPI in this case).

Everything on the _DSM list is something the OS can do natively (even
conventional reset can be done via Secondary Bus Reset), and it says
nothing about a connection with ACPI power management methods (_PS0,
etc), so I think it's ambiguous at best.  A simple "OS is responsible
for any bus-specific delays after a transition" in the ACPI _PS0
documentation would have trivially resolved this.

But it seems that at least some ACPI firmware doesn't do those delays,
so I guess our only alternatives are to always do it in the OS or have
some sort of blacklist.  And it doesn't really seem practical to
maintain a blacklist.

> > > Relevant PCIe spec section is 6.6.1 (also referenced in the changelog).
> > > 
> > > [If you have access to ECN titled "Async Hot-Plug Updates" (you can find
> > > it in PCI-SIG site) that document has a nice table about the delays in
> > > page 32. It compares surprise hotplug with downstream port containment
> > > for async hotplug]
> > 
> > Thanks for the pointer, that ECN looks very useful.  It does talk
> > about delays in general, but I don't see anything that clarifies
> > whether ACPI methods or the OS is responsible for them.
> 
> No but the _DSM description above is pretty clear about that. At least
> for me it is clear.
> 
> > > > What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
> > > > Port) is in D3hot, I'm not really clear on the state of its link.  If
> > > > the link is down, I assume putting the bridge in D0 will bring it up
> > > > and we'd have to wait for that?  If so, we'd need to do something in
> > > > the reset path, e.g., pci_pm_reset()?
> > > 
> > > AFAIK the link goes into L1 when the function is programmed to any other
> > > D state than D0. 
> > 
> > Yes, and the "function" here is the one on the *downstream* end, e.g.,
> > the Endpoint or Switch Upstream Port.  When the upstream bridge (Root
> > Port or Switch Downstream Port) is in a non-D0 state, the downstream
> > component is unreachable (memory, I/O, and type 1 config requests are
> > terminated by the bridge as unsupported requests).
> 
> Yes, the link is in L1 (its PM state is determined by the D-state of the
> downstream component. From there you can get it back to functional state
> by programming the downstream port to D0 (the link is still in L1)
> followed by programming the function itself to D0 which brings the link
> back to L0. It does not involve conventional reset (see picture in page
> 492 of PCIe 5.0 spec). The recovery delays needed are listed in the same
> page.
> 
> > > If we don't put the device into D3cold then I think it
> > > stays in L1 where it can be brought back by writing D0 to PM register
> > > which does not need any other delay than the D3hot -> D0 (10ms).
> > 
> > In pci_pm_reset(), we're doing the D0->D3hot->D0 transitions
> > specifically to do a reset, so No_Soft_Reset is false.  Doesn't 6.6.1
> > say we need at least 100ms here?
> 
> No since it does not go into D3cold. It just "reset" the thing if it
> happens to do internal reset after D3hot -> D0.

Sec 5.8, Figure 5-18 says D3hot->D0uninitialized is a "Soft Reset", which
unfortunately is not defined.

My guess is that in sec 5.9, Table 5-13, the 10ms delay is for the
D3hot->D0active (i.e., No_Soft_Reset=1) transition, and the
D3hot->D0uninitialized (i.e., No_Soft_Reset=0) that does a "soft
reset" (whatever that is) probably requires more and we should handle
it like a conventional reset to be safe.

> Actually looking at the spec 5.3.1.4 it seems that pci_pm_reset() may
> depend on something not guaranteed:
> 
>   If the No_Soft_Reset bit is Clear, functional context is not required
>   to be maintained by the Function in the D3hot state, however it is not
>   guaranteed that functional context will be cleared and software must
>   not depend on such behavior.

Good point.  Sounds like that reset method may not be reliable in
general, although it might work for SR-IOV: 9.6.2 says that PFs with
No_Soft_Reset clear must perform an internal reset on D3hot->D0.

Bjorn
Mika Westerberg Oct. 29, 2019, 11:15 a.m. UTC | #6
On Mon, Oct 28, 2019 at 03:16:53PM -0500, Bjorn Helgaas wrote:
> > The related hardware event is resume in this case. Can you point me to
> > the actual point where you want me to put this?
> 
> "Resume" is a Linux software concept, so of course the PCIe spec
> doesn't say anything about it.  The spec talks about delays related to
> resets and device power and link state transitions, so somehow we have
> to connect the Linux delay with those hardware events.
> 
> Since we're talking about a transition from D3cold, this has to be
> done via something external to the device such as power regulators.
> For ACPI systems that's probably hidden inside _PS0 or something
> similar.  That's opaque, but at least it's a hook that says "here's
> where we put the device into D0".  I suggested
> acpi_pci_set_power_state() as a possibility since I think that's the
> lowest-level point where we have the pci_dev so we know the current
> state and the new state.

I looked at how we could use acpi_pci_set_power_state() but I don't
think it is possible because it is likely that only the root port has
the power resource that is used to bring the link to L2 or L3. However,
we would need to repeat the delay for each downstream/root port if there
are multiple PCIe switches in the topology.

Also the delay needs to be issued after the downstream link is trained
so the downstream/root port needs to be in D0 first.

> > > > > For D3cold->D0, I guess that would be somewhere down in
> > > > > platform_pci_set_power_state()?  Maybe acpi_pci_set_power_state()?
> > > > > What about the mid_pci_set_power_state() path?  Does that need this
> > > > > too?
> > > > 
> > > > I can take a look if it can be placed there. Yes,
> > > > mid_pci_set_power_state() may at least in theory need it too although I
> > > > don't remember any MID platforms with real PCIe devices.
> > > 
> > > I don't know how the OS is supposed to know if these are real PCIe
> > > devices or not.  If we don't know, we have to assume they work per
> > > spec and may require the delays per spec.
> > 
> > Well MID devices are pretty much "hard-coded" the OS knows everything
> > there is connected.
> 
> MID seems to be magic in that it wants to use the normal PCI core
> without having to abide by all the assumptions in the spec.  That's
> OK, but MID needs to be explicit about when it is OK to violate those
> assumptions.  In this case, I think it means that if we add the delay
> to acpi_pci_set_power_state(), we should at least add a comment to
> mid_pci_set_power_state() about why the delay is or is not required
> for MID.
> 
> > > > > In the ACPI spec, _PS0 doesn't say anything about delays.  _ON (which
> > > > > I assume is not for PCI devices themselves) *does* say firmware is
> > > > > responsible for sequencing delays, so I would tend to assume it's
> > > > > really firmware's job and we shouldn't need to do this in the kernel
> > > > > at all.
> > > > 
> > > > _ON is also for PCI device itself but all those methods are not just for
> > > > PCI so they don't really talk about any PCI specific delays. You need to
> > > > look at other specs. For example PCI FW spec v3.2 section 4.6.9 says
> > > > this about the _DSM that can be used to decrease the delays:
> > > > 
> > > >   This function is optional. If the platform does not provide it, the
> > > >   operating system must adhere to all timing requirements as described
> > > >   in the PCI Express Base specification and/or applicable form factor
> > > >   specification, including values contained in Readiness Time Reporting
> > > >   capability structure.
> > > 
> > > I don't think this _DSM tells us anything about delays after _ON,
> > > _PS0, etc.  All the delays it mentions are for transitions the OS can
> > > do natively without the _ON, _PS0, etc methods.  It makes no mention
> > > of those methods, or of the D3cold->D0 transition (which would require
> > > them).
> > 
> > D3cold->D0 transition is explained in PCI spec 5.0 page 492 (there is
> > picture). You can see that D3cold -> D0 involves fundamental reset.
> > Section 6.6.1 (page 551) then says that fundamental reset is one
> > category of conventional reset. Now, that _DSM allows lowering the init
> > time after conventional reset. So to me it talks exactly about those
> > delays (also PCIe cannot go into D3cold without help from the platform,
> > ACPI in this case).
> 
> Everything on the _DSM list is something the OS can do natively (even
> conventional reset can be done via Secondary Bus Reset), and it says
> nothing about a connection with ACPI power management methods (_PS0,
> etc), so I think it's ambiguous at best.  A simple "OS is responsible
> for any bus-specific delays after a transition" in the ACPI _PS0
> documentation would have trivially resolved this.

But I would imagine that is not always the case, that's the reason we
have documents such as PCI FW.

> But it seems that at least some ACPI firmware doesn't do those delays,
> so I guess our only alternatives are to always do it in the OS or have
> some sort of blacklist.  And it doesn't really seem practical to
> maintain a blacklist.

I really think this is crystal clear:

The OS is always responsible for the delays described in the PCIe spec.
However, if the platform implements some of them say in _ON or _PS0
methods then it can notify the OS about this by using the _DSM so the OS
does not need to duplicate all of them.

> > > > Relevant PCIe spec section is 6.6.1 (also referenced in the changelog).
> > > > 
> > > > [If you have access to ECN titled "Async Hot-Plug Updates" (you can find
> > > > it in PCI-SIG site) that document has a nice table about the delays in
> > > > page 32. It compares surprise hotplug with downstream port containment
> > > > for async hotplug]
> > > 
> > > Thanks for the pointer, that ECN looks very useful.  It does talk
> > > about delays in general, but I don't see anything that clarifies
> > > whether ACPI methods or the OS is responsible for them.
> > 
> > No but the _DSM description above is pretty clear about that. At least
> > for me it is clear.
> > 
> > > > > What about D3hot->D0?  When a bridge (Root Port or Switch Downstream
> > > > > Port) is in D3hot, I'm not really clear on the state of its link.  If
> > > > > the link is down, I assume putting the bridge in D0 will bring it up
> > > > > and we'd have to wait for that?  If so, we'd need to do something in
> > > > > the reset path, e.g., pci_pm_reset()?
> > > > 
> > > > AFAIK the link goes into L1 when the function is programmed to any other
> > > > D state than D0. 
> > > 
> > > Yes, and the "function" here is the one on the *downstream* end, e.g.,
> > > the Endpoint or Switch Upstream Port.  When the upstream bridge (Root
> > > Port or Switch Downstream Port) is in a non-D0 state, the downstream
> > > component is unreachable (memory, I/O, and type 1 config requests are
> > > terminated by the bridge as unsupported requests).
> > 
> > Yes, the link is in L1 (its PM state is determined by the D-state of the
> > downstream component. From there you can get it back to functional state
> > by programming the downstream port to D0 (the link is still in L1)
> > followed by programming the function itself to D0 which brings the link
> > back to L0. It does not involve conventional reset (see picture in page
> > 492 of PCIe 5.0 spec). The recovery delays needed are listed in the same
> > page.
> > 
> > > > If we don't put the device into D3cold then I think it
> > > > stays in L1 where it can be brought back by writing D0 to PM register
> > > > which does not need any other delay than the D3hot -> D0 (10ms).
> > > 
> > > In pci_pm_reset(), we're doing the D0->D3hot->D0 transitions
> > > specifically to do a reset, so No_Soft_Reset is false.  Doesn't 6.6.1
> > > say we need at least 100ms here?
> > 
> > No since it does not go into D3cold. It just "reset" the thing if it
> > happens to do internal reset after D3hot -> D0.
> 
> Sec 5.8, Figure 5-18 says D3hot->D0uninitialized is a "Soft Reset", which
> unfortunately is not defined.
> 
> My guess is that in sec 5.9, Table 5-13, the 10ms delay is for the
> D3hot->D0active (i.e., No_Soft_Reset=1) transition, and the
> D3hot->D0uninitialized (i.e., No_Soft_Reset=0) that does a "soft
> reset" (whatever that is) probably requires more and we should handle
> it like a conventional reset to be safe.

I think it simply means the device functional context is lost (there is
more in section 5.3.1.4). Linux handles this properly already (well at
least according the minimum timings required by the spec) and restores
the context accordingly after it has waited for the 10ms.

It is the D3cold (where links go to L2 or L3) where we really need the
delays so that the link gets properly trained before we start poking the
downstream device.
Bjorn Helgaas Oct. 29, 2019, 8:27 p.m. UTC | #7
On Tue, Oct 29, 2019 at 01:15:20PM +0200, Mika Westerberg wrote:
> On Mon, Oct 28, 2019 at 03:16:53PM -0500, Bjorn Helgaas wrote:
> > > The related hardware event is resume in this case. Can you point
> > > me to the actual point where you want me to put this?
> > 
> > "Resume" is a Linux software concept, so of course the PCIe spec
> > doesn't say anything about it.  The spec talks about delays
> > related to resets and device power and link state transitions, so
> > somehow we have to connect the Linux delay with those hardware
> > events.
> > 
> > Since we're talking about a transition from D3cold, this has to be
> > done via something external to the device such as power
> > regulators.  For ACPI systems that's probably hidden inside _PS0
> > or something similar.  That's opaque, but at least it's a hook
> > that says "here's where we put the device into D0".  I suggested
> > acpi_pci_set_power_state() as a possibility since I think that's
> > the lowest-level point where we have the pci_dev so we know the
> > current state and the new state.
> 
> I looked at how we could use acpi_pci_set_power_state() but I don't
> think it is possible because it is likely that only the root port
> has the power resource that is used to bring the link to L2 or L3.
> However, we would need to repeat the delay for each downstream/root
> port if there are multiple PCIe switches in the topology.

OK, I think I understand why that's a problem (correct me if I'm
wrong):

  We call pci_pm_resume_noirq() for every device, but it only calls
  acpi_pci_set_power_state() for devices that have _PS0 or _PR0
  methods.  So if the delay is in acpi_pci_set_power_state() and we
  have A -> B -> C where only A has _PS0, we would delay for the link
  to B to come up, but not for the link to C.

I do see that we do need both delays.  In acpi_pci_set_power_state()
when we transition A from D3cold->D0, I assume that single _PS0
evaluation on A causes B to transition from D3cold->D3hot, which in
turn causes C to transition from D3cold->D3hot.  Is that your
understanding, too?

We do know that topology in acpi_pci_set_power_state(), since we have
the pci_dev for A, so it seems conceivable that we could descend the
hierarchy and delay for each level.

If the delay is in pci_pm_resume_noirq() (as in your patch), what
happens with a switch with several Downstream Ports?  I assume that
all the Downstream Ports start their transition out of D3cold
basically simultaneously, so we probably don't need N delays, do we?
It seems a little messy to optimize this in pci_pm_resume_noirq().

The outline of the pci_pm_resume_noirq() part of this patch is:

  pci_pm_resume_noirq
    if (!dev->skip_bus_pm ...)   # <-- condition 1
      pci_pm_default_resume_early
        pci_power_up
          if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
            platform_pci_set_power_state
              pci_platform_pm->set_state
                acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
                  acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
+   if (d3cold)                  # <-- condition 2
+     pci_bridge_wait_for_secondary_bus

Another thing that niggles at me here is that the condition for
calling pci_bridge_wait_for_secondary_bus() is completely different
than the condition for changing the power state.  If we didn't change
the power state, there's no reason to wait, is there?

The outline of the pci_pm_runtime_resume() part of this patch is:

  pci_pm_runtime_resume
    pci_restore_standard_config
      if (dev->current_state != PCI_D0)
        pci_set_power_state(PCI_D0)
          __pci_start_power_transition
            pci_platform_power_transition
              if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
                platform_pci_set_power_state
                  pci_platform_pm->set_state
                    acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
                      acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
              pci_raw_set_power_state
          __pci_complete_power_transition
+   if (d3cold)
+     pci_bridge_wait_for_secondary_bus

In this part, the power state change is inside
pci_restore_standard_config(), which calls pci_set_power_state().
There are many other callers of pci_set_power_state(); can we be sure
that none of them need a delay?

> > But it seems that at least some ACPI firmware doesn't do those
> > delays, so I guess our only alternatives are to always do it in
> > the OS or have some sort of blacklist.  And it doesn't really seem
> > practical to maintain a blacklist.
> 
> I really think this is crystal clear:

I am agreeing with you that the OS needs to do the delays.

> The OS is always responsible for the delays described in the PCIe
> spec.

If the ACPI spec contained this statement, it would be useful, but I
haven't seen it.  It's certainly true that some combination of
firmware and the OS is responsible for the delays :)

> However, if the platform implements some of them say in _ON or _PS0
> methods then it can notify the OS about this by using the _DSM so
> the OS does not need to duplicate all of them.

That makes good sense, but there are other reasons for using that
_DSM, e.g., firmware may know that MID or similar devices are not
really PCI devices and don't need delays anywhere.  So the existence
of the _DSM by itself doesn't convince me that the OS is responsible
for the delays.

> > > > In pci_pm_reset(), we're doing the D0->D3hot->D0 transitions
> > > > specifically to do a reset, so No_Soft_Reset is false.
> > > > Doesn't 6.6.1 say we need at least 100ms here?
> > > 
> > > No since it does not go into D3cold. It just "reset" the thing
> > > if it happens to do internal reset after D3hot -> D0.
> > 
> > Sec 5.8, Figure 5-18 says D3hot->D0uninitialized is a "Soft
> > Reset", which unfortunately is not defined.
> > 
> > My guess is that in sec 5.9, Table 5-13, the 10ms delay is for the
> > D3hot->D0active (i.e., No_Soft_Reset=1) transition, and the
> > D3hot->D0uninitialized (i.e., No_Soft_Reset=0) that does a "soft
> > reset" (whatever that is) probably requires more and we should
> > handle it like a conventional reset to be safe.
> 
> I think it simply means the device functional context is lost (there
> is more in section 5.3.1.4). Linux handles this properly already
> (well at least according the minimum timings required by the spec)
> and restores the context accordingly after it has waited for the
> 10ms.
> 
> It is the D3cold (where links go to L2 or L3) where we really need
> the delays so that the link gets properly trained before we start
> poking the downstream device.

I'm already speculating above, so I don't think I can contribute
anything useful here.
Bjorn Helgaas Oct. 29, 2019, 8:54 p.m. UTC | #8
On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> Currently Linux does not follow PCIe spec regarding the required delays
> after reset. A concrete example is a Thunderbolt add-in-card that
> consists of a PCIe switch and two PCIe endpoints:
> ...

> @@ -1025,15 +1025,11 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>  	if (state == PCI_D0) {
>  		pci_platform_power_transition(dev, PCI_D0);
>  		/*
> -		 * Mandatory power management transition delays, see
> -		 * PCI Express Base Specification Revision 2.0 Section
> -		 * 6.6.1: Conventional Reset.  Do not delay for
> -		 * devices powered on/off by corresponding bridge,
> -		 * because have already delayed for the bridge.
> +		 * Mandatory power management transition delays are handled
> +		 * in pci_pm_runtime_resume() of the corresponding
> +		 * downstream/root port.
>  		 */
>  		if (dev->runtime_d3cold) {
> -			if (dev->d3cold_delay && !dev->imm_ready)
> -				msleep(dev->d3cold_delay);

This removes the only use of d3cold_delay.  I assume that's
intentional?  If we no longer need it, we might as well remove it from
the pci_dev and remove the places that set it.  It'd be nice if that
could be a separate patch, even if we waited a little longer than
necessary at that one bisection point.

It also removes one of the three uses of imm_ready, leaving only the
two in FLR.  I suspect there are other places we should use imm_ready,
e.g., transitions to/from D1 and D2, but that would be beyond the
scope of this patch.

> +	/*
> +	 * For PCIe downstream and root ports that do not support speeds
> +	 * greater than 5 GT/s need to wait minimum 100 ms. For higher
> +	 * speeds (gen3) we need to wait first for the data link layer to
> +	 * become active.
> +	 *
> +	 * However, 100 ms is the minimum and the PCIe spec says the
> +	 * software must allow at least 1s before it can determine that the
> +	 * device that did not respond is a broken device. There is
> +	 * evidence that 100 ms is not always enough, for example certain
> +	 * Titan Ridge xHCI controller does not always respond to
> +	 * configuration requests if we only wait for 100 ms (see
> +	 * https://bugzilla.kernel.org/show_bug.cgi?id=203885).
> +	 *
> +	 * Therefore we wait for 100 ms and check for the device presence.
> +	 * If it is still not present give it an additional 100 ms.
> +	 */
> +	if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT &&
> +	    pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)
> +		return;

Shouldn't this be:

  if (!pcie_downstream_port(dev))
    return

so we include PCI/PCI-X to PCIe bridges?
Mika Westerberg Oct. 30, 2019, 11:15 a.m. UTC | #9
On Tue, Oct 29, 2019 at 03:27:09PM -0500, Bjorn Helgaas wrote:
> On Tue, Oct 29, 2019 at 01:15:20PM +0200, Mika Westerberg wrote:
> > On Mon, Oct 28, 2019 at 03:16:53PM -0500, Bjorn Helgaas wrote:
> > > > The related hardware event is resume in this case. Can you point
> > > > me to the actual point where you want me to put this?
> > > 
> > > "Resume" is a Linux software concept, so of course the PCIe spec
> > > doesn't say anything about it.  The spec talks about delays
> > > related to resets and device power and link state transitions, so
> > > somehow we have to connect the Linux delay with those hardware
> > > events.
> > > 
> > > Since we're talking about a transition from D3cold, this has to be
> > > done via something external to the device such as power
> > > regulators.  For ACPI systems that's probably hidden inside _PS0
> > > or something similar.  That's opaque, but at least it's a hook
> > > that says "here's where we put the device into D0".  I suggested
> > > acpi_pci_set_power_state() as a possibility since I think that's
> > > the lowest-level point where we have the pci_dev so we know the
> > > current state and the new state.
> > 
> > I looked at how we could use acpi_pci_set_power_state() but I don't
> > think it is possible because it is likely that only the root port
> > has the power resource that is used to bring the link to L2 or L3.
> > However, we would need to repeat the delay for each downstream/root
> > port if there are multiple PCIe switches in the topology.
> 
> OK, I think I understand why that's a problem (correct me if I'm
> wrong):
> 
>   We call pci_pm_resume_noirq() for every device, but it only calls
>   acpi_pci_set_power_state() for devices that have _PS0 or _PR0
>   methods.  So if the delay is in acpi_pci_set_power_state() and we
>   have A -> B -> C where only A has _PS0, we would delay for the link
>   to B to come up, but not for the link to C.

Yes, that's correct.

> I do see that we do need both delays.  In acpi_pci_set_power_state()
> when we transition A from D3cold->D0, I assume that single _PS0
> evaluation on A causes B to transition from D3cold->D3hot, which in
> turn causes C to transition from D3cold->D3hot.  Is that your
> understanding, too?

Not exactly :)

It is _ON() that causes the links to be retrained and it also causes the
PERST# (reset) to be unasserted for the whole topology transitioning all
devices into D0unitialized (default value for PMCSR PowerState field is 0).

> We do know that topology in acpi_pci_set_power_state(), since we have
> the pci_dev for A, so it seems conceivable that we could descend the
> hierarchy and delay for each level.

Right.

> If the delay is in pci_pm_resume_noirq() (as in your patch), what
> happens with a switch with several Downstream Ports?  I assume that
> all the Downstream Ports start their transition out of D3cold
> basically simultaneously, so we probably don't need N delays, do we?

No. Actually Linux already resumes these in paraller because async
suspend is set for them (for system suspend that is).

> It seems a little messy to optimize this in pci_pm_resume_noirq().

I agree.

> The outline of the pci_pm_resume_noirq() part of this patch is:
> 
>   pci_pm_resume_noirq
>     if (!dev->skip_bus_pm ...)   # <-- condition 1
>       pci_pm_default_resume_early
>         pci_power_up
>           if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
>             platform_pci_set_power_state
>               pci_platform_pm->set_state
>                 acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
>                   acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> +   if (d3cold)                  # <-- condition 2
> +     pci_bridge_wait_for_secondary_bus
> 
> Another thing that niggles at me here is that the condition for
> calling pci_bridge_wait_for_secondary_bus() is completely different
> than the condition for changing the power state.  If we didn't change
> the power state, there's no reason to wait, is there?

Indeed, if you are talking about the dev->skip_bus_pm check there is no
point to wait if we did not change the power state. I would assume that
d3cold is false in that case but we could also do this for clarity:

	if (!dev->skip_bus_pm && d3cold)
		pci_bridge_wait_for_secondary_bus(...)

> The outline of the pci_pm_runtime_resume() part of this patch is:
> 
>   pci_pm_runtime_resume
>     pci_restore_standard_config
>       if (dev->current_state != PCI_D0)
>         pci_set_power_state(PCI_D0)
>           __pci_start_power_transition
>             pci_platform_power_transition
>               if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
>                 platform_pci_set_power_state
>                   pci_platform_pm->set_state
>                     acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
>                       acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
>               pci_raw_set_power_state
>           __pci_complete_power_transition
> +   if (d3cold)
> +     pci_bridge_wait_for_secondary_bus
> 
> In this part, the power state change is inside
> pci_restore_standard_config(), which calls pci_set_power_state().
> There are many other callers of pci_set_power_state(); can we be sure
> that none of them need a delay?

Since we are handling the delay when we resume the downstream port, not
when we resume the device itself, I think the link should be up already
and the device accessible if someone calls pci_set_power_state() for it
(as the parent is always resumed before children).

> > > But it seems that at least some ACPI firmware doesn't do those
> > > delays, so I guess our only alternatives are to always do it in
> > > the OS or have some sort of blacklist.  And it doesn't really seem
> > > practical to maintain a blacklist.
> > 
> > I really think this is crystal clear:
> 
> I am agreeing with you that the OS needs to do the delays.
> 
> > The OS is always responsible for the delays described in the PCIe
> > spec.
> 
> If the ACPI spec contained this statement, it would be useful, but I
> haven't seen it.  It's certainly true that some combination of
> firmware and the OS is responsible for the delays :)
> 
> > However, if the platform implements some of them say in _ON or _PS0
> > methods then it can notify the OS about this by using the _DSM so
> > the OS does not need to duplicate all of them.
> 
> That makes good sense, but there are other reasons for using that
> _DSM, e.g., firmware may know that MID or similar devices are not
> really PCI devices and don't need delays anywhere.  So the existence
> of the _DSM by itself doesn't convince me that the OS is responsible
> for the delays.

Hmm, my interpretion of the specs is that OS is responsible for these
delays but if you can't be convinced then how you propose we handle this
problem? I mean there are two cases already listed in the changelog of
this patch from a real systems that need these delays. I don't think we
can just say people that unfortunately your system will not be supported
by Linux because we are not convinced that OS should do these delays. ;-)
Mika Westerberg Oct. 30, 2019, 11:33 a.m. UTC | #10
On Tue, Oct 29, 2019 at 03:54:56PM -0500, Bjorn Helgaas wrote:
> On Fri, Oct 04, 2019 at 03:39:47PM +0300, Mika Westerberg wrote:
> > Currently Linux does not follow PCIe spec regarding the required delays
> > after reset. A concrete example is a Thunderbolt add-in-card that
> > consists of a PCIe switch and two PCIe endpoints:
> > ...
> 
> > @@ -1025,15 +1025,11 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
> >  	if (state == PCI_D0) {
> >  		pci_platform_power_transition(dev, PCI_D0);
> >  		/*
> > -		 * Mandatory power management transition delays, see
> > -		 * PCI Express Base Specification Revision 2.0 Section
> > -		 * 6.6.1: Conventional Reset.  Do not delay for
> > -		 * devices powered on/off by corresponding bridge,
> > -		 * because have already delayed for the bridge.
> > +		 * Mandatory power management transition delays are handled
> > +		 * in pci_pm_runtime_resume() of the corresponding
> > +		 * downstream/root port.
> >  		 */
> >  		if (dev->runtime_d3cold) {
> > -			if (dev->d3cold_delay && !dev->imm_ready)
> > -				msleep(dev->d3cold_delay);
> 
> This removes the only use of d3cold_delay.  I assume that's
> intentional?  If we no longer need it, we might as well remove it from
> the pci_dev and remove the places that set it.  It'd be nice if that
> could be a separate patch, even if we waited a little longer than
> necessary at that one bisection point.

Yes, it is intentional. In the previous version I had function
pcie_get_downstream_delay() that used both d3cold_delay and imm_ready to
calculate the downstream device delay but you said:

  I'm not sold on the idea that this delay depends on what's *below* the                                                                                                   
  bridge.  We're using sec 6.6.1 to justify the delay, and that section                                                                                               
  doesn't say anything about downstream devices.

So I dropped it and use 100 ms instead.

Now that you mention, I think if we want to continue support that _DSM,
we should still take d3cold_delay into account in this patch. There is
also one driver (drivers/mfd/intel-lpss-pci.c) that sets it to 0.

> It also removes one of the three uses of imm_ready, leaving only the
> two in FLR.  I suspect there are other places we should use imm_ready,
> e.g., transitions to/from D1 and D2, but that would be beyond the
> scope of this patch.

Right, I think imm_ready does not apply here. If I understand correctly
it is exactly for D1, D2 and D3hot transitions which we should take into
account in pci_dev_d3_sleep() (which we don't do right now).

> > +	/*
> > +	 * For PCIe downstream and root ports that do not support speeds
> > +	 * greater than 5 GT/s need to wait minimum 100 ms. For higher
> > +	 * speeds (gen3) we need to wait first for the data link layer to
> > +	 * become active.
> > +	 *
> > +	 * However, 100 ms is the minimum and the PCIe spec says the
> > +	 * software must allow at least 1s before it can determine that the
> > +	 * device that did not respond is a broken device. There is
> > +	 * evidence that 100 ms is not always enough, for example certain
> > +	 * Titan Ridge xHCI controller does not always respond to
> > +	 * configuration requests if we only wait for 100 ms (see
> > +	 * https://bugzilla.kernel.org/show_bug.cgi?id=203885).
> > +	 *
> > +	 * Therefore we wait for 100 ms and check for the device presence.
> > +	 * If it is still not present give it an additional 100 ms.
> > +	 */
> > +	if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT &&
> > +	    pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)
> > +		return;
> 
> Shouldn't this be:
> 
>   if (!pcie_downstream_port(dev))
>     return
> 
> so we include PCI/PCI-X to PCIe bridges?

Yes, I'll change it in v3.
Bjorn Helgaas Oct. 31, 2019, 10:31 p.m. UTC | #11
On Wed, Oct 30, 2019 at 01:15:16PM +0200, Mika Westerberg wrote:
> On Tue, Oct 29, 2019 at 03:27:09PM -0500, Bjorn Helgaas wrote:
> > On Tue, Oct 29, 2019 at 01:15:20PM +0200, Mika Westerberg wrote:
> > > On Mon, Oct 28, 2019 at 03:16:53PM -0500, Bjorn Helgaas wrote:
> > > > > The related hardware event is resume in this case. Can you point
> > > > > me to the actual point where you want me to put this?
> > > > 
> > > > "Resume" is a Linux software concept, so of course the PCIe spec
> > > > doesn't say anything about it.  The spec talks about delays
> > > > related to resets and device power and link state transitions, so
> > > > somehow we have to connect the Linux delay with those hardware
> > > > events.
> > > > 
> > > > Since we're talking about a transition from D3cold, this has to be
> > > > done via something external to the device such as power
> > > > regulators.  For ACPI systems that's probably hidden inside _PS0
> > > > or something similar.  That's opaque, but at least it's a hook
> > > > that says "here's where we put the device into D0".  I suggested
> > > > acpi_pci_set_power_state() as a possibility since I think that's
> > > > the lowest-level point where we have the pci_dev so we know the
> > > > current state and the new state.
> > > 
> > > I looked at how we could use acpi_pci_set_power_state() but I don't
> > > think it is possible because it is likely that only the root port
> > > has the power resource that is used to bring the link to L2 or L3.
> > > However, we would need to repeat the delay for each downstream/root
> > > port if there are multiple PCIe switches in the topology.
> > 
> > OK, I think I understand why that's a problem (correct me if I'm
> > wrong):
> > 
> >   We call pci_pm_resume_noirq() for every device, but it only calls
> >   acpi_pci_set_power_state() for devices that have _PS0 or _PR0
> >   methods.  So if the delay is in acpi_pci_set_power_state() and we
> >   have A -> B -> C where only A has _PS0, we would delay for the link
> >   to B to come up, but not for the link to C.
> 
> Yes, that's correct.
> 
> > I do see that we do need both delays.  In acpi_pci_set_power_state()
> > when we transition A from D3cold->D0, I assume that single _PS0
> > evaluation on A causes B to transition from D3cold->D3hot, which in
> > turn causes C to transition from D3cold->D3hot.  Is that your
> > understanding, too?
> 
> Not exactly :)
> 
> It is _ON() that causes the links to be retrained and it also causes the
> PERST# (reset) to be unasserted for the whole topology transitioning all
> devices into D0unitialized (default value for PMCSR PowerState field is 0).

OK.  I guess the important thing is that a single power-on from D3cold
at any point in the hierarchy can power on the entire subtree rooted
at that point.  So if we have RP -> SUP -> SDP0..SDP7 where SDP0..SDP7
are Switch Downstream Ports, when we evaluate _ON for RP, PERST# will
be deasserted below it, and everything downstream should start the
process of going to D0uninitialized.

And we can't rely on any other hooks like _ON/_PS0 invocations for
SUP, SDPx, etc, where we could do additional delays.

> > If the delay is in pci_pm_resume_noirq() (as in your patch), what
> > happens with a switch with several Downstream Ports?  I assume that
> > all the Downstream Ports start their transition out of D3cold
> > basically simultaneously, so we probably don't need N delays, do we?
> 
> No. Actually Linux already resumes these in parallel because async
> suspend is set for them (for system suspend that is).

So I think we have something like this:

  pci_pm_resume_noirq(RP)
    pdev->current_state == PCI_D3cold  # HW actually in D3cold
    _ON(RP)                            # turns on entire hierarchy
    pci_bridge_wait_for_secondary_bus  # waits only for RP -> SUP link

  pci_pm_resume_noirq(SUP)
    pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
    pci_bridge_wait_for_secondary_bus  # no wait (not a downstream port)

  pci_pm_resume_noirq(SDP0)
    pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
    pci_bridge_wait_for_secondary_bus  # waits for SDP0 -> ? link

  ...

  pci_pm_resume_noirq(SDP7)
    pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
    pci_bridge_wait_for_secondary_bus  # waits for SDP7 -> ? link

and we have 1 delay for the Root Port plus 8 delays (one for each
Switch Downstream Port), and as soon as SUP has been resumed,
SDP0..SDP7 can be resumed simultaneously (assuming async is set for
them)?

I'm not a huge fan of relying on async because the asynchrony is far
removed from this code and really hard to figure out.  Maybe an
alternative would be to figure out in the pci_pm_resume_noirq(RP) path
how many levels of links to wait for.

Ideally someone expert in PCIe but not in Linux would be able to look
at the local code and verify that it matches the spec.  If verification
requires extensive analysis or someone expert in *both* PCIe and
Linux, it makes maintenance much harder.

> > The outline of the pci_pm_resume_noirq() part of this patch is:
> > 
> >   pci_pm_resume_noirq
> >     if (!dev->skip_bus_pm ...)   # <-- condition 1
> >       pci_pm_default_resume_early
> >         pci_power_up
> >           if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> >             platform_pci_set_power_state
> >               pci_platform_pm->set_state
> >                 acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> >                   acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> > +   if (d3cold)                  # <-- condition 2
> > +     pci_bridge_wait_for_secondary_bus
> > 
> > Another thing that niggles at me here is that the condition for
> > calling pci_bridge_wait_for_secondary_bus() is completely different
> > than the condition for changing the power state.  If we didn't change
> > the power state, there's no reason to wait, is there?
> 
> Indeed, if you are talking about the dev->skip_bus_pm check there is no
> point to wait if we did not change the power state. I would assume that
> d3cold is false in that case but we could also do this for clarity:
> 
> 	if (!dev->skip_bus_pm && d3cold)
> 		pci_bridge_wait_for_secondary_bus(...)

Could the wait go in pci_power_up()?  That would at least connect it
directly with a -> D0 transition.  Or, if that doesn't seem the right
place for it, could we do the following?

  if (!(pci_dev->skip_bus_pm && pm_suspend_no_platform())) {
    pci_pm_default_resume_early(pci_dev);
    if (d3cold)
      pci_bridge_wait_for_secondary_bus(pci_dev);
  }

  pci_fixup_device(pci_fixup_resume_early, pci_dev);
  pcie_pme_root_status_cleanup(pci_dev);

  if (pci_has_legacy_pm_support(pci_dev))
    return pci_legacy_resume_early(dev);
  ...

Either way would also fix the problem that with the current patch, if
the device has legacy PM support, we call pci_legacy_resume_early()
but we don't wait for the secondary bus.

> > The outline of the pci_pm_runtime_resume() part of this patch is:
> > 
> >   pci_pm_runtime_resume
> >     pci_restore_standard_config
> >       if (dev->current_state != PCI_D0)
> >         pci_set_power_state(PCI_D0)
> >           __pci_start_power_transition
> >             pci_platform_power_transition
> >               if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> >                 platform_pci_set_power_state
> >                   pci_platform_pm->set_state
> >                     acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> >                       acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> >               pci_raw_set_power_state
> >           __pci_complete_power_transition
> > +   if (d3cold)
> > +     pci_bridge_wait_for_secondary_bus
> > 
> > In this part, the power state change is inside
> > pci_restore_standard_config(), which calls pci_set_power_state().
> > There are many other callers of pci_set_power_state(); can we be sure
> > that none of them need a delay?
> 
> Since we are handling the delay when we resume the downstream port, not
> when we resume the device itself, I think the link should be up already
> and the device accessible if someone calls pci_set_power_state() for it
> (as the parent is always resumed before children).

Ah, yeah, I guess that since all the calls I see are for non-bridge
devices, there would be no delay for a secondary bus.

This is a tangent, but there are ~140 pci_set_power_state(PCI_D0)
calls, mostly from .resume() methods of drivers with legacy PM.  Are
those even necessary?  It looks like the PCI core does this so the
driver wouldn't need to:

  pci_pm_resume_noirq
    pci_pm_default_resume_early
      pci_power_up
        pci_raw_set_power_state(dev, PCI_D0)   # <-- PCI core

  pci_pm_resume
    if (pci_has_legacy_pm_support(pci_dev))
      pci_legacy_resume(dev)
        drv->resume
	  pci_set_power_state(PCI_D0)          # <-- driver .resume()

> > > > But it seems that at least some ACPI firmware doesn't do those
> > > > delays, so I guess our only alternatives are to always do it in
> > > > the OS or have some sort of blacklist.  And it doesn't really seem
> > > > practical to maintain a blacklist.
> > > 
> > > I really think this is crystal clear:
> > 
> > I am agreeing with you that the OS needs to do the delays.

Did you miss this part?  I said below that the existence of the _DSM
*by itself* doesn't convince me.  But I think the lack of clarity and
the fact that at least some firmware doesn't do it means that the OS
must do it.

> > > The OS is always responsible for the delays described in the PCIe
> > > spec.
> > 
> > If the ACPI spec contained this statement, it would be useful, but I
> > haven't seen it.  It's certainly true that some combination of
> > firmware and the OS is responsible for the delays :)
> > 
> > > However, if the platform implements some of them say in _ON or _PS0
> > > methods then it can notify the OS about this by using the _DSM so
> > > the OS does not need to duplicate all of them.
> > 
> > That makes good sense, but there are other reasons for using that
> > _DSM, e.g., firmware may know that MID or similar devices are not
> > really PCI devices and don't need delays anywhere.  So the existence
> > of the _DSM by itself doesn't convince me that the OS is responsible
> > for the delays.
> 
> Hmm, my interpretion of the specs is that OS is responsible for these
> delays but if you can't be convinced then how you propose we handle this
> problem? I mean there are two cases already listed in the changelog of
> this patch from a real systems that need these delays. I don't think we
> can just say people that unfortunately your system will not be supported
> by Linux because we are not convinced that OS should do these delays. ;-)
Mika Westerberg Nov. 1, 2019, 11:19 a.m. UTC | #12
On Thu, Oct 31, 2019 at 05:31:44PM -0500, Bjorn Helgaas wrote:
> On Wed, Oct 30, 2019 at 01:15:16PM +0200, Mika Westerberg wrote:
> > On Tue, Oct 29, 2019 at 03:27:09PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Oct 29, 2019 at 01:15:20PM +0200, Mika Westerberg wrote:
> > > > On Mon, Oct 28, 2019 at 03:16:53PM -0500, Bjorn Helgaas wrote:
> > > > > > The related hardware event is resume in this case. Can you point
> > > > > > me to the actual point where you want me to put this?
> > > > > 
> > > > > "Resume" is a Linux software concept, so of course the PCIe spec
> > > > > doesn't say anything about it.  The spec talks about delays
> > > > > related to resets and device power and link state transitions, so
> > > > > somehow we have to connect the Linux delay with those hardware
> > > > > events.
> > > > > 
> > > > > Since we're talking about a transition from D3cold, this has to be
> > > > > done via something external to the device such as power
> > > > > regulators.  For ACPI systems that's probably hidden inside _PS0
> > > > > or something similar.  That's opaque, but at least it's a hook
> > > > > that says "here's where we put the device into D0".  I suggested
> > > > > acpi_pci_set_power_state() as a possibility since I think that's
> > > > > the lowest-level point where we have the pci_dev so we know the
> > > > > current state and the new state.
> > > > 
> > > > I looked at how we could use acpi_pci_set_power_state() but I don't
> > > > think it is possible because it is likely that only the root port
> > > > has the power resource that is used to bring the link to L2 or L3.
> > > > However, we would need to repeat the delay for each downstream/root
> > > > port if there are multiple PCIe switches in the topology.
> > > 
> > > OK, I think I understand why that's a problem (correct me if I'm
> > > wrong):
> > > 
> > >   We call pci_pm_resume_noirq() for every device, but it only calls
> > >   acpi_pci_set_power_state() for devices that have _PS0 or _PR0
> > >   methods.  So if the delay is in acpi_pci_set_power_state() and we
> > >   have A -> B -> C where only A has _PS0, we would delay for the link
> > >   to B to come up, but not for the link to C.
> > 
> > Yes, that's correct.
> > 
> > > I do see that we do need both delays.  In acpi_pci_set_power_state()
> > > when we transition A from D3cold->D0, I assume that single _PS0
> > > evaluation on A causes B to transition from D3cold->D3hot, which in
> > > turn causes C to transition from D3cold->D3hot.  Is that your
> > > understanding, too?
> > 
> > Not exactly :)
> > 
> > It is _ON() that causes the links to be retrained and it also causes the
> > PERST# (reset) to be unasserted for the whole topology transitioning all
> > devices into D0unitialized (default value for PMCSR PowerState field is 0).
> 
> OK.  I guess the important thing is that a single power-on from D3cold
> at any point in the hierarchy can power on the entire subtree rooted
> at that point.  So if we have RP -> SUP -> SDP0..SDP7 where SDP0..SDP7
> are Switch Downstream Ports, when we evaluate _ON for RP, PERST# will
> be deasserted below it, and everything downstream should start the
> process of going to D0uninitialized.
> 
> And we can't rely on any other hooks like _ON/_PS0 invocations for
> SUP, SDPx, etc, where we could do additional delays.

What I've seen they don't typically even have representation in ACPI.

> > > If the delay is in pci_pm_resume_noirq() (as in your patch), what
> > > happens with a switch with several Downstream Ports?  I assume that
> > > all the Downstream Ports start their transition out of D3cold
> > > basically simultaneously, so we probably don't need N delays, do we?
> > 
> > No. Actually Linux already resumes these in parallel because async
> > suspend is set for them (for system suspend that is).
> 
> So I think we have something like this:
> 
>   pci_pm_resume_noirq(RP)
>     pdev->current_state == PCI_D3cold  # HW actually in D3cold
>     _ON(RP)                            # turns on entire hierarchy
>     pci_bridge_wait_for_secondary_bus  # waits only for RP -> SUP link
> 
>   pci_pm_resume_noirq(SUP)
>     pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
>     pci_bridge_wait_for_secondary_bus  # no wait (not a downstream port)
> 
>   pci_pm_resume_noirq(SDP0)
>     pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
>     pci_bridge_wait_for_secondary_bus  # waits for SDP0 -> ? link
> 
>   ...
> 
>   pci_pm_resume_noirq(SDP7)
>     pdev->current_state == PCI_D3cold  # HW probably in D0uninitialized
>     pci_bridge_wait_for_secondary_bus  # waits for SDP7 -> ? link
> 
> and we have 1 delay for the Root Port plus 8 delays (one for each
> Switch Downstream Port), and as soon as SUP has been resumed,
> SDP0..SDP7 can be resumed simultaneously (assuming async is set for
> them)?

Yes.

> I'm not a huge fan of relying on async because the asynchrony is far
> removed from this code and really hard to figure out.  Maybe an
> alternative would be to figure out in the pci_pm_resume_noirq(RP) path
> how many levels of links to wait for.

There is problem with this. For gen3 speeds and further we need to wait
for the link (each link) to be activated before we delay. If we do it
only in the root port it would need to enumerate all the ports and
handle this which complicates it unnecessarily.

You can also "guess" the delay by waiting for the worst possible time
but I don't think that's something we want to do.

> Ideally someone expert in PCIe but not in Linux would be able to look
> at the local code and verify that it matches the spec.  If verification
> requires extensive analysis or someone expert in *both* PCIe and
> Linux, it makes maintenance much harder.

Well whoever reads the code needs to be expert in both anyway.

> > > The outline of the pci_pm_resume_noirq() part of this patch is:
> > > 
> > >   pci_pm_resume_noirq
> > >     if (!dev->skip_bus_pm ...)   # <-- condition 1
> > >       pci_pm_default_resume_early
> > >         pci_power_up
> > >           if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> > >             platform_pci_set_power_state
> > >               pci_platform_pm->set_state
> > >                 acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> > >                   acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> > > +   if (d3cold)                  # <-- condition 2
> > > +     pci_bridge_wait_for_secondary_bus
> > > 
> > > Another thing that niggles at me here is that the condition for
> > > calling pci_bridge_wait_for_secondary_bus() is completely different
> > > than the condition for changing the power state.  If we didn't change
> > > the power state, there's no reason to wait, is there?
> > 
> > Indeed, if you are talking about the dev->skip_bus_pm check there is no
> > point to wait if we did not change the power state. I would assume that
> > d3cold is false in that case but we could also do this for clarity:
> > 
> > 	if (!dev->skip_bus_pm && d3cold)
> > 		pci_bridge_wait_for_secondary_bus(...)
> 
> Could the wait go in pci_power_up()?  That would at least connect it
> directly with a -> D0 transition.  Or, if that doesn't seem the right
> place for it, could we do the following?
> 
>   if (!(pci_dev->skip_bus_pm && pm_suspend_no_platform())) {
>     pci_pm_default_resume_early(pci_dev);
>     if (d3cold)
>       pci_bridge_wait_for_secondary_bus(pci_dev);
>   }
> 
>   pci_fixup_device(pci_fixup_resume_early, pci_dev);
>   pcie_pme_root_status_cleanup(pci_dev);
> 
>   if (pci_has_legacy_pm_support(pci_dev))
>     return pci_legacy_resume_early(dev);
>   ...

The reason why pci_bridge_wait_for_secondary_bus() is called almost the
last is that I figured we want to resume the root/downstream port
completely first before we start delaying for the device downstream.
Need to call it before port services (pciehp) is resumed, though.

If you think it is fine to do the delay before we have restored
everything I can move it inside pci_power_up() or call it after
pci_pm_default_resume_early() as above. I think at least we should make
sure all the saved registers are restored before so that the link
activation check actually works.

> Either way would also fix the problem that with the current patch, if
> the device has legacy PM support, we call pci_legacy_resume_early()
> but we don't wait for the secondary bus.

True.

> > > The outline of the pci_pm_runtime_resume() part of this patch is:
> > > 
> > >   pci_pm_runtime_resume
> > >     pci_restore_standard_config
> > >       if (dev->current_state != PCI_D0)
> > >         pci_set_power_state(PCI_D0)
> > >           __pci_start_power_transition
> > >             pci_platform_power_transition
> > >               if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> > >                 platform_pci_set_power_state
> > >                   pci_platform_pm->set_state
> > >                     acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> > >                       acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> > >               pci_raw_set_power_state
> > >           __pci_complete_power_transition
> > > +   if (d3cold)
> > > +     pci_bridge_wait_for_secondary_bus
> > > 
> > > In this part, the power state change is inside
> > > pci_restore_standard_config(), which calls pci_set_power_state().
> > > There are many other callers of pci_set_power_state(); can we be sure
> > > that none of them need a delay?
> > 
> > Since we are handling the delay when we resume the downstream port, not
> > when we resume the device itself, I think the link should be up already
> > and the device accessible if someone calls pci_set_power_state() for it
> > (as the parent is always resumed before children).
> 
> Ah, yeah, I guess that since all the calls I see are for non-bridge
> devices, there would be no delay for a secondary bus.
> 
> This is a tangent, but there are ~140 pci_set_power_state(PCI_D0)
> calls, mostly from .resume() methods of drivers with legacy PM.  Are
> those even necessary?  It looks like the PCI core does this so the
> driver wouldn't need to:
> 
>   pci_pm_resume_noirq
>     pci_pm_default_resume_early
>       pci_power_up
>         pci_raw_set_power_state(dev, PCI_D0)   # <-- PCI core
> 
>   pci_pm_resume
>     if (pci_has_legacy_pm_support(pci_dev))
>       pci_legacy_resume(dev)
>         drv->resume
> 	  pci_set_power_state(PCI_D0)          # <-- driver .resume()

I don't think most of them are necessary anymore.

> > > > > But it seems that at least some ACPI firmware doesn't do those
> > > > > delays, so I guess our only alternatives are to always do it in
> > > > > the OS or have some sort of blacklist.  And it doesn't really seem
> > > > > practical to maintain a blacklist.
> > > > 
> > > > I really think this is crystal clear:
> > > 
> > > I am agreeing with you that the OS needs to do the delays.
> 
> Did you miss this part?  I said below that the existence of the _DSM
> *by itself* doesn't convince me.  But I think the lack of clarity and
> the fact that at least some firmware doesn't do it means that the OS
> must do it.

Yes, I missed this part. Sorry about that.
Bjorn Helgaas Nov. 5, 2019, midnight UTC | #13
On Fri, Nov 01, 2019 at 01:19:18PM +0200, Mika Westerberg wrote:
> On Thu, Oct 31, 2019 at 05:31:44PM -0500, Bjorn Helgaas wrote:
> > On Wed, Oct 30, 2019 at 01:15:16PM +0200, Mika Westerberg wrote:
> > > On Tue, Oct 29, 2019 at 03:27:09PM -0500, Bjorn Helgaas wrote:

> > I'm not a huge fan of relying on async because the asynchrony is far
> > removed from this code and really hard to figure out.  Maybe an
> > alternative would be to figure out in the pci_pm_resume_noirq(RP) path
> > how many levels of links to wait for.
> 
> There is problem with this. For gen3 speeds and further we need to wait
> for the link (each link) to be activated before we delay. If we do it
> only in the root port it would need to enumerate all the ports and
> handle this which complicates it unnecessarily.

I agree, that doesn't sound good.  If we're resuming a Downstream
Port, I don't think we should be reading Link Status from other ports
farther downstream.

> > > > The outline of the pci_pm_resume_noirq() part of this patch is:
> > > > 
> > > >   pci_pm_resume_noirq
> > > >     if (!dev->skip_bus_pm ...)   # <-- condition 1
> > > >       pci_pm_default_resume_early
> > > >         pci_power_up
> > > >           if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> > > >             platform_pci_set_power_state
> > > >               pci_platform_pm->set_state
> > > >                 acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> > > >                   acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> > > > +   if (d3cold)                  # <-- condition 2
> > > > +     pci_bridge_wait_for_secondary_bus

> The reason why pci_bridge_wait_for_secondary_bus() is called almost the
> last is that I figured we want to resume the root/downstream port
> completely first before we start delaying for the device downstream.

For understandability, I think the wait needs to go in some function
that contains "PCI_D0", e.g., platform_pci_set_power_state() or
pci_power_up(), so it's connected with the transition from D3cold to
D0.

Since pci_pm_default_resume_early() is the only caller of
pci_power_up(), maybe we should just inline pci_power_up(), e.g.,
something like this:

  static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
  {
    pci_power_state_t prev_state = pci_dev->current_state;

    if (platform_pci_power_manageable(pci_dev))
      platform_pci_set_power_state(pci_dev, PCI_D0);

    pci_raw_set_power_state(pci_dev, PCI_D0);
    pci_update_current_state(pci_dev, PCI_D0);

    pci_restore_state(pci_dev);
    pci_pme_restore(pci_dev);

    if (prev_state == PCI_D3cold)
      pci_bridge_wait_for_secondary_bus(dev);
  }

I don't understand why we call both platform_pci_set_power_state() and
pci_raw_set_power_state().  I thought platform_pci_set_power_state()
should put the device in D0, so we shouldn't need the PCI_PM_CTRL
update in pci_raw_set_power_state(), although we probably do need
things like pci_restore_bars() and pcie_aspm_pm_state_change().

And in fact, it seems wrong that if platform_pci_set_power_state()
puts the device in D0 and the device lacks a PM capability, we bail
out of pci_raw_set_power_state() before calling pci_restore_bars().

Tangent: I think "pci_pm_default_resume_early" is the wrong name for
this because "default" suggests that this is what we fall back to if a
driver or arch doesn't supply a more specific method.  But here we're
doing mandatory things that cannot be overridden, so I think something
like "pci_pm_core_resume_early()" would be more accurate.

> Need to call it before port services (pciehp) is resumed, though.

I guess this is because pciehp_resume() looks at PCI_EXP_LNKSTA and
will be confused if the link isn't up yet?

> If you think it is fine to do the delay before we have restored
> everything I can move it inside pci_power_up() or call it after
> pci_pm_default_resume_early() as above. I think at least we should make
> sure all the saved registers are restored before so that the link
> activation check actually works.

What needs to be restored to make pcie_wait_for_link_delay() work?
And what event does the restore need to be ordered with?  I could
imagine needing to restore something like Target Link Speed before
waiting, but that sounds racy unless we force a link retrain after
restoring it.

Bjorn
Mika Westerberg Nov. 5, 2019, 9:54 a.m. UTC | #14
On Mon, Nov 04, 2019 at 06:00:00PM -0600, Bjorn Helgaas wrote:
> > > > > The outline of the pci_pm_resume_noirq() part of this patch is:
> > > > > 
> > > > >   pci_pm_resume_noirq
> > > > >     if (!dev->skip_bus_pm ...)   # <-- condition 1
> > > > >       pci_pm_default_resume_early
> > > > >         pci_power_up
> > > > >           if (platform_pci_power_manageable())   # _PS0 or _PR0 exist?
> > > > >             platform_pci_set_power_state
> > > > >               pci_platform_pm->set_state
> > > > >                 acpi_pci_set_power_state(PCI_D0) # acpi_pci_platform_pm.set_state
> > > > >                   acpi_device_set_power(ACPI_STATE_D0) # <-- eval _PS0
> > > > > +   if (d3cold)                  # <-- condition 2
> > > > > +     pci_bridge_wait_for_secondary_bus
> 
> > The reason why pci_bridge_wait_for_secondary_bus() is called almost the
> > last is that I figured we want to resume the root/downstream port
> > completely first before we start delaying for the device downstream.
> 
> For understandability, I think the wait needs to go in some function
> that contains "PCI_D0", e.g., platform_pci_set_power_state() or
> pci_power_up(), so it's connected with the transition from D3cold to
> D0.
> 
> Since pci_pm_default_resume_early() is the only caller of
> pci_power_up(), maybe we should just inline pci_power_up(), e.g.,
> something like this:
> 
>   static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
>   {
>     pci_power_state_t prev_state = pci_dev->current_state;
> 
>     if (platform_pci_power_manageable(pci_dev))
>       platform_pci_set_power_state(pci_dev, PCI_D0);
> 
>     pci_raw_set_power_state(pci_dev, PCI_D0);
>     pci_update_current_state(pci_dev, PCI_D0);
> 
>     pci_restore_state(pci_dev);
>     pci_pme_restore(pci_dev);
> 
>     if (prev_state == PCI_D3cold)
>       pci_bridge_wait_for_secondary_bus(dev);
>   }

OK, I'll see if this works.

> I don't understand why we call both platform_pci_set_power_state() and
> pci_raw_set_power_state().

platform_pci_set_power_state() deals with the ACPI methods such as
calling _PS0 after D3hot. To transition the device from D3hot to D0 you
need the PMCSR write which is done in pci_raw_set_power_state().

> I thought platform_pci_set_power_state()
> should put the device in D0, so we shouldn't need the PCI_PM_CTRL
> update in pci_raw_set_power_state(), although we probably do need
> things like pci_restore_bars() and pcie_aspm_pm_state_change().
> 
> And in fact, it seems wrong that if platform_pci_set_power_state()
> puts the device in D0 and the device lacks a PM capability, we bail
> out of pci_raw_set_power_state() before calling pci_restore_bars().
> 
> Tangent: I think "pci_pm_default_resume_early" is the wrong name for
> this because "default" suggests that this is what we fall back to if a
> driver or arch doesn't supply a more specific method.  But here we're
> doing mandatory things that cannot be overridden, so I think something
> like "pci_pm_core_resume_early()" would be more accurate.
> 
> > Need to call it before port services (pciehp) is resumed, though.
> 
> I guess this is because pciehp_resume() looks at PCI_EXP_LNKSTA and
> will be confused if the link isn't up yet?

Yes.

> > If you think it is fine to do the delay before we have restored
> > everything I can move it inside pci_power_up() or call it after
> > pci_pm_default_resume_early() as above. I think at least we should make
> > sure all the saved registers are restored before so that the link
> > activation check actually works.
> 
> What needs to be restored to make pcie_wait_for_link_delay() work?

I'm not entirely sure. I think that pci_restore_state() at least should
be called so that the PCIe capability gets restored. Maybe not event
that because Data Link Layer Layer Active always reflects the DL_Active
or not and it does not need to be enabled separately.

> And what event does the restore need to be ordered with?

Not sure I follow you here.
Mika Westerberg Nov. 5, 2019, 12:58 p.m. UTC | #15
On Tue, Nov 05, 2019 at 11:54:33AM +0200, Mika Westerberg wrote:
> > For understandability, I think the wait needs to go in some function
> > that contains "PCI_D0", e.g., platform_pci_set_power_state() or
> > pci_power_up(), so it's connected with the transition from D3cold to
> > D0.
> > 
> > Since pci_pm_default_resume_early() is the only caller of
> > pci_power_up(), maybe we should just inline pci_power_up(), e.g.,
> > something like this:
> > 
> >   static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
> >   {
> >     pci_power_state_t prev_state = pci_dev->current_state;
> > 
> >     if (platform_pci_power_manageable(pci_dev))
> >       platform_pci_set_power_state(pci_dev, PCI_D0);
> > 
> >     pci_raw_set_power_state(pci_dev, PCI_D0);
> >     pci_update_current_state(pci_dev, PCI_D0);
> > 
> >     pci_restore_state(pci_dev);
> >     pci_pme_restore(pci_dev);
> > 
> >     if (prev_state == PCI_D3cold)
> >       pci_bridge_wait_for_secondary_bus(dev);
> >   }
> 
> OK, I'll see if this works.

Well, I think we want to execute pci_fixup_resume_early before we delay
for the downstream component (same for runtime resume path). Currently
nobody is using that for root/downstream ports but in theory at least
some port may need it before it works properly. Also probably good idea
to disable wake as soon as possible to avoid possible extra PME messages
coming from the port.

I feel that the following is the right place to perform the delay but if
you think we can ignore the above, I will just do what you say and do it
in pci_pm_default_resume_early() (and pci_restore_standard_config() for
runtime path).

[The below diff does not have check for pci_dev->skip_bus_pm because I
 was planning to move it inside pci_bridge_wait_for_secondary_bus() itself.]

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 08d3bdbc8c04..3c0e52aaef79 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -890,6 +890,7 @@ static int pci_pm_resume_noirq(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	pci_power_t prev_state = pci_dev->current_state;
 
 	if (dev_pm_may_skip_resume(dev))
 		return 0;
@@ -914,6 +915,9 @@ static int pci_pm_resume_noirq(struct device *dev)
 	pci_fixup_device(pci_fixup_resume_early, pci_dev);
 	pcie_pme_root_status_cleanup(pci_dev);
 
+	if (prev_state == PCI_D3cold)
+		pci_bridge_wait_for_secondary_bus(pci_dev);
+
 	if (pci_has_legacy_pm_support(pci_dev))
 		return 0;
 
@@ -1299,6 +1303,7 @@ static int pci_pm_runtime_resume(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	pci_power_t prev_state = pci_dev->current_state;
 	int error = 0;
 
 	/*
@@ -1314,6 +1319,9 @@ static int pci_pm_runtime_resume(struct device *dev)
 	pci_fixup_device(pci_fixup_resume_early, pci_dev);
 	pci_pm_default_resume(pci_dev);
 
+	if (prev_state == PCI_D3cold)
+		pci_bridge_wait_for_secondary_bus(pci_dev);
+
 	if (pm && pm->runtime_resume)
 		error = pm->runtime_resume(dev);
Bjorn Helgaas Nov. 5, 2019, 3 p.m. UTC | #16
On Tue, Nov 05, 2019 at 11:54:28AM +0200, Mika Westerberg wrote:
> On Mon, Nov 04, 2019 at 06:00:00PM -0600, Bjorn Helgaas wrote:

> > > If you think it is fine to do the delay before we have restored
> > > everything I can move it inside pci_power_up() or call it after
> > > pci_pm_default_resume_early() as above. I think at least we should make
> > > sure all the saved registers are restored before so that the link
> > > activation check actually works.
> > 
> > What needs to be restored to make pcie_wait_for_link_delay() work?
> 
> I'm not entirely sure. I think that pci_restore_state() at least should
> be called so that the PCIe capability gets restored. Maybe not even
> that because Data Link Layer Layer Active always reflects the DL_Active
> or not and it does not need to be enabled separately.
> 
> > And what event does the restore need to be ordered with?
> 
> Not sure I follow you here.

You're suggesting that we should restore saved registers first so
pcie_wait_for_link_delay() works.  If the link activation depends on
something being restored and we don't enforce an ordering, the
activation might succeed or fail depending on whether it happens
before or after the restore.  So if there is a dependency, we should
make it explicit to avoid a race like that.

But I'm not saying we *shouldn't* do the restore before the wait; only
that any dependency should be explicit.  Even if there is no actual
dependency it probably makes sense to do the restore first so it can
overlap with the hardware link training, which may reduce the time
pcie_wait_for_link_delay() has to wait when we do call it, e.g.,

  |-----------------|          link activation
     |-----|                   restore state
           |--------|          pcie_wait_for_link_delay()

whereas if we do the restore after waiting for the link to come up, it
probably takes longer:

  |-----------------|          link activation
     |--------------|          pcie_wait_for_link_delay()
                    |-----|    restore state

I actually suspect there *is* a dependency -- we should respect the
Target Link Speed and and width so the link resumes in the same
configuration it was before suspend.  And I suspect that may require
an explicit retrain after restoring PCI_EXP_LNKCTL2.

Bjorn
Mika Westerberg Nov. 5, 2019, 3:28 p.m. UTC | #17
On Tue, Nov 05, 2019 at 09:00:13AM -0600, Bjorn Helgaas wrote:
> On Tue, Nov 05, 2019 at 11:54:28AM +0200, Mika Westerberg wrote:
> > On Mon, Nov 04, 2019 at 06:00:00PM -0600, Bjorn Helgaas wrote:
> 
> > > > If you think it is fine to do the delay before we have restored
> > > > everything I can move it inside pci_power_up() or call it after
> > > > pci_pm_default_resume_early() as above. I think at least we should make
> > > > sure all the saved registers are restored before so that the link
> > > > activation check actually works.
> > > 
> > > What needs to be restored to make pcie_wait_for_link_delay() work?
> > 
> > I'm not entirely sure. I think that pci_restore_state() at least should
> > be called so that the PCIe capability gets restored. Maybe not even
> > that because Data Link Layer Layer Active always reflects the DL_Active
> > or not and it does not need to be enabled separately.
> > 
> > > And what event does the restore need to be ordered with?
> > 
> > Not sure I follow you here.
> 
> You're suggesting that we should restore saved registers first so
> pcie_wait_for_link_delay() works.  If the link activation depends on
> something being restored and we don't enforce an ordering, the
> activation might succeed or fail depending on whether it happens
> before or after the restore.  So if there is a dependency, we should
> make it explicit to avoid a race like that.

OK thanks. By explicit you mean document it in the code, right?

> But I'm not saying we *shouldn't* do the restore before the wait; only
> that any dependency should be explicit.  Even if there is no actual
> dependency it probably makes sense to do the restore first so it can
> overlap with the hardware link training, which may reduce the time
> pcie_wait_for_link_delay() has to wait when we do call it, e.g.,
> 
>   |-----------------|          link activation
>      |-----|                   restore state
>            |--------|          pcie_wait_for_link_delay()
> 
> whereas if we do the restore after waiting for the link to come up, it
> probably takes longer:
> 
>   |-----------------|          link activation
>      |--------------|          pcie_wait_for_link_delay()
>                     |-----|    restore state
> 
> I actually suspect there *is* a dependency -- we should respect the
> Target Link Speed and and width so the link resumes in the same
> configuration it was before suspend.  And I suspect that may require
> an explicit retrain after restoring PCI_EXP_LNKCTL2.

According the PCIe spec the PCI_EXP_LNKCTL2 Target Link Speed is marked
as RWS (S for sticky) so I suspect its value is retained after reset in
the same way as PME bits. Assuming I understood it correctly.
Bjorn Helgaas Nov. 5, 2019, 4:10 p.m. UTC | #18
On Tue, Nov 05, 2019 at 05:28:32PM +0200, Mika Westerberg wrote:
> On Tue, Nov 05, 2019 at 09:00:13AM -0600, Bjorn Helgaas wrote:
> > On Tue, Nov 05, 2019 at 11:54:28AM +0200, Mika Westerberg wrote:
> > > On Mon, Nov 04, 2019 at 06:00:00PM -0600, Bjorn Helgaas wrote:
> > 
> > > > > If you think it is fine to do the delay before we have restored
> > > > > everything I can move it inside pci_power_up() or call it after
> > > > > pci_pm_default_resume_early() as above. I think at least we should make
> > > > > sure all the saved registers are restored before so that the link
> > > > > activation check actually works.
> > > > 
> > > > What needs to be restored to make pcie_wait_for_link_delay() work?
> > > 
> > > I'm not entirely sure. I think that pci_restore_state() at least should
> > > be called so that the PCIe capability gets restored. Maybe not even
> > > that because Data Link Layer Layer Active always reflects the DL_Active
> > > or not and it does not need to be enabled separately.
> > > 
> > > > And what event does the restore need to be ordered with?
> > > 
> > > Not sure I follow you here.
> > 
> > You're suggesting that we should restore saved registers first so
> > pcie_wait_for_link_delay() works.  If the link activation depends on
> > something being restored and we don't enforce an ordering, the
> > activation might succeed or fail depending on whether it happens
> > before or after the restore.  So if there is a dependency, we should
> > make it explicit to avoid a race like that.
> 
> OK thanks. By explicit you mean document it in the code, right?

So far all we have is a feeling that maybe we ought to restore before
waiting, but I don't really know why.  If there's an actual
dependency, we should chase down specifically what it is and add a
comment or code (e.g., a link retrain) as appropriate.

> > I actually suspect there *is* a dependency -- we should respect the
> > Target Link Speed and and width so the link resumes in the same
> > configuration it was before suspend.  And I suspect that may require
> > an explicit retrain after restoring PCI_EXP_LNKCTL2.
> 
> According the PCIe spec the PCI_EXP_LNKCTL2 Target Link Speed is marked
> as RWS (S for sticky) so I suspect its value is retained after reset in
> the same way as PME bits. Assuming I understood it correctly.

This patch is about coming from D3cold, isn't it?  I don't think we
can assume sticky bits are preserved in D3cold (except maybe when
auxiliary power is enabled).
Bjorn Helgaas Nov. 5, 2019, 8:01 p.m. UTC | #19
On Tue, Nov 05, 2019 at 02:58:18PM +0200, Mika Westerberg wrote:
> On Tue, Nov 05, 2019 at 11:54:33AM +0200, Mika Westerberg wrote:
> > > For understandability, I think the wait needs to go in some function
> > > that contains "PCI_D0", e.g., platform_pci_set_power_state() or
> > > pci_power_up(), so it's connected with the transition from D3cold to
> > > D0.
> > > 
> > > Since pci_pm_default_resume_early() is the only caller of
> > > pci_power_up(), maybe we should just inline pci_power_up(), e.g.,
> > > something like this:
> > > 
> > >   static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
> > >   {
> > >     pci_power_state_t prev_state = pci_dev->current_state;
> > > 
> > >     if (platform_pci_power_manageable(pci_dev))
> > >       platform_pci_set_power_state(pci_dev, PCI_D0);
> > > 
> > >     pci_raw_set_power_state(pci_dev, PCI_D0);
> > >     pci_update_current_state(pci_dev, PCI_D0);
> > > 
> > >     pci_restore_state(pci_dev);
> > >     pci_pme_restore(pci_dev);
> > > 
> > >     if (prev_state == PCI_D3cold)
> > >       pci_bridge_wait_for_secondary_bus(dev);
> > >   }
> > 
> > OK, I'll see if this works.
> 
> Well, I think we want to execute pci_fixup_resume_early before we delay
> for the downstream component (same for runtime resume path). Currently
> nobody is using that for root/downstream ports but in theory at least
> some port may need it before it works properly. Also probably good idea
> to disable wake as soon as possible to avoid possible extra PME messages
> coming from the port.

OK, I wish we could connect it more closely with the actual power-on,
but I guess that makes sense.

> I feel that the following is the right place to perform the delay but if
> you think we can ignore the above, I will just do what you say and do it
> in pci_pm_default_resume_early() (and pci_restore_standard_config() for
> runtime path).
> 
> [The below diff does not have check for pci_dev->skip_bus_pm because I
>  was planning to move it inside pci_bridge_wait_for_secondary_bus() itself.]

What do you gain by moving it?  IIUC we want them to be the same
condition, and if one is in pci_pm_resume_noirq() and another is
inside pci_bridge_wait_for_secondary_bus(), it's hard to see that
they're connected.  I'd rather have pci_pm_resume_noirq() check it
once, save the result, and test that result before calling
pci_pm_default_resume_early() and pci_bridge_wait_for_secondary_bus().

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 08d3bdbc8c04..3c0e52aaef79 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -890,6 +890,7 @@ static int pci_pm_resume_noirq(struct device *dev)
>  {
>  	struct pci_dev *pci_dev = to_pci_dev(dev);
>  	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	pci_power_t prev_state = pci_dev->current_state;
>  
>  	if (dev_pm_may_skip_resume(dev))
>  		return 0;
> @@ -914,6 +915,9 @@ static int pci_pm_resume_noirq(struct device *dev)
>  	pci_fixup_device(pci_fixup_resume_early, pci_dev);
>  	pcie_pme_root_status_cleanup(pci_dev);
>  
> +	if (prev_state == PCI_D3cold)
> +		pci_bridge_wait_for_secondary_bus(pci_dev);
> +
>  	if (pci_has_legacy_pm_support(pci_dev))
>  		return 0;
>  
> @@ -1299,6 +1303,7 @@ static int pci_pm_runtime_resume(struct device *dev)
>  {
>  	struct pci_dev *pci_dev = to_pci_dev(dev);
>  	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	pci_power_t prev_state = pci_dev->current_state;
>  	int error = 0;
>  
>  	/*
> @@ -1314,6 +1319,9 @@ static int pci_pm_runtime_resume(struct device *dev)
>  	pci_fixup_device(pci_fixup_resume_early, pci_dev);
>  	pci_pm_default_resume(pci_dev);
>  
> +	if (prev_state == PCI_D3cold)
> +		pci_bridge_wait_for_secondary_bus(pci_dev);
> +
>  	if (pm && pm->runtime_resume)
>  		error = pm->runtime_resume(dev);
>
Mika Westerberg Nov. 6, 2019, 1:29 p.m. UTC | #20
On Tue, Nov 05, 2019 at 10:10:17AM -0600, Bjorn Helgaas wrote:
> > > I actually suspect there *is* a dependency -- we should respect the
> > > Target Link Speed and and width so the link resumes in the same
> > > configuration it was before suspend.  And I suspect that may require
> > > an explicit retrain after restoring PCI_EXP_LNKCTL2.
> > 
> > According the PCIe spec the PCI_EXP_LNKCTL2 Target Link Speed is marked
> > as RWS (S for sticky) so I suspect its value is retained after reset in
> > the same way as PME bits. Assuming I understood it correctly.
> 
> This patch is about coming from D3cold, isn't it?  I don't think we
> can assume sticky bits are preserved in D3cold (except maybe when
> auxiliary power is enabled).

Indeed, good point. I see some GPU drivers are programming Target Link
Speed which will not be retained after the hierarchy is put into D3cold
and back. I think this potential problem is not related to the missing
link delays this patch is addressing, though. It has been existing in
pci_restore_pcie_state() already (where it restores PCI_EXP_LNKCTL2).

I think this can be solved as a separate patch by doing something
like:

  1. In pci_restore_pcie_state() check if the saved Target Link Speed
     differs from what is in the register currently.

  2. Restore the value as we already do now.

  3. If there the speed differs then trigger link retrain.

  4. Restore rest of the root/downstream port state.

It is not clear if we need to do anything for upstream ports (PCIe 5.0
sec 6.11 talks about doing this on upstream component e.g downstream
port). After this there will be the link delay (added by this patch)
which takes care of waiting for the downstream component to be
accessible (even after retrain).

However, I'm not sure how this can be properly tested. Maybe hacking
some downstream port to lower the speed, enter D3cold and then resume it
and see if the Target Link Speed gets updated correctly.
Mika Westerberg Nov. 6, 2019, 1:31 p.m. UTC | #21
On Tue, Nov 05, 2019 at 02:01:05PM -0600, Bjorn Helgaas wrote:
> > I feel that the following is the right place to perform the delay but if
> > you think we can ignore the above, I will just do what you say and do it
> > in pci_pm_default_resume_early() (and pci_restore_standard_config() for
> > runtime path).
> > 
> > [The below diff does not have check for pci_dev->skip_bus_pm because I
> >  was planning to move it inside pci_bridge_wait_for_secondary_bus() itself.]
> 
> What do you gain by moving it?  IIUC we want them to be the same
> condition, and if one is in pci_pm_resume_noirq() and another is
> inside pci_bridge_wait_for_secondary_bus(), it's hard to see that
> they're connected.  I'd rather have pci_pm_resume_noirq() check it
> once, save the result, and test that result before calling
> pci_pm_default_resume_early() and pci_bridge_wait_for_secondary_bus().

Fair enough :)
diff mbox series

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index a8124e47bf6e..74a144c9cf4e 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -917,6 +917,7 @@  static int pci_pm_suspend_noirq(struct device *dev)
 static int pci_pm_resume_noirq(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
+	bool d3cold = pci_dev->current_state == PCI_D3cold;
 	struct device_driver *drv = dev->driver;
 	int error = 0;
 
@@ -947,6 +948,14 @@  static int pci_pm_resume_noirq(struct device *dev)
 
 	pcie_pme_root_status_cleanup(pci_dev);
 
+	/*
+	 * If the hierarchy went into D3cold wait for the secondary bus to
+	 * become accessible. This is important for PCIe to prevent pciehp
+	 * from tearing down the downstream devices too soon.
+	 */
+	if (d3cold)
+		pci_bridge_wait_for_secondary_bus(pci_dev);
+
 	if (drv && drv->pm && drv->pm->resume_noirq)
 		error = drv->pm->resume_noirq(dev);
 
@@ -1329,6 +1338,7 @@  static int pci_pm_runtime_resume(struct device *dev)
 	int rc = 0;
 	struct pci_dev *pci_dev = to_pci_dev(dev);
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	bool d3cold = pci_dev->current_state == PCI_D3cold;
 
 	/*
 	 * Restoring config space is necessary even if the device is not bound
@@ -1344,6 +1354,14 @@  static int pci_pm_runtime_resume(struct device *dev)
 	pci_enable_wake(pci_dev, PCI_D0, false);
 	pci_fixup_device(pci_fixup_resume, pci_dev);
 
+	/*
+	 * If the hierarchy went into D3cold wait for the secondary bus to
+	 * become accessible. This is important for PCIe to prevent pciehp
+	 * from tearing down the downstream devices too soon.
+	 */
+	if (d3cold)
+		pci_bridge_wait_for_secondary_bus(pci_dev);
+
 	if (pm && pm->runtime_resume)
 		rc = pm->runtime_resume(dev);
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bfd92e018925..749c4625dea4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1025,15 +1025,11 @@  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
 	if (state == PCI_D0) {
 		pci_platform_power_transition(dev, PCI_D0);
 		/*
-		 * Mandatory power management transition delays, see
-		 * PCI Express Base Specification Revision 2.0 Section
-		 * 6.6.1: Conventional Reset.  Do not delay for
-		 * devices powered on/off by corresponding bridge,
-		 * because have already delayed for the bridge.
+		 * Mandatory power management transition delays are handled
+		 * in pci_pm_runtime_resume() of the corresponding
+		 * downstream/root port.
 		 */
 		if (dev->runtime_d3cold) {
-			if (dev->d3cold_delay && !dev->imm_ready)
-				msleep(dev->d3cold_delay);
 			/*
 			 * When powering on a bridge from D3cold, the
 			 * whole hierarchy may be powered on into
@@ -4673,6 +4669,88 @@  bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 	return pcie_wait_for_link_delay(pdev, active, 100);
 }
 
+/**
+ * pci_bridge_wait_for_secondary_bus - Wait secondary bus to be accessible
+ * @dev: PCI bridge
+ *
+ * Handle necessary delays before access to the devices on the secondary
+ * side of the bridge are permitted after D3cold to D0 transition.
+ *
+ * For PCIe this means the delays in PCIe 4.0 chapter 6.6.1. For
+ * conventional PCI it means Tpvrh + Trhfa specified in PCI 3.0 chapter
+ * 4.3.2.
+ */
+void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
+{
+	struct pci_dev *child;
+
+	if (pci_dev_is_disconnected(dev))
+		return;
+
+	if (!pci_is_bridge(dev) || !dev->bridge_d3)
+		return;
+
+	/*
+	 * We only deal with devices that are present currently on the bus.
+	 * For any hot-added devices the access delay is handled in pciehp
+	 * board_added(). In case of ACPI hotplug the firmware is expected
+	 * to configure the devices before OS is notified.
+	 */
+	if (!dev->subordinate || list_empty(&dev->subordinate->devices))
+		return;
+
+	/*
+	 * Conventional PCI and PCI-X we need to wait Tpvrh + Trhfa before
+	 * accessing the device after reset (that is 100 ms + 1000 ms). In
+	 * practice this should not be needed because we don't do power
+	 * management for them (see pci_bridge_d3_possible()).
+	 */
+	if (!pci_is_pcie(dev)) {
+		pci_dbg(dev, "waiting 1100 ms for secondary bus\n");
+		msleep(1100);
+		return;
+	}
+
+	/*
+	 * For PCIe downstream and root ports that do not support speeds
+	 * greater than 5 GT/s need to wait minimum 100 ms. For higher
+	 * speeds (gen3) we need to wait first for the data link layer to
+	 * become active.
+	 *
+	 * However, 100 ms is the minimum and the PCIe spec says the
+	 * software must allow at least 1s before it can determine that the
+	 * device that did not respond is a broken device. There is
+	 * evidence that 100 ms is not always enough, for example certain
+	 * Titan Ridge xHCI controller does not always respond to
+	 * configuration requests if we only wait for 100 ms (see
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=203885).
+	 *
+	 * Therefore we wait for 100 ms and check for the device presence.
+	 * If it is still not present give it an additional 100 ms.
+	 */
+	if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT &&
+	    pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)
+		return;
+
+	if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
+		pci_dbg(dev, "waiting 100 ms for downstream link\n");
+		msleep(100);
+	} else {
+		pci_dbg(dev, "waiting 100 ms for downstream link, after activation\n");
+		if (!pcie_wait_for_link_delay(dev, true, 100)) {
+			/* Did not train, no need to wait any further */
+			return;
+		}
+	}
+
+	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
+				 bus_list);
+	if (!pci_device_is_present(child)) {
+		pci_dbg(child, "waiting additional 100 ms to become accessible\n");
+		msleep(100);
+	}
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3f6947ee3324..7ade8f077f6e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -104,6 +104,7 @@  void pci_allocate_cap_save_buffers(struct pci_dev *dev);
 void pci_free_cap_save_buffers(struct pci_dev *dev);
 bool pci_bridge_d3_possible(struct pci_dev *dev);
 void pci_bridge_d3_update(struct pci_dev *dev);
+void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev);
 
 static inline void pci_wakeup_event(struct pci_dev *dev)
 {