diff mbox series

PCI: Add CRS timeout for pci_device_is_present()

Message ID 20191005182129.32538-1-vidyas@nvidia.com
State Changes Requested
Headers show
Series PCI: Add CRS timeout for pci_device_is_present() | expand

Commit Message

Vidya Sagar Oct. 5, 2019, 6:21 p.m. UTC
Adds a 60 seconds timeout to consider CRS (Configuration request Retry
Status) from a downstream device when Vendor ID read is attempted by
an upstream device. This helps to work with devices that return CRS
during system resume. This also makes pci_device_is_present() consistent
with the probe path where 60 seconds timeout is already being used.

Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
---
 drivers/pci/pci.c   | 3 ++-
 drivers/pci/pci.h   | 2 ++
 drivers/pci/probe.c | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

Comments

Thierry Reding Oct. 14, 2019, 8:20 a.m. UTC | #1
On Sat, Oct 05, 2019 at 11:51:29PM +0530, Vidya Sagar wrote:
> Adds a 60 seconds timeout to consider CRS (Configuration request Retry
> Status) from a downstream device when Vendor ID read is attempted by
> an upstream device. This helps to work with devices that return CRS
> during system resume. This also makes pci_device_is_present() consistent
> with the probe path where 60 seconds timeout is already being used.
> 
> Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
> ---
>  drivers/pci/pci.c   | 3 ++-
>  drivers/pci/pci.h   | 2 ++
>  drivers/pci/probe.c | 2 +-
>  3 files changed, 5 insertions(+), 2 deletions(-)

I think this makes sense, so:

Reviewed-by: Thierry Reding <treding@nvidia.com>

However, it looks like Sinan has researched this extensively in the past
and gave a presentation on this at Plumbers in 2017:

	https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf

Adding Sinan to see if he has any concerns about this, since resume time
is explicitly mentioned in the above slides.

Thierry

> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 95dc78ebdded..3ab9f6d3c8a6 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
>  
>  	if (pci_dev_is_disconnected(pdev))
>  		return false;
> -	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
> +	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
> +					  PCI_CRS_TIMEOUT);
>  }
>  EXPORT_SYMBOL_GPL(pci_device_is_present);
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 3f6947ee3324..aa25c5fdc6a5 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -4,6 +4,8 @@
>  
>  #include <linux/pci.h>
>  
> +#define PCI_CRS_TIMEOUT		(60 * 1000)	/* 60 sec*/
> +
>  #define PCI_FIND_CAP_TTL	48
>  
>  #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 7c5d68b807ef..6e44a03283c8 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2258,7 +2258,7 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>  	struct pci_dev *dev;
>  	u32 l;
>  
> -	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000))
> +	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT))
>  		return NULL;
>  
>  	dev = pci_alloc_dev(bus);
> -- 
> 2.17.1
>
Andrew Murray Oct. 14, 2019, 10:45 a.m. UTC | #2
On Sat, Oct 05, 2019 at 11:51:29PM +0530, Vidya Sagar wrote:
> Adds a 60 seconds timeout to consider CRS (Configuration request Retry
> Status) from a downstream device when Vendor ID read is attempted by
> an upstream device. This helps to work with devices that return CRS
> during system resume. This also makes pci_device_is_present() consistent
> with the probe path where 60 seconds timeout is already being used.

This looks like a good idea.

> 
> Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
> ---
>  drivers/pci/pci.c   | 3 ++-
>  drivers/pci/pci.h   | 2 ++
>  drivers/pci/probe.c | 2 +-
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 95dc78ebdded..3ab9f6d3c8a6 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
>  
>  	if (pci_dev_is_disconnected(pdev))
>  		return false;
> -	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
> +	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
> +					  PCI_CRS_TIMEOUT);
>  }
>  EXPORT_SYMBOL_GPL(pci_device_is_present);
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 3f6947ee3324..aa25c5fdc6a5 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -4,6 +4,8 @@
>  
>  #include <linux/pci.h>
>  
> +#define PCI_CRS_TIMEOUT		(60 * 1000)	/* 60 sec*/
> +

This has the same value as PCIE_RESET_READY_POLL_MS defined in pci.c, when
I look at how PCIE_RESET_READY_POLL_MS is used, it seems that we have two
nearly identical ways to handle the same thing.

pci_dev_wait - this seems to be specifically used for handling SV CRS when
reading the vendor ID.

pci_dev_wait - this seems to be used after FLR, AF_FLR, bus reset and D3-D0,
in all the use-cases the timeout is 60 seconds. This function waits for
the function to return a non-CRS completion - however it doesn't rely on
the SV value of 0x0001.

What is the reason that pci_dev_wait polls on PCI_COMMAND instead of a SV
CRS value? (Is this a hack to gain some CPU time on RC's without SV? I.e.
rather than the hardware retrying, it just gives up allowing us to retry).

Is there any reason why these two functions can be combined?

>  #define PCI_FIND_CAP_TTL	48
>  
>  #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 7c5d68b807ef..6e44a03283c8 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2258,7 +2258,7 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>  	struct pci_dev *dev;
>  	u32 l;
>  
> -	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000))
> +	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT))

Should you also fix up acpiphp_add_context in drivers/pci/hotplug/acpiphp_glue.c
to use PCI_CRS_TIMEOUT?

Thanks,

Andrew Murray

>  		return NULL;
>  
>  	dev = pci_alloc_dev(bus);
> -- 
> 2.17.1
>
Sinan Kaya Oct. 14, 2019, 8:21 p.m. UTC | #3
On 10/14/2019 1:20 AM, Thierry Reding wrote:
> I think this makes sense, so:
> 
> Reviewed-by: Thierry Reding <treding@nvidia.com>
> 
> However, it looks like Sinan has researched this extensively in the past
> and gave a presentation on this at Plumbers in 2017:
> 
> 	https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf
> 
> Adding Sinan to see if he has any concerns about this, since resume time
> is explicitly mentioned in the above slides.


Thanks for including me. Let me catch up here.

pci_dev_wait() is supposed to handle this case via pci_pm_reset().

/**
 * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0.
 * @dev: Device to reset.
 * @probe: If set, only check if the device can be reset this way.
 */

Do you know if your execution path hits this function? We might have
missed a use case.
Thierry Reding Oct. 15, 2019, 9:30 a.m. UTC | #4
On Mon, Oct 14, 2019 at 01:21:31PM -0700, Sinan Kaya wrote:
> On 10/14/2019 1:20 AM, Thierry Reding wrote:
> > I think this makes sense, so:
> > 
> > Reviewed-by: Thierry Reding <treding@nvidia.com>
> > 
> > However, it looks like Sinan has researched this extensively in the past
> > and gave a presentation on this at Plumbers in 2017:
> > 
> > 	https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf
> > 
> > Adding Sinan to see if he has any concerns about this, since resume time
> > is explicitly mentioned in the above slides.
> 
> 
> Thanks for including me. Let me catch up here.
> 
> pci_dev_wait() is supposed to handle this case via pci_pm_reset().
> 
> /**
>  * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0.
>  * @dev: Device to reset.
>  * @probe: If set, only check if the device can be reset this way.
>  */
> 
> Do you know if your execution path hits this function? We might have
> missed a use case.
> 

I see only a couple of callers of pci_device_is_present() in the tree,
this being from next-20191015:

	$ git grep -n pci_device_is_present
	drivers/net/ethernet/broadcom/tg3.c:9070:       if (!pci_device_is_present(tp->pdev))drivers/net/ethernet/broadcom/tg3.c:11785:      if (pci_device_is_present(tp->pdev)) {
	drivers/net/ethernet/intel/igb/igb_main.c:8838: if (!pci_device_is_present(pdev))
	drivers/nvme/host/pci.c:2866:   if (!pci_device_is_present(pdev)) {
	drivers/pci/hotplug/acpiphp_glue.c:650:         alive = pci_device_is_present(dev);
	drivers/pci/pci.c:935:      !pci_device_is_present(dev)) {
	drivers/pci/pci.c:5902:bool pci_device_is_present(struct pci_dev *pdev)
	drivers/pci/pci.c:5910:EXPORT_SYMBOL_GPL(pci_device_is_present);
	drivers/thunderbolt/nhi.c:939:  if (!pci_device_is_present(pdev)) {
	include/linux/pci.h:1206:bool pci_device_is_present(struct pci_dev *pdev);

The NVME driver has the call in the ->remove() callback, so I don't
think it's relevant here. Both TG3 and IGB ethernet drivers seem to call
this during resume and so does Thunderbolt.

Vidya, can you clarify for which device you're seeing the issues? Sounds
like adding a call to pci_pm_reset() (via pci_reset_function()) at some
point.

Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used
very widely. Is that just because most drivers haven't had a need for it
yet? Or am I missing some core functionality that calls this for every
device anyway?

Thierry
Sinan Kaya Oct. 15, 2019, 11:10 a.m. UTC | #5
+Rafael

On 10/15/2019 2:30 AM, Thierry Reding wrote:
> Vidya, can you clarify for which device you're seeing the issues? Sounds
> like adding a call to pci_pm_reset() (via pci_reset_function()) at some
> point.
> 
> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used
> very widely. Is that just because most drivers haven't had a need for it
> yet? Or am I missing some core functionality that calls this for every
> device anyway?

pci_pm_reset() is there as an alternative reset path. We are not
supposed to call this function. Sorry for giving you wrong direction
here. pci_reset_function() should call it only if there is no other
suitable reset function is found.

I think the PCI core should be putting the device back D0 state as one
of the first actions before enumerating. Wake up could be a combination
of ACPI and/or PCI wake up depending on where your device sits in the
topology.

On the other hand, wake up code doesn't perform the CRS wait. CRS
wait is deferred until the first vendor id read in pci_scan_device().
I see that it already waits for 60 seconds.

Going back to the patch...

I think we need to find the path that actually needs this sleep and
put pci_dev_wait() there.

+++ b/drivers/pci/pci.c
@@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)

 	if (pci_dev_is_disconnected(pdev))
 		return false;
-	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
+	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
+					  PCI_CRS_TIMEOUT);
 }

pci_device_is_present() is a too low-level function and it may not
be allowed to sleep. It uses 0 as timeout value.
Vidya Sagar Oct. 15, 2019, 11:34 a.m. UTC | #6
On 10/15/2019 1:51 AM, Sinan Kaya wrote:
> On 10/14/2019 1:20 AM, Thierry Reding wrote:
>> I think this makes sense, so:
>>
>> Reviewed-by: Thierry Reding <treding@nvidia.com>
>>
>> However, it looks like Sinan has researched this extensively in the past
>> and gave a presentation on this at Plumbers in 2017:
>>
>> 	https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf
>>
>> Adding Sinan to see if he has any concerns about this, since resume time
>> is explicitly mentioned in the above slides.
> 
> 
> Thanks for including me. Let me catch up here.
> 
> pci_dev_wait() is supposed to handle this case via pci_pm_reset().
> 
> /**
>   * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0.
>   * @dev: Device to reset.
>   * @probe: If set, only check if the device can be reset this way.
>   */
> 
> Do you know if your execution path hits this function? We might have
> missed a use case.
> 
Nope. It doesn't.
Following is the stack dump showing how pci_update_current_state() is called in resume() path.
And pci_device_is_present() is called from inside pci_update_current_state() API.
My understanding is that pci_device_is_present() is the API we hit first in the resume() path
for any PCIe device.

[   36.380726] Call trace:
[   36.383270]  dump_backtrace+0x0/0x158
[   36.386802]  show_stack+0x14/0x20
[   36.389749]  dump_stack+0xb0/0xf8
[   36.393451]  pci_update_current_state+0x58/0xe0
[   36.398178]  pci_power_up+0x60/0x70
[   36.401672]  pci_pm_resume_noirq+0x6c/0x130
[   36.405669]  dpm_run_callback.isra.16+0x20/0x70
[   36.410248]  device_resume_noirq+0x120/0x238
[   36.414364]  async_resume_noirq+0x24/0x58
[   36.418364]  async_run_entry_fn+0x40/0x148
[   36.422418]  process_one_work+0x1e8/0x360
[   36.426525]  worker_thread+0x40/0x488
[   36.430201]  kthread+0x118/0x120
[   36.433843]  ret_from_fork+0x10/0x1c

Also, I don't see pci_pm_reset() getting called in resume() path at all.

Thanks,
Vidya Sagar
Vidya Sagar Oct. 15, 2019, 12:03 p.m. UTC | #7
On 10/15/2019 3:00 PM, Thierry Reding wrote:
> On Mon, Oct 14, 2019 at 01:21:31PM -0700, Sinan Kaya wrote:
>> On 10/14/2019 1:20 AM, Thierry Reding wrote:
>>> I think this makes sense, so:
>>>
>>> Reviewed-by: Thierry Reding <treding@nvidia.com>
>>>
>>> However, it looks like Sinan has researched this extensively in the past
>>> and gave a presentation on this at Plumbers in 2017:
>>>
>>> 	https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4732/original/crs.pdf
>>>
>>> Adding Sinan to see if he has any concerns about this, since resume time
>>> is explicitly mentioned in the above slides.
>>
>>
>> Thanks for including me. Let me catch up here.
>>
>> pci_dev_wait() is supposed to handle this case via pci_pm_reset().
>>
>> /**
>>   * pci_pm_reset - Put device into PCI_D3 and back into PCI_D0.
>>   * @dev: Device to reset.
>>   * @probe: If set, only check if the device can be reset this way.
>>   */
>>
>> Do you know if your execution path hits this function? We might have
>> missed a use case.
>>
> 
> I see only a couple of callers of pci_device_is_present() in the tree,
> this being from next-20191015:
> 
> 	$ git grep -n pci_device_is_present
> 	drivers/net/ethernet/broadcom/tg3.c:9070:       if (!pci_device_is_present(tp->pdev))drivers/net/ethernet/broadcom/tg3.c:11785:      if (pci_device_is_present(tp->pdev)) {
> 	drivers/net/ethernet/intel/igb/igb_main.c:8838: if (!pci_device_is_present(pdev))
> 	drivers/nvme/host/pci.c:2866:   if (!pci_device_is_present(pdev)) {
> 	drivers/pci/hotplug/acpiphp_glue.c:650:         alive = pci_device_is_present(dev);
> 	drivers/pci/pci.c:935:      !pci_device_is_present(dev)) {
> 	drivers/pci/pci.c:5902:bool pci_device_is_present(struct pci_dev *pdev)
> 	drivers/pci/pci.c:5910:EXPORT_SYMBOL_GPL(pci_device_is_present);
> 	drivers/thunderbolt/nhi.c:939:  if (!pci_device_is_present(pdev)) {
> 	include/linux/pci.h:1206:bool pci_device_is_present(struct pci_dev *pdev);
> 
I think the important one is the following which is called from inside
pci_update_current_state() function.

drivers/pci/pci.c:935:      !pci_device_is_present(dev)) {

I've put a dump_stack() to see how this is called and following is the trace
[   36.380726] Call trace:
[   36.383270]  dump_backtrace+0x0/0x158
[   36.386802]  show_stack+0x14/0x20
[   36.389749]  dump_stack+0xb0/0xf8
[   36.393451]  pci_update_current_state+0x58/0xe0
[   36.398178]  pci_power_up+0x60/0x70
[   36.401672]  pci_pm_resume_noirq+0x6c/0x130
[   36.405669]  dpm_run_callback.isra.16+0x20/0x70
[   36.410248]  device_resume_noirq+0x120/0x238
[   36.414364]  async_resume_noirq+0x24/0x58
[   36.418364]  async_run_entry_fn+0x40/0x148
[   36.422418]  process_one_work+0x1e8/0x360
[   36.426525]  worker_thread+0x40/0x488
[   36.430201]  kthread+0x118/0x120
[   36.433843]  ret_from_fork+0x10/0x1c


> The NVME driver has the call in the ->remove() callback, so I don't
> think it's relevant here. Both TG3 and IGB ethernet drivers seem to call
> this during resume and so does Thunderbolt.
> 
> Vidya, can you clarify for which device you're seeing the issues? Sounds
> like adding a call to pci_pm_reset() (via pci_reset_function()) at some
> point.
With 0 sec wait, I see issue with Intel 750 NVMe card. As I mentioned above,
it is called from the PCI-PM subsystem which is where the timeout is required.

> 
> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used
> very widely. Is that just because most drivers haven't had a need for it
> yet? Or am I missing some core functionality that calls this for every
> device anyway?
> 
> Thierry
>
Vidya Sagar Oct. 15, 2019, 12:14 p.m. UTC | #8
On 10/15/2019 4:40 PM, Sinan Kaya wrote:
> +Rafael
> 
> On 10/15/2019 2:30 AM, Thierry Reding wrote:
>> Vidya, can you clarify for which device you're seeing the issues? Sounds
>> like adding a call to pci_pm_reset() (via pci_reset_function()) at some
>> point.
>>
>> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used
>> very widely. Is that just because most drivers haven't had a need for it
>> yet? Or am I missing some core functionality that calls this for every
>> device anyway?
> 
> pci_pm_reset() is there as an alternative reset path. We are not
> supposed to call this function. Sorry for giving you wrong direction
> here. pci_reset_function() should call it only if there is no other
> suitable reset function is found.
> 
> I think the PCI core should be putting the device back D0 state as one
> of the first actions before enumerating. Wake up could be a combination
> of ACPI and/or PCI wake up depending on where your device sits in the
> topology.
Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
But, what is confusing to me is the order of the calls.
pci_power_up() has following calls in the same order.
	pci_raw_set_power_state(dev, PCI_D0);
	pci_update_current_state(dev, PCI_D0);
But, pci_raw_set_power_state() is accessing config space without calling
pci_device_is_present() whereas pci_update_current_state() which is called
later in the flow is calling pci_device_is_present()...!

> 
> On the other hand, wake up code doesn't perform the CRS wait. CRS
> wait is deferred until the first vendor id read in pci_scan_device().
> I see that it already waits for 60 seconds.
> 
> Going back to the patch...
> 
> I think we need to find the path that actually needs this sleep and
> put pci_dev_wait() there.
Following is the path in resume() flow.
[   36.380726] Call trace:
[   36.383270]  dump_backtrace+0x0/0x158
[   36.386802]  show_stack+0x14/0x20
[   36.389749]  dump_stack+0xb0/0xf8
[   36.393451]  pci_update_current_state+0x58/0xe0
[   36.398178]  pci_power_up+0x60/0x70
[   36.401672]  pci_pm_resume_noirq+0x6c/0x130
[   36.405669]  dpm_run_callback.isra.16+0x20/0x70
[   36.410248]  device_resume_noirq+0x120/0x238
[   36.414364]  async_resume_noirq+0x24/0x58
[   36.418364]  async_run_entry_fn+0x40/0x148
[   36.422418]  process_one_work+0x1e8/0x360
[   36.426525]  worker_thread+0x40/0x488
[   36.430201]  kthread+0x118/0x120
[   36.433843]  ret_from_fork+0x10/0x1c

> 
> +++ b/drivers/pci/pci.c
> @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
> 
>   	if (pci_dev_is_disconnected(pdev))
>   		return false;
> -	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
> +	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
> +					  PCI_CRS_TIMEOUT);
>   }
> 
> pci_device_is_present() is a too low-level function and it may not
> be allowed to sleep. It uses 0 as timeout value.
> 
>
Vidya Sagar Oct. 25, 2019, 11:58 a.m. UTC | #9
On 10/21/2019 11:13 AM, Vidya Sagar wrote:

Hi Sinan / Rafael,
Apologies for the ping again.
Do you guys have any further comments on this?

-Vidya Sagar

> On 10/15/2019 5:44 PM, Vidya Sagar wrote:
> 
> Hi Sinan / Rafael,
> Do you have any further comments on this patch?
> 
> - Vidya Sagar
> 
>> On 10/15/2019 4:40 PM, Sinan Kaya wrote:
>>> +Rafael
>>>
>>> On 10/15/2019 2:30 AM, Thierry Reding wrote:
>>>> Vidya, can you clarify for which device you're seeing the issues? Sounds
>>>> like adding a call to pci_pm_reset() (via pci_reset_function()) at some
>>>> point.
>>>>
>>>> Sinan, it looks as if pci_pm_reset() (or any of its callers) is not used
>>>> very widely. Is that just because most drivers haven't had a need for it
>>>> yet? Or am I missing some core functionality that calls this for every
>>>> device anyway?
>>>
>>> pci_pm_reset() is there as an alternative reset path. We are not
>>> supposed to call this function. Sorry for giving you wrong direction
>>> here. pci_reset_function() should call it only if there is no other
>>> suitable reset function is found.
>>>
>>> I think the PCI core should be putting the device back D0 state as one
>>> of the first actions before enumerating. Wake up could be a combination
>>> of ACPI and/or PCI wake up depending on where your device sits in the
>>> topology.
>> Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
>> But, what is confusing to me is the order of the calls.
>> pci_power_up() has following calls in the same order.
>>      pci_raw_set_power_state(dev, PCI_D0);
>>      pci_update_current_state(dev, PCI_D0);
>> But, pci_raw_set_power_state() is accessing config space without calling
>> pci_device_is_present() whereas pci_update_current_state() which is called
>> later in the flow is calling pci_device_is_present()...!
>>
>>>
>>> On the other hand, wake up code doesn't perform the CRS wait. CRS
>>> wait is deferred until the first vendor id read in pci_scan_device().
>>> I see that it already waits for 60 seconds.
>>>
>>> Going back to the patch...
>>>
>>> I think we need to find the path that actually needs this sleep and
>>> put pci_dev_wait() there.
>> Following is the path in resume() flow.
>> [   36.380726] Call trace:
>> [   36.383270]  dump_backtrace+0x0/0x158
>> [   36.386802]  show_stack+0x14/0x20
>> [   36.389749]  dump_stack+0xb0/0xf8
>> [   36.393451]  pci_update_current_state+0x58/0xe0
>> [   36.398178]  pci_power_up+0x60/0x70
>> [   36.401672]  pci_pm_resume_noirq+0x6c/0x130
>> [   36.405669]  dpm_run_callback.isra.16+0x20/0x70
>> [   36.410248]  device_resume_noirq+0x120/0x238
>> [   36.414364]  async_resume_noirq+0x24/0x58
>> [   36.418364]  async_run_entry_fn+0x40/0x148
>> [   36.422418]  process_one_work+0x1e8/0x360
>> [   36.426525]  worker_thread+0x40/0x488
>> [   36.430201]  kthread+0x118/0x120
>> [   36.433843]  ret_from_fork+0x10/0x1c
>>
>>>
>>> +++ b/drivers/pci/pci.c
>>> @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
>>>
>>>       if (pci_dev_is_disconnected(pdev))
>>>           return false;
>>> -    return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
>>> +    return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
>>> +                      PCI_CRS_TIMEOUT);
>>>   }
>>>
>>> pci_device_is_present() is a too low-level function and it may not
>>> be allowed to sleep. It uses 0 as timeout value.
>>>
>>>
>>
>
Sinan Kaya Oct. 26, 2019, 1:59 p.m. UTC | #10
On 10/25/2019 7:58 AM, Vidya Sagar wrote:
> On 10/21/2019 11:13 AM, Vidya Sagar wrote:
> 
> Hi Sinan / Rafael,
> Apologies for the ping again.
> Do you guys have any further comments on this?
> 
> -Vidya Sagar

I think you'll need some attention from Bjorn here to see the complete
picture.

As I said, changing pci_device_is_present() is not right. This needs to
be done at one level higher.
Vidya Sagar Nov. 4, 2019, 11:43 a.m. UTC | #11
On 10/26/2019 7:29 PM, Sinan Kaya wrote:
> On 10/25/2019 7:58 AM, Vidya Sagar wrote:
>> On 10/21/2019 11:13 AM, Vidya Sagar wrote:
>>
>> Hi Sinan / Rafael,
>> Apologies for the ping again.
>> Do you guys have any further comments on this?
>>
>> -Vidya Sagar
> 
> I think you'll need some attention from Bjorn here to see the complete
> picture.
> 
> As I said, changing pci_device_is_present() is not right. This needs to
> be done at one level higher.

Hi Bjorn,
Could you please help me understand why this change can't be done in pci_device_is_present()
API?

- Vidya Sagar

>
Lorenzo Pieralisi Nov. 4, 2019, 4:52 p.m. UTC | #12
On Mon, Nov 04, 2019 at 05:13:50PM +0530, Vidya Sagar wrote:
> On 10/26/2019 7:29 PM, Sinan Kaya wrote:
> > On 10/25/2019 7:58 AM, Vidya Sagar wrote:
> > > On 10/21/2019 11:13 AM, Vidya Sagar wrote:
> > > 
> > > Hi Sinan / Rafael,
> > > Apologies for the ping again.
> > > Do you guys have any further comments on this?
> > > 
> > > -Vidya Sagar
> > 
> > I think you'll need some attention from Bjorn here to see the complete
> > picture.
> > 
> > As I said, changing pci_device_is_present() is not right. This needs to
> > be done at one level higher.
> 
> Hi Bjorn,
> Could you please help me understand why this change can't be done in
> pci_device_is_present() API?

I assume that's because pci_device_is_present() is currently called
in contexts that aren't allowed to sleep, therefore you would trigger
a regression.

Lorenzo
Bjorn Helgaas Nov. 4, 2019, 5:39 p.m. UTC | #13
[+cc Andrew, Lukas]

On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote:
> On 10/15/2019 4:40 PM, Sinan Kaya wrote:
> > ...
> > I think the PCI core should be putting the device back D0 state as one
> > of the first actions before enumerating. Wake up could be a combination
> > of ACPI and/or PCI wake up depending on where your device sits in the
> > topology.
>
> Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
> But, what is confusing to me is the order of the calls.
> pci_power_up() has following calls in the same order.
> 	pci_raw_set_power_state(dev, PCI_D0);
> 	pci_update_current_state(dev, PCI_D0);
> But, pci_raw_set_power_state() is accessing config space without calling
> pci_device_is_present() whereas pci_update_current_state() which is called
> later in the flow is calling pci_device_is_present()...!

A device should always respond to config reads unless it is in D3cold
or it is initializing after a reset.  IIUC you're doing a resume, not
a reset, so I think your device must be coming out of D3cold.  That's
typically done via ACPI, and I think we are missing some kind of delay
before our first config access:

  pci_power_up
    platform_pci_set_power_state(PCI_D0)    # eg, ACPI
    pci_raw_set_power_state
      pci_read_config_word(PCI_PM_CTRL)     # <-- first config access
      pci_write_config_word(PCI_PM_CTRL)
      pci_read_config_word(PCI_PM_CTRL)
    pci_update_current_state
      if (... || !pci_device_is_present())

Mika is working on some delays for the transition out of D3cold [1].
He's more concerned with a secondary bus behind a bridge, so I don't
think his patch addresses this case, but he's certainly familiar with
this area.

Huh, I'm really confused about this, too.  I don't
understand how resume ever works without any delay between
platform_pci_power_manageable() and the config reads in
pci_raw_set_power_state().  I must be missing something.

The pci_device_is_present() call in pci_update_current_state() was
added by a6a64026c0cd ("PCI: Recognize D3cold in
pci_update_current_state()") [2].  The purpose there is not to wait
for a device to become ready; it is to learn whether the device is in
D3cold.  I don't think we'd want a delay in this path because it would
slow down all transitions into D3cold.

[1] https://lore.kernel.org/r/20191004123947.11087-1-mika.westerberg@linux.intel.com
[2] https://git.kernel.org/linus/a6a64026c0cd

> > On the other hand, wake up code doesn't perform the CRS wait. CRS
> > wait is deferred until the first vendor id read in pci_scan_device().
> > I see that it already waits for 60 seconds.
> > 
> > Going back to the patch...
> > 
> > I think we need to find the path that actually needs this sleep and
> > put pci_dev_wait() there.
>
> Following is the path in resume() flow.
> [   36.380726] Call trace:
> [   36.383270]  dump_backtrace+0x0/0x158
> [   36.386802]  show_stack+0x14/0x20
> [   36.389749]  dump_stack+0xb0/0xf8
> [   36.393451]  pci_update_current_state+0x58/0xe0
> [   36.398178]  pci_power_up+0x60/0x70
> [   36.401672]  pci_pm_resume_noirq+0x6c/0x130
> [   36.405669]  dpm_run_callback.isra.16+0x20/0x70
> [   36.410248]  device_resume_noirq+0x120/0x238
> [   36.414364]  async_resume_noirq+0x24/0x58
> [   36.418364]  async_run_entry_fn+0x40/0x148
> [   36.422418]  process_one_work+0x1e8/0x360
> [   36.426525]  worker_thread+0x40/0x488
> [   36.430201]  kthread+0x118/0x120
> [   36.433843]  ret_from_fork+0x10/0x1c
> 
> > 
> > +++ b/drivers/pci/pci.c
> > @@ -5905,7 +5905,8 @@ bool pci_device_is_present(struct pci_dev *pdev)
> > 
> >   	if (pci_dev_is_disconnected(pdev))
> >   		return false;
> > -	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
> > +	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
> > +					  PCI_CRS_TIMEOUT);
> >   }
> > 
> > pci_device_is_present() is a too low-level function and it may not
> > be allowed to sleep. It uses 0 as timeout value.
> > 
> > 
>
Rafael J. Wysocki Nov. 5, 2019, 10:55 a.m. UTC | #14
On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote:
> [+cc Andrew, Lukas]
> 
> On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote:
> > On 10/15/2019 4:40 PM, Sinan Kaya wrote:
> > > ...
> > > I think the PCI core should be putting the device back D0 state as one
> > > of the first actions before enumerating. Wake up could be a combination
> > > of ACPI and/or PCI wake up depending on where your device sits in the
> > > topology.
> >
> > Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
> > But, what is confusing to me is the order of the calls.
> > pci_power_up() has following calls in the same order.
> > 	pci_raw_set_power_state(dev, PCI_D0);
> > 	pci_update_current_state(dev, PCI_D0);
> > But, pci_raw_set_power_state() is accessing config space without calling
> > pci_device_is_present() whereas pci_update_current_state() which is called
> > later in the flow is calling pci_device_is_present()...!
> 
> A device should always respond to config reads unless it is in D3cold
> or it is initializing after a reset.  IIUC you're doing a resume, not
> a reset, so I think your device must be coming out of D3cold.  That's
> typically done via ACPI, and I think we are missing some kind of delay
> before our first config access:
> 
>   pci_power_up
>     platform_pci_set_power_state(PCI_D0)    # eg, ACPI
>     pci_raw_set_power_state
>       pci_read_config_word(PCI_PM_CTRL)     # <-- first config access
>       pci_write_config_word(PCI_PM_CTRL)
>       pci_read_config_word(PCI_PM_CTRL)
>     pci_update_current_state
>       if (... || !pci_device_is_present())
> 
> Mika is working on some delays for the transition out of D3cold [1].
> He's more concerned with a secondary bus behind a bridge, so I don't
> think his patch addresses this case, but he's certainly familiar with
> this area.
> 
> Huh, I'm really confused about this, too.  I don't
> understand how resume ever works without any delay between
> platform_pci_power_manageable() and the config reads in
> pci_raw_set_power_state().  I must be missing something.

There is a delay in the runtime_d3cold case, see __pci_start_power_transition().

But overall platform_pci_power_manageable() only checks whether or not the
platform firmware can change the power state of the device.  If it can, it
is expected to take care of any necessary delays while doing that (because
there may be delays required by this particular instance of the platform
firmware, beyond what is mandated by the PCI spec, or there may not be any
need to wait at all).  If the platform firmware becomes responsible for
setting the device's power state, there is not reason why it should not be
responsible for the delay part too.

In any case, I'm not sure how useful it is to add delays for everyone in the
cases in which a specific system needs a delay because of its own PM
implementation limitations.  It may be better to quirk such systems explicitly
as long as there are not too many quirks in there, or we'll end up adding more
and more *implicit* quirks in the form of general delays.

Cheers!
Bjorn Helgaas Nov. 6, 2019, 4:41 p.m. UTC | #15
On Tue, Nov 05, 2019 at 11:55:45AM +0100, Rafael J. Wysocki wrote:
> On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote:
> > [+cc Andrew, Lukas]
> > 
> > On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote:
> > > On 10/15/2019 4:40 PM, Sinan Kaya wrote:
> > > > ...
> > > > I think the PCI core should be putting the device back D0 state as one
> > > > of the first actions before enumerating. Wake up could be a combination
> > > > of ACPI and/or PCI wake up depending on where your device sits in the
> > > > topology.
> > >
> > > Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
> > > But, what is confusing to me is the order of the calls.
> > > pci_power_up() has following calls in the same order.
> > > 	pci_raw_set_power_state(dev, PCI_D0);
> > > 	pci_update_current_state(dev, PCI_D0);
> > > But, pci_raw_set_power_state() is accessing config space without calling
> > > pci_device_is_present() whereas pci_update_current_state() which is called
> > > later in the flow is calling pci_device_is_present()...!
> > 
> > A device should always respond to config reads unless it is in D3cold
> > or it is initializing after a reset.  IIUC you're doing a resume, not
> > a reset, so I think your device must be coming out of D3cold.  That's
> > typically done via ACPI, and I think we are missing some kind of delay
> > before our first config access:
> > 
> >   pci_power_up
> >     platform_pci_set_power_state(PCI_D0)    # eg, ACPI
> >     pci_raw_set_power_state
> >       pci_read_config_word(PCI_PM_CTRL)     # <-- first config access
> >       pci_write_config_word(PCI_PM_CTRL)
> >       pci_read_config_word(PCI_PM_CTRL)
> >     pci_update_current_state
> >       if (... || !pci_device_is_present())
> > 
> > Mika is working on some delays for the transition out of D3cold [1].
> > He's more concerned with a secondary bus behind a bridge, so I don't
> > think his patch addresses this case, but he's certainly familiar with
> > this area.
> > 
> > Huh, I'm really confused about this, too.  I don't
> > understand how resume ever works without any delay between
> > platform_pci_power_manageable() and the config reads in
> > pci_raw_set_power_state().  I must be missing something.
> 
> There is a delay in the runtime_d3cold case, see
> __pci_start_power_transition().

I see the delay in __pci_start_power_transition(), but I don't see how
it's relevant.  It's only called by pci_set_power_state(), and while
many drivers call pci_set_power_state() from legacy .resume() methods,
the pci_pm_resume_noirq() path where Vidya is seeing problems doesn't
use it.

> But overall platform_pci_power_manageable() only checks whether or
> not the platform firmware can change the power state of the device.
> If it can, it is expected to take care of any necessary delays while
> doing that (because there may be delays required by this particular
> instance of the platform firmware, beyond what is mandated by the
> PCI spec, or there may not be any need to wait at all). ...

That sounds like a reasonable argument for why firmware should be
responsible for this delay, but I don't think that's very clear in the
ACPI spec, so I wouldn't be surprised if it got missed.

Based on Vidya's backtrace, I think the resume path with problems is
this:

  pci_pm_resume_noirq
    pci_pm_default_resume_early
      pci_power_up
        if (platform_pci_power_manageable(dev))
          platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
        pci_raw_set_power_state
        pci_update_current_state
          pci_device_is_present        # <-- config read returns CRS

So I think your suggestion is that Vidya's firmware should be doing
the delay inside platform_pci_set_power_state()?

Vidya, you typically work on Tegra, so I assume this is on an arm64
system?  Does it have ACPI?  Do you have access to the firmware
developers to ask about who they expect to do the delays?

> In any case, I'm not sure how useful it is to add delays for
> everyone in the cases in which a specific system needs a delay
> because of its own PM implementation limitations.  It may be better
> to quirk such systems explicitly as long as there are not too many
> quirks in there, or we'll end up adding more and more *implicit*
> quirks in the form of general delays.

I agree, a general delay doesn't sound good.  Are you thinking
something like this?

  void pci_power_up(struct pci_dev *dev)
  {
    if (platform_pci_power_manageable(dev)) {
      platform_pci_set_power_state(dev, PCI_D0);
      if (dev->XXX)
        msleep(dev->XXX);
    }
    ...

We already have dev->d3_delay and d3cold_delay, so it's getting a bit
messy to keep them all straight.

Bjorn
Vidya Sagar Nov. 11, 2019, 6:01 a.m. UTC | #16
On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> On Tue, Nov 05, 2019 at 11:55:45AM +0100, Rafael J. Wysocki wrote:
>> On Monday, November 4, 2019 6:39:04 PM CET Bjorn Helgaas wrote:
>>> [+cc Andrew, Lukas]
>>>
>>> On Tue, Oct 15, 2019 at 05:44:47PM +0530, Vidya Sagar wrote:
>>>> On 10/15/2019 4:40 PM, Sinan Kaya wrote:
>>>>> ...
>>>>> I think the PCI core should be putting the device back D0 state as one
>>>>> of the first actions before enumerating. Wake up could be a combination
>>>>> of ACPI and/or PCI wake up depending on where your device sits in the
>>>>> topology.
>>>>
>>>> Yup. It is indeed doing it as part of pci_power_up() in pci.c file.
>>>> But, what is confusing to me is the order of the calls.
>>>> pci_power_up() has following calls in the same order.
>>>> 	pci_raw_set_power_state(dev, PCI_D0);
>>>> 	pci_update_current_state(dev, PCI_D0);
>>>> But, pci_raw_set_power_state() is accessing config space without calling
>>>> pci_device_is_present() whereas pci_update_current_state() which is called
>>>> later in the flow is calling pci_device_is_present()...!
>>>
>>> A device should always respond to config reads unless it is in D3cold
>>> or it is initializing after a reset.  IIUC you're doing a resume, not
>>> a reset, so I think your device must be coming out of D3cold.  That's
>>> typically done via ACPI, and I think we are missing some kind of delay
>>> before our first config access:
>>>
>>>    pci_power_up
>>>      platform_pci_set_power_state(PCI_D0)    # eg, ACPI
>>>      pci_raw_set_power_state
>>>        pci_read_config_word(PCI_PM_CTRL)     # <-- first config access
>>>        pci_write_config_word(PCI_PM_CTRL)
>>>        pci_read_config_word(PCI_PM_CTRL)
>>>      pci_update_current_state
>>>        if (... || !pci_device_is_present())
>>>
>>> Mika is working on some delays for the transition out of D3cold [1].
>>> He's more concerned with a secondary bus behind a bridge, so I don't
>>> think his patch addresses this case, but he's certainly familiar with
>>> this area.
>>>
>>> Huh, I'm really confused about this, too.  I don't
>>> understand how resume ever works without any delay between
>>> platform_pci_power_manageable() and the config reads in
>>> pci_raw_set_power_state().  I must be missing something.
>>
>> There is a delay in the runtime_d3cold case, see
>> __pci_start_power_transition().
> 
> I see the delay in __pci_start_power_transition(), but I don't see how
> it's relevant.  It's only called by pci_set_power_state(), and while
> many drivers call pci_set_power_state() from legacy .resume() methods,
> the pci_pm_resume_noirq() path where Vidya is seeing problems doesn't
> use it.
> 
>> But overall platform_pci_power_manageable() only checks whether or
>> not the platform firmware can change the power state of the device.
>> If it can, it is expected to take care of any necessary delays while
>> doing that (because there may be delays required by this particular
>> instance of the platform firmware, beyond what is mandated by the
>> PCI spec, or there may not be any need to wait at all). ...
> 
> That sounds like a reasonable argument for why firmware should be
> responsible for this delay, but I don't think that's very clear in the
> ACPI spec, so I wouldn't be surprised if it got missed.
> 
> Based on Vidya's backtrace, I think the resume path with problems is
> this:
> 
>    pci_pm_resume_noirq
>      pci_pm_default_resume_early
>        pci_power_up
>          if (platform_pci_power_manageable(dev))
>            platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
>          pci_raw_set_power_state
>          pci_update_current_state
>            pci_device_is_present        # <-- config read returns CRS
> 
> So I think your suggestion is that Vidya's firmware should be doing
> the delay inside platform_pci_set_power_state()?
> 
> Vidya, you typically work on Tegra, so I assume this is on an arm64
> system?  Does it have ACPI?  Do you have access to the firmware
> developers to ask about who they expect to do the delays?
Yes. This is on arm64 (Tegra) and we don't have any ACPI or any other firmware
for that matter. PCIe is brought up directly in the kernel.

> 
>> In any case, I'm not sure how useful it is to add delays for
>> everyone in the cases in which a specific system needs a delay
>> because of its own PM implementation limitations.  It may be better
>> to quirk such systems explicitly as long as there are not too many
>> quirks in there, or we'll end up adding more and more *implicit*
>> quirks in the form of general delays.
> 
> I agree, a general delay doesn't sound good.  Are you thinking
> something like this?
> 
>    void pci_power_up(struct pci_dev *dev)
>    {
>      if (platform_pci_power_manageable(dev)) {
>        platform_pci_set_power_state(dev, PCI_D0);
>        if (dev->XXX)
>          msleep(dev->XXX);
>      }
>      ...
> 
> We already have dev->d3_delay and d3cold_delay, so it's getting a bit
> messy to keep them all straight.
> 
> Bjorn
>
Bjorn Helgaas Nov. 11, 2019, 10:32 p.m. UTC | #17
On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:

> > Based on Vidya's backtrace, I think the resume path with problems
> > is this:
> > 
> >    pci_pm_resume_noirq
> >      pci_pm_default_resume_early
> >        pci_power_up
> >          if (platform_pci_power_manageable(dev))
> >            platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> >          pci_raw_set_power_state
> >          pci_update_current_state
> >            pci_device_is_present        # <-- config read returns CRS
> > 
> > So I think your suggestion is that Vidya's firmware should be
> > doing the delay inside platform_pci_set_power_state()?
> > 
> > Vidya, you typically work on Tegra, so I assume this is on an
> > arm64 system?  Does it have ACPI?  Do you have access to the
> > firmware developers to ask about who they expect to do the delays?
>
> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> other firmware for that matter. PCIe is brought up directly in the
> kernel.

I assume that your device is coming out of D3cold because apparently
you're seeing a CRS status from the config read when
pci_update_current_state() calls pci_device_is_present().  CRS status
should only happen after reset or power-on from D3cold, and you're not
doing a reset.

I'm pretty sure platform_pci_power_manageable() returns false on
your system (can you confirm that?) because the only scenarios with
platform power management are MID (Intel platform) and ACPI (which you
don't have).

Maybe you have some other platform-specific mechanism that controls
power to PCI devices, and it's not integrated into the
platform_pci_*() framework?

Bjorn
Thierry Reding Nov. 12, 2019, 12:59 p.m. UTC | #18
On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> 
> > > Based on Vidya's backtrace, I think the resume path with problems
> > > is this:
> > > 
> > >    pci_pm_resume_noirq
> > >      pci_pm_default_resume_early
> > >        pci_power_up
> > >          if (platform_pci_power_manageable(dev))
> > >            platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > >          pci_raw_set_power_state
> > >          pci_update_current_state
> > >            pci_device_is_present        # <-- config read returns CRS
> > > 
> > > So I think your suggestion is that Vidya's firmware should be
> > > doing the delay inside platform_pci_set_power_state()?
> > > 
> > > Vidya, you typically work on Tegra, so I assume this is on an
> > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > firmware developers to ask about who they expect to do the delays?
> >
> > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > other firmware for that matter. PCIe is brought up directly in the
> > kernel.
> 
> I assume that your device is coming out of D3cold because apparently
> you're seeing a CRS status from the config read when
> pci_update_current_state() calls pci_device_is_present().  CRS status
> should only happen after reset or power-on from D3cold, and you're not
> doing a reset.
> 
> I'm pretty sure platform_pci_power_manageable() returns false on
> your system (can you confirm that?) because the only scenarios with
> platform power management are MID (Intel platform) and ACPI (which you
> don't have).
> 
> Maybe you have some other platform-specific mechanism that controls
> power to PCI devices, and it's not integrated into the
> platform_pci_*() framework?

My understanding after reading the PCIe specification is that CRS is a
mechanism that allows an endpoint to signal that it isn't ready yet for
operation after reset or power-on from D3cold. There's nothing in there
that's platform specific. This is really only for specific endpoints.

I don't see how adding platform specific PM code would help in this
case. At a platform level we don't know if users are going to plug in a
PCI endpoint that needs a long delay before it's ready after reset and/
or exit from D3cold.

I do understand that perhaps pci_device_is_present() is perhaps not the
best place to do complex CRS handling, but if a mechanism is clearly
described in the specification, isn't it something that should be dealt
with in the core? That way we don't have to quirk this for every device
and platform.

The PCIe specification says that:

	Software that intends to take advantage of this mechanism must
	ensure that the first access made to a device following a valid
	reset condition is a Configuration Read Request accessing both
	bytes of the Vendor ID field in the device's Configuration Space
	header.

So doesn't that mean that pci_device_is_present() is already much too
late because we've potentially made other configuration read requests in
the meantime?

Wouldn't it make more sense to push the CRS handling up a bit? The
existing pci_power_up() function seems like it would be a good place.
For example, adding code to deal with CRS right after the platform PCI
PM calls but before pci_raw_set_power_state() seems like it would fit
the restrictions given in the above quote from the specification.

Thierry
Bjorn Helgaas Nov. 12, 2019, 2:21 p.m. UTC | #19
On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
> On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> > 
> > > > Based on Vidya's backtrace, I think the resume path with problems
> > > > is this:
> > > > 
> > > >    pci_pm_resume_noirq
> > > >      pci_pm_default_resume_early
> > > >        pci_power_up
> > > >          if (platform_pci_power_manageable(dev))
> > > >            platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > > >          pci_raw_set_power_state
> > > >          pci_update_current_state
> > > >            pci_device_is_present        # <-- config read returns CRS
> > > > 
> > > > So I think your suggestion is that Vidya's firmware should be
> > > > doing the delay inside platform_pci_set_power_state()?
> > > > 
> > > > Vidya, you typically work on Tegra, so I assume this is on an
> > > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > > firmware developers to ask about who they expect to do the delays?
> > >
> > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > > other firmware for that matter. PCIe is brought up directly in the
> > > kernel.
> > 
> > I assume that your device is coming out of D3cold because apparently
> > you're seeing a CRS status from the config read when
> > pci_update_current_state() calls pci_device_is_present().  CRS status
> > should only happen after reset or power-on from D3cold, and you're not
> > doing a reset.
> > 
> > I'm pretty sure platform_pci_power_manageable() returns false on
> > your system (can you confirm that?) because the only scenarios with
> > platform power management are MID (Intel platform) and ACPI (which you
> > don't have).
> > 
> > Maybe you have some other platform-specific mechanism that controls
> > power to PCI devices, and it's not integrated into the
> > platform_pci_*() framework?
> 
> My understanding after reading the PCIe specification is that CRS is a
> mechanism that allows an endpoint to signal that it isn't ready yet for
> operation after reset or power-on from D3cold. There's nothing in there
> that's platform specific. This is really only for specific endpoints.
> 
> I don't see how adding platform specific PM code would help in this
> case. At a platform level we don't know if users are going to plug in a
> PCI endpoint that needs a long delay before it's ready after reset and/
> or exit from D3cold.

Right, see below.

> I do understand that perhaps pci_device_is_present() is perhaps not the
> best place to do complex CRS handling, but if a mechanism is clearly
> described in the specification, isn't it something that should be dealt
> with in the core? That way we don't have to quirk this for every device
> and platform.

Definitely; we don't want quirks for endpoints (unless they're
actually broken) or for platforms (unless there's a platform hardware
or firmware defect).

There's no question that we need to delay and handle CRS after
power-on from D3cold.  I'm trying to get at the point that PCI itself
doesn't tell us how to do that power-on.  The mechanisms defined by
PCI rely on config space, which is only accessible in D0-D3hot, not
D3cold.  Power-on from D3cold can only happen via ACPI methods or
other platform-specific mechanisms, and the current design abstracts
those via platform_pci_set_power_state().  This is partly based on
Rafael's response in [1].

> The PCIe specification says that:
> 
> 	Software that intends to take advantage of this mechanism must
> 	ensure that the first access made to a device following a valid
> 	reset condition is a Configuration Read Request accessing both
> 	bytes of the Vendor ID field in the device's Configuration Space
> 	header.
> 
> So doesn't that mean that pci_device_is_present() is already much too
> late because we've potentially made other configuration read requests in
> the meantime?
> 
> Wouldn't it make more sense to push the CRS handling up a bit? The
> existing pci_power_up() function seems like it would be a good place.
> For example, adding code to deal with CRS right after the platform PCI
> PM calls but before pci_raw_set_power_state() seems like it would fit
> the restrictions given in the above quote from the specification.

Yep, I think that's the right point.  I'm trying to figure out how to
integrate it.  Rafael suggests that delays may be platform-specific
and should be in platform_pci_set_power_state(), but the CRS handling
itself isn't platform-specific and maybe could be higher up.

I'm fishing to see if Tegra has some kind of power control for
endpoints that is not related to the platform_pci_*() framework.  How
did the endpoint get put in D3cold in the first place?  I assume
something in the suspend path did that?  Maybe this happens when we
suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?

  tegra_pcie_pm_suspend
    tegra_pcie_phy_power_off
    tegra_pcie_power_off

  tegra_pcie_pm_resume
    tegra_pcie_power_on
    tegra_pcie_phy_power_on

If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
transition for the endpoint, I don't think we want to do CRS handling
there because that path shouldn't be touching the endpoint.  But maybe
it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
any config accesses are issued to devices.

[1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
Vidya Sagar Nov. 12, 2019, 5:59 p.m. UTC | #20
On 11/12/2019 4:02 AM, Bjorn Helgaas wrote:
> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
>> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> 
>>> Based on Vidya's backtrace, I think the resume path with problems
>>> is this:
>>>
>>>     pci_pm_resume_noirq
>>>       pci_pm_default_resume_early
>>>         pci_power_up
>>>           if (platform_pci_power_manageable(dev))
>>>             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
>>>           pci_raw_set_power_state
>>>           pci_update_current_state
>>>             pci_device_is_present        # <-- config read returns CRS
>>>
>>> So I think your suggestion is that Vidya's firmware should be
>>> doing the delay inside platform_pci_set_power_state()?
>>>
>>> Vidya, you typically work on Tegra, so I assume this is on an
>>> arm64 system?  Does it have ACPI?  Do you have access to the
>>> firmware developers to ask about who they expect to do the delays?
>>
>> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
>> other firmware for that matter. PCIe is brought up directly in the
>> kernel.
> 
> I assume that your device is coming out of D3cold because apparently
> you're seeing a CRS status from the config read when
> pci_update_current_state() calls pci_device_is_present().  CRS status
> should only happen after reset or power-on from D3cold, and you're not
> doing a reset.
> 
> I'm pretty sure platform_pci_power_manageable() returns false on
> your system (can you confirm that?) because the only scenarios with
> platform power management are MID (Intel platform) and ACPI (which you
> don't have).
Yes. I can confirm that platform_pci_power_manageable() is false in case of
Tegra.

> 
> Maybe you have some other platform-specific mechanism that controls
> power to PCI devices, and it's not integrated into the
> platform_pci_*() framework?
> 
> Bjorn
>
Vidya Sagar Nov. 12, 2019, 5:59 p.m. UTC | #21
On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
> On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
>> On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
>>> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
>>>> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
>>>
>>>>> Based on Vidya's backtrace, I think the resume path with problems
>>>>> is this:
>>>>>
>>>>>     pci_pm_resume_noirq
>>>>>       pci_pm_default_resume_early
>>>>>         pci_power_up
>>>>>           if (platform_pci_power_manageable(dev))
>>>>>             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
>>>>>           pci_raw_set_power_state
>>>>>           pci_update_current_state
>>>>>             pci_device_is_present        # <-- config read returns CRS
>>>>>
>>>>> So I think your suggestion is that Vidya's firmware should be
>>>>> doing the delay inside platform_pci_set_power_state()?
>>>>>
>>>>> Vidya, you typically work on Tegra, so I assume this is on an
>>>>> arm64 system?  Does it have ACPI?  Do you have access to the
>>>>> firmware developers to ask about who they expect to do the delays?
>>>>
>>>> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
>>>> other firmware for that matter. PCIe is brought up directly in the
>>>> kernel.
>>>
>>> I assume that your device is coming out of D3cold because apparently
>>> you're seeing a CRS status from the config read when
>>> pci_update_current_state() calls pci_device_is_present().  CRS status
>>> should only happen after reset or power-on from D3cold, and you're not
>>> doing a reset.
>>>
>>> I'm pretty sure platform_pci_power_manageable() returns false on
>>> your system (can you confirm that?) because the only scenarios with
>>> platform power management are MID (Intel platform) and ACPI (which you
>>> don't have).
>>>
>>> Maybe you have some other platform-specific mechanism that controls
>>> power to PCI devices, and it's not integrated into the
>>> platform_pci_*() framework?
>>
>> My understanding after reading the PCIe specification is that CRS is a
>> mechanism that allows an endpoint to signal that it isn't ready yet for
>> operation after reset or power-on from D3cold. There's nothing in there
>> that's platform specific. This is really only for specific endpoints.
>>
>> I don't see how adding platform specific PM code would help in this
>> case. At a platform level we don't know if users are going to plug in a
>> PCI endpoint that needs a long delay before it's ready after reset and/
>> or exit from D3cold.
> 
> Right, see below.
> 
>> I do understand that perhaps pci_device_is_present() is perhaps not the
>> best place to do complex CRS handling, but if a mechanism is clearly
>> described in the specification, isn't it something that should be dealt
>> with in the core? That way we don't have to quirk this for every device
>> and platform.
> 
> Definitely; we don't want quirks for endpoints (unless they're
> actually broken) or for platforms (unless there's a platform hardware
> or firmware defect).
> 
> There's no question that we need to delay and handle CRS after
> power-on from D3cold.  I'm trying to get at the point that PCI itself
> doesn't tell us how to do that power-on.  The mechanisms defined by
> PCI rely on config space, which is only accessible in D0-D3hot, not
> D3cold.  Power-on from D3cold can only happen via ACPI methods or
> other platform-specific mechanisms, and the current design abstracts
> those via platform_pci_set_power_state().  This is partly based on
> Rafael's response in [1].
> 
>> The PCIe specification says that:
>>
>> 	Software that intends to take advantage of this mechanism must
>> 	ensure that the first access made to a device following a valid
>> 	reset condition is a Configuration Read Request accessing both
>> 	bytes of the Vendor ID field in the device's Configuration Space
>> 	header.
>>
>> So doesn't that mean that pci_device_is_present() is already much too
>> late because we've potentially made other configuration read requests in
>> the meantime?
>>
>> Wouldn't it make more sense to push the CRS handling up a bit? The
>> existing pci_power_up() function seems like it would be a good place.
>> For example, adding code to deal with CRS right after the platform PCI
>> PM calls but before pci_raw_set_power_state() seems like it would fit
>> the restrictions given in the above quote from the specification.
> 
> Yep, I think that's the right point.  I'm trying to figure out how to
> integrate it.  Rafael suggests that delays may be platform-specific
> and should be in platform_pci_set_power_state(), but the CRS handling
> itself isn't platform-specific and maybe could be higher up.
> 
> I'm fishing to see if Tegra has some kind of power control for
> endpoints that is not related to the platform_pci_*() framework.  How
> did the endpoint get put in D3cold in the first place?  I assume
> something in the suspend path did that?  Maybe this happens when we
> suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
> 
>    tegra_pcie_pm_suspend
>      tegra_pcie_phy_power_off
>      tegra_pcie_power_off
> 
>    tegra_pcie_pm_resume
>      tegra_pcie_power_on
>      tegra_pcie_phy_power_on
> 
> If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
> transition for the endpoint, I don't think we want to do CRS handling
> there because that path shouldn't be touching the endpoint.  But maybe
> it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
> any config accesses are issued to devices.
Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
concern is Intel 750 NVMe drive.
PCIe host controller driver already has 100ms of delay as per the spec,
but this particular device is taking 1023ms to get ready to respond to
configuration space requests (till that time, it responds with
configuration request retry statuses)
I've put a dump_stack () to see the path resume sequence takes and here it is

  Call trace:
   dump_backtrace+0x0/0x158
   show_stack+0x14/0x20
   dump_stack+0xb0/0xf4
   pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
   pci_bus_read_dev_vendor_id+0x48/0x70
   pci_update_current_state+0x68/0xd8
   pci_power_up+0x40/0x50
   pci_pm_resume_noirq+0x7c/0x138
   dpm_run_callback.isra.16+0x20/0x70
   device_resume_noirq+0x120/0x238
   async_resume_noirq+0x24/0x58
   async_run_entry_fn+0x40/0x148
   process_one_work+0x1e8/0x360
   worker_thread+0x40/0x488
   kthread+0x118/0x120
   ret_from_fork+0x10/0x1c
  pci 0005:01:00.0: ready after 1023ms

Spec also mentions the following
     Unless Readiness Notifications mechanisms are used (see Section 6.23), the Root Complex
     and/or system software must allow at least 1.0 s after a Conventional Reset of a device, before it
     may determine that a device which fails to return a Successful Completion status for a valid
     Configuration Request is a broken device. This period is independent of how quickly Link
     training completes.

My understanding is that this 1sec waiting is supposed to be done by the PCIe sub-system and not the host
controller driver.
FWIW, this particular device is taking a little more time than 1 sec (i.e. 1023 ms)
I'm now wondering why is it that the CRS has a timeout of 60 secs than just 1 sec?

> 
> [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
>
Bjorn Helgaas Nov. 12, 2019, 6:58 p.m. UTC | #22
On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote:
> On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
> > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
> > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> > > > 
> > > > > > Based on Vidya's backtrace, I think the resume path with problems
> > > > > > is this:
> > > > > > 
> > > > > >     pci_pm_resume_noirq
> > > > > >       pci_pm_default_resume_early
> > > > > >         pci_power_up
> > > > > >           if (platform_pci_power_manageable(dev))
> > > > > >             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > > > > >           pci_raw_set_power_state
> > > > > >           pci_update_current_state
> > > > > >             pci_device_is_present        # <-- config read returns CRS
> > > > > > 
> > > > > > So I think your suggestion is that Vidya's firmware should be
> > > > > > doing the delay inside platform_pci_set_power_state()?
> > > > > > 
> > > > > > Vidya, you typically work on Tegra, so I assume this is on an
> > > > > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > > > > firmware developers to ask about who they expect to do the delays?
> > > > > 
> > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > > > > other firmware for that matter. PCIe is brought up directly in the
> > > > > kernel.
> > > > 
> > > > I assume that your device is coming out of D3cold because apparently
> > > > you're seeing a CRS status from the config read when
> > > > pci_update_current_state() calls pci_device_is_present().  CRS status
> > > > should only happen after reset or power-on from D3cold, and you're not
> > > > doing a reset.
> > > > 
> > > > I'm pretty sure platform_pci_power_manageable() returns false on
> > > > your system (can you confirm that?) because the only scenarios with
> > > > platform power management are MID (Intel platform) and ACPI (which you
> > > > don't have).
> > > > 
> > > > Maybe you have some other platform-specific mechanism that controls
> > > > power to PCI devices, and it's not integrated into the
> > > > platform_pci_*() framework?
> > > 
> > > My understanding after reading the PCIe specification is that CRS is a
> > > mechanism that allows an endpoint to signal that it isn't ready yet for
> > > operation after reset or power-on from D3cold. There's nothing in there
> > > that's platform specific. This is really only for specific endpoints.
> > > 
> > > I don't see how adding platform specific PM code would help in this
> > > case. At a platform level we don't know if users are going to plug in a
> > > PCI endpoint that needs a long delay before it's ready after reset and/
> > > or exit from D3cold.
> > 
> > Right, see below.
> > 
> > > I do understand that perhaps pci_device_is_present() is perhaps not the
> > > best place to do complex CRS handling, but if a mechanism is clearly
> > > described in the specification, isn't it something that should be dealt
> > > with in the core? That way we don't have to quirk this for every device
> > > and platform.
> > 
> > Definitely; we don't want quirks for endpoints (unless they're
> > actually broken) or for platforms (unless there's a platform hardware
> > or firmware defect).
> > 
> > There's no question that we need to delay and handle CRS after
> > power-on from D3cold.  I'm trying to get at the point that PCI itself
> > doesn't tell us how to do that power-on.  The mechanisms defined by
> > PCI rely on config space, which is only accessible in D0-D3hot, not
> > D3cold.  Power-on from D3cold can only happen via ACPI methods or
> > other platform-specific mechanisms, and the current design abstracts
> > those via platform_pci_set_power_state().  This is partly based on
> > Rafael's response in [1].
> > 
> > > The PCIe specification says that:
> > > 
> > > 	Software that intends to take advantage of this mechanism must
> > > 	ensure that the first access made to a device following a valid
> > > 	reset condition is a Configuration Read Request accessing both
> > > 	bytes of the Vendor ID field in the device's Configuration Space
> > > 	header.
> > > 
> > > So doesn't that mean that pci_device_is_present() is already much too
> > > late because we've potentially made other configuration read requests in
> > > the meantime?
> > > 
> > > Wouldn't it make more sense to push the CRS handling up a bit? The
> > > existing pci_power_up() function seems like it would be a good place.
> > > For example, adding code to deal with CRS right after the platform PCI
> > > PM calls but before pci_raw_set_power_state() seems like it would fit
> > > the restrictions given in the above quote from the specification.
> > 
> > Yep, I think that's the right point.  I'm trying to figure out how to
> > integrate it.  Rafael suggests that delays may be platform-specific
> > and should be in platform_pci_set_power_state(), but the CRS handling
> > itself isn't platform-specific and maybe could be higher up.
> > 
> > I'm fishing to see if Tegra has some kind of power control for
> > endpoints that is not related to the platform_pci_*() framework.  How
> > did the endpoint get put in D3cold in the first place?  I assume
> > something in the suspend path did that?  Maybe this happens when we
> > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
> > 
> >    tegra_pcie_pm_suspend
> >      tegra_pcie_phy_power_off
> >      tegra_pcie_power_off
> > 
> >    tegra_pcie_pm_resume
> >      tegra_pcie_power_on
> >      tegra_pcie_phy_power_on
> > 
> > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
> > transition for the endpoint, I don't think we want to do CRS handling
> > there because that path shouldn't be touching the endpoint.  But maybe
> > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
> > any config accesses are issued to devices.
>
> Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
> concern is Intel 750 NVMe drive.
> PCIe host controller driver already has 100ms of delay as per the spec,

To make this more concrete, where specifically is this delay?

> but this particular device is taking 1023ms to get ready to respond to
> configuration space requests (till that time, it responds with
> configuration request retry statuses)
> I've put a dump_stack () to see the path resume sequence takes and here it is
> 
>  Call trace:
>   dump_backtrace+0x0/0x158
>   show_stack+0x14/0x20
>   dump_stack+0xb0/0xf4
>   pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
>   pci_bus_read_dev_vendor_id+0x48/0x70
>   pci_update_current_state+0x68/0xd8
>   pci_power_up+0x40/0x50
>   pci_pm_resume_noirq+0x7c/0x138
>   dpm_run_callback.isra.16+0x20/0x70
>   device_resume_noirq+0x120/0x238
>   async_resume_noirq+0x24/0x58
>   async_run_entry_fn+0x40/0x148
>   process_one_work+0x1e8/0x360
>   worker_thread+0x40/0x488
>   kthread+0x118/0x120
>   ret_from_fork+0x10/0x1c
>  pci 0005:01:00.0: ready after 1023ms
> 
> Spec also mentions the following
>     Unless Readiness Notifications mechanisms are used (see Section
>     6.23), the Root Complex and/or system software must allow at
>     least 1.0 s after a Conventional Reset of a device, before it
>     may determine that a device which fails to return a Successful
>     Completion status for a valid Configuration Request is a broken
>     device. This period is independent of how quickly Link training
>     completes.
> 
> My understanding is that this 1sec waiting is supposed to be done by
> the PCIe sub-system and not the host controller driver.

That doesn't say we must wait 1s; it only says we can't decide the
device is broken before 1s.  But regardless, I agree that the CRS
handling doesn't belong in the driver for either the endpoint or the
PCIe host controller.

My question is whether this wait should be connected somehow with
platform_pci_set_power_state().  It sounds like the tegra host
controller driver already does the platform-specific delays, and I'm
not sure it's reasonable for platform_pci_set_power_state() to do the
CRS polling.  Maybe something like this?  I'd really like to get
Rafael's thinking here.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e7982af9a5d8..052fa316c917 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
  */
 void pci_power_up(struct pci_dev *dev)
 {
+	pci_power_state_t prev_state = dev->current_state;
+
 	if (platform_pci_power_manageable(dev))
 		platform_pci_set_power_state(dev, PCI_D0);
 
+	if (prev_state == PCI_D3cold)
+		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
+
 	pci_raw_set_power_state(dev, PCI_D0);
 	pci_update_current_state(dev, PCI_D0);
 }

> FWIW, this particular device is taking a little more time than 1 sec
> (i.e. 1023 ms) I'm now wondering why is it that the CRS has a
> timeout of 60 secs than just 1 sec?

pci_bus_wait_crs() does an exponential backoff, i.e., it does reads
after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know
*exactly* when the device became ready.  All we know is that it
became ready somewhere between 512ms and 1024ms.

But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become
ready after FLR") does suggest that the Intel 750 NVMe may require
more than 1s.  I think the 60s timeout is just a convenient large
number and I'm not sure it's derived from anything in the spec.

> > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
Vidya Sagar Nov. 13, 2019, 5:39 a.m. UTC | #23
On 11/13/2019 12:28 AM, Bjorn Helgaas wrote:
> On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote:
>> On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
>>> On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
>>>> On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
>>>>> On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
>>>>>> On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
>>>>>
>>>>>>> Based on Vidya's backtrace, I think the resume path with problems
>>>>>>> is this:
>>>>>>>
>>>>>>>      pci_pm_resume_noirq
>>>>>>>        pci_pm_default_resume_early
>>>>>>>          pci_power_up
>>>>>>>            if (platform_pci_power_manageable(dev))
>>>>>>>              platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
>>>>>>>            pci_raw_set_power_state
>>>>>>>            pci_update_current_state
>>>>>>>              pci_device_is_present        # <-- config read returns CRS
>>>>>>>
>>>>>>> So I think your suggestion is that Vidya's firmware should be
>>>>>>> doing the delay inside platform_pci_set_power_state()?
>>>>>>>
>>>>>>> Vidya, you typically work on Tegra, so I assume this is on an
>>>>>>> arm64 system?  Does it have ACPI?  Do you have access to the
>>>>>>> firmware developers to ask about who they expect to do the delays?
>>>>>>
>>>>>> Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
>>>>>> other firmware for that matter. PCIe is brought up directly in the
>>>>>> kernel.
>>>>>
>>>>> I assume that your device is coming out of D3cold because apparently
>>>>> you're seeing a CRS status from the config read when
>>>>> pci_update_current_state() calls pci_device_is_present().  CRS status
>>>>> should only happen after reset or power-on from D3cold, and you're not
>>>>> doing a reset.
>>>>>
>>>>> I'm pretty sure platform_pci_power_manageable() returns false on
>>>>> your system (can you confirm that?) because the only scenarios with
>>>>> platform power management are MID (Intel platform) and ACPI (which you
>>>>> don't have).
>>>>>
>>>>> Maybe you have some other platform-specific mechanism that controls
>>>>> power to PCI devices, and it's not integrated into the
>>>>> platform_pci_*() framework?
>>>>
>>>> My understanding after reading the PCIe specification is that CRS is a
>>>> mechanism that allows an endpoint to signal that it isn't ready yet for
>>>> operation after reset or power-on from D3cold. There's nothing in there
>>>> that's platform specific. This is really only for specific endpoints.
>>>>
>>>> I don't see how adding platform specific PM code would help in this
>>>> case. At a platform level we don't know if users are going to plug in a
>>>> PCI endpoint that needs a long delay before it's ready after reset and/
>>>> or exit from D3cold.
>>>
>>> Right, see below.
>>>
>>>> I do understand that perhaps pci_device_is_present() is perhaps not the
>>>> best place to do complex CRS handling, but if a mechanism is clearly
>>>> described in the specification, isn't it something that should be dealt
>>>> with in the core? That way we don't have to quirk this for every device
>>>> and platform.
>>>
>>> Definitely; we don't want quirks for endpoints (unless they're
>>> actually broken) or for platforms (unless there's a platform hardware
>>> or firmware defect).
>>>
>>> There's no question that we need to delay and handle CRS after
>>> power-on from D3cold.  I'm trying to get at the point that PCI itself
>>> doesn't tell us how to do that power-on.  The mechanisms defined by
>>> PCI rely on config space, which is only accessible in D0-D3hot, not
>>> D3cold.  Power-on from D3cold can only happen via ACPI methods or
>>> other platform-specific mechanisms, and the current design abstracts
>>> those via platform_pci_set_power_state().  This is partly based on
>>> Rafael's response in [1].
>>>
>>>> The PCIe specification says that:
>>>>
>>>> 	Software that intends to take advantage of this mechanism must
>>>> 	ensure that the first access made to a device following a valid
>>>> 	reset condition is a Configuration Read Request accessing both
>>>> 	bytes of the Vendor ID field in the device's Configuration Space
>>>> 	header.
>>>>
>>>> So doesn't that mean that pci_device_is_present() is already much too
>>>> late because we've potentially made other configuration read requests in
>>>> the meantime?
>>>>
>>>> Wouldn't it make more sense to push the CRS handling up a bit? The
>>>> existing pci_power_up() function seems like it would be a good place.
>>>> For example, adding code to deal with CRS right after the platform PCI
>>>> PM calls but before pci_raw_set_power_state() seems like it would fit
>>>> the restrictions given in the above quote from the specification.
>>>
>>> Yep, I think that's the right point.  I'm trying to figure out how to
>>> integrate it.  Rafael suggests that delays may be platform-specific
>>> and should be in platform_pci_set_power_state(), but the CRS handling
>>> itself isn't platform-specific and maybe could be higher up.
>>>
>>> I'm fishing to see if Tegra has some kind of power control for
>>> endpoints that is not related to the platform_pci_*() framework.  How
>>> did the endpoint get put in D3cold in the first place?  I assume
>>> something in the suspend path did that?  Maybe this happens when we
>>> suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
>>>
>>>     tegra_pcie_pm_suspend
>>>       tegra_pcie_phy_power_off
>>>       tegra_pcie_power_off
>>>
>>>     tegra_pcie_pm_resume
>>>       tegra_pcie_power_on
>>>       tegra_pcie_phy_power_on
>>>
>>> If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
>>> transition for the endpoint, I don't think we want to do CRS handling
>>> there because that path shouldn't be touching the endpoint.  But maybe
>>> it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
>>> any config accesses are issued to devices.
>>
>> Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
>> concern is Intel 750 NVMe drive.
>> PCIe host controller driver already has 100ms of delay as per the spec,
> 
> To make this more concrete, where specifically is this delay?

It is after PERST is deasserted and before making the first check for DLActive+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/dwc/pcie-tegra194.c?h=v5.4-rc7#n816
and dw_pcie_wait_for_link() has maximum timeout of 1sec before bailing out
Anyway, in this case, link is coming up within 100ms but config space is not accessible as device returns CRS.

> 
>> but this particular device is taking 1023ms to get ready to respond to
>> configuration space requests (till that time, it responds with
>> configuration request retry statuses)
>> I've put a dump_stack () to see the path resume sequence takes and here it is
>>
>>   Call trace:
>>    dump_backtrace+0x0/0x158
>>    show_stack+0x14/0x20
>>    dump_stack+0xb0/0xf4
>>    pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
>>    pci_bus_read_dev_vendor_id+0x48/0x70
>>    pci_update_current_state+0x68/0xd8
>>    pci_power_up+0x40/0x50
>>    pci_pm_resume_noirq+0x7c/0x138
>>    dpm_run_callback.isra.16+0x20/0x70
>>    device_resume_noirq+0x120/0x238
>>    async_resume_noirq+0x24/0x58
>>    async_run_entry_fn+0x40/0x148
>>    process_one_work+0x1e8/0x360
>>    worker_thread+0x40/0x488
>>    kthread+0x118/0x120
>>    ret_from_fork+0x10/0x1c
>>   pci 0005:01:00.0: ready after 1023ms
>>
>> Spec also mentions the following
>>      Unless Readiness Notifications mechanisms are used (see Section
>>      6.23), the Root Complex and/or system software must allow at
>>      least 1.0 s after a Conventional Reset of a device, before it
>>      may determine that a device which fails to return a Successful
>>      Completion status for a valid Configuration Request is a broken
>>      device. This period is independent of how quickly Link training
>>      completes.
>>
>> My understanding is that this 1sec waiting is supposed to be done by
>> the PCIe sub-system and not the host controller driver.
> 
> That doesn't say we must wait 1s; it only says we can't decide the
> device is broken before 1s.  But regardless, I agree that the CRS
> handling doesn't belong in the driver for either the endpoint or the
> PCIe host controller.
> 
> My question is whether this wait should be connected somehow with
> platform_pci_set_power_state().  It sounds like the tegra host
> controller driver already does the platform-specific delays, and I'm
> not sure it's reasonable for platform_pci_set_power_state() to do the
> CRS polling.  Maybe something like this?  I'd really like to get
> Rafael's thinking here.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..052fa316c917 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>    */
>   void pci_power_up(struct pci_dev *dev)
>   {
> +	pci_power_state_t prev_state = dev->current_state;
> +
>   	if (platform_pci_power_manageable(dev))
>   		platform_pci_set_power_state(dev, PCI_D0);
>   
> +	if (prev_state == PCI_D3cold)
> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
> +
>   	pci_raw_set_power_state(dev, PCI_D0);
>   	pci_update_current_state(dev, PCI_D0);
>   }
> 
>> FWIW, this particular device is taking a little more time than 1 sec
>> (i.e. 1023 ms) I'm now wondering why is it that the CRS has a
>> timeout of 60 secs than just 1 sec?
> 
> pci_bus_wait_crs() does an exponential backoff, i.e., it does reads
> after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know
> *exactly* when the device became ready.  All we know is that it
> became ready somewhere between 512ms and 1024ms.
> 
> But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become
> ready after FLR") does suggest that the Intel 750 NVMe may require
> more than 1s.  I think the 60s timeout is just a convenient large
> number and I'm not sure it's derived from anything in the spec.
I tried out the above patch that you came up with.
Couple of points here.
In Tegra194, we put the endpoint to D0 explicitly (instead of leaving it in D3Hot)
and then go for link transition to L2 (because of Tegra specific reasons).
Even if I comment out the API in pcie-tegra194.c which puts the devices back into D0
(from D3Hot) just before going for L2 transition, device is in D3Hot but not in D3Cold
So, if (prev_state == PCI_D3cold) didn't work really.
Also, I'm wondering why should there be a check in the first place anyway?
why can't pci_dev_wait() be called always?

> 
>>> [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
>
Thierry Reding Nov. 13, 2019, 11:20 a.m. UTC | #24
On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
> On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote:
> > On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
> > > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
> > > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> > > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> > > > > 
> > > > > > > Based on Vidya's backtrace, I think the resume path with problems
> > > > > > > is this:
> > > > > > > 
> > > > > > >     pci_pm_resume_noirq
> > > > > > >       pci_pm_default_resume_early
> > > > > > >         pci_power_up
> > > > > > >           if (platform_pci_power_manageable(dev))
> > > > > > >             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > > > > > >           pci_raw_set_power_state
> > > > > > >           pci_update_current_state
> > > > > > >             pci_device_is_present        # <-- config read returns CRS
> > > > > > > 
> > > > > > > So I think your suggestion is that Vidya's firmware should be
> > > > > > > doing the delay inside platform_pci_set_power_state()?
> > > > > > > 
> > > > > > > Vidya, you typically work on Tegra, so I assume this is on an
> > > > > > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > > > > > firmware developers to ask about who they expect to do the delays?
> > > > > > 
> > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > > > > > other firmware for that matter. PCIe is brought up directly in the
> > > > > > kernel.
> > > > > 
> > > > > I assume that your device is coming out of D3cold because apparently
> > > > > you're seeing a CRS status from the config read when
> > > > > pci_update_current_state() calls pci_device_is_present().  CRS status
> > > > > should only happen after reset or power-on from D3cold, and you're not
> > > > > doing a reset.
> > > > > 
> > > > > I'm pretty sure platform_pci_power_manageable() returns false on
> > > > > your system (can you confirm that?) because the only scenarios with
> > > > > platform power management are MID (Intel platform) and ACPI (which you
> > > > > don't have).
> > > > > 
> > > > > Maybe you have some other platform-specific mechanism that controls
> > > > > power to PCI devices, and it's not integrated into the
> > > > > platform_pci_*() framework?
> > > > 
> > > > My understanding after reading the PCIe specification is that CRS is a
> > > > mechanism that allows an endpoint to signal that it isn't ready yet for
> > > > operation after reset or power-on from D3cold. There's nothing in there
> > > > that's platform specific. This is really only for specific endpoints.
> > > > 
> > > > I don't see how adding platform specific PM code would help in this
> > > > case. At a platform level we don't know if users are going to plug in a
> > > > PCI endpoint that needs a long delay before it's ready after reset and/
> > > > or exit from D3cold.
> > > 
> > > Right, see below.
> > > 
> > > > I do understand that perhaps pci_device_is_present() is perhaps not the
> > > > best place to do complex CRS handling, but if a mechanism is clearly
> > > > described in the specification, isn't it something that should be dealt
> > > > with in the core? That way we don't have to quirk this for every device
> > > > and platform.
> > > 
> > > Definitely; we don't want quirks for endpoints (unless they're
> > > actually broken) or for platforms (unless there's a platform hardware
> > > or firmware defect).
> > > 
> > > There's no question that we need to delay and handle CRS after
> > > power-on from D3cold.  I'm trying to get at the point that PCI itself
> > > doesn't tell us how to do that power-on.  The mechanisms defined by
> > > PCI rely on config space, which is only accessible in D0-D3hot, not
> > > D3cold.  Power-on from D3cold can only happen via ACPI methods or
> > > other platform-specific mechanisms, and the current design abstracts
> > > those via platform_pci_set_power_state().  This is partly based on
> > > Rafael's response in [1].
> > > 
> > > > The PCIe specification says that:
> > > > 
> > > > 	Software that intends to take advantage of this mechanism must
> > > > 	ensure that the first access made to a device following a valid
> > > > 	reset condition is a Configuration Read Request accessing both
> > > > 	bytes of the Vendor ID field in the device's Configuration Space
> > > > 	header.
> > > > 
> > > > So doesn't that mean that pci_device_is_present() is already much too
> > > > late because we've potentially made other configuration read requests in
> > > > the meantime?
> > > > 
> > > > Wouldn't it make more sense to push the CRS handling up a bit? The
> > > > existing pci_power_up() function seems like it would be a good place.
> > > > For example, adding code to deal with CRS right after the platform PCI
> > > > PM calls but before pci_raw_set_power_state() seems like it would fit
> > > > the restrictions given in the above quote from the specification.
> > > 
> > > Yep, I think that's the right point.  I'm trying to figure out how to
> > > integrate it.  Rafael suggests that delays may be platform-specific
> > > and should be in platform_pci_set_power_state(), but the CRS handling
> > > itself isn't platform-specific and maybe could be higher up.
> > > 
> > > I'm fishing to see if Tegra has some kind of power control for
> > > endpoints that is not related to the platform_pci_*() framework.  How
> > > did the endpoint get put in D3cold in the first place?  I assume
> > > something in the suspend path did that?  Maybe this happens when we
> > > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
> > > 
> > >    tegra_pcie_pm_suspend
> > >      tegra_pcie_phy_power_off
> > >      tegra_pcie_power_off
> > > 
> > >    tegra_pcie_pm_resume
> > >      tegra_pcie_power_on
> > >      tegra_pcie_phy_power_on
> > > 
> > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
> > > transition for the endpoint, I don't think we want to do CRS handling
> > > there because that path shouldn't be touching the endpoint.  But maybe
> > > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
> > > any config accesses are issued to devices.
> >
> > Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
> > concern is Intel 750 NVMe drive.
> > PCIe host controller driver already has 100ms of delay as per the spec,
> 
> To make this more concrete, where specifically is this delay?
> 
> > but this particular device is taking 1023ms to get ready to respond to
> > configuration space requests (till that time, it responds with
> > configuration request retry statuses)
> > I've put a dump_stack () to see the path resume sequence takes and here it is
> > 
> >  Call trace:
> >   dump_backtrace+0x0/0x158
> >   show_stack+0x14/0x20
> >   dump_stack+0xb0/0xf4
> >   pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
> >   pci_bus_read_dev_vendor_id+0x48/0x70
> >   pci_update_current_state+0x68/0xd8
> >   pci_power_up+0x40/0x50
> >   pci_pm_resume_noirq+0x7c/0x138
> >   dpm_run_callback.isra.16+0x20/0x70
> >   device_resume_noirq+0x120/0x238
> >   async_resume_noirq+0x24/0x58
> >   async_run_entry_fn+0x40/0x148
> >   process_one_work+0x1e8/0x360
> >   worker_thread+0x40/0x488
> >   kthread+0x118/0x120
> >   ret_from_fork+0x10/0x1c
> >  pci 0005:01:00.0: ready after 1023ms
> > 
> > Spec also mentions the following
> >     Unless Readiness Notifications mechanisms are used (see Section
> >     6.23), the Root Complex and/or system software must allow at
> >     least 1.0 s after a Conventional Reset of a device, before it
> >     may determine that a device which fails to return a Successful
> >     Completion status for a valid Configuration Request is a broken
> >     device. This period is independent of how quickly Link training
> >     completes.
> > 
> > My understanding is that this 1sec waiting is supposed to be done by
> > the PCIe sub-system and not the host controller driver.
> 
> That doesn't say we must wait 1s; it only says we can't decide the
> device is broken before 1s.  But regardless, I agree that the CRS
> handling doesn't belong in the driver for either the endpoint or the
> PCIe host controller.
> 
> My question is whether this wait should be connected somehow with
> platform_pci_set_power_state().  It sounds like the tegra host
> controller driver already does the platform-specific delays, and I'm
> not sure it's reasonable for platform_pci_set_power_state() to do the
> CRS polling.  Maybe something like this?  I'd really like to get
> Rafael's thinking here.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..052fa316c917 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>   */
>  void pci_power_up(struct pci_dev *dev)
>  {
> +	pci_power_state_t prev_state = dev->current_state;
> +
>  	if (platform_pci_power_manageable(dev))
>  		platform_pci_set_power_state(dev, PCI_D0);
>  
> +	if (prev_state == PCI_D3cold)
> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);

Is there any reason in particular why you chose to call pci_dev_wait()?
It seems to me like that's a little broader than pci_bus_wait_crs(). The
latter is static in drivers/pci/probe.c so we'd need to change that in
order to use it from drivers/pci/pci.c, but it sounds like the more
explicit thing to do.

Thierry

> +
>  	pci_raw_set_power_state(dev, PCI_D0);
>  	pci_update_current_state(dev, PCI_D0);
>  }
> 
> > FWIW, this particular device is taking a little more time than 1 sec
> > (i.e. 1023 ms) I'm now wondering why is it that the CRS has a
> > timeout of 60 secs than just 1 sec?
> 
> pci_bus_wait_crs() does an exponential backoff, i.e., it does reads
> after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know
> *exactly* when the device became ready.  All we know is that it
> became ready somewhere between 512ms and 1024ms.
> 
> But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become
> ready after FLR") does suggest that the Intel 750 NVMe may require
> more than 1s.  I think the 60s timeout is just a convenient large
> number and I'm not sure it's derived from anything in the spec.
> 
> > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
>
Bjorn Helgaas Nov. 14, 2019, 6:36 p.m. UTC | #25
On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote:
> On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:

> > My question is whether this wait should be connected somehow with
> > platform_pci_set_power_state().  It sounds like the tegra host
> > controller driver already does the platform-specific delays, and I'm
> > not sure it's reasonable for platform_pci_set_power_state() to do the
> > CRS polling.  Maybe something like this?  I'd really like to get
> > Rafael's thinking here.
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e7982af9a5d8..052fa316c917 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
> >   */
> >  void pci_power_up(struct pci_dev *dev)
> >  {
> > +	pci_power_state_t prev_state = dev->current_state;
> > +
> >  	if (platform_pci_power_manageable(dev))
> >  		platform_pci_set_power_state(dev, PCI_D0);
> >  
> > +	if (prev_state == PCI_D3cold)
> > +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
> 
> Is there any reason in particular why you chose to call pci_dev_wait()?
> It seems to me like that's a little broader than pci_bus_wait_crs(). The
> latter is static in drivers/pci/probe.c so we'd need to change that in
> order to use it from drivers/pci/pci.c, but it sounds like the more
> explicit thing to do.

Broader in what sense?  They work essentially identically except that
pci_bus_wait_crs() doesn't need a pci_dev * (because it's used during
enumeration when we don't have a pci_dev yet) and it does dword reads
instead of word reads.

It is a shame that the logic is duplicated, but we don't have to worry
about that here.

I think I would stick with pci_dev_wait() in this case since we do
have a pci_dev * and it's a little simpler, unless I'm missing the
advantage of pci_bus_wait_crs().

Bjorn
Vidya Sagar Nov. 15, 2019, 10:04 a.m. UTC | #26
On 11/15/2019 12:06 AM, Bjorn Helgaas wrote:
> On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote:
>> On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
> 
>>> My question is whether this wait should be connected somehow with
>>> platform_pci_set_power_state().  It sounds like the tegra host
>>> controller driver already does the platform-specific delays, and I'm
>>> not sure it's reasonable for platform_pci_set_power_state() to do the
>>> CRS polling.  Maybe something like this?  I'd really like to get
>>> Rafael's thinking here.
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index e7982af9a5d8..052fa316c917 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>>>    */
>>>   void pci_power_up(struct pci_dev *dev)
>>>   {
>>> +	pci_power_state_t prev_state = dev->current_state;
>>> +
>>>   	if (platform_pci_power_manageable(dev))
>>>   		platform_pci_set_power_state(dev, PCI_D0);
>>>   
>>> +	if (prev_state == PCI_D3cold)
>>> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
>>
>> Is there any reason in particular why you chose to call pci_dev_wait()?
>> It seems to me like that's a little broader than pci_bus_wait_crs(). The
>> latter is static in drivers/pci/probe.c so we'd need to change that in
>> order to use it from drivers/pci/pci.c, but it sounds like the more
>> explicit thing to do.
> 
> Broader in what sense?  They work essentially identically except that
> pci_bus_wait_crs() doesn't need a pci_dev * (because it's used during
> enumeration when we don't have a pci_dev yet) and it does dword reads
> instead of word reads.
> 
> It is a shame that the logic is duplicated, but we don't have to worry
> about that here.
> 
> I think I would stick with pci_dev_wait() in this case since we do
> have a pci_dev * and it's a little simpler, unless I'm missing the
> advantage of pci_bus_wait_crs().
Is there any specific reason why should there be a check for the state?
In Tegra series, I observe that, by the time execution comes to this point,
prev_state is PCI_D3Hot and in Tegra194 particularly, it is PCI_D0 because the
host controller driver explicitly keeps the downstream devices in PCI_D0 state
as a work around for one of the Tegra194 specific issues. So, I feel the check
for current_state may not be need here(?)

- Vidya Sagar
> 
> Bjorn
>
Bjorn Helgaas Nov. 15, 2019, 10:36 p.m. UTC | #27
On Fri, Nov 15, 2019 at 03:34:23PM +0530, Vidya Sagar wrote:
> On 11/15/2019 12:06 AM, Bjorn Helgaas wrote:
> > On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote:
> > > On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
> > >
> > > > My question is whether this wait should be connected somehow with
> > > > platform_pci_set_power_state().  It sounds like the tegra host
> > > > controller driver already does the platform-specific delays, and I'm
> > > > not sure it's reasonable for platform_pci_set_power_state() to do the
> > > > CRS polling.  Maybe something like this?  I'd really like to get
> > > > Rafael's thinking here.
> > > > 
> > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > index e7982af9a5d8..052fa316c917 100644
> > > > --- a/drivers/pci/pci.c
> > > > +++ b/drivers/pci/pci.c
> > > > @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
> > > >    */
> > > >   void pci_power_up(struct pci_dev *dev)
> > > >   {
> > > > +	pci_power_state_t prev_state = dev->current_state;
> > > > +
> > > >   	if (platform_pci_power_manageable(dev))
> > > >   		platform_pci_set_power_state(dev, PCI_D0);
> > > > +	if (prev_state == PCI_D3cold)
> > > > +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
>
> Is there any specific reason why should there be a check for the
> state?  In Tegra series, I observe that, by the time execution comes
> to this point, prev_state is PCI_D3Hot and in Tegra194 particularly,
> it is PCI_D0 because the host controller driver explicitly keeps the
> downstream devices in PCI_D0 state as a work around for one of the
> Tegra194 specific issues. 

I think you're right, we probably should not try to check "prev_state"
here because we can't rely on that being accurate.

On Tegra, I assume suspend puts the PCIe devices in D3hot, then when
we suspend the RC itself, it looks like tegra_pcie_pm_suspend()
actually turns off the power, so the PCIe devices probably go to
D3cold but nothing updates their dev->current_state, so it's probably
still PCI_D3hot.

On Tegra194, the same probably happens, except that when we suspend
the RC itself, tegra_pcie_downstream_dev_to_D0() puts the PCIe devices
back in D0 (updating their dev->current_state to PCI_D0), and then we
turn off the power, so they probably also end up in D3cold but with
dev_current_state still set to PCI_D0.

> So, I feel the check for current_state may not be need here(?)

I think you're right.  We can't tell what dev->current_state is when
we enter pci_power_up().

Bjorn
Vidya Sagar Nov. 18, 2019, 3:18 p.m. UTC | #28
On 11/16/2019 4:06 AM, Bjorn Helgaas wrote:
> On Fri, Nov 15, 2019 at 03:34:23PM +0530, Vidya Sagar wrote:
>> On 11/15/2019 12:06 AM, Bjorn Helgaas wrote:
>>> On Wed, Nov 13, 2019 at 12:20:43PM +0100, Thierry Reding wrote:
>>>> On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
>>>>
>>>>> My question is whether this wait should be connected somehow with
>>>>> platform_pci_set_power_state().  It sounds like the tegra host
>>>>> controller driver already does the platform-specific delays, and I'm
>>>>> not sure it's reasonable for platform_pci_set_power_state() to do the
>>>>> CRS polling.  Maybe something like this?  I'd really like to get
>>>>> Rafael's thinking here.
>>>>>
>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>> index e7982af9a5d8..052fa316c917 100644
>>>>> --- a/drivers/pci/pci.c
>>>>> +++ b/drivers/pci/pci.c
>>>>> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>>>>>     */
>>>>>    void pci_power_up(struct pci_dev *dev)
>>>>>    {
>>>>> +	pci_power_state_t prev_state = dev->current_state;
>>>>> +
>>>>>    	if (platform_pci_power_manageable(dev))
>>>>>    		platform_pci_set_power_state(dev, PCI_D0);
>>>>> +	if (prev_state == PCI_D3cold)
>>>>> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);
>>
>> Is there any specific reason why should there be a check for the
>> state?  In Tegra series, I observe that, by the time execution comes
>> to this point, prev_state is PCI_D3Hot and in Tegra194 particularly,
>> it is PCI_D0 because the host controller driver explicitly keeps the
>> downstream devices in PCI_D0 state as a work around for one of the
>> Tegra194 specific issues.
> 
> I think you're right, we probably should not try to check "prev_state"
> here because we can't rely on that being accurate.
> 
> On Tegra, I assume suspend puts the PCIe devices in D3hot, then when
> we suspend the RC itself, it looks like tegra_pcie_pm_suspend()
> actually turns off the power, so the PCIe devices probably go to
> D3cold but nothing updates their dev->current_state, so it's probably
> still PCI_D3hot.
> 
> On Tegra194, the same probably happens, except that when we suspend
> the RC itself, tegra_pcie_downstream_dev_to_D0() puts the PCIe devices
> back in D0 (updating their dev->current_state to PCI_D0), and then we
> turn off the power, so they probably also end up in D3cold but with
> dev_current_state still set to PCI_D0.
> 
>> So, I feel the check for current_state may not be need here(?)
> 
> I think you're right.  We can't tell what dev->current_state is when
> we enter pci_power_up().
Thanks,
I'll push a change with your suggested modifications.

- Vidya Sagar
> 
> Bjorn
>
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 95dc78ebdded..3ab9f6d3c8a6 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5905,7 +5905,8 @@  bool pci_device_is_present(struct pci_dev *pdev)
 
 	if (pci_dev_is_disconnected(pdev))
 		return false;
-	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
+	return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v,
+					  PCI_CRS_TIMEOUT);
 }
 EXPORT_SYMBOL_GPL(pci_device_is_present);
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3f6947ee3324..aa25c5fdc6a5 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -4,6 +4,8 @@ 
 
 #include <linux/pci.h>
 
+#define PCI_CRS_TIMEOUT		(60 * 1000)	/* 60 sec*/
+
 #define PCI_FIND_CAP_TTL	48
 
 #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7c5d68b807ef..6e44a03283c8 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2258,7 +2258,7 @@  static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
 	struct pci_dev *dev;
 	u32 l;
 
-	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000))
+	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, PCI_CRS_TIMEOUT))
 		return NULL;
 
 	dev = pci_alloc_dev(bus);