diff mbox series

[2/3] PCI: Do not poll for PME if the device is in D3cold

Message ID 20190605145820.37169-3-mika.westerberg@linux.intel.com
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Power management improvements | expand

Commit Message

Mika Westerberg June 5, 2019, 2:58 p.m. UTC
PME polling does not take into account that a device that is directly
connected to the host bridge may go into D3cold as well. This leads to a
situation where the PME poll thread reads from a config space of a
device that is in D3cold and gets incorrect information because the
config space is not accessible.

Here is an example from Intel Ice Lake system where two PCIe root ports
are in D3cold (I've instrumented the kernel to log the PMCSR register
contents):

  [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
  [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff

Since 0xffff is interpreted so that PME is pending, the root ports will
be runtime resumed. This repeats over and over again essentially
blocking all runtime power management.

Prevent this from happening by checking whether the device is in D3cold
before its PME status is read.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Lukas Wunner June 5, 2019, 7:05 p.m. UTC | #1
On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.

There's more broken here.  The below patch fixes a PME polling race
and should also fix the issue you're witnessing, could you verify that?

The patch has been rotting on my development branch for several months,
I just didn't get around to posting it, my apologies.

-- >8 --
Subject: [PATCH] PCI / PM: Fix race on PME polling

Since commit df17e62e5bff ("PCI: Add support for polling PME state on
suspended legacy PCI devices"), the work item pci_pme_list_scan() polls
the PME status flag of devices and wakes them up if the bit is set.

The function performs a check whether a device's upstream bridge is in
D0 for otherwise the device is inaccessible, rendering PME polling
impossible.  However the check is racy because it is performed before
polling the device.  If the upstream bridge runtime suspends to D3hot
after pci_pme_list_scan() checks its power state and before it invokes
pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and
mistake it for a set PME status flag.  I am seeing this race play out as
a Thunderbolt controller going to D3cold and occasionally immediately
going to D0 again because PM polling was performed at just the wrong
time.

Avoid by checking for an "all ones" PMCSR in pci_check_pme_status().

Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org # v2.6.34+
---
 drivers/pci/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b98a564..2e05348 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev)
 	pci_read_config_word(dev, pmcsr_pos, &pmcsr);
 	if (!(pmcsr & PCI_PM_CTRL_PME_STATUS))
 		return false;
+	if (pmcsr == ~0)
+		return false;
 
 	/* Clear PME status. */
 	pmcsr |= PCI_PM_CTRL_PME_STATUS;
Mika Westerberg June 6, 2019, 11:21 a.m. UTC | #2
On Wed, Jun 05, 2019 at 09:05:57PM +0200, Lukas Wunner wrote:
> On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> > PME polling does not take into account that a device that is directly
> > connected to the host bridge may go into D3cold as well. This leads to a
> > situation where the PME poll thread reads from a config space of a
> > device that is in D3cold and gets incorrect information because the
> > config space is not accessible.
> > 
> > Here is an example from Intel Ice Lake system where two PCIe root ports
> > are in D3cold (I've instrumented the kernel to log the PMCSR register
> > contents):
> > 
> >   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
> >   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> > 
> > Since 0xffff is interpreted so that PME is pending, the root ports will
> > be runtime resumed. This repeats over and over again essentially
> > blocking all runtime power management.
> > 
> > Prevent this from happening by checking whether the device is in D3cold
> > before its PME status is read.
> 
> There's more broken here.  The below patch fixes a PME polling race
> and should also fix the issue you're witnessing, could you verify that?

It fixes the issue but I needed to tune it a bit ->

> The patch has been rotting on my development branch for several months,
> I just didn't get around to posting it, my apologies.

Better late than never :)

> -- >8 --
> Subject: [PATCH] PCI / PM: Fix race on PME polling
> 
> Since commit df17e62e5bff ("PCI: Add support for polling PME state on
> suspended legacy PCI devices"), the work item pci_pme_list_scan() polls
> the PME status flag of devices and wakes them up if the bit is set.
> 
> The function performs a check whether a device's upstream bridge is in
> D0 for otherwise the device is inaccessible, rendering PME polling
> impossible.  However the check is racy because it is performed before
> polling the device.  If the upstream bridge runtime suspends to D3hot
> after pci_pme_list_scan() checks its power state and before it invokes
> pci_pme_wakeup(), the latter will read the PMCSR as "all ones" and
> mistake it for a set PME status flag.  I am seeing this race play out as
> a Thunderbolt controller going to D3cold and occasionally immediately
> going to D0 again because PM polling was performed at just the wrong
> time.
> 
> Avoid by checking for an "all ones" PMCSR in pci_check_pme_status().
> 
> Fixes: 58ff463396ad ("PCI PM: Add function for checking PME status of devices")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: stable@vger.kernel.org # v2.6.34+
> ---
>  drivers/pci/pci.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b98a564..2e05348 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1753,6 +1753,8 @@ bool pci_check_pme_status(struct pci_dev *dev)
>  	pci_read_config_word(dev, pmcsr_pos, &pmcsr);
>  	if (!(pmcsr & PCI_PM_CTRL_PME_STATUS))
>  		return false;
> +	if (pmcsr == ~0)

<- Here I needed to do

	if (pmcsr == (u16)~0)

I think it is because pmcsr is u16 so we end up comparing:

	0xffff == 0xffffffff

> +		return false;
>  
>  	/* Clear PME status. */
>  	pmcsr |= PCI_PM_CTRL_PME_STATUS;
> -- 
> 2.20.1
Lukas Wunner June 9, 2019, 6:38 p.m. UTC | #3
On Wed, Jun 05, 2019 at 05:58:19PM +0300, Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.
> 
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>
Fixes: 71a83bd727cc ("PCI/PM: add runtime PM support to PCIe port")
Cc: stable@vger.kernel.org # v3.6+

Although the patch I've posted today (which checks for an "all ones" read
from config space) covers the issue fixed herein, your patch still makes
sense to avoid unnecessarily accessing config space in the first place.

Thanks,

Lukas

> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 87a1f902fa8e..720da09d4d73 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>  			 */
>  			if (bridge && bridge->current_state != PCI_D0)
>  				continue;
> +			/*
> +			 * If the device is in D3cold it should not be
> +			 * polled either.
> +			 */
> +			if (pme_dev->dev->current_state == PCI_D3cold)
> +				continue;
> +
>  			pci_pme_wakeup(pme_dev->dev, NULL);
>  		} else {
>  			list_del(&pme_dev->list);
> -- 
> 2.20.1
>
Rafael J. Wysocki June 10, 2019, 11:35 a.m. UTC | #4
On Wednesday, June 5, 2019 4:58:19 PM CEST Mika Westerberg wrote:
> PME polling does not take into account that a device that is directly
> connected to the host bridge may go into D3cold as well. This leads to a
> situation where the PME poll thread reads from a config space of a
> device that is in D3cold and gets incorrect information because the
> config space is not accessible.
> 
> Here is an example from Intel Ice Lake system where two PCIe root ports
> are in D3cold (I've instrumented the kernel to log the PMCSR register
> contents):
> 
>   [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
>   [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff
> 
> Since 0xffff is interpreted so that PME is pending, the root ports will
> be runtime resumed. This repeats over and over again essentially
> blocking all runtime power management.
> 
> Prevent this from happening by checking whether the device is in D3cold
> before its PME status is read.
> 
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/pci/pci.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 87a1f902fa8e..720da09d4d73 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2060,6 +2060,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>  			 */
>  			if (bridge && bridge->current_state != PCI_D0)
>  				continue;
> +			/*
> +			 * If the device is in D3cold it should not be
> +			 * polled either.
> +			 */
> +			if (pme_dev->dev->current_state == PCI_D3cold)
> +				continue;
> +
>  			pci_pme_wakeup(pme_dev->dev, NULL);
>  		} else {
>  			list_del(&pme_dev->list);
>
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 87a1f902fa8e..720da09d4d73 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2060,6 +2060,13 @@  static void pci_pme_list_scan(struct work_struct *work)
 			 */
 			if (bridge && bridge->current_state != PCI_D0)
 				continue;
+			/*
+			 * If the device is in D3cold it should not be
+			 * polled either.
+			 */
+			if (pme_dev->dev->current_state == PCI_D3cold)
+				continue;
+
 			pci_pme_wakeup(pme_dev->dev, NULL);
 		} else {
 			list_del(&pme_dev->list);