[v2,2/2] PCI: Fix runtime PM race with PME polling

Message ID	20230803171233.3810944-3-alex.williamson@redhat.com
State	New
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Alex Williamson <alex.williamson@redhat.com> To: bhelgaas@google.com Cc: Alex Williamson <alex.williamson@redhat.com>, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, eric.auger@redhat.com Subject: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Date: Thu, 3 Aug 2023 11:12:33 -0600 Message-Id: <20230803171233.3810944-3-alex.williamson@redhat.com> In-Reply-To: <20230803171233.3810944-1-alex.williamson@redhat.com> References: <20230803171233.3810944-1-alex.williamson@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	PCI: Protect VPD and PME accesses from power management \| expand [v2,0/2] PCI: Protect VPD and PME accesses from power management [v2,1/2] PCI/VPD: Add runtime power management to sysfs interface [v2,2/2] PCI: Fix runtime PM race with PME polling

Alex Williamson Aug. 3, 2023, 5:12 p.m. UTC

Testing that a device is not currently in a low power state provides no
guarantees that the device is not immenently transitioning to such a state.
We need to increment the PM usage counter before accessing the device.
Since we don't wish to wake the device for PME polling, do so only if the
device is already active by using pm_runtime_get_if_active().

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

Alex Williamson Jan. 18, 2024, 6:50 p.m. UTC | #1

On Thu,  3 Aug 2023 11:12:33 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> Testing that a device is not currently in a low power state provides no
> guarantees that the device is not immenently transitioning to such a state.
> We need to increment the PM usage counter before accessing the device.
> Since we don't wish to wake the device for PME polling, do so only if the
> device is already active by using pm_runtime_get_if_active().
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/pci/pci.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)

Hey folks,

Resurrecting this patch (currently commit d3fcd7360338) for discussion
as it's been identified as the source of a regression in:

https://bugzilla.kernel.org/show_bug.cgi?id=218360

Copying Mika, Lukas, and Rafael as it's related to:

000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")

where we skip devices in D3cold when processing the PME list.

I think the issue in the above bz is that the downstream TB3/USB4 port
is in D3 (presumably D3hot) and I therefore infer the device is in state
RPM_SUSPENDED.  This commit is attempting to make sure the device power
state is stable across the call such that it does not transition into
D3cold while we're accessing it.

To do that I used pm_runtime_get_if_active(), but in retrospect this
requires the device to be in RPM_ACTIVE so we end up skipping anything
suspended or transitioning.

As reported in the above bz, I tried replacing this with:

	pm_runtime_get_noresume(dev);
	pm_runtime_barrier(dev);

The theory here being that the barrier would wait for any transitioning
states such that as far as runtime power management is concerned, the
device power state is stable.

This causes live locks where the barrier never returns.

Instead I'm considering that since we're polling the PME list, maybe we
could just defer devices in transition states, for instance something
that looks like pm_runtime_get_if_active(), but would return zero if
the device was in RPM_SUSPENDING or RPM_RESUMING rather than requiring
RPM_ACTIVE.

I'm not an expert in PME or runtime power management though, so I'm
looking for advice.  Thanks,

Alex

> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 60230da957e0..bc266f290b2c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>  
>  	mutex_lock(&pci_pme_list_mutex);
>  	list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) {
> -		if (pme_dev->dev->pme_poll) {
> -			struct pci_dev *bridge;
> +		struct pci_dev *pdev = pme_dev->dev;
> +
> +		if (pdev->pme_poll) {
> +			struct pci_dev *bridge = pdev->bus->self;
> +			struct device *dev = &pdev->dev;
> +			int pm_status;
>  
> -			bridge = pme_dev->dev->bus->self;
>  			/*
>  			 * If bridge is in low power state, the
>  			 * configuration space of subordinate devices
> @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work)
>  			 */
>  			if (bridge && bridge->current_state != PCI_D0)
>  				continue;
> +
>  			/*
> -			 * If the device is in D3cold it should not be
> -			 * polled either.
> +			 * If the device is in a low power state it
> +			 * should not be polled either.
>  			 */
> -			if (pme_dev->dev->current_state == PCI_D3cold)
> +			pm_status = pm_runtime_get_if_active(dev, true);
> +			if (!pm_status)
>  				continue;
>  
> -			pci_pme_wakeup(pme_dev->dev, NULL);
> +			if (pdev->current_state != PCI_D3cold)
> +				pci_pme_wakeup(pdev, NULL);
> +
> +			if (pm_status > 0)
> +				pm_runtime_put(dev);
>  		} else {
>  			list_del(&pme_dev->list);
>  			kfree(pme_dev);

Lukas Wunner Jan. 22, 2024, 10:17 p.m. UTC | #2

On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
> On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote:
> > Testing that a device is not currently in a low power state provides no
> > guarantees that the device is not immenently transitioning to such a state.
> > We need to increment the PM usage counter before accessing the device.
> > Since we don't wish to wake the device for PME polling, do so only if the
> > device is already active by using pm_runtime_get_if_active().
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/pci/pci.c | 23 ++++++++++++++++-------
> >  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> Resurrecting this patch (currently commit d3fcd7360338) for discussion
> as it's been identified as the source of a regression in:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=218360
> 
> Copying Mika, Lukas, and Rafael as it's related to:
> 
> 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
> 
> where we skip devices in D3cold when processing the PME list.
> 
> I think the issue in the above bz is that the downstream TB3/USB4 port
> is in D3 (presumably D3hot) and I therefore infer the device is in state
> RPM_SUSPENDED.  This commit is attempting to make sure the device power
> state is stable across the call such that it does not transition into
> D3cold while we're accessing it.
> 
> To do that I used pm_runtime_get_if_active(), but in retrospect this
> requires the device to be in RPM_ACTIVE so we end up skipping anything
> suspended or transitioning.

How about dropping the calls to pm_runtime_get_if_active() and
pm_runtime_put() and instead simply do:

			if (pm_runtime_suspended(&pdev->dev) &&
			    pdev->current_state != PCI_D3cold)
				pci_pme_wakeup(pdev, NULL);

Thanks,

Lukas

Alex Williamson Jan. 22, 2024, 10:50 p.m. UTC | #3

On Mon, 22 Jan 2024 23:17:30 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
> > On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote:  
> > > Testing that a device is not currently in a low power state provides no
> > > guarantees that the device is not immenently transitioning to such a state.
> > > We need to increment the PM usage counter before accessing the device.
> > > Since we don't wish to wake the device for PME polling, do so only if the
> > > device is already active by using pm_runtime_get_if_active().
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  drivers/pci/pci.c | 23 ++++++++++++++++-------
> > >  1 file changed, 16 insertions(+), 7 deletions(-)  
> > 
> > Resurrecting this patch (currently commit d3fcd7360338) for discussion
> > as it's been identified as the source of a regression in:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=218360
> > 
> > Copying Mika, Lukas, and Rafael as it's related to:
> > 
> > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
> > 
> > where we skip devices in D3cold when processing the PME list.
> > 
> > I think the issue in the above bz is that the downstream TB3/USB4 port
> > is in D3 (presumably D3hot) and I therefore infer the device is in state
> > RPM_SUSPENDED.  This commit is attempting to make sure the device power
> > state is stable across the call such that it does not transition into
> > D3cold while we're accessing it.
> > 
> > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > suspended or transitioning.  
> 
> How about dropping the calls to pm_runtime_get_if_active() and
> pm_runtime_put() and instead simply do:
> 
> 			if (pm_runtime_suspended(&pdev->dev) &&
> 			    pdev->current_state != PCI_D3cold)
> 				pci_pme_wakeup(pdev, NULL);

Hi Lukas,

Do we require that the polled device is in the RPM_SUSPENDED state?
Also pm_runtime_suspended() can also only be trusted while holding the
device power.lock, we need a usage count reference to maintain that
state.

I'm also seeing cases where the bridge is power state D0, but PM state
RPM_SUSPENDING, so config space of the polled device becomes
inaccessible even while we're holding a reference once we allow polling
in RPM_SUSPENDED.

I'm currently working with the below patch, which I believe addresses
all these issues, but I'd welcome review and testing. Thanks,

Alex

commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Tue Jan 16 13:28:33 2024 -0700

    PCI: Fix active state requirement in PME polling
    
    The commit noted in fixes added a bogus requirement that runtime PM
    managed devices need to be in the RPM_ACTIVE state for PME polling.
    In fact, there is no requirement of a specific runtime PM state, it
    is only required that the state is stable such that testing config
    space availability, ie. !D3cold, remains valid across the PME wakeup.
    
    To that effect, defer polling of runtime PM managed devices that are
    not in either the RPM_ACTIVE or RPM_SUSPENDED states.  Devices in
    transitional states remain on the pci_pme_list and will be re-queued.
    
    However in allowing polling of devices in the RPM_SUSPENDED state,
    the bridge state requires further refinement as it's possible to poll
    while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
    An asynchronous completion of the bridge transition to a low power
    state can make config space of the subordinate device become
    unavailable.  A runtime PM reference to the bridge is therefore added
    with a supplementary requirement that the bridge is in the RPM_ACTIVE
    state.
    
    Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
    Reported-by: Sanath S <sanath.s@amd.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bdbf8a94b4d0..31dbf1834b07 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
 		if (pdev->pme_poll) {
 			struct pci_dev *bridge = pdev->bus->self;
 			struct device *dev = &pdev->dev;
-			int pm_status;
+			struct device *bdev = bridge ? &bridge->dev : NULL;
 
 			/*
-			 * If bridge is in low power state, the
-			 * configuration space of subordinate devices
-			 * may be not accessible
+			 * If we have a bridge, it should be in an active/D0
+			 * state or the configuration space of subordinate
+			 * devices may not be accessible.
 			 */
-			if (bridge && bridge->current_state != PCI_D0)
-				continue;
+			if (bdev) {
+				spin_lock_irq(&bdev->power.lock);
+				if (!pm_runtime_active(bdev) ||
+				    bridge->current_state != PCI_D0) {
+					spin_unlock_irq(&bdev->power.lock);
+					continue;
+				}
+				pm_runtime_get_noresume(bdev);
+				spin_unlock_irq(&bdev->power.lock);
+			}
 
 			/*
-			 * If the device is in a low power state it
-			 * should not be polled either.
+			 * The device itself may be either in active or
+			 * suspended state, but must not be in D3cold so
+			 * that configuration space is accessible.  The
+			 * transitional resuming and suspending states are
+			 * skipped to avoid D3cold races.
 			 */
-			pm_status = pm_runtime_get_if_active(dev, true);
-			if (!pm_status)
-				continue;
-
-			if (pdev->current_state != PCI_D3cold)
+			spin_lock_irq(&dev->power.lock);
+			if ((pm_runtime_active(dev) ||
+			     pm_runtime_suspended(dev)) &&
+			    pdev->current_state != PCI_D3cold) {
+				pm_runtime_get_noresume(dev);
+				spin_unlock_irq(&dev->power.lock);
 				pci_pme_wakeup(pdev, NULL);
-
-			if (pm_status > 0)
 				pm_runtime_put(dev);
+			} else {
+				spin_unlock_irq(&dev->power.lock);
+			}
+
+			if (bdev)
+				pm_runtime_put(bdev);
 		} else {
 			list_del(&pme_dev->list);
 			kfree(pme_dev);

Alex Williamson Jan. 23, 2024, 4:46 a.m. UTC | #4

On Mon, 22 Jan 2024 15:50:03 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 22 Jan 2024 23:17:30 +0100
> Lukas Wunner <lukas@wunner.de> wrote:
> 
> > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:  
> > > On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote:    
> > > > Testing that a device is not currently in a low power state provides no
> > > > guarantees that the device is not immenently transitioning to such a state.
> > > > We need to increment the PM usage counter before accessing the device.
> > > > Since we don't wish to wake the device for PME polling, do so only if the
> > > > device is already active by using pm_runtime_get_if_active().
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > >  drivers/pci/pci.c | 23 ++++++++++++++++-------
> > > >  1 file changed, 16 insertions(+), 7 deletions(-)    
> > > 
> > > Resurrecting this patch (currently commit d3fcd7360338) for discussion
> > > as it's been identified as the source of a regression in:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=218360
> > > 
> > > Copying Mika, Lukas, and Rafael as it's related to:
> > > 
> > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
> > > 
> > > where we skip devices in D3cold when processing the PME list.
> > > 
> > > I think the issue in the above bz is that the downstream TB3/USB4 port
> > > is in D3 (presumably D3hot) and I therefore infer the device is in state
> > > RPM_SUSPENDED.  This commit is attempting to make sure the device power
> > > state is stable across the call such that it does not transition into
> > > D3cold while we're accessing it.
> > > 
> > > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > > suspended or transitioning.    
> > 
> > How about dropping the calls to pm_runtime_get_if_active() and
> > pm_runtime_put() and instead simply do:
> > 
> > 			if (pm_runtime_suspended(&pdev->dev) &&
> > 			    pdev->current_state != PCI_D3cold)
> > 				pci_pme_wakeup(pdev, NULL);  
> 
> Hi Lukas,
> 
> Do we require that the polled device is in the RPM_SUSPENDED state?
> Also pm_runtime_suspended() can also only be trusted while holding the
> device power.lock, we need a usage count reference to maintain that
> state.
> 
> I'm also seeing cases where the bridge is power state D0, but PM state
> RPM_SUSPENDING, so config space of the polled device becomes
> inaccessible even while we're holding a reference once we allow polling
> in RPM_SUSPENDED.
> 
> I'm currently working with the below patch, which I believe addresses
> all these issues, but I'd welcome review and testing. Thanks,
> 
> Alex
> 
> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
> Author: Alex Williamson <alex.williamson@redhat.com>
> Date:   Tue Jan 16 13:28:33 2024 -0700
> 
>     PCI: Fix active state requirement in PME polling
>     
>     The commit noted in fixes added a bogus requirement that runtime PM
>     managed devices need to be in the RPM_ACTIVE state for PME polling.
>     In fact, there is no requirement of a specific runtime PM state, it
>     is only required that the state is stable such that testing config
>     space availability, ie. !D3cold, remains valid across the PME wakeup.
>     
>     To that effect, defer polling of runtime PM managed devices that are
>     not in either the RPM_ACTIVE or RPM_SUSPENDED states.  Devices in
>     transitional states remain on the pci_pme_list and will be re-queued.
>     
>     However in allowing polling of devices in the RPM_SUSPENDED state,
>     the bridge state requires further refinement as it's possible to poll
>     while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
>     An asynchronous completion of the bridge transition to a low power
>     state can make config space of the subordinate device become
>     unavailable.  A runtime PM reference to the bridge is therefore added
>     with a supplementary requirement that the bridge is in the RPM_ACTIVE
>     state.
>     
>     Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
>     Reported-by: Sanath S <sanath.s@amd.com>
>     Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
>     Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index bdbf8a94b4d0..31dbf1834b07 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
>  		if (pdev->pme_poll) {
>  			struct pci_dev *bridge = pdev->bus->self;
>  			struct device *dev = &pdev->dev;
> -			int pm_status;
> +			struct device *bdev = bridge ? &bridge->dev : NULL;
>  
>  			/*
> -			 * If bridge is in low power state, the
> -			 * configuration space of subordinate devices
> -			 * may be not accessible
> +			 * If we have a bridge, it should be in an active/D0
> +			 * state or the configuration space of subordinate
> +			 * devices may not be accessible.
>  			 */
> -			if (bridge && bridge->current_state != PCI_D0)
> -				continue;
> +			if (bdev) {
> +				spin_lock_irq(&bdev->power.lock);

With the code as shown here I have one system that seems to be getting
contention when reading the vpd sysfs attribute when the endpoints
(QL41000) are bound to vfio-pci and unused, resulting in the root port
and endpoints being suspended.  A vpd read can take over a minute.
Seems to be resolved changing the above spin_lock to a spin_trylock:

				if (!spin_trylock_irq(&bdev->power.lock))
					continue;

The pm_runtime_barrier() as used in the vpd path can be prone to such
issues, I saw similar in the fix I previously proposed in the bz.

I'll continue to do more testing with this change and hopefully Sanath
can verify this resolves the bug report.  Thanks,

Alex

> +				if (!pm_runtime_active(bdev) ||
> +				    bridge->current_state != PCI_D0) {
> +					spin_unlock_irq(&bdev->power.lock);
> +					continue;
> +				}
> +				pm_runtime_get_noresume(bdev);
> +				spin_unlock_irq(&bdev->power.lock);
> +			}
>  
>  			/*
> -			 * If the device is in a low power state it
> -			 * should not be polled either.
> +			 * The device itself may be either in active or
> +			 * suspended state, but must not be in D3cold so
> +			 * that configuration space is accessible.  The
> +			 * transitional resuming and suspending states are
> +			 * skipped to avoid D3cold races.
>  			 */
> -			pm_status = pm_runtime_get_if_active(dev, true);
> -			if (!pm_status)
> -				continue;
> -
> -			if (pdev->current_state != PCI_D3cold)
> +			spin_lock_irq(&dev->power.lock);
> +			if ((pm_runtime_active(dev) ||
> +			     pm_runtime_suspended(dev)) &&
> +			    pdev->current_state != PCI_D3cold) {
> +				pm_runtime_get_noresume(dev);
> +				spin_unlock_irq(&dev->power.lock);
>  				pci_pme_wakeup(pdev, NULL);
> -
> -			if (pm_status > 0)
>  				pm_runtime_put(dev);
> +			} else {
> +				spin_unlock_irq(&dev->power.lock);
> +			}
> +
> +			if (bdev)
> +				pm_runtime_put(bdev);
>  		} else {
>  			list_del(&pme_dev->list);
>  			kfree(pme_dev);

Sanath S Jan. 23, 2024, 4:48 a.m. UTC | #5

On 1/23/2024 10:16 AM, Alex Williamson wrote:
> On Mon, 22 Jan 2024 15:50:03 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
>> On Mon, 22 Jan 2024 23:17:30 +0100
>> Lukas Wunner <lukas@wunner.de> wrote:
>>
>>> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
>>>> On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote:
>>>>> Testing that a device is not currently in a low power state provides no
>>>>> guarantees that the device is not immenently transitioning to such a state.
>>>>> We need to increment the PM usage counter before accessing the device.
>>>>> Since we don't wish to wake the device for PME polling, do so only if the
>>>>> device is already active by using pm_runtime_get_if_active().
>>>>>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> ---
>>>>>   drivers/pci/pci.c | 23 ++++++++++++++++-------
>>>>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>>> Resurrecting this patch (currently commit d3fcd7360338) for discussion
>>>> as it's been identified as the source of a regression in:
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=218360
>>>>
>>>> Copying Mika, Lukas, and Rafael as it's related to:
>>>>
>>>> 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
>>>>
>>>> where we skip devices in D3cold when processing the PME list.
>>>>
>>>> I think the issue in the above bz is that the downstream TB3/USB4 port
>>>> is in D3 (presumably D3hot) and I therefore infer the device is in state
>>>> RPM_SUSPENDED.  This commit is attempting to make sure the device power
>>>> state is stable across the call such that it does not transition into
>>>> D3cold while we're accessing it.
>>>>
>>>> To do that I used pm_runtime_get_if_active(), but in retrospect this
>>>> requires the device to be in RPM_ACTIVE so we end up skipping anything
>>>> suspended or transitioning.
>>> How about dropping the calls to pm_runtime_get_if_active() and
>>> pm_runtime_put() and instead simply do:
>>>
>>> 			if (pm_runtime_suspended(&pdev->dev) &&
>>> 			    pdev->current_state != PCI_D3cold)
>>> 				pci_pme_wakeup(pdev, NULL);
>> Hi Lukas,
>>
>> Do we require that the polled device is in the RPM_SUSPENDED state?
>> Also pm_runtime_suspended() can also only be trusted while holding the
>> device power.lock, we need a usage count reference to maintain that
>> state.
>>
>> I'm also seeing cases where the bridge is power state D0, but PM state
>> RPM_SUSPENDING, so config space of the polled device becomes
>> inaccessible even while we're holding a reference once we allow polling
>> in RPM_SUSPENDED.
>>
>> I'm currently working with the below patch, which I believe addresses
>> all these issues, but I'd welcome review and testing. Thanks,
>>
>> Alex
>>
>> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
>> Author: Alex Williamson <alex.williamson@redhat.com>
>> Date:   Tue Jan 16 13:28:33 2024 -0700
>>
>>      PCI: Fix active state requirement in PME polling
>>      
>>      The commit noted in fixes added a bogus requirement that runtime PM
>>      managed devices need to be in the RPM_ACTIVE state for PME polling.
>>      In fact, there is no requirement of a specific runtime PM state, it
>>      is only required that the state is stable such that testing config
>>      space availability, ie. !D3cold, remains valid across the PME wakeup.
>>      
>>      To that effect, defer polling of runtime PM managed devices that are
>>      not in either the RPM_ACTIVE or RPM_SUSPENDED states.  Devices in
>>      transitional states remain on the pci_pme_list and will be re-queued.
>>      
>>      However in allowing polling of devices in the RPM_SUSPENDED state,
>>      the bridge state requires further refinement as it's possible to poll
>>      while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
>>      An asynchronous completion of the bridge transition to a low power
>>      state can make config space of the subordinate device become
>>      unavailable.  A runtime PM reference to the bridge is therefore added
>>      with a supplementary requirement that the bridge is in the RPM_ACTIVE
>>      state.
>>      
>>      Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
>>      Reported-by: Sanath S <sanath.s@amd.com>
>>      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
>>      Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index bdbf8a94b4d0..31dbf1834b07 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
>>   		if (pdev->pme_poll) {
>>   			struct pci_dev *bridge = pdev->bus->self;
>>   			struct device *dev = &pdev->dev;
>> -			int pm_status;
>> +			struct device *bdev = bridge ? &bridge->dev : NULL;
>>   
>>   			/*
>> -			 * If bridge is in low power state, the
>> -			 * configuration space of subordinate devices
>> -			 * may be not accessible
>> +			 * If we have a bridge, it should be in an active/D0
>> +			 * state or the configuration space of subordinate
>> +			 * devices may not be accessible.
>>   			 */
>> -			if (bridge && bridge->current_state != PCI_D0)
>> -				continue;
>> +			if (bdev) {
>> +				spin_lock_irq(&bdev->power.lock);
> With the code as shown here I have one system that seems to be getting
> contention when reading the vpd sysfs attribute when the endpoints
> (QL41000) are bound to vfio-pci and unused, resulting in the root port
> and endpoints being suspended.  A vpd read can take over a minute.
> Seems to be resolved changing the above spin_lock to a spin_trylock:
>
> 				if (!spin_trylock_irq(&bdev->power.lock))
> 					continue;
>
> The pm_runtime_barrier() as used in the vpd path can be prone to such
> issues, I saw similar in the fix I previously proposed in the bz.
>
> I'll continue to do more testing with this change and hopefully Sanath
> can verify this resolves the bug report.  Thanks,
>
> Alex
I'll verify it today and let you know the observations.
>> +				if (!pm_runtime_active(bdev) ||
>> +				    bridge->current_state != PCI_D0) {
>> +					spin_unlock_irq(&bdev->power.lock);
>> +					continue;
>> +				}
>> +				pm_runtime_get_noresume(bdev);
>> +				spin_unlock_irq(&bdev->power.lock);
>> +			}
>>   
>>   			/*
>> -			 * If the device is in a low power state it
>> -			 * should not be polled either.
>> +			 * The device itself may be either in active or
>> +			 * suspended state, but must not be in D3cold so
>> +			 * that configuration space is accessible.  The
>> +			 * transitional resuming and suspending states are
>> +			 * skipped to avoid D3cold races.
>>   			 */
>> -			pm_status = pm_runtime_get_if_active(dev, true);
>> -			if (!pm_status)
>> -				continue;
>> -
>> -			if (pdev->current_state != PCI_D3cold)
>> +			spin_lock_irq(&dev->power.lock);
>> +			if ((pm_runtime_active(dev) ||
>> +			     pm_runtime_suspended(dev)) &&
>> +			    pdev->current_state != PCI_D3cold) {
>> +				pm_runtime_get_noresume(dev);
>> +				spin_unlock_irq(&dev->power.lock);
>>   				pci_pme_wakeup(pdev, NULL);
>> -
>> -			if (pm_status > 0)
>>   				pm_runtime_put(dev);
>> +			} else {
>> +				spin_unlock_irq(&dev->power.lock);
>> +			}
>> +
>> +			if (bdev)
>> +				pm_runtime_put(bdev);
>>   		} else {
>>   			list_del(&pme_dev->list);
>>   			kfree(pme_dev);

Lukas Wunner Jan. 23, 2024, 10:45 a.m. UTC | #6

On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote:
> On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote:
> > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
> > > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > > suspended or transitioning.  
> > 
> > How about dropping the calls to pm_runtime_get_if_active() and
> > pm_runtime_put() and instead simply do:
> > 
> > 			if (pm_runtime_suspended(&pdev->dev) &&
> > 			    pdev->current_state != PCI_D3cold)
> > 				pci_pme_wakeup(pdev, NULL);
> 
> Do we require that the polled device is in the RPM_SUSPENDED state?

If the device is RPM_SUSPENDING, why immediately resume it for polling?
It's sufficient to poll it the next time around, i.e. 1 second later.

Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
to poll PME.

This leaves RPM_SUSPENDED as the only state in which it makes sense to
poll.

> Also pm_runtime_suspended() can also only be trusted while holding the
> device power.lock, we need a usage count reference to maintain that
> state.

Why?  Let's say there's a race and the device resumes immediately after
we call pm_runtime_suspended() here.  So we might call pci_pme_wakeup()
gratuitouly.  So what?  No biggie.

> +			if (bdev) {
> +				spin_lock_irq(&bdev->power.lock);

Hm, I'd expect that lock to be internal to the PM core,
although there *are* a few stray users outside of it.

Thanks,

Lukas

Alex Williamson Jan. 23, 2024, 3:55 p.m. UTC | #7

On Tue, 23 Jan 2024 11:45:19 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote:
> > On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote:  
> > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:  
> > > > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > > > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > > > suspended or transitioning.    
> > > 
> > > How about dropping the calls to pm_runtime_get_if_active() and
> > > pm_runtime_put() and instead simply do:
> > > 
> > > 			if (pm_runtime_suspended(&pdev->dev) &&
> > > 			    pdev->current_state != PCI_D3cold)
> > > 				pci_pme_wakeup(pdev, NULL);  
> > 
> > Do we require that the polled device is in the RPM_SUSPENDED state?  
> 
> If the device is RPM_SUSPENDING, why immediately resume it for polling?
> It's sufficient to poll it the next time around, i.e. 1 second later.
> 
> Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
> to poll PME.

I'm clearly not an expert on PME, but this is not obvious to me and
before the commit that went in through this thread, PME wakeup was
triggered regardless of the PM state.  I was trying to restore the
behavior of not requiring a specific PM state other than deferring
polling across transition states.

> This leaves RPM_SUSPENDED as the only state in which it makes sense to
> poll.
>
> > Also pm_runtime_suspended() can also only be trusted while holding the
> > device power.lock, we need a usage count reference to maintain that
> > state.  
> 
> Why?  Let's say there's a race and the device resumes immediately after
> we call pm_runtime_suspended() here.  So we might call pci_pme_wakeup()
> gratuitouly.  So what?  No biggie.

The issue I'm trying to address is that config space of the device can
become inaccessible while calling pci_pme_wakeup() on it, causing a
system fault on some hardware.  So a gratuitous pci_pme_wakeup() can be
detrimental.

We require the device config space to remain accessible, therefore the
instantaneous test against D3cold and that the parent bridge is in D0
is not sufficient.  I see traces where the parent bridge is in D0, but
the PM state is RPM_SUSPENDING and the endpoint device transitions to
D3cold while we're executing pci_pme_wakeup().

Therefore at a minimum, I think we need to enforce that the bridge is
in RPM_ACTIVE and remains in that state across pci_pme_wakeup(), which
means we need to hold a usage count reference, and that usage count
reference must be acquired under power.lock in RPM_ACTIVE state to be
effective.

> > +			if (bdev) {
> > +				spin_lock_irq(&bdev->power.lock);  
> 
> Hm, I'd expect that lock to be internal to the PM core,
> although there *are* a few stray users outside of it.

Right, there are.  It's possible that if we only need to hold a
reference on the bridge we can abstract this through
pm_runtime_get_if_active(), the semantics worked better to essentially
open code it in this iteration though.  Thanks,

Alex

Lukas Wunner Jan. 23, 2024, 4:12 p.m. UTC | #8

On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote:
> On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote:
> > If the device is RPM_SUSPENDING, why immediately resume it for polling?
> > It's sufficient to poll it the next time around, i.e. 1 second later.
> > 
> > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
> > to poll PME.
> 
> I'm clearly not an expert on PME, but this is not obvious to me and
> before the commit that went in through this thread, PME wakeup was
> triggered regardless of the PM state.  I was trying to restore the
> behavior of not requiring a specific PM state other than deferring
> polling across transition states.

There are broken devices which are incapable of signaling PME.
As a workaround, the kernel polls these devices once per second.
The first time the device signals PME, the kernel stops polling
that particular device because PME is clearly working.

So this is just a best-effort way to support PME for broken devices.
If it takes a little longer to detect that PME was signaled, it's not
a big deal.

> The issue I'm trying to address is that config space of the device can
> become inaccessible while calling pci_pme_wakeup() on it, causing a
> system fault on some hardware.  So a gratuitous pci_pme_wakeup() can be
> detrimental.
> 
> We require the device config space to remain accessible, therefore the
> instantaneous test against D3cold and that the parent bridge is in D0
> is not sufficient.  I see traces where the parent bridge is in D0, but
> the PM state is RPM_SUSPENDING and the endpoint device transitions to
> D3cold while we're executing pci_pme_wakeup().

We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent
of a device is in D0 so that the device's config space is accessible.
So you may need to use that in pci_pme_wakeup().

Thanks,

Lukas

Alex Williamson Jan. 23, 2024, 4:47 p.m. UTC | #9

On Tue, 23 Jan 2024 17:12:39 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote:
> > On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote:  
> > > If the device is RPM_SUSPENDING, why immediately resume it for polling?
> > > It's sufficient to poll it the next time around, i.e. 1 second later.
> > > 
> > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
> > > to poll PME.  
> > 
> > I'm clearly not an expert on PME, but this is not obvious to me and
> > before the commit that went in through this thread, PME wakeup was
> > triggered regardless of the PM state.  I was trying to restore the
> > behavior of not requiring a specific PM state other than deferring
> > polling across transition states.  
> 
> There are broken devices which are incapable of signaling PME.
> As a workaround, the kernel polls these devices once per second.
> The first time the device signals PME, the kernel stops polling
> that particular device because PME is clearly working.
> 
> So this is just a best-effort way to support PME for broken devices.
> If it takes a little longer to detect that PME was signaled, it's not
> a big deal.
> 
> > The issue I'm trying to address is that config space of the device can
> > become inaccessible while calling pci_pme_wakeup() on it, causing a
> > system fault on some hardware.  So a gratuitous pci_pme_wakeup() can be
> > detrimental.
> > 
> > We require the device config space to remain accessible, therefore the
> > instantaneous test against D3cold and that the parent bridge is in D0
> > is not sufficient.  I see traces where the parent bridge is in D0, but
> > the PM state is RPM_SUSPENDING and the endpoint device transitions to
> > D3cold while we're executing pci_pme_wakeup().  
> 
> We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent
> of a device is in D0 so that the device's config space is accessible.
> So you may need to use that in pci_pme_wakeup().

pci_config_pm_runtime_get() doesn't seem to align with our current
philosophy to defer polling devices that aren't in the correct power
state.  We require the bridge to be in D0, but we defer polling rather
than resume it otherwise.  We also defer device polling if the device
is in D3cold, whereas the above function would resume a device in that
state.  I think our bridge D0 test could be reliable if it were done
holding a reference acquired via pm_runtime_get_if_active().  Thanks,

Alex

[v2,2/2] PCI: Fix runtime PM race with PME polling

Commit Message

Comments

Patch