diff mbox series

[v2] PCI: Fix no-op wait after secondary bus reset

Message ID 20220518115432.76183-1-windy.bi.enflame@gmail.com
State New
Headers show
Series [v2] PCI: Fix no-op wait after secondary bus reset | expand

Commit Message

Sheng Bi May 18, 2022, 11:54 a.m. UTC
pci_bridge_secondary_bus_reset() triggers SBR followed by 1 second sleep,
and then uses pci_dev_wait() for waiting device ready. The dev parameter
passes to the wait function is currently the bridge itself, but not the
device been reset.

If we call pci_bridge_secondary_bus_reset() to trigger SBR to a device,
there is 1 second sleep but not waiting device ready, since the bridge
is always ready while resetting downstream devices. pci_dev_wait() here
is a no-op actually. This would be risky in the case which the device
becomes ready after more than 1 second, especially while hotplug enabled.
The late coming hotplug event after 1 second will trigger hotplug module
to remove/re-insert the device.

Instead of waiting ready of bridge itself, changing to wait all the
downstream devices become ready with timeout PCIE_RESET_READY_POLL_MS
after SBR, considering all downstream devices are affected during SBR.
Once one of the devices doesn't reappear within the timeout, return
-ENOTTY to indicate SBR doesn't complete successfully.

Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
Signed-off-by: Sheng Bi <windy.bi.enflame@gmail.com>
---
 drivers/pci/pci.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)


base-commit: 617c8a1e527fadaaec3ba5bafceae7a922ebef7e

Comments

Alex Williamson May 19, 2022, 5:06 p.m. UTC | #1
On Wed, 18 May 2022 19:54:32 +0800
Sheng Bi <windy.bi.enflame@gmail.com> wrote:

> pci_bridge_secondary_bus_reset() triggers SBR followed by 1 second sleep,
> and then uses pci_dev_wait() for waiting device ready. The dev parameter
> passes to the wait function is currently the bridge itself, but not the
> device been reset.
> 
> If we call pci_bridge_secondary_bus_reset() to trigger SBR to a device,
> there is 1 second sleep but not waiting device ready, since the bridge
> is always ready while resetting downstream devices. pci_dev_wait() here
> is a no-op actually. This would be risky in the case which the device
> becomes ready after more than 1 second, especially while hotplug enabled.
> The late coming hotplug event after 1 second will trigger hotplug module
> to remove/re-insert the device.
> 
> Instead of waiting ready of bridge itself, changing to wait all the
> downstream devices become ready with timeout PCIE_RESET_READY_POLL_MS
> after SBR, considering all downstream devices are affected during SBR.
> Once one of the devices doesn't reappear within the timeout, return
> -ENOTTY to indicate SBR doesn't complete successfully.
> 
> Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
> Signed-off-by: Sheng Bi <windy.bi.enflame@gmail.com>
> ---
>  drivers/pci/pci.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index eb7c0a08ff57..32b7a5c1fa3a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5049,6 +5049,34 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  	}
>  }
>  
> +static int pci_bridge_secondary_bus_wait(struct pci_dev *bridge, int timeout)
> +{
> +	struct pci_dev *dev;
> +	int delay = 0;
> +
> +	if (!bridge->subordinate || list_empty(&bridge->subordinate->devices))
> +		return 0;
> +
> +	list_for_each_entry(dev, &bridge->subordinate->devices, bus_list) {
> +		while (!pci_device_is_present(dev)) {
> +			if (delay > timeout) {
> +				pci_warn(dev, "not ready %dms after secondary bus reset; giving up\n",
> +					delay);
> +				return -ENOTTY;
> +			}
> +
> +			msleep(20);
> +			delay += 20;

Your previous version used the same exponential back-off as used in
pci_dev_wait(), why the change here to poll at 20ms intervals?  Thanks,

Alex

> +		}
> +
> +		if (delay > 1000)
> +			pci_info(dev, "ready %dms after secondary bus reset\n",
> +				delay);
> +	}
> +
> +	return 0;
> +}
> +
>  void pci_reset_secondary_bus(struct pci_dev *dev)
>  {
>  	u16 ctrl;
> @@ -5092,7 +5120,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>  {
>  	pcibios_reset_secondary_bus(dev);
>  
> -	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> +	return pci_bridge_secondary_bus_wait(dev, PCIE_RESET_READY_POLL_MS);
>  }
>  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
>  
> 
> base-commit: 617c8a1e527fadaaec3ba5bafceae7a922ebef7e
Sheng Bi May 20, 2022, 3 a.m. UTC | #2
On Fri, May 20, 2022 at 1:06 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Wed, 18 May 2022 19:54:32 +0800
> Sheng Bi <windy.bi.enflame@gmail.com> wrote:
>
> > pci_bridge_secondary_bus_reset() triggers SBR followed by 1 second sleep,
> > and then uses pci_dev_wait() for waiting device ready. The dev parameter
> > passes to the wait function is currently the bridge itself, but not the
> > device been reset.
> >
> > If we call pci_bridge_secondary_bus_reset() to trigger SBR to a device,
> > there is 1 second sleep but not waiting device ready, since the bridge
> > is always ready while resetting downstream devices. pci_dev_wait() here
> > is a no-op actually. This would be risky in the case which the device
> > becomes ready after more than 1 second, especially while hotplug enabled.
> > The late coming hotplug event after 1 second will trigger hotplug module
> > to remove/re-insert the device.
> >
> > Instead of waiting ready of bridge itself, changing to wait all the
> > downstream devices become ready with timeout PCIE_RESET_READY_POLL_MS
> > after SBR, considering all downstream devices are affected during SBR.
> > Once one of the devices doesn't reappear within the timeout, return
> > -ENOTTY to indicate SBR doesn't complete successfully.
> >
> > Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
> > Signed-off-by: Sheng Bi <windy.bi.enflame@gmail.com>
> > ---
> >  drivers/pci/pci.c | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index eb7c0a08ff57..32b7a5c1fa3a 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5049,6 +5049,34 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
> >       }
> >  }
> >
> > +static int pci_bridge_secondary_bus_wait(struct pci_dev *bridge, int timeout)
> > +{
> > +     struct pci_dev *dev;
> > +     int delay = 0;
> > +
> > +     if (!bridge->subordinate || list_empty(&bridge->subordinate->devices))
> > +             return 0;
> > +
> > +     list_for_each_entry(dev, &bridge->subordinate->devices, bus_list) {
> > +             while (!pci_device_is_present(dev)) {
> > +                     if (delay > timeout) {
> > +                             pci_warn(dev, "not ready %dms after secondary bus reset; giving up\n",
> > +                                     delay);
> > +                             return -ENOTTY;
> > +                     }
> > +
> > +                     msleep(20);
> > +                     delay += 20;
>
> Your previous version used the same exponential back-off as used in
> pci_dev_wait(), why the change here to poll at 20ms intervals?  Thanks,
>
> Alex

Many thanks for your time. The change is to get a more accurate
timeout, to align with
previous statement "we shouldn't incur any extra delay once timeout has passed".
Previous binary exponential back-off incurred probable unexpected
extra delay, like
60,000 ms timeout but actual 65,535 ms, and the difference probably
goes worse by
timeout setting changes. Thanks,

windy

>
> > +             }
> > +
> > +             if (delay > 1000)
> > +                     pci_info(dev, "ready %dms after secondary bus reset\n",
> > +                             delay);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> >  void pci_reset_secondary_bus(struct pci_dev *dev)
> >  {
> >       u16 ctrl;
> > @@ -5092,7 +5120,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
> >  {
> >       pcibios_reset_secondary_bus(dev);
> >
> > -     return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> > +     return pci_bridge_secondary_bus_wait(dev, PCIE_RESET_READY_POLL_MS);
> >  }
> >  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
> >
> >
> > base-commit: 617c8a1e527fadaaec3ba5bafceae7a922ebef7e
>
Lukas Wunner May 20, 2022, 6:41 a.m. UTC | #3
On Wed, May 18, 2022 at 07:54:32PM +0800, Sheng Bi wrote:
> +static int pci_bridge_secondary_bus_wait(struct pci_dev *bridge, int timeout)
> +{
> +	struct pci_dev *dev;
> +	int delay = 0;
> +
> +	if (!bridge->subordinate || list_empty(&bridge->subordinate->devices))
> +		return 0;
> +
> +	list_for_each_entry(dev, &bridge->subordinate->devices, bus_list) {
> +		while (!pci_device_is_present(dev)) {
> +			if (delay > timeout) {
> +				pci_warn(dev, "not ready %dms after secondary bus reset; giving up\n",
> +					delay);
> +				return -ENOTTY;
> +			}
> +
> +			msleep(20);
> +			delay += 20;
> +		}
> +
> +		if (delay > 1000)
> +			pci_info(dev, "ready %dms after secondary bus reset\n",
> +				delay);
> +	}
> +
> +	return 0;
> +}

An alternative approach you may want to consider is to call
pci_dev_wait() in the list_for_each_entry loop, but instead of
passing it a constant timeout you'd pass the remaining time.

Get the current time before and after each pci_dev_wait() call
from "jiffies", calculate the difference, convert to msecs with
jiffies_to_msecs() and subtract from the "timeout" parameter
passed in by the caller, then simply pass "timeout" to each
pci_dev_wait() call.

As a side note, traversing the bus list normally requires
holding the pci_bus_sem for reading.  But it's probably unlikely
that devices are added/removed concurrently to a bus reset
and we're doing it wrong pretty much everywhere in the
PCI reset code, so...

(I fixed up one of the reset functions with 10791141a6cf,
but plenty of others remain...)

Thanks,

Lukas
Sheng Bi May 21, 2022, 8:36 a.m. UTC | #4
On Fri, May 20, 2022 at 2:41 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Wed, May 18, 2022 at 07:54:32PM +0800, Sheng Bi wrote:
> > +static int pci_bridge_secondary_bus_wait(struct pci_dev *bridge, int timeout)
> > +{
> > +     struct pci_dev *dev;
> > +     int delay = 0;
> > +
> > +     if (!bridge->subordinate || list_empty(&bridge->subordinate->devices))
> > +             return 0;
> > +
> > +     list_for_each_entry(dev, &bridge->subordinate->devices, bus_list) {
> > +             while (!pci_device_is_present(dev)) {
> > +                     if (delay > timeout) {
> > +                             pci_warn(dev, "not ready %dms after secondary bus reset; giving up\n",
> > +                                     delay);
> > +                             return -ENOTTY;
> > +                     }
> > +
> > +                     msleep(20);
> > +                     delay += 20;
> > +             }
> > +
> > +             if (delay > 1000)
> > +                     pci_info(dev, "ready %dms after secondary bus reset\n",
> > +                             delay);
> > +     }
> > +
> > +     return 0;
> > +}
>
> An alternative approach you may want to consider is to call
> pci_dev_wait() in the list_for_each_entry loop, but instead of
> passing it a constant timeout you'd pass the remaining time.
>
> Get the current time before and after each pci_dev_wait() call
> from "jiffies", calculate the difference, convert to msecs with
> jiffies_to_msecs() and subtract from the "timeout" parameter
> passed in by the caller, then simply pass "timeout" to each
> pci_dev_wait() call.

Thanks for your proposal, which can avoid doing duplicated things as
pci_dev_wait().

If so, I also want to align the polling things mentioned in the
question from Alex, since pci_dev_wait() is also used for reset
functions other than SBR. To Bjorn, Alex, Lucas, how do you think if
we need to change the polling in pci_dev_wait() to 20ms intervals, or
keep binary exponential back-off with probable unexpected extra
timeout delay.

>
> As a side note, traversing the bus list normally requires
> holding the pci_bus_sem for reading.  But it's probably unlikely
> that devices are added/removed concurrently to a bus reset
> and we're doing it wrong pretty much everywhere in the
> PCI reset code, so...

Yeah... I think that is why I saw different coding there. I would
prefer a separate thread for estimating which ones are real risks.

>
> (I fixed up one of the reset functions with 10791141a6cf,
> but plenty of others remain...)
>
> Thanks,
>
> Lukas
Lukas Wunner May 21, 2022, 12:49 p.m. UTC | #5
On Sat, May 21, 2022 at 04:36:10PM +0800, Sheng Bi wrote:
> If so, I also want to align the polling things mentioned in the
> question from Alex, since pci_dev_wait() is also used for reset
> functions other than SBR. To Bjorn, Alex, Lucas, how do you think if
> we need to change the polling in pci_dev_wait() to 20ms intervals, or
> keep binary exponential back-off with probable unexpected extra
> timeout delay.

The exponential backoff should probably be capped at some point
to avoid excessive wait delays.  I guess the rationale for
exponential backoff is to not poll too frequently.
Capping at 20 msec or 100 msec may be reasonable, i.e.:

-		delay *= 2;
+		delay = min(delay * 2, 100);

Thanks,

Lukas
Sheng Bi May 21, 2022, 5:37 p.m. UTC | #6
On Sat, May 21, 2022 at 8:49 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Sat, May 21, 2022 at 04:36:10PM +0800, Sheng Bi wrote:
> > If so, I also want to align the polling things mentioned in the
> > question from Alex, since pci_dev_wait() is also used for reset
> > functions other than SBR. To Bjorn, Alex, Lucas, how do you think if
> > we need to change the polling in pci_dev_wait() to 20ms intervals, or
> > keep binary exponential back-off with probable unexpected extra
> > timeout delay.
>
> The exponential backoff should probably be capped at some point
> to avoid excessive wait delays.  I guess the rationale for
> exponential backoff is to not poll too frequently.
> Capping at 20 msec or 100 msec may be reasonable, i.e.:
>
> -               delay *= 2;
> +               delay = min(delay * 2, 100);
>
> Thanks,
>
> Lukas

Capping at 20 or 100 msec seems reasonable to me. Btw, since 20 msec
is not a long time in these scenarios, how about changing to a fixed
20 msec interval? Thanks,

windy
Lukas Wunner May 23, 2022, 2:20 p.m. UTC | #7
On Sun, May 22, 2022 at 01:37:50AM +0800, Sheng Bi wrote:
> On Sat, May 21, 2022 at 8:49 PM Lukas Wunner <lukas@wunner.de> wrote:
> > On Sat, May 21, 2022 at 04:36:10PM +0800, Sheng Bi wrote:
> > > If so, I also want to align the polling things mentioned in the
> > > question from Alex, since pci_dev_wait() is also used for reset
> > > functions other than SBR. To Bjorn, Alex, Lucas, how do you think if
> > > we need to change the polling in pci_dev_wait() to 20ms intervals, or
> > > keep binary exponential back-off with probable unexpected extra
> > > timeout delay.
> >
> > The exponential backoff should probably be capped at some point
> > to avoid excessive wait delays.  I guess the rationale for
> > exponential backoff is to not poll too frequently.
> > Capping at 20 msec or 100 msec may be reasonable, i.e.:
> >
> > -               delay *= 2;
> > +               delay = min(delay * 2, 100);
> 
> Capping at 20 or 100 msec seems reasonable to me. Btw, since 20 msec
> is not a long time in these scenarios, how about changing to a fixed
> 20 msec interval?

The callers of pci_dev_wait() seem to wait for the spec-defined
delay and only call pci_dev_wait() to allow for an additional period
that non-compliant devices may need.  That extra delay can be expected
to be low, which is why it makes sense to start with a short poll interval
and gradually extend it.  So the algorithm seems to be reasonable and
I wouldn't recommend changing it to a constant interval unless that
fixes something which is currently broken.

Thanks,

Lukas
Sheng Bi May 23, 2022, 3:59 p.m. UTC | #8
On Mon, May 23, 2022 at 10:20 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Sun, May 22, 2022 at 01:37:50AM +0800, Sheng Bi wrote:
> > On Sat, May 21, 2022 at 8:49 PM Lukas Wunner <lukas@wunner.de> wrote:
> > > On Sat, May 21, 2022 at 04:36:10PM +0800, Sheng Bi wrote:
> > > > If so, I also want to align the polling things mentioned in the
> > > > question from Alex, since pci_dev_wait() is also used for reset
> > > > functions other than SBR. To Bjorn, Alex, Lucas, how do you think if
> > > > we need to change the polling in pci_dev_wait() to 20ms intervals, or
> > > > keep binary exponential back-off with probable unexpected extra
> > > > timeout delay.
> > >
> > > The exponential backoff should probably be capped at some point
> > > to avoid excessive wait delays.  I guess the rationale for
> > > exponential backoff is to not poll too frequently.
> > > Capping at 20 msec or 100 msec may be reasonable, i.e.:
> > >
> > > -               delay *= 2;
> > > +               delay = min(delay * 2, 100);
> >
> > Capping at 20 or 100 msec seems reasonable to me. Btw, since 20 msec
> > is not a long time in these scenarios, how about changing to a fixed
> > 20 msec interval?
>
> The callers of pci_dev_wait() seem to wait for the spec-defined
> delay and only call pci_dev_wait() to allow for an additional period
> that non-compliant devices may need.  That extra delay can be expected
> to be low, which is why it makes sense to start with a short poll interval
> and gradually extend it.  So the algorithm seems to be reasonable and
> I wouldn't recommend changing it to a constant interval unless that
> fixes something which is currently broken.
>
> Thanks,
>
> Lukas

Thanks Lukas!

From my perspective, there is nothing broken so far, but a theoretical
unexpected extra delay while the timeout has passed. So I will keep
pci_dev_wait() as previously with exponential backoff in this patch,
and change the pci_bridge_secondary_bus_wait() with "jiffies" and
pci_dev_wait().

Thanks,

windy
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index eb7c0a08ff57..32b7a5c1fa3a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5049,6 +5049,34 @@  void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 	}
 }
 
+static int pci_bridge_secondary_bus_wait(struct pci_dev *bridge, int timeout)
+{
+	struct pci_dev *dev;
+	int delay = 0;
+
+	if (!bridge->subordinate || list_empty(&bridge->subordinate->devices))
+		return 0;
+
+	list_for_each_entry(dev, &bridge->subordinate->devices, bus_list) {
+		while (!pci_device_is_present(dev)) {
+			if (delay > timeout) {
+				pci_warn(dev, "not ready %dms after secondary bus reset; giving up\n",
+					delay);
+				return -ENOTTY;
+			}
+
+			msleep(20);
+			delay += 20;
+		}
+
+		if (delay > 1000)
+			pci_info(dev, "ready %dms after secondary bus reset\n",
+				delay);
+	}
+
+	return 0;
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
@@ -5092,7 +5120,7 @@  int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
 {
 	pcibios_reset_secondary_bus(dev);
 
-	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
+	return pci_bridge_secondary_bus_wait(dev, PCIE_RESET_READY_POLL_MS);
 }
 EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);