[3/3] PCI/DPC: Await readiness of secondary bus after reset

Message ID	a2ff8481c3f08458dcd2b4028a838730e965c72f.1672511017.git.lukas@wunner.de
State	New
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> Message-Id: <a2ff8481c3f08458dcd2b4028a838730e965c72f.1672511017.git.lukas@wunner.de> In-Reply-To: <cover.1672511016.git.lukas@wunner.de> References: <cover.1672511016.git.lukas@wunner.de> From: Lukas Wunner <lukas@wunner.de> Date: Sat, 31 Dec 2022 19:33:39 +0100 Subject: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset To: Bjorn Helgaas <helgaas@kernel.org>, linux-pci@vger.kernel.org Cc: Keith Busch <kbusch@kernel.org>, Ashok Raj <ashok.raj@intel.com>, Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>, Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>, Mika Westerberg <mika.westerberg@linux.intel.com>, Sheng Bi <windy.bi.enflame@gmail.com>, Stanislav Spassov <stanspas@amazon.de>, Yang Su <yang.su@linux.alibaba.com> Precedence: bulk
Series	PCI reset delay fixes \| expand [0/3] PCI reset delay fixes [1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 [2/3] PCI: Unify delay handling for reset and resume [3/3] PCI/DPC: Await readiness of secondary bus after reset

Lukas Wunner Dec. 31, 2022, 6:33 p.m. UTC

We're calling pci_bridge_wait_for_secondary_bus() after performing a
Secondary Bus Reset, but neglect to do the same after coming out of a
DPC-induced Hot Reset.  As a result, we're not observing the delays
prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
secondary bus before they're ready.  Fix it.

Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
---
 drivers/pci/pci.c      | 3 ---
 drivers/pci/pci.h      | 3 +++
 drivers/pci/pcie/dpc.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

Kuppuswamy Sathyanarayanan Jan. 3, 2023, 7:49 p.m. UTC | #1

HI,

On 12/31/22 10:33 AM, Lukas Wunner wrote:
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.

Please remove "we" usage. Otherwise, looks good.

> 
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/pci.c      | 3 ---
>  drivers/pci/pci.h      | 3 +++
>  drivers/pci/pcie/dpc.c | 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0b49243a908..19fe0ef0e583 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -167,9 +167,6 @@ static int __init pcie_port_pm_setup(char *str)
>  }
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
>  
> -/* Time to wait after a reset for device to become responsive */
> -#define PCIE_RESET_READY_POLL_MS 60000
> -
>  /**
>   * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
>   * @bus: pointer to PCI bus structure to search
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 40758248dd80..b72fd888379b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -7,6 +7,9 @@
>  /* Number of possible devfns: 0.0 to 1f.7 inclusive */
>  #define MAX_NR_DEVFNS 256
>  
> +/* Time to wait after a reset for device to become responsive */
> +#define PCIE_RESET_READY_POLL_MS 60000
> +
>  #define PCI_FIND_CAP_TTL	48
>  
>  #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index f5ffea17c7f8..a5d7c69b764e 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -170,8 +170,8 @@ pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
>  			      PCI_EXP_DPC_STATUS_TRIGGER);
>  
> -	if (!pcie_wait_for_link(pdev, true)) {
> -		pci_info(pdev, "Data Link Layer Link Active not set in 1000 msec\n");
> +	if (pci_bridge_wait_for_secondary_bus(pdev, "DPC",
> +					      PCIE_RESET_READY_POLL_MS)) {
>  		clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>  		ret = PCI_ERS_RESULT_DISCONNECT;
>  	} else {

Bjorn Helgaas Jan. 12, 2023, 10:35 p.m. UTC | #2

On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.
> 
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>

I assume this patch is the one that makes the difference for the
Intel Ponte Vecchio HPC GPU?  Is there a URL to a problem report, or
at least a sentence or two we can include here to connect the patch
with the problem users may see?  Most people won't know how to
recognize accesses to devices on the secondary bus before they're
ready.

Bjorn

Bjorn Helgaas Jan. 13, 2023, 3:06 a.m. UTC | #3

On Thu, Jan 12, 2023 at 8:39 PM Yang Su <yang.su@linux.alibaba.com> wrote:
>
> Hi Bjorn,
>
> This email is I discussed with Lukas previously, but now I found I forget to CC you.
>
> I also test pci_bridge_wait_for_secondary_bus in NVIDIA GPU T4 which bind vfio
>
> passthrough in VMM, I found the pci_bridge_wait_for_secondary_bus can not wait
>
> the enough time as pci spec requires, the reasons are described as below.

Hi Yang,

When you're discussing Linux patches, please always keep the cc list
so the rest of us can see what's happening.  Also please use plain
ASCII email (not HTML, etc) because the list archives often discard
fancy emails.  Hints:

http://vger.kernel.org/majordomo-info.html
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
https://people.kernel.org/tglx/notes-about-netiquette

I'll wait for you and Lukas to continue this conversation on the mailing list.

Bjorn

> -------- 转发的消息 --------
> 主题: Re: [PATCH 2/3] PCI: Unify delay handling for reset and resume
> 日期: Wed, 4 Jan 2023 12:46:17 +0800
> From: Yang Su <yang.su@linux.alibaba.com>
> 收件人: Lukas Wunner <lukas@wunner.de>
>
>
> Hi Lukas,
>
> When some device such as NVIDIA GPU T4 bind vfio to passthrough in VMM, the vfio will do reset for the divice, but some device
>
> such as NVIDIA GPU T4 only support secondary bus reset (SBR), and then will call pci_reset_secondary_bus.
>
> The call path is below as figure shows,
>
>
> There are there reasons I think your modify for pci_reset_secondary_bus is not ok.
>
> 1. As I describe for you, if the device not bridge call pci_reset_secondary_bus,
>
> the device will enter pci_bridge_wait_for_secondary_bus, but in pci_bridge_wait_for_secondary_bus
>
> the device is not judged as pci bridge, will directly return.
>
> if (!pci_is_bridge(dev) || !dev->bridge_d3) {
>
> above code will do the judge.
>
>
> 2. In pci_bridge_wait_for_secondary_bus, pci_bus_max_d3cold_delay will take count of wrong time delay,
>
> such as NVIDIA GPU T4 is not pci bridge, so the subordinate is none, pci_bus_max_d3cold_delay
>
> set the min_delay is 100, max_delay is 0, here is the bug, after list_for_each_entry() in pci_bus_max_d3cold_delay,
>
> the min_delay will be 0, the max_delay also 0, the pci_bus_max_d3cold_delay return is surely 0.
>
>
> 3. pci_dev_wait must be saved, and cannot be replaced as pci_bridge_wait_for_secondary_bus.
>
> Because the pcie spec requires the device to check CRS, and pci_dev_wait do this thing.
>
> "
>
> Note: Software should use 100 ms wait periods only if software enables CRS Software Visibility.
>
> Otherwise, Completion timeouts, platform timeouts, or lengthy processor instruction stalls may result.
>
> See the Configuration Request Retry Status Implementation Note in Section 2.3.1.
>
> "
>
>
> void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
> {
>     struct pci_dev *child;
>     int delay;
>
>     if (pci_dev_is_disconnected(dev))
>         return;
>
>     if (!pci_is_bridge(dev) || !dev->bridge_d3)
>         return;
>
> ...
>
>     /* Take d3cold_delay requirements into account */
>     delay = pci_bus_max_d3cold_delay(dev->subordinate);
>
> }
>
>
>   void pci_reset_secondary_bus(struct pci_dev *dev)
> @@ -5058,15 +5060,6 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
>
>   ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>   pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> -
> - /*
> - * Trhfa for conventional PCI is 2^25 clock cycles.
> - * Assuming a minimum 33MHz clock this results in a 1s
> - * delay before we can consider subordinate devices to
> - * be re-initialized.  PCIe has some ways to shorten this,
> - * but we don't make use of them yet.
> - */
> - ssleep(1);
>   }
>
>   void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
> @@ -5085,7 +5078,8 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>   {
>   pcibios_reset_secondary_bus(dev);
>
> - return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> + return pci_bridge_wait_for_secondary_bus(dev, "bus reset",
> + PCIE_RESET_READY_POLL_MS);
>   }
>
>
> 在 2023/1/4 01:25, Lukas Wunner 写道:
>
> Hi Yang Su!
>
> Thanks for reaching out.
>
> On Tue, Jan 03, 2023 at 07:45:25PM +0800, Yang Su wrote:
>
> I think your modify in pci_reset_secondary_bus is not ok. I test pci_bridge_secondary_bus_reset used in pci_reset_secondary_bus,
> there will be an error is a device is not pci bridge device in pci_reset_secondary_buscall pci_bridge_secondary_bus_reset,
> pci_bridge_secondary_bus_reset judge the device is not pci bridge device and just return and will do nothing.
> So just use pci_bridge_secondary_bus_reset to replace ssleep(1) is not ok.
>
> Hm, I don't quite understand.
>
> How do you invoke pci_bridge_secondary_bus_reset() with a non-bridge device?
> I followed all the code paths leading to pci_bridge_secondary_bus_reset()
> and it seems to me none of them is calling the function with an endpoint
> device.  What am I missing?
>
> Thanks,
>
> Lukas
>
> 在 2023/1/13 06:35, Bjorn Helgaas 写道:
>
> On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
>
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.
>
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
>
> I assume this patch is the one that makes the difference for the
> Intel Ponte Vecchio HPC GPU?  Is there a URL to a problem report, or
> at least a sentence or two we can include here to connect the patch
> with the problem users may see?  Most people won't know how to
> recognize accesses to devices on the secondary bus before they're
> ready.
>
> Bjorn

Lukas Wunner Jan. 13, 2023, 9:10 a.m. UTC | #4

On Thu, Jan 12, 2023 at 04:35:33PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
> > We're calling pci_bridge_wait_for_secondary_bus() after performing a
> > Secondary Bus Reset, but neglect to do the same after coming out of a
> > DPC-induced Hot Reset.  As a result, we're not observing the delays
> > prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> > secondary bus before they're ready.  Fix it.
> > 
> > Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> 
> I assume this patch is the one that makes the difference for the
> Intel Ponte Vecchio HPC GPU?

Right.

> Is there a URL to a problem report, or
> at least a sentence or two we can include here to connect the patch
> with the problem users may see?

There's no public problem report.  My understanding is that Ponte Vecchio
was formally launched this Tuesday and mass distribution starts only now:

https://www.tomshardware.com/news/intel-launches-sapphire-rapids-fourth-gen-xeon-cpus-and-ponte-vecchio-max-gpu-series

The idea is to get the issue in the kernel fixed early so that users will
never even see it.

> Most people won't know how to
> recognize accesses to devices on the secondary bus before they're
> ready.

With Ponte Vecchio, the GPU is located below a PCIe switch and the
Downstream Port Containment happens at the Root Port.  So the Root
Port needs to wait for the Switch Upstream Port to re-appear.

Because config space is currently restored too early on the Switch
Upstream Port, it remains in D0uninitialized once it comes out of
reset, so all its registers, in particular the bridge windows,
are in power-on reset state.  As a result, anything downstream of it
(including the GPU) remains inaccessible and the user-visible
error messages look like this:

i915 0000:8c:00.0: can't change power state from D3cold to D0 (config space inaccessible)
intel_vsec 0000:8e:00.1: can't change power state from D3cold to D0 (config space inaccessible)

Where intel_vsec is a sibling of the GPU which is used for
telemetry I believe.

I'll be sure to include that additional information in the commit
message when respinning.

Thanks,

Lukas

Lukas Wunner Jan. 13, 2023, 10:18 a.m. UTC | #5

On Thu, Jan 12, 2023 at 09:06:15PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 12, 2023 at 8:39 PM Yang Su <yang.su@linux.alibaba.com> wrote:
> > I also test pci_bridge_wait_for_secondary_bus in NVIDIA GPU T4
> > which bind vfio passthrough in VMM, I found the
> > pci_bridge_wait_for_secondary_bus can not wait the enough time
> > as pci spec requires, the reasons are described as below.
[...]
> I'll wait for you and Lukas to continue this conversation on the
> mailing list.

Yang Su reports that pci_bridge_secondary_bus_reset() is called for
an Nvidia T4 GPU.  That's an endpoint device but the function should
only ever be called for bridges.  It's unclear to me how that can happen.

The call stack looks like this:

vfio_pci_open()
  pci_set_power_state()
  pci_clear_master()
  pci_enable_device()
  pci_try_reset_function()
    pci_dev_trylock()
    pci_dev_save_and_disable()
    pci_dev_specific_reset()
    pcie_has_flr()
    pcie_af_flr()
    pci_read_config_word()
    pci_dev_reset_slot_function()
    pci_parent_bus_reset()
      pci_bridge_secondary_bus_reset()
        pcibios_reset_secondary_bus()
	  pci_reset_secondary_bus()
	    pci_read_config_word()
	    pci_write_config_word()
	    pci_write_config_word()
	    pcie_wait_for_link_delay()
	pci_dev_wait()
    pci_dev_restore()
    pci_cfg_access_unlock()
  pci_save_state()
  pci_store_saved_state()
  vfio_config_init()

Note that vfio_pci_open() was renamed to vfio_pci_open_device() with
commit 2cd8b14aaa66, which went into v5.15.  So apparently this call
stack is from an earlier kernel.

The GPU is located below a PEX9797 switch, which is connected to a
SkyLake-E Root Port.  So pci_bridge_secondary_bus_reset() should be
called for the Switch Downstream Port but Yang Su says it's called
for the GPU endpoint device.

pci_parent_bus_reset() finds the parent by following dev->bus->self.
I've suggested to Yang Su to replace that with pci_upstream_bridge(dev)
and see if it fixes the issue.  It does make a difference in SR-IOV
scenarios (see the comment in include/linux/pci.h above pci_is_root_bus())
though Yang Su reports that SR-IOV is not used on this machine,
only vfio passthrough.

I believe that endpoint devices don't have a Bridge Control Register
(Type 0 Config Space has Max_Lat / Min_Gnt instead), so setting the
Secondary Bus Reset bit should have no effect.  But apparently it
does have an effect because Yang Su is witnessing issues with the delay
after reset.

Specifically, Yang Su says that pci_bridge_wait_for_secondary_bus()
bails out because the GPU endpoint device fails the !pci_is_bridge()
check at the top of the function.  Also, the calculation of
pci_bus_max_d3cold_delay() fails because the GPU lacks a subordinate bus.
Apparently the unconditional 1 second delay in pci_reset_secondary_bus()
papers over the issue because it waits long enough for the GPU endpoint
device to come out of reset.

Maybe the information that pci_bridge_secondary_bus_reset() is executed
for an endpoint device is not correct and it's in fact executed for
the Downstream Port of the PEX9797 switch, but perhaps config space
of the port is broken and contains an incorrect Header Type field,
which would cause pci_is_bridge() to return false for it.  Though we
wouldn't scan the secondary bus below the switch in that case,
so the GPU wouldn't be enumerated.  I've requested "lspci -vvv" output
for the GPU and the Downstream Port off-list but haven't received it yet.

Thanks,

Lukas

[3/3] PCI/DPC: Await readiness of secondary bus after reset

Commit Message

Comments

Patch