diff mbox series

PCI: Extend D3hot delay for NVIDIA HDA controllers

Message ID 20230413194042.605768-1-alex.williamson@redhat.com
State New
Headers show
Series PCI: Extend D3hot delay for NVIDIA HDA controllers | expand

Commit Message

Alex Williamson April 13, 2023, 7:40 p.m. UTC
Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
below referenced commit, where the reduced D3hot transition delay appears
to introduce a small window where a D3hot->D0 transition followed by a bus
reset can wedge the device.  The entire device is subsequently unavailable,
returning -1 on config space read and is unrecoverable without a host reset.

This has been observed with RTX A2000 and A5000 GPU and audio functions
assigned to a Windows VM, where shutdown of the VM places the devices in
D3hot prior to vfio-pci performing a bus reset when userspace releases the
devices.  The issue has roughly a 2-3% chance of occurring per shutdown.

Restoring the HDA controller d3hot_delay to the effective value before the
below commit has been shown to resolve the issue.  NVIDIA confirms this
change should be safe for all of their HDA controllers.

Cc: Abhishek Sahu <abhsahu@nvidia.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Reviewed-by: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Unfortunately Tarun's reply with confirmation doesn't show up on lore,
possibly due to html email, or else I'd provide that as a Link:.

 drivers/pci/quirks.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Bjorn Helgaas April 17, 2023, 9:14 p.m. UTC | #1
[+cc Mika, Sathy, Lukas since they've been looking at similar delays]

On Thu, Apr 13, 2023 at 01:40:42PM -0600, Alex Williamson wrote:
> Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
> below referenced commit, where the reduced D3hot transition delay appears
> to introduce a small window where a D3hot->D0 transition followed by a bus
> reset can wedge the device.  The entire device is subsequently unavailable,
> returning -1 on config space read and is unrecoverable without a host reset.
> 
> This has been observed with RTX A2000 and A5000 GPU and audio functions
> assigned to a Windows VM, where shutdown of the VM places the devices in
> D3hot prior to vfio-pci performing a bus reset when userspace releases the
> devices.  The issue has roughly a 2-3% chance of occurring per shutdown.
> 
> Restoring the HDA controller d3hot_delay to the effective value before the
> below commit has been shown to resolve the issue.  NVIDIA confirms this
> change should be safe for all of their HDA controllers.
> 
> Cc: Abhishek Sahu <abhsahu@nvidia.com>
> Cc: Tarun Gupta <targupta@nvidia.com>
> Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
> Reported-by: Zhiyi Guo <zhguo@redhat.com>
> Reviewed-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Applied to pci/reset for v6.4, thanks, Alex!

I guess there's no real risk here since we're waiting *longer*.  It
only makes NVIDIA GPU resets take longer.

Mika has some patches in flight that increase delays generically in
some cases, but I think that applies to D3cold -> D0 transitions,
which I don't *think* you're doing here.

> ---
> 
> Unfortunately Tarun's reply with confirmation doesn't show up on lore,
> possibly due to html email, or else I'd provide that as a Link:.
> 
>  drivers/pci/quirks.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..f4e2a88729fd 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev)
>  }
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);
>  
> +/*
> + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
> + * reset is performed too soon after transition to D0, extend d3hot_delay
> + * to previous effective default for all NVIDIA HDA controllers.
> + */
> +static void quirk_nvidia_hda_pm(struct pci_dev *dev)
> +{
> +	quirk_d3hot_delay(dev, 20);
> +}
> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> +			      PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
> +			      quirk_nvidia_hda_pm);
> +
>  /*
>   * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
>   * https://bugzilla.kernel.org/show_bug.cgi?id=205587
> -- 
> 2.39.2
>
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44cab813bf95..f4e2a88729fd 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -1939,6 +1939,19 @@  static void quirk_radeon_pm(struct pci_dev *dev)
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);
 
+/*
+ * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
+ * reset is performed too soon after transition to D0, extend d3hot_delay
+ * to previous effective default for all NVIDIA HDA controllers.
+ */
+static void quirk_nvidia_hda_pm(struct pci_dev *dev)
+{
+	quirk_d3hot_delay(dev, 20);
+}
+DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+			      PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
+			      quirk_nvidia_hda_pm);
+
 /*
  * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
  * https://bugzilla.kernel.org/show_bug.cgi?id=205587