diff mbox series

[RFC] PCI: avoid SBR for NVIDIA T4

Message ID 388bc353a5f88edb502ec04c0dc53ab62a526020.1680090885.git.wuzongyong@linux.alibaba.com
State New
Headers show
Series [RFC] PCI: avoid SBR for NVIDIA T4 | expand

Commit Message

Wu Zongyong March 29, 2023, 11:58 a.m. UTC
Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
root port.
So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.

Maybe NVIDIA guys can do some detailed explanation abount the SBR
behaviour of GPUs.

Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
---
 drivers/pci/quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Bjorn Helgaas March 29, 2023, 5:05 p.m. UTC | #1
On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> root port.

Blank line between paragraphs.  Rewrap to fill 75 columns if it's a
single paragraph.

Is this only a problem when direct attached to a Root Port?  Why would
that be?  If it's *not* related to being directly under a Root Port,
don't mention that at all.

> So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.
> 
> Maybe NVIDIA guys can do some detailed explanation abount the SBR
> behaviour of GPUs.

This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to
avoid bus reset"), so probably should have a Fixes: tag so it goes
whereever that commit goes.

Also copy the subject line from 4c207e7121fa, e.g.,

  PCI: Mark NVIDIA T4 GPUs to avoid bus reset

Are there any problem reports or bugzilla issues you can include a URL
to?

> Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> ---
>  drivers/pci/quirks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..be86b93b9e38 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
>   */
>  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
>  {
> -	if ((dev->device & 0xffc0) == 0x2340)
> +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
>  		quirk_no_bus_reset(dev);
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -- 
> 2.34.3
>
Wu Zongyong March 30, 2023, 2:10 a.m. UTC | #2
On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > root port.
> 
> Blank line between paragraphs.  Rewrap to fill 75 columns if it's a
> single paragraph.
Will be fixed.
> 
> Is this only a problem when direct attached to a Root Port?  Why would
> that be?  If it's *not* related to being directly under a Root Port,
> don't mention that at all.
Yes, this problem occurs only when the T4 card is direct attached to a
Root Port.
I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
and it works well.

> 
> > So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.
> > 
> > Maybe NVIDIA guys can do some detailed explanation abount the SBR
> > behaviour of GPUs.
> 
> This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to
> avoid bus reset"), so probably should have a Fixes: tag so it goes
> whereever that commit goes.
> 
> Also copy the subject line from 4c207e7121fa, e.g.,
> 
>   PCI: Mark NVIDIA T4 GPUs to avoid bus reset
Will be fixed too.
> 
> Are there any problem reports or bugzilla issues you can include a URL
> to?
No, I just find the problem in our test environment and I didn't find a
similar report.
> 
> > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com>
> > ---
> >  drivers/pci/quirks.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 44cab813bf95..be86b93b9e38 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
> >   */
> >  static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> >  {
> > -	if ((dev->device & 0xffc0) == 0x2340)
> > +	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
> >  		quirk_no_bus_reset(dev);
> >  }
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > -- 
> > 2.34.3
> >
Lukas Wunner March 30, 2023, 2:41 a.m. UTC | #3
On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> root port.
> So avoid to do bus reset,  pci_parent_bus_reset() works nomarlly.

If you cherry-pick commits

  8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3")
  ac91e6980563 ("PCI: Unify delay handling for reset and resume")

from v6.3-rc1, does the issue still persist?

Thanks,

Lukas
Bjorn Helgaas March 30, 2023, 3:49 p.m. UTC | #4
On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > root port.
> > 
> > Is this only a problem when direct attached to a Root Port?  Why would
> > that be?  If it's *not* related to being directly under a Root Port,
> > don't mention that at all.
>
> Yes, this problem occurs only when the T4 card is direct attached to a
> Root Port.
> I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> and it works well.

From an electrical and protocol point of view, the device should not
be able to tell the difference, so Lukas' suggestion that it may be
related to reset delays seems very pertinent.

Bjorn
Wu Zongyong March 31, 2023, 2:11 a.m. UTC | #5
On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > > root port.
> > > 
> > > Is this only a problem when direct attached to a Root Port?  Why would
> > > that be?  If it's *not* related to being directly under a Root Port,
> > > don't mention that at all.
> >
> > Yes, this problem occurs only when the T4 card is direct attached to a
> > Root Port.
> > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> > and it works well.
> 
> From an electrical and protocol point of view, the device should not
> be able to tell the difference, so Lukas' suggestion that it may be
> related to reset delays seems very pertinent.
I will test it with the commits mentioned above.
But it may take some time since it is not easy to replace kernel in our
environment.
> 
> Bjorn
Wu Zongyong April 3, 2023, 4:02 a.m. UTC | #6
On Fri, Mar 31, 2023 at 10:11:15AM +0800, Wu Zongyong wrote:
> On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote:
> > On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote:
> > > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote:
> > > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a
> > > > > root port.
> > > > 
> > > > Is this only a problem when direct attached to a Root Port?  Why would
> > > > that be?  If it's *not* related to being directly under a Root Port,
> > > > don't mention that at all.
> > >
> > > Yes, this problem occurs only when the T4 card is direct attached to a
> > > Root Port.
> > > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge,
> > > and it works well.
> > 
> > From an electrical and protocol point of view, the device should not
> > be able to tell the difference, so Lukas' suggestion that it may be
> > related to reset delays seems very pertinent.
> I will test it with the commits mentioned above.
> But it may take some time since it is not easy to replace kernel in our
> environment.

I have tested it with Lukas' suggestion and it didn't work for T4 cards.

My base kernel is v5.10, and I cherry-picked the following patches:

  730643d33e2d ("PCI/PM: Resume subordinate bus in bus type callbacks")
  8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3")
  ac91e6980563 ("PCI: Unify delay handling for reset and resume")

Any other necessary patches I should apply?

> > 
> > Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44cab813bf95..be86b93b9e38 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3618,7 +3618,7 @@  static void quirk_no_bus_reset(struct pci_dev *dev)
  */
 static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
 {
-	if ((dev->device & 0xffc0) == 0x2340)
+	if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8)
 		quirk_no_bus_reset(dev);
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,