Message ID | 388bc353a5f88edb502ec04c0dc53ab62a526020.1680090885.git.wuzongyong@linux.alibaba.com |
---|---|
State | New |
Headers | show |
Series | [RFC] PCI: avoid SBR for NVIDIA T4 | expand |
On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > root port. Blank line between paragraphs. Rewrap to fill 75 columns if it's a single paragraph. Is this only a problem when direct attached to a Root Port? Why would that be? If it's *not* related to being directly under a Root Port, don't mention that at all. > So avoid to do bus reset, pci_parent_bus_reset() works nomarlly. > > Maybe NVIDIA guys can do some detailed explanation abount the SBR > behaviour of GPUs. This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset"), so probably should have a Fixes: tag so it goes whereever that commit goes. Also copy the subject line from 4c207e7121fa, e.g., PCI: Mark NVIDIA T4 GPUs to avoid bus reset Are there any problem reports or bugzilla issues you can include a URL to? > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com> > --- > drivers/pci/quirks.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44cab813bf95..be86b93b9e38 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev) > */ > static void quirk_nvidia_no_bus_reset(struct pci_dev *dev) > { > - if ((dev->device & 0xffc0) == 0x2340) > + if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8) > quirk_no_bus_reset(dev); > } > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > -- > 2.34.3 >
On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote: > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > > root port. > > Blank line between paragraphs. Rewrap to fill 75 columns if it's a > single paragraph. Will be fixed. > > Is this only a problem when direct attached to a Root Port? Why would > that be? If it's *not* related to being directly under a Root Port, > don't mention that at all. Yes, this problem occurs only when the T4 card is direct attached to a Root Port. I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge, and it works well. > > > So avoid to do bus reset, pci_parent_bus_reset() works nomarlly. > > > > Maybe NVIDIA guys can do some detailed explanation abount the SBR > > behaviour of GPUs. > > This is a follow-on to 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to > avoid bus reset"), so probably should have a Fixes: tag so it goes > whereever that commit goes. > > Also copy the subject line from 4c207e7121fa, e.g., > > PCI: Mark NVIDIA T4 GPUs to avoid bus reset Will be fixed too. > > Are there any problem reports or bugzilla issues you can include a URL > to? No, I just find the problem in our test environment and I didn't find a similar report. > > > Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com> > > --- > > drivers/pci/quirks.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 44cab813bf95..be86b93b9e38 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev) > > */ > > static void quirk_nvidia_no_bus_reset(struct pci_dev *dev) > > { > > - if ((dev->device & 0xffc0) == 0x2340) > > + if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8) > > quirk_no_bus_reset(dev); > > } > > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > > -- > > 2.34.3 > >
On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > root port. > So avoid to do bus reset, pci_parent_bus_reset() works nomarlly. If you cherry-pick commits 8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3") ac91e6980563 ("PCI: Unify delay handling for reset and resume") from v6.3-rc1, does the issue still persist? Thanks, Lukas
On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote: > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote: > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > > > root port. > > > > Is this only a problem when direct attached to a Root Port? Why would > > that be? If it's *not* related to being directly under a Root Port, > > don't mention that at all. > > Yes, this problem occurs only when the T4 card is direct attached to a > Root Port. > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge, > and it works well. From an electrical and protocol point of view, the device should not be able to tell the difference, so Lukas' suggestion that it may be related to reset delays seems very pertinent. Bjorn
On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote: > On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote: > > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote: > > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > > > > root port. > > > > > > Is this only a problem when direct attached to a Root Port? Why would > > > that be? If it's *not* related to being directly under a Root Port, > > > don't mention that at all. > > > > Yes, this problem occurs only when the T4 card is direct attached to a > > Root Port. > > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge, > > and it works well. > > From an electrical and protocol point of view, the device should not > be able to tell the difference, so Lukas' suggestion that it may be > related to reset delays seems very pertinent. I will test it with the commits mentioned above. But it may take some time since it is not easy to replace kernel in our environment. > > Bjorn
On Fri, Mar 31, 2023 at 10:11:15AM +0800, Wu Zongyong wrote: > On Thu, Mar 30, 2023 at 10:49:26AM -0500, Bjorn Helgaas wrote: > > On Thu, Mar 30, 2023 at 10:10:16AM +0800, Wu Zongyong wrote: > > > On Wed, Mar 29, 2023 at 12:05:15PM -0500, Bjorn Helgaas wrote: > > > > On Wed, Mar 29, 2023 at 07:58:45PM +0800, Wu Zongyong wrote: > > > > > Secondary bus reset will fail if NVIDIA T4 card is direct attached to a > > > > > root port. > > > > > > > > Is this only a problem when direct attached to a Root Port? Why would > > > > that be? If it's *not* related to being directly under a Root Port, > > > > don't mention that at all. > > > > > > Yes, this problem occurs only when the T4 card is direct attached to a > > > Root Port. > > > I have test it with a T4 card attached to a PCIe Switch or a PCI Bridge, > > > and it works well. > > > > From an electrical and protocol point of view, the device should not > > be able to tell the difference, so Lukas' suggestion that it may be > > related to reset delays seems very pertinent. > I will test it with the commits mentioned above. > But it may take some time since it is not easy to replace kernel in our > environment. I have tested it with Lukas' suggestion and it didn't work for T4 cards. My base kernel is v5.10, and I cherry-picked the following patches: 730643d33e2d ("PCI/PM: Resume subordinate bus in bus type callbacks") 8ef0217227b4 ("PCI/PM: Observe reset delay irrespective of bridge_d3") ac91e6980563 ("PCI: Unify delay handling for reset and resume") Any other necessary patches I should apply? > > > > Bjorn
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 44cab813bf95..be86b93b9e38 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3618,7 +3618,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev) */ static void quirk_nvidia_no_bus_reset(struct pci_dev *dev) { - if ((dev->device & 0xffc0) == 0x2340) + if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8) quirk_no_bus_reset(dev); } DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
Secondary bus reset will fail if NVIDIA T4 card is direct attached to a root port. So avoid to do bus reset, pci_parent_bus_reset() works nomarlly. Maybe NVIDIA guys can do some detailed explanation abount the SBR behaviour of GPUs. Signed-off-by: Wu Zongyong <wuzongyong@linux.alibaba.com> --- drivers/pci/quirks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)