diff mbox series

PCI/portdrv: Skip enabling AER on external facing ports

Message ID 20220105060643.822111-1-kai.heng.feng@canonical.com
State New
Headers show
Series PCI/portdrv: Skip enabling AER on external facing ports | expand

Commit Message

Kai-Heng Feng Jan. 5, 2022, 6:06 a.m. UTC
The Thunderbolt root ports may constantly spew out uncorrected errors
from AER service:
[   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
[   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
[   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
[   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
[   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed

The link may not be reliable on external facing ports, so don't enable
AER on those ports.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/pci/pcie/portdrv_core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas Jan. 5, 2022, 8:12 p.m. UTC | #1
On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote:
> The Thunderbolt root ports may constantly spew out uncorrected errors
> from AER service:
> [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed

No timestamps needed here; they don't add to understanding the
problem.

> The link may not be reliable on external facing ports, so don't enable
> AER on those ports.

I'm not sure what you want to accomplish here.  If the errors are
legitimate and the result of some hardware issue like a bad cable, why
should we ignore them?  If they're caused by a software problem, we
should figure that out and fix it.

Does this occur on a specific instance of possibly flaky hardware?

You mention a spew of errors; do you think this is a single error that
we fail to clear correctly?  Or is it really many separate errors?

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/portdrv_core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index bda630889f955..d464d00ade8f2 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev)
>  
>  #ifdef CONFIG_PCIEAER
>  	if (dev->aer_cap && pci_aer_available() &&
> -	    (pcie_ports_native || host->native_aer)) {
> +	    (pcie_ports_native || host->native_aer) &&
> +	    !dev->external_facing) {
>  		services |= PCIE_PORT_SERVICE_AER;
>  
>  		/*
> -- 
> 2.33.1
>
Kai-Heng Feng Jan. 7, 2022, 4:09 a.m. UTC | #2
On Thu, Jan 6, 2022 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jan 05, 2022 at 02:06:41PM +0800, Kai-Heng Feng wrote:
> > The Thunderbolt root ports may constantly spew out uncorrected errors
> > from AER service:
> > [   30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > [   30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > [   30.100256] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > [   30.100262] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > [   30.100267] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 08000052 00000000 00000000
> > [   30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)
> > [   30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
> > [   30.100427] pcieport 0000:00:1d.0: AER: device recovery failed
>
> No timestamps needed here; they don't add to understanding the
> problem.

Got it. Will remove it for later iteration.

>
> > The link may not be reliable on external facing ports, so don't enable
> > AER on those ports.
>
> I'm not sure what you want to accomplish here.  If the errors are
> legitimate and the result of some hardware issue like a bad cable, why
> should we ignore them?  If they're caused by a software problem, we
> should figure that out and fix it.
>
> Does this occur on a specific instance of possibly flaky hardware?

Only from root ports of thunderbolt devices.

The error occurs as soon as the root port is runtime suspended to D3cold.

Runtime suspend the AER service can resolve the issue. I wonder if
it's the right thing to do here?
D3cold should also mean the PCI link is gone, disabling AER seems to
be a reasonable approach.

Kai-Heng

>
> You mention a spew of errors; do you think this is a single error that
> we fail to clear correctly?  Or is it really many separate errors?
>
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/portdrv_core.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> > index bda630889f955..d464d00ade8f2 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -219,7 +219,8 @@ static int get_port_device_capability(struct pci_dev *dev)
> >
> >  #ifdef CONFIG_PCIEAER
> >       if (dev->aer_cap && pci_aer_available() &&
> > -         (pcie_ports_native || host->native_aer)) {
> > +         (pcie_ports_native || host->native_aer) &&
> > +         !dev->external_facing) {
> >               services |= PCIE_PORT_SERVICE_AER;
> >
> >               /*
> > --
> > 2.33.1
> >
Mika Westerberg Jan. 21, 2022, 10:55 a.m. UTC | #3
Hi Kai-Heng,

On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> Only from root ports of thunderbolt devices.
> 
> The error occurs as soon as the root port is runtime suspended to D3cold.
> 
> Runtime suspend the AER service can resolve the issue. I wonder if
> it's the right thing to do here?

I think you are right here. It seems that AER "service driver" is
completely missing PM hooks. Probably because it is more used in server
type of systems where power management is not priority.

> D3cold should also mean the PCI link is gone, disabling AER seems to
> be a reasonable approach.

Indeed - I think AER might trigger here because the link does "down" /
low power state if left enabled while the root port enters D3. Something
like below hack should disable it over low power transitions:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..64138cf82db8 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
 
+static int aer_suspend(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_disable_rootport(rpc);
+	return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+	struct aer_rpc *rpc = get_service_data(dev);
+
+	aer_enable_rootport(rpc);
+	return 0;
+}
+
 static struct pcie_port_service_driver aerdriver = {
 	.name		= "aer",
 	.port_type	= PCIE_ANY_PORT,
@@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = {
 
 	.probe		= aer_probe,
 	.remove		= aer_remove,
+	.suspend	= aer_suspend,
+	.resume		= aer_resume,
+	.runtime_suspend = aer_suspend,
+	.runtime_resume	= aer_resume,
 };
 
 /**
Kai-Heng Feng Jan. 21, 2022, 12:31 p.m. UTC | #4
Hi Mika,

On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi Kai-Heng,
>
> On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > Only from root ports of thunderbolt devices.
> >
> > The error occurs as soon as the root port is runtime suspended to D3cold.
> >
> > Runtime suspend the AER service can resolve the issue. I wonder if
> > it's the right thing to do here?
>
> I think you are right here. It seems that AER "service driver" is
> completely missing PM hooks. Probably because it is more used in server
> type of systems where power management is not priority.

Here is my previous attempt to suspend AER:
https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/

>
> > D3cold should also mean the PCI link is gone, disabling AER seems to
> > be a reasonable approach.
>
> Indeed - I think AER might trigger here because the link does "down" /
> low power state if left enabled while the root port enters D3. Something
> like below hack should disable it over low power transitions:

Ubuntu kernel has been carrying the patch for quite some time:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/unstable/commit/?id=e82f15f1a26273b004054a81ef45937fb1b632e5

>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..64138cf82db8 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1432,6 +1432,22 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>         return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
>
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +       struct aer_rpc *rpc = get_service_data(dev);
> +
> +       aer_disable_rootport(rpc);
> +       return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +       struct aer_rpc *rpc = get_service_data(dev);
> +
> +       aer_enable_rootport(rpc);
> +       return 0;
> +}
> +
>  static struct pcie_port_service_driver aerdriver = {
>         .name           = "aer",
>         .port_type      = PCIE_ANY_PORT,
> @@ -1439,6 +1455,10 @@ static struct pcie_port_service_driver aerdriver = {
>
>         .probe          = aer_probe,
>         .remove         = aer_remove,
> +       .suspend        = aer_suspend,
> +       .resume         = aer_resume,
> +       .runtime_suspend = aer_suspend,
> +       .runtime_resume = aer_resume,
>  };

This patch is exactly what I tested.

Maybe only suspend/runtime_suspend AER when the target PM state is D3cold?
PCIe spec doesn't say how to handle AER in Link L2/L3Ready/L3, but I
think it's reasonable to suspend AER when power is loss.

Let me come up with a patch with that idea.

Kai-Heng

>
>  /**
Mika Westerberg Jan. 21, 2022, 12:44 p.m. UTC | #5
Hi,

On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote:
> Hi Mika,
> 
> On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Hi Kai-Heng,
> >
> > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > > Only from root ports of thunderbolt devices.
> > >
> > > The error occurs as soon as the root port is runtime suspended to D3cold.
> > >
> > > Runtime suspend the AER service can resolve the issue. I wonder if
> > > it's the right thing to do here?
> >
> > I think you are right here. It seems that AER "service driver" is
> > completely missing PM hooks. Probably because it is more used in server
> > type of systems where power management is not priority.
> 
> Here is my previous attempt to suspend AER:
> https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/

That's great!

I think we should do the same for runtime PM paths too, though. Will you
take care of that as well? :)
Kai-Heng Feng Jan. 21, 2022, 2:25 p.m. UTC | #6
On Fri, Jan 21, 2022 at 8:44 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi,
>
> On Fri, Jan 21, 2022 at 08:31:27PM +0800, Kai-Heng Feng wrote:
> > Hi Mika,
> >
> > On Fri, Jan 21, 2022 at 6:55 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > Hi Kai-Heng,
> > >
> > > On Fri, Jan 07, 2022 at 12:09:57PM +0800, Kai-Heng Feng wrote:
> > > > Only from root ports of thunderbolt devices.
> > > >
> > > > The error occurs as soon as the root port is runtime suspended to D3cold.
> > > >
> > > > Runtime suspend the AER service can resolve the issue. I wonder if
> > > > it's the right thing to do here?
> > >
> > > I think you are right here. It seems that AER "service driver" is
> > > completely missing PM hooks. Probably because it is more used in server
> > > type of systems where power management is not priority.
> >
> > Here is my previous attempt to suspend AER:
> > https://lore.kernel.org/linux-pci/20210127173101.446940-1-kai.heng.feng@canonical.com/
>
> That's great!
>
> I think we should do the same for runtime PM paths too, though. Will you
> take care of that as well? :)

Yes that's the plan. I hope I can persuade Bjorn this time...

Kai-Heng
diff mbox series

Patch

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index bda630889f955..d464d00ade8f2 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -219,7 +219,8 @@  static int get_port_device_capability(struct pci_dev *dev)
 
 #ifdef CONFIG_PCIEAER
 	if (dev->aer_cap && pci_aer_available() &&
-	    (pcie_ports_native || host->native_aer)) {
+	    (pcie_ports_native || host->native_aer) &&
+	    !dev->external_facing) {
 		services |= PCIE_PORT_SERVICE_AER;
 
 		/*