diff mbox series

PCI/portdrv: Avoid enabling AER on Thunderbolt devices

Message ID 20221226153048.1208359-1-kai.heng.feng@canonical.com
State New
Headers show
Series PCI/portdrv: Avoid enabling AER on Thunderbolt devices | expand

Commit Message

Kai-Heng Feng Dec. 26, 2022, 3:30 p.m. UTC
We are seeing igc ethernet device on Thunderbolt dock stops working
after S3 resume because of AER error, or even make S3 resume freeze:
pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
pcieport 0000:00:1d.0:    [15] HeaderOF
pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
pcieport 0000:04:01.0:    [20] UnsupReq               (First)
pcieport 0000:04:01.0:    [21] ACSViol
pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)

This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
PTM for all devices during suspend"), but somehow it doesn't work for
this case.

By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
already flipped on by either the Thunderbolt dock firmware or the host
BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.

Windows is however not affected by this issue, by using WinDbg's !pci
command, it shows that AER is not enabled for devices connected via
Thunderbolt port, and that's the reason why Windows doesn't exhibit the
issue.

So turn a blind eye on external Thunderbolt devices like Windows does by
disabling AER.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/pci/pcie/portdrv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Pali Rohár Dec. 26, 2022, 3:46 p.m. UTC | #1
On Monday 26 December 2022 23:30:31 Kai-Heng Feng wrote:
> We are seeing igc ethernet device on Thunderbolt dock stops working
> after S3 resume because of AER error, or even make S3 resume freeze:

Hello! Is igc ethernet the only device which does not work after resume?
Or do you have also more devices to test and check for this issue?

I'm asking it just because to know if we are dealing with one device or
there are lot of more. Because if it is just one device then it could be
better to disable AER only for one targeted device instead of all. Error
reporting is a feature which may help to detect broken HW unit and be useful.

> pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> pcieport 0000:00:1d.0:    [15] HeaderOF
> pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> pcieport 0000:04:01.0:    [21] ACSViol
> pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> 
> This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> PTM for all devices during suspend"), but somehow it doesn't work for
> this case.
> 
> By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> already flipped on by either the Thunderbolt dock firmware or the host
> BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
> 
> Windows is however not affected by this issue, by using WinDbg's !pci
> command, it shows that AER is not enabled for devices connected via
> Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> issue.

Could you try to manually enable AER on Windows (via touching PCIe
config registers) if Windows can trigger this issue too, or not?

> So turn a blind eye on external Thunderbolt devices like Windows does by
> disabling AER.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/portdrv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 2cc2e60bcb396..59d00e20e57bf 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -237,7 +237,8 @@ static int get_port_device_capability(struct pci_dev *dev)
>  	if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
>               pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
>  	    dev->aer_cap && pci_aer_available() &&
> -	    (pcie_ports_native || host->native_aer))
> +	    (pcie_ports_native || host->native_aer) &&
> +	    !dev_is_removable(&dev->dev))
>  		services |= PCIE_PORT_SERVICE_AER;
>  #endif
>  
> -- 
> 2.34.1
>
Bjorn Helgaas Dec. 26, 2022, 10:50 p.m. UTC | #2
[+cc David]

Hi Kai-Heng,

Thanks for the report and the debugging!

On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> We are seeing igc ethernet device on Thunderbolt dock stops working
> after S3 resume because of AER error, or even make S3 resume freeze:
> pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> pcieport 0000:00:1d.0:    [15] HeaderOF
> pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000

From a very quick look, I think 34...... ......52 is a PTM message (as
you suggest below).

> pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> pcieport 0000:04:01.0:    [21] ACSViol
> pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> 
> This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> PTM for all devices during suspend"), but somehow it doesn't work for
> this case.
> 
> By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> already flipped on by either the Thunderbolt dock firmware or the host
> BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.

Can you share your debug patch and corresponding dmesg log in the
bugzilla?

> Windows is however not affected by this issue, by using WinDbg's !pci
> command, it shows that AER is not enabled for devices connected via
> Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> issue.
> 
> So turn a blind eye on external Thunderbolt devices like Windows does by
> disabling AER.

Unless there's something in the PCIe or Thunderbolt spec that says AER
shouldn't be used on external devices, I think we need to figure out
the root cause before disabling AER on all removable devices.

The dmesg in the bugzilla below is from an HP ZBook Fury 16.  Do you
see this on any other platforms?  Do you have any HP BIOS contacts to
ask about this?

It seems like a firmware defect to enable PTM without knowing whether
upstream devices have PTM enabled.

We could leave PTM enabled on upstream devices when suspending, but
that apparently prevents some low-power states.  Adding David since he
worked on that.

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/portdrv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 2cc2e60bcb396..59d00e20e57bf 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -237,7 +237,8 @@ static int get_port_device_capability(struct pci_dev *dev)
>  	if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
>               pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
>  	    dev->aer_cap && pci_aer_available() &&
> -	    (pcie_ports_native || host->native_aer))
> +	    (pcie_ports_native || host->native_aer) &&
> +	    !dev_is_removable(&dev->dev))
>  		services |= PCIE_PORT_SERVICE_AER;
>  #endif
>  
> -- 
> 2.34.1
>
Kai-Heng Feng Dec. 29, 2022, 3:45 a.m. UTC | #3
Hi Pali,


On Mon, Dec 26, 2022 at 11:46 PM Pali Rohár <pali@kernel.org> wrote:
>
> On Monday 26 December 2022 23:30:31 Kai-Heng Feng wrote:
> > We are seeing igc ethernet device on Thunderbolt dock stops working
> > after S3 resume because of AER error, or even make S3 resume freeze:
>
> Hello! Is igc ethernet the only device which does not work after resume?

Seems so.
A Thunderbolt NVMe enclosure plugged to the dock doesn't exhibit this
issue. I don't have an eGPU to try.

> Or do you have also more devices to test and check for this issue?
>
> I'm asking it just because to know if we are dealing with one device or
> there are lot of more. Because if it is just one device then it could be
> better to disable AER only for one targeted device instead of all. Error
> reporting is a feature which may help to detect broken HW unit and be useful.
>
> > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > pcieport 0000:00:1d.0:    [15] HeaderOF
> > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > pcieport 0000:04:01.0:    [21] ACSViol
> > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> >
> > This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> > PTM for all devices during suspend"), but somehow it doesn't work for
> > this case.
> >
> > By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> > already flipped on by either the Thunderbolt dock firmware or the host
> > BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
> >
> > Windows is however not affected by this issue, by using WinDbg's !pci
> > command, it shows that AER is not enabled for devices connected via
> > Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> > issue.
>
> Could you try to manually enable AER on Windows (via touching PCIe
> config registers) if Windows can trigger this issue too, or not?

Actually I misread the output of WinDbg !pci command, the AER is also
enabled under Windows.
!pci command also shows the same PTM error in Header Log. I can also
find the AER warnings in Windows' Event Viewer.

I am asking hardware vendor to see if it's possible to fix it at firmware side.

However, on Windows the ACS is disabled for all downstream ports in
the dock, so unlike Linux there's no ACS violation.
That can be the reason why the igc device continues to work on Windows
despite AER errors.

So yes, this patch is wrong. Let me dig this issue a bit more.

Kai-Heng

>
> > So turn a blind eye on external Thunderbolt devices like Windows does by
> > disabling AER.
> >
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/portdrv.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 2cc2e60bcb396..59d00e20e57bf 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -237,7 +237,8 @@ static int get_port_device_capability(struct pci_dev *dev)
> >       if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
> >               pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
> >           dev->aer_cap && pci_aer_available() &&
> > -         (pcie_ports_native || host->native_aer))
> > +         (pcie_ports_native || host->native_aer) &&
> > +         !dev_is_removable(&dev->dev))
> >               services |= PCIE_PORT_SERVICE_AER;
> >  #endif
> >
> > --
> > 2.34.1
> >
Kai-Heng Feng Dec. 29, 2022, 4:26 a.m. UTC | #4
Hi Bjorn,

On Tue, Dec 27, 2022 at 6:50 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc David]
>
> Hi Kai-Heng,
>
> Thanks for the report and the debugging!
>
> On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> > We are seeing igc ethernet device on Thunderbolt dock stops working
> > after S3 resume because of AER error, or even make S3 resume freeze:
> > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > pcieport 0000:00:1d.0:    [15] HeaderOF
> > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
>
> From a very quick look, I think 34...... ......52 is a PTM message (as
> you suggest below).
>
> > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > pcieport 0000:04:01.0:    [21] ACSViol
> > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> >
> > This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> > PTM for all devices during suspend"), but somehow it doesn't work for
> > this case.
> >
> > By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> > already flipped on by either the Thunderbolt dock firmware or the host
> > BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
>
> Can you share your debug patch and corresponding dmesg log in the
> bugzilla?

Actually Windows has the same PTM issue too like what I replied to
Pali's message.

>
> > Windows is however not affected by this issue, by using WinDbg's !pci
> > command, it shows that AER is not enabled for devices connected via
> > Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> > issue.
> >
> > So turn a blind eye on external Thunderbolt devices like Windows does by
> > disabling AER.
>
> Unless there's something in the PCIe or Thunderbolt spec that says AER
> shouldn't be used on external devices, I think we need to figure out
> the root cause before disabling AER on all removable devices.

You are right.

The most outstanding difference I can find is that the ACS is disabled
for all TBT dock's downstream ports.
So the ACS violation probably doesn't happen under Windows.

However the PTM message is still considered as Uncorrected error, so
the AER reset still happens on device resume.

I think when the reset happens (i.e. pcie_do_recovery()), the device
resume should be skipped. Not sure how to achieve that in a non-racy
way though.

>
> The dmesg in the bugzilla below is from an HP ZBook Fury 16.  Do you
> see this on any other platforms?  Do you have any HP BIOS contacts to
> ask about this?
>
> It seems like a firmware defect to enable PTM without knowing whether
> upstream devices have PTM enabled.

Yes, just raised the issue to HP.

>
> We could leave PTM enabled on upstream devices when suspending, but
> that apparently prevents some low-power states.  Adding David since he
> worked on that.

Leaving PTM enabled makes the system unable to suspend.

Kai-Heng

>
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/portdrv.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 2cc2e60bcb396..59d00e20e57bf 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -237,7 +237,8 @@ static int get_port_device_capability(struct pci_dev *dev)
> >       if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
> >               pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
> >           dev->aer_cap && pci_aer_available() &&
> > -         (pcie_ports_native || host->native_aer))
> > +         (pcie_ports_native || host->native_aer) &&
> > +         !dev_is_removable(&dev->dev))
> >               services |= PCIE_PORT_SERVICE_AER;
> >  #endif
> >
> > --
> > 2.34.1
> >
Pali Rohár Dec. 29, 2022, 12:02 p.m. UTC | #5
Hello!

On Thursday 29 December 2022 11:45:51 Kai-Heng Feng wrote:
> Hi Pali,
> 
> 
> On Mon, Dec 26, 2022 at 11:46 PM Pali Rohár <pali@kernel.org> wrote:
> >
> > On Monday 26 December 2022 23:30:31 Kai-Heng Feng wrote:
> > > We are seeing igc ethernet device on Thunderbolt dock stops working
> > > after S3 resume because of AER error, or even make S3 resume freeze:
> >
> > Hello! Is igc ethernet the only device which does not work after resume?
> 
> Seems so.
> A Thunderbolt NVMe enclosure plugged to the dock doesn't exhibit this
> issue. I don't have an eGPU to try.

Ok! So at least this test means that the issue does not affect all
devices. That is good to know.

> > Or do you have also more devices to test and check for this issue?
> >
> > I'm asking it just because to know if we are dealing with one device or
> > there are lot of more. Because if it is just one device then it could be
> > better to disable AER only for one targeted device instead of all. Error
> > reporting is a feature which may help to detect broken HW unit and be useful.
> >
> > > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > > pcieport 0000:00:1d.0:    [15] HeaderOF
> > > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> > > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > > pcieport 0000:04:01.0:    [21] ACSViol
> > > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> > >
> > > This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> > > PTM for all devices during suspend"), but somehow it doesn't work for
> > > this case.
> > >
> > > By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> > > already flipped on by either the Thunderbolt dock firmware or the host
> > > BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
> > >
> > > Windows is however not affected by this issue, by using WinDbg's !pci
> > > command, it shows that AER is not enabled for devices connected via
> > > Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> > > issue.
> >
> > Could you try to manually enable AER on Windows (via touching PCIe
> > config registers) if Windows can trigger this issue too, or not?
> 
> Actually I misread the output of WinDbg !pci command, the AER is also
> enabled under Windows.
> !pci command also shows the same PTM error in Header Log. I can also
> find the AER warnings in Windows' Event Viewer.

This is interesting. Maybe Windows can recover from that error?

Anyway, you can use also lspci and setpci on Windows, last version of
pciutils has better support for it.

> I am asking hardware vendor to see if it's possible to fix it at firmware side.
> 
> However, on Windows the ACS is disabled for all downstream ports in
> the dock, so unlike Linux there's no ACS violation.
> That can be the reason why the igc device continues to work on Windows
> despite AER errors.

Could you try to enable ACS on Windows? If igc continue to work or also
crash on Windows?

> So yes, this patch is wrong. Let me dig this issue a bit more.
> 
> Kai-Heng
> 
> >
> > > So turn a blind eye on external Thunderbolt devices like Windows does by
> > > disabling AER.
> > >
> > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> > > Cc: Mario Limonciello <mario.limonciello@amd.com>
> > > Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > >  drivers/pci/pcie/portdrv.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > index 2cc2e60bcb396..59d00e20e57bf 100644
> > > --- a/drivers/pci/pcie/portdrv.c
> > > +++ b/drivers/pci/pcie/portdrv.c
> > > @@ -237,7 +237,8 @@ static int get_port_device_capability(struct pci_dev *dev)
> > >       if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
> > >               pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
> > >           dev->aer_cap && pci_aer_available() &&
> > > -         (pcie_ports_native || host->native_aer))
> > > +         (pcie_ports_native || host->native_aer) &&
> > > +         !dev_is_removable(&dev->dev))
> > >               services |= PCIE_PORT_SERVICE_AER;
> > >  #endif
> > >
> > > --
> > > 2.34.1
> > >
Bjorn Helgaas Dec. 29, 2022, 4:51 p.m. UTC | #6
On Thu, Dec 29, 2022 at 11:45:51AM +0800, Kai-Heng Feng wrote:
> On Mon, Dec 26, 2022 at 11:46 PM Pali Rohár <pali@kernel.org> wrote:
> > On Monday 26 December 2022 23:30:31 Kai-Heng Feng wrote:
> > > We are seeing igc ethernet device on Thunderbolt dock stops working
> > > after S3 resume because of AER error, or even make S3 resume freeze:

> > > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > > pcieport 0000:00:1d.0:    [15] HeaderOF
> > > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> > > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > > pcieport 0000:04:01.0:    [21] ACSViol
> > > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> > >
> > > This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> > > PTM for all devices during suspend"), but somehow it doesn't work for
> > > this case.
> > >
> > > By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> > > already flipped on by either the Thunderbolt dock firmware or the host
> > > BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
> > >
> > > Windows is however not affected by this issue, by using WinDbg's !pci
> > > command, it shows that AER is not enabled for devices connected via
> > > Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> > > issue.
> >
> > Could you try to manually enable AER on Windows (via touching PCIe
> > config registers) if Windows can trigger this issue too, or not?
> 
> Actually I misread the output of WinDbg !pci command, the AER is also
> enabled under Windows.
> !pci command also shows the same PTM error in Header Log. I can also
> find the AER warnings in Windows' Event Viewer.

I suspected a Linux problem (e.g., we messed up disabling/restoring
PTM).  That's why I was asking about your debug patch, to see if we
could find something wrong with Linux.

But if you also see the Unsupported Request errors on Windows, that
makes it more likely that it's a firmware issue.

> I am asking hardware vendor to see if it's possible to fix it at
> firmware side.

I assume PTM was not enabled by firmware at boot-time (you might be
able to confirm this by tweaking early_dump_pci_device() to dump more
space and using "pci=earlydump").  If that's the case, it seems
strange that firmware would enable PTM at resume-time.

Linux *should* be disabling PTM at suspend-time, so firmware should
never see the fact that it had been enabled, so I don't know how it
could conclude that it's safe to enable PTM at resume-time.

Bjorn
Bjorn Helgaas Jan. 17, 2023, 11:14 p.m. UTC | #7
On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> We are seeing igc ethernet device on Thunderbolt dock stops working
> after S3 resume because of AER error, or even make S3 resume freeze:
> pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> pcieport 0000:00:1d.0:    [15] HeaderOF
> pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> pcieport 0000:04:01.0:    [21] ACSViol
> pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)

Is this a regression?  E.g., is this something that started after
f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native") or
something similar?

Bjorn
Kai-Heng Feng Feb. 8, 2023, 1:33 p.m. UTC | #8
Hi Bjorn,

Sorry for the belated response.

On Wed, Jan 18, 2023 at 7:14 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> > We are seeing igc ethernet device on Thunderbolt dock stops working
> > after S3 resume because of AER error, or even make S3 resume freeze:
> > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > pcieport 0000:00:1d.0:    [15] HeaderOF
> > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > pcieport 0000:04:01.0:    [21] ACSViol
> > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
>
> Is this a regression?  E.g., is this something that started after
> f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native") or
> something similar?

Reverting the commit doesn't help. Because 0000:00:1d.0 is already
native so AER is already enabled.

Kai-Heng

>
> Bjorn
Bjorn Helgaas Feb. 14, 2023, 12:10 a.m. UTC | #9
On Wed, Feb 08, 2023 at 09:33:18PM +0800, Kai-Heng Feng wrote:
> Hi Bjorn,
> 
> Sorry for the belated response.
> 
> On Wed, Jan 18, 2023 at 7:14 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> > > We are seeing igc ethernet device on Thunderbolt dock stops working
> > > after S3 resume because of AER error, or even make S3 resume freeze:
> > > pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> > > pcieport 0000:00:1d.0:    [15] HeaderOF
> > > pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> > > pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> > > pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> > > pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> > > pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> > > pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> > > pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> > > pcieport 0000:04:01.0:    [21] ACSViol
> > > pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> > > thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> >
> > Is this a regression?  E.g., is this something that started after
> > f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native") or
> > something similar?
> 
> Reverting the commit doesn't help. Because 0000:00:1d.0 is already
> native so AER is already enabled.

OK.  Unless I missed it, we don't really have a root cause or a good
reason to disable AER on removable devices.  I don't want to disable
AER indiscriminately.  The fact that we see errors doesn't seem like a
good enough reason.

Bjorn
Bagas Sanjaya May 16, 2023, 2:14 p.m. UTC | #10
On Mon, Dec 26, 2022 at 11:30:31PM +0800, Kai-Heng Feng wrote:
> We are seeing igc ethernet device on Thunderbolt dock stops working
> after S3 resume because of AER error, or even make S3 resume freeze:
> pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00008000/00002000
> pcieport 0000:00:1d.0:    [15] HeaderOF
> pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
> pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
> pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
> pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
> pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
> pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
> pcieport 0000:04:01.0:    [20] UnsupReq               (First)
> pcieport 0000:04:01.0:    [21] ACSViol
> pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000
> thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
> 
> This supposedly should be fixed by commit c01163dbd1b8 ("PCI/PM: Always disable
> PTM for all devices during suspend"), but somehow it doesn't work for
> this case.
> 
> By dumping the PCI_PTM_CTRL register on resume, it turns out PTM is
> already flipped on by either the Thunderbolt dock firmware or the host
> BIOS. Writing 0 to PCI_PTM_CTRL yields the same result.
> 
> Windows is however not affected by this issue, by using WinDbg's !pci
> command, it shows that AER is not enabled for devices connected via
> Thunderbolt port, and that's the reason why Windows doesn't exhibit the
> issue.
> 
> So turn a blind eye on external Thunderbolt devices like Windows does by
> disabling AER.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>

Hi,

I noticed a similar regression on bugzilla [1] where I asked the
reporter to test your patch, and his regression still occured. For
full details, see bugzilla.

Thanks.

Reported-by: Pengyu Ma <mapengyu@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217446 [1]
diff mbox series

Patch

diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 2cc2e60bcb396..59d00e20e57bf 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -237,7 +237,8 @@  static int get_port_device_capability(struct pci_dev *dev)
 	if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
              pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
 	    dev->aer_cap && pci_aer_available() &&
-	    (pcie_ports_native || host->native_aer))
+	    (pcie_ports_native || host->native_aer) &&
+	    !dev_is_removable(&dev->dev))
 		services |= PCIE_PORT_SERVICE_AER;
 #endif