[v6,0/7] pci: Work around ASMedia ASM2824 PCIe link training failures

Message ID	alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> Date: Sun, 5 Feb 2023 15:48:54 +0000 (GMT) From: "Maciej W. Rozycki" <macro@orcam.me.uk> To: Bjorn Helgaas <bhelgaas@google.com>, Mahesh J Salgaonkar <mahesh@linux.ibm.com>, Oliver O'Halloran <oohall@gmail.com>, Michael Ellerman <mpe@ellerman.id.au>, Nicholas Piggin <npiggin@gmail.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Saeed Mahameed <saeedm@nvidia.com>, Leon Romanovsky <leon@kernel.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> cc: Alex Williamson <alex.williamson@redhat.com>, Lukas Wunner <lukas@wunner.de>, Mika Westerberg <mika.westerberg@linux.intel.com>, Stefan Roese <sr@denx.de>, Jim Wilson <wilson@tuliptree.org>, David Abdurachmanov <david.abdurachmanov@gmail.com>, =?utf-8?q?Pali_Roh?= =?utf-8?q?=C3=A1r?= <pali@kernel.org>, linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-rdma@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v6 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures Message-ID: <alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	pci: Work around ASMedia ASM2824 PCIe link training failures \| expand [v6,0/7] pci: Work around ASMedia ASM2824 PCIe link training failures [v6,1/7] PCI: Export PCI link retrain timeout [v6,2/7] PCI: Execute `quirk_enable_clear_retrain_link' earlier [v6,3/7] PCI: Initialize `link_active_reporting' earlier [v6,4/7] powerpc/eeh: Rely on `link_active_reporting' [v6,5/7] net/mlx5: Rely on `link_active_reporting' [v6,6/7] PCI: pciehp: Rely on `link_active_reporting' [v6,7/7] PCI: Work around PCIe link training failures

Maciej W. Rozycki Feb. 5, 2023, 3:48 p.m. UTC

Hi,

This is v6 of the change to work around a PCIe link training phenomenon
where a pair of devices both capable of operating at a link speed above
2.5GT/s seems unable to negotiate the link speed and continues training
indefinitely with the Link Training bit switching on and off repeatedly
and the data link layer never reaching the active state.

Following Bjorn's suggestion from the previous iteration:
<https://lore.kernel.org/lkml/20221109050418.GA529724@bhelgaas/> I have
moved the workaround into the PCI core. I have kept the part specific to
ASMedia (to lift the speed restriction after a successful retrain) within,
although I find it a good candidate for a standalone quirk. It seems to
me we'd have to add additional classes of fixups however to move this part
to drivers/pci/quirks.c, which I think would be an overkill. So I've only
made it explicitly guarded by CONFIG_PCI_QUIRKS; I can see there's prior
art with this approach.

In the course of the update I have realised that commit 6b2f1351af56
("PCI: Wait for device to become ready after secondary bus reset") makes
no sense and was about to figure out what to do here about it, but then
found Lukas's recent patch series addressing this issue (thanks, Lukas,
you made my life easier!), so I have rebased my patch set on top of
Lukas's:
<https://lore.kernel.org/all/da77c92796b99ec568bd070cbe4725074a117038.1673769517.git.lukas@wunner.de/>.

This has resulted in mild ugliness in that `pcie_downstream_link_retrain'
may be called from `pci_bridge_wait_for_secondary_bus' twice, first time
via `pcie_wait_for_link_delay' and second time via `pci_dev_wait'. This
second call to `pcie_downstream_link_retrain' will do nothing, because for
`link_active_reporting' devices `pcie_wait_for_link_delay' will have
ensured the link has gone up or the second call won't have been reached.

I have also decided to move the initialisation of `link_active_reporting'
earlier on, so as to have a single way to check for the feature. This has
brought an extra patch and its 3 clean-up dependencies into the series.

This was originally observed in a configuration featuring a downstream
port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the
Pericom PI7C9X2G304 Gen 2 switch. However in the course of review I have
come to the conclusion that similarly to the earlier similar change to
U-Boot it is indeed expected to be safe to apply this workaround to any
downstream port that has failed link negotiation provided that:

1. the port is capable of reporting the data link layer link active
status (because unlike U-Boot we cannot busy-loop continuously polling
the link training bit),

and:

2. we don't attempt to lift the 2.5GT/s speed restriction, imposed as the
basis of the workaround, for devices not explicitly known to continue
working in that case.

It is expected to be safe because the workaround is applied to a failed
link, that is one that does not (at the time this code is executed) work
anyway, so trying to bring it up cannot make the situation worse.

This has been verified with a SiFive HiFive unmatched board, with and
without CONFIG_PCI_QUIRKS enabled, booting with or without the workaround
activated in U-Boot, which covered both the link retraining part of the
quirk and the lifting of speed restriction already imposed by U-Boot.

I have also issued resets via sysfs to see how this change behaves. For
the problematic link this required a hack to remove a `dev->subordinate'
check from `pci_parent_bus_reset', which in turn triggered the workaround
as expected and brought the link up (but otherwise clobbered downstream
devices as one would inevitably expect).

I have no way to verify these patches with power management or hot-plug
events, but owing to Lukas's effort they get into the same infrastructure,
so I expect the workaround to do its job as expected. I note that there
is an extra call to `pcie_wait_for_link' from `pciehp_check_link_status',
but I expect it to work too. For `link_active_reporting' devices it will
call `pcie_downstream_link_retrain' and for`!link_active_reporting' ones
we have no means to do anything anyway.

The 3 extra clean-ups were only compile-tested (with PowerPC and x86-64
configurations, as appropriate), because I have no suitable hardware
available.

Please see individual change descriptions for further details.

Let me know if this is going in the right direction.

Maciej

Maciej W. Rozycki Feb. 19, 2023, 6:52 p.m. UTC | #1

On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:

>  This is v6 of the change to work around a PCIe link training phenomenon 
> where a pair of devices both capable of operating at a link speed above 
> 2.5GT/s seems unable to negotiate the link speed and continues training 
> indefinitely with the Link Training bit switching on and off repeatedly 
> and the data link layer never reaching the active state.

 Ping for: 
<https://lore.kernel.org/linux-pci/alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk/>.

  Maciej

Lukas Wunner Feb. 19, 2023, 7:46 p.m. UTC | #2

[+cc Philipp]

On Sun, Feb 19, 2023 at 06:52:21PM +0000, Maciej W. Rozycki wrote:
> On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:
> >  This is v6 of the change to work around a PCIe link training phenomenon 
> > where a pair of devices both capable of operating at a link speed above 
> > 2.5GT/s seems unable to negotiate the link speed and continues training 
> > indefinitely with the Link Training bit switching on and off repeatedly 
> > and the data link layer never reaching the active state.
> 
>  Ping for: 
> <https://lore.kernel.org/linux-pci/alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk/>.

Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
switch below a Broadcom STB host controller:  On some rare occasions,
when booting the system the link trains correctly at 5 GT/s and the
switch is accessible without any issues.  But most of the time,
the switch is inaccessible on boot.  The Broadcom STB host controller
claims not to support Link Active Reporting, but in reality has a
link status indicator in a custom register.  It indicates that the
link is up, the link trains to 2.5 GT/s but the switch is inaccessible.
Due to a quirk of the Broadcom STB host controller, ECAM access to
the inaccessible switch raises an unhandled CPU exception and thus
causes a kernel panic, making the issue difficult to debug.

The switch works fine 100% when plugged into a TI Sitara AM64 board
(contains a DesignWare-derived PCIe host controller).

The switch is the same as yours, only with 4 instead of 3 ports.
Perhaps the issue you're seeing isn't specific to the ASMedia switch,
but is due to an oddity of the Pericom switch, that happens to be
triggered by the Broadcom STB host controller as well?

I've cooked up a modified version of patch 7 in your series which
performs the link retraining in the pci-brcmstb.c driver before
performing the first access to the switch.  Unfortunately it
didn't result in any kind of improvement.  Next step is to hook up
a Teledyne T28 analyzer to see what's going on.

Thanks,

Lukas

Pali Rohár Feb. 21, 2023, 9:46 p.m. UTC | #3

Hello!

On Sunday 19 February 2023 20:46:19 Lukas Wunner wrote:
> [+cc Philipp]
> 
> On Sun, Feb 19, 2023 at 06:52:21PM +0000, Maciej W. Rozycki wrote:
> > On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:
> > >  This is v6 of the change to work around a PCIe link training phenomenon 
> > > where a pair of devices both capable of operating at a link speed above 
> > > 2.5GT/s seems unable to negotiate the link speed and continues training 
> > > indefinitely with the Link Training bit switching on and off repeatedly 
> > > and the data link layer never reaching the active state.
> > 
> >  Ping for: 
> > <https://lore.kernel.org/linux-pci/alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk/>.
> 
> Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
> switch below a Broadcom STB host controller:  On some rare occasions,
> when booting the system the link trains correctly at 5 GT/s and the
> switch is accessible without any issues.  But most of the time,
> the switch is inaccessible on boot.  The Broadcom STB host controller
> claims not to support Link Active Reporting, but in reality has a
> link status indicator in a custom register.  It indicates that the
> link is up, the link trains to 2.5 GT/s but the switch is inaccessible.

This is interesting. Do you know which layer it indicates that is up?
I can image that PCIe physical layer or data link layer is up but
PCIe transaction layer not yet up and so sending config requests fail.
Existence of custom register may explain that it indicates different
"link up" meaning.

> Due to a quirk of the Broadcom STB host controller, ECAM access to
> the inaccessible switch raises an unhandled CPU exception and thus
> causes a kernel panic, making the issue difficult to debug.

Is this ARM Cortex A53 core and unhandled exception is asynchronous one
with syndrome 0xbf000002?

> The switch works fine 100% when plugged into a TI Sitara AM64 board
> (contains a DesignWare-derived PCIe host controller).

It is really DesignWare? I had an impression that TI uses PCIe IPs from
Cadence, not from DesignWare. And Cadence controllers behave in some
cases different from Designware controllers.

> The switch is the same as yours, only with 4 instead of 3 ports.
> Perhaps the issue you're seeing isn't specific to the ASMedia switch,
> but is due to an oddity of the Pericom switch, that happens to be
> triggered by the Broadcom STB host controller as well?
> 
> I've cooked up a modified version of patch 7 in your series which
> performs the link retraining in the pci-brcmstb.c driver before
> performing the first access to the switch.  Unfortunately it
> didn't result in any kind of improvement.  Next step is to hook up
> a Teledyne T28 analyzer to see what's going on.

Can you use Teledyne T28 for debugging this issue? Because this is
something which can finally show what is happing there.

I would really like to see what switch and controller are sending that
they can result in such buggy and incredible state.

> Thanks,
> 
> Lukas

Lukas Wunner Feb. 22, 2023, 8:40 a.m. UTC | #4

On Tue, Feb 21, 2023 at 10:46:11PM +0100, Pali Rohár wrote:
> On Sunday 19 February 2023 20:46:19 Lukas Wunner wrote:
> > > On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:
> > > >  This is v6 of the change to work around a PCIe link training phenomenon
> > > > where a pair of devices both capable of operating at a link speed above
> > > > 2.5GT/s seems unable to negotiate the link speed and continues training
> > > > indefinitely with the Link Training bit switching on and off repeatedly
> > > > and the data link layer never reaching the active state.
> > 
> > Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
> > switch below a Broadcom STB host controller:  On some rare occasions,
> > when booting the system the link trains correctly at 5 GT/s and the
> > switch is accessible without any issues.  But most of the time,
> > the switch is inaccessible on boot.  The Broadcom STB host controller
> > claims not to support Link Active Reporting, but in reality has a
> > link status indicator in a custom register.  It indicates that the
> > link is up, the link trains to 2.5 GT/s but the switch is inaccessible.
> 
> This is interesting. Do you know which layer it indicates that is up?
> I can image that PCIe physical layer or data link layer is up but
> PCIe transaction layer not yet up and so sending config requests fail.
> Existence of custom register may explain that it indicates different
> "link up" meaning.

drivers/pci/controller/pcie-brcmstb.c defines the following bits:

#define  PCIE_MISC_PCIE_STATUS_PCIE_PORT_MASK           0x80
#define  PCIE_MISC_PCIE_STATUS_PCIE_DL_ACTIVE_MASK      0x20
#define  PCIE_MISC_PCIE_STATUS_PCIE_PHYLINKUP_MASK      0x10
#define  PCIE_MISC_PCIE_STATUS_PCIE_LINK_IN_L23_MASK    0x40

And brcm_pcie_link_up() checks that both DL_ACTIVE and PHYLINKUP are set.

A public spec for the Broadcom STB PCIe controller does not seem to exist,
so I do not know what the register bits mean exactly.


> > Due to a quirk of the Broadcom STB host controller, ECAM access to
> > the inaccessible switch raises an unhandled CPU exception and thus
> > causes a kernel panic, making the issue difficult to debug.
> 
> Is this ARM Cortex A53 core and unhandled exception is asynchronous one
> with syndrome 0xbf000002?

It's a Cortex A72 and yes the exception looks like this:

SError Interrupt on CPU1, code 0x00000000bf000002 -- SError

I was wondering why we're not checking in the exception handler whether
the accessed address is in ECAM space, and just return from the handler
since such exceptions could be handled by returning "all ones" in
software from the PCI core.

Then again, perhaps there's a method to stop the controller from
raising an exception on ECAM access to an inaccessible device.
If such a method exists (e.g. some register bit), that would
obviously be preferred.


> > The switch works fine 100% when plugged into a TI Sitara AM64 board
> > (contains a DesignWare-derived PCIe host controller).
> 
> It is really DesignWare? I had an impression that TI uses PCIe IPs from
> Cadence, not from DesignWare. And Cadence controllers behave in some
> cases different from Designware controllers.

You're right, I was mistaken, it's indeed a Cadence.


> > Next step is to hook up
> > a Teledyne T28 analyzer to see what's going on.
> 
> Can you use Teledyne T28 for debugging this issue? Because this is
> something which can finally show what is happing there.

Yes it should be possible to debug this, the analyzer is capable of
logging the link training sequence and present it in a Wireshark-esque
interface.

Thanks,

Lukas

Pali Rohár Feb. 22, 2023, 9:17 a.m. UTC | #5

On Wednesday 22 February 2023 09:40:33 Lukas Wunner wrote:
> On Tue, Feb 21, 2023 at 10:46:11PM +0100, Pali Rohár wrote:
> > On Sunday 19 February 2023 20:46:19 Lukas Wunner wrote:
> > > > On Sun, 5 Feb 2023, Maciej W. Rozycki wrote:
> > > > >  This is v6 of the change to work around a PCIe link training phenomenon
> > > > > where a pair of devices both capable of operating at a link speed above
> > > > > 2.5GT/s seems unable to negotiate the link speed and continues training
> > > > > indefinitely with the Link Training bit switching on and off repeatedly
> > > > > and the data link layer never reaching the active state.
> > > 
> > > Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
> > > switch below a Broadcom STB host controller:  On some rare occasions,
> > > when booting the system the link trains correctly at 5 GT/s and the
> > > switch is accessible without any issues.  But most of the time,
> > > the switch is inaccessible on boot.  The Broadcom STB host controller
> > > claims not to support Link Active Reporting, but in reality has a
> > > link status indicator in a custom register.  It indicates that the
> > > link is up, the link trains to 2.5 GT/s but the switch is inaccessible.
> > 
> > This is interesting. Do you know which layer it indicates that is up?
> > I can image that PCIe physical layer or data link layer is up but
> > PCIe transaction layer not yet up and so sending config requests fail.
> > Existence of custom register may explain that it indicates different
> > "link up" meaning.
> 
> drivers/pci/controller/pcie-brcmstb.c defines the following bits:
> 
> #define  PCIE_MISC_PCIE_STATUS_PCIE_PORT_MASK           0x80
> #define  PCIE_MISC_PCIE_STATUS_PCIE_DL_ACTIVE_MASK      0x20
> #define  PCIE_MISC_PCIE_STATUS_PCIE_PHYLINKUP_MASK      0x10
> #define  PCIE_MISC_PCIE_STATUS_PCIE_LINK_IN_L23_MASK    0x40
> 
> And brcm_pcie_link_up() checks that both DL_ACTIVE and PHYLINKUP are set.
> 
> A public spec for the Broadcom STB PCIe controller does not seem to exist,
> so I do not know what the register bits mean exactly.

Ok, so then this is question for Broadcom. Without spec we probably
cannot do anything.

> 
> > > Due to a quirk of the Broadcom STB host controller, ECAM access to
> > > the inaccessible switch raises an unhandled CPU exception and thus
> > > causes a kernel panic, making the issue difficult to debug.
> > 
> > Is this ARM Cortex A53 core and unhandled exception is asynchronous one
> > with syndrome 0xbf000002?
> 
> It's a Cortex A72 and yes the exception looks like this:
> 
> SError Interrupt on CPU1, code 0x00000000bf000002 -- SError

This is core specific exception and for A53 it means AXI Slave Error.
I guess that on A72 it could mean too. But generally for Aarch64 they
are not defined.

> I was wondering why we're not checking in the exception handler whether
> the accessed address is in ECAM space, and just return from the handler
> since such exceptions could be handled by returning "all ones" in
> software from the PCI core.

We are not checking them because it is asynchronous exception. You do
not know who, when and why caused this exception. Exception may come
also after executing lot of other instructions. It is non-recoverable
fatal exception and it means something like internal core error. The
only reasonable thing to do is to reset CPU.

CPU should not receive AXI Slave Error under non-fatal condition and it
is basically bug in that PCIe controller that it sends such thing to
the CPU core. PCIe controller should for ECAM load/store operations
always returns AXI Slave OK response and on error it should set all-ones
in data part.

The proper way is to find out if PCIe controller does not have some
hidden or debug register which allows to disable sending these errors.
IIRC DesignWare has it, so there is big chance that Broadcom has it too.
But for example Cadence does not have it (yet).

> Then again, perhaps there's a method to stop the controller from
> raising an exception on ECAM access to an inaccessible device.
> If such a method exists (e.g. some register bit), that would
> obviously be preferred.

Other way is map ECAM address memory space in strong ordering mode.
It would cause CPU core to wait during executing of load / store
operation after they completely finish and then AXI Slave Error is
reported as synchronous exception as Data Abort, which is possible.
On ARMv7 it was possible by marking mapping as Device. On ARMv8 it
should be possible too, but on A53 it is unimplemented. Maybe possible
on A72? But it is kind like a hack to workaround total mess.

I will send separate email to Broadcom people who already helped with
one of their PCIe controller in the past. Maybe there is hidden register
debug bit which can turn it off.

> 
> > > The switch works fine 100% when plugged into a TI Sitara AM64 board
> > > (contains a DesignWare-derived PCIe host controller).
> > 
> > It is really DesignWare? I had an impression that TI uses PCIe IPs from
> > Cadence, not from DesignWare. And Cadence controllers behave in some
> > cases different from Designware controllers.
> 
> You're right, I was mistaken, it's indeed a Cadence.
> 
> 
> > > Next step is to hook up
> > > a Teledyne T28 analyzer to see what's going on.
> > 
> > Can you use Teledyne T28 for debugging this issue? Because this is
> > something which can finally show what is happing there.
> 
> Yes it should be possible to debug this, the analyzer is capable of
> logging the link training sequence and present it in a Wireshark-esque
> interface.
> 
> Thanks,
> 
> Lukas

The main issue is that nobody had analyzer for doing it, it is not a
cheap device.

Maciej W. Rozycki Feb. 22, 2023, 11:54 a.m. UTC | #6

On Sun, 19 Feb 2023, Lukas Wunner wrote:

> > >  This is v6 of the change to work around a PCIe link training phenomenon 
> > > where a pair of devices both capable of operating at a link speed above 
> > > 2.5GT/s seems unable to negotiate the link speed and continues training 
> > > indefinitely with the Link Training bit switching on and off repeatedly 
> > > and the data link layer never reaching the active state.
> > 
> >  Ping for: 
> > <https://lore.kernel.org/linux-pci/alpine.DEB.2.21.2302022022230.45310@angie.orcam.me.uk/>.
> 
> Philipp is witnessing similar issues with a Pericom PI7C9X2G404EL
> switch below a Broadcom STB host controller:  On some rare occasions,
> when booting the system the link trains correctly at 5 GT/s and the
> switch is accessible without any issues.  But most of the time,
> the switch is inaccessible on boot.  The Broadcom STB host controller
> claims not to support Link Active Reporting, but in reality has a
> link status indicator in a custom register.  It indicates that the
> link is up, the link trains to 2.5 GT/s but the switch is inaccessible.

 Thank you for chiming in.

 Note that the U-Boot variant of this workaround referred to by the link 
in the change description does not rely on Link Active Reporting, but 
busy-loops polling on Link Training status instead and verifies training 
remains stable off long enough.  We can't do this in Linux of course, but 
I guess a variant using link status reported in a vendor-specific register 
could be made.

> The switch is the same as yours, only with 4 instead of 3 ports.
> Perhaps the issue you're seeing isn't specific to the ASMedia switch,
> but is due to an oddity of the Pericom switch, that happens to be
> triggered by the Broadcom STB host controller as well?

 I have seen two reports from people claiming their devices to be absent 
from `lspci' output, an Intel and a Realtek NIC respectively, when plugged 
into a slot behing the ASMedia switch, while working just fine in another 
system.  Neither replied to my request for further information, so it's 
not clear to me if it's the same issue, but it may or may not be limited 
to Pericom hardware.

 As I mentioned in previous iterations of the change there is an option 
card available with the ASM2824 switch onboard and dual M.2 slots behind, 
which could be used for experimenting.  Sold under the StarTech brand as 
PEX8M2E2 and under the Ableconn brand as PEXM2-130.  And M.2 M-Key to PCIe 
slot adapters are widely available.  I just can't be persuaded to buy such 
an option card, especially as ASMedia have been unhelpful and ignored my 
enquiry.

> I've cooked up a modified version of patch 7 in your series which
> performs the link retraining in the pci-brcmstb.c driver before
> performing the first access to the switch.  Unfortunately it
> didn't result in any kind of improvement.  Next step is to hook up
> a Teledyne T28 analyzer to see what's going on.

 That would be most helpful, although given your lack of success with my
workaround it might be a different issue here after all.

 Let me know if I could help anyhow.  I have a few of these Pericom-based 
riser card adapters as spares, though I'm short on high-speed (5GT/s+) 
PCIe slots to try them with.  Delock has claimed, back in Jul 2021, I'm 
the only one to report them the device not to work.

  Maciej

[v6,0/7] pci: Work around ASMedia ASM2824 PCIe link training failures

Message

Comments