diff mbox series

[1/2] hw/phb4: Assert Link Disable bit after ETU init

Message ID 20190530043642.31421-1-oohall@gmail.com
State Accepted
Headers show
Series [1/2] hw/phb4: Assert Link Disable bit after ETU init | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (76f7316bc8fc8a18fdbfcbc0e1fe1bb992d2a7d7)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot fail Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Oliver O'Halloran May 30, 2019, 4:36 a.m. UTC
The cursed RAID card in ozrom1 has a bug where it ignores PERST being
asserted. The PCIe Base spec is a little vague about what happens
while PERST is asserted, but it does clearly specify that when
PERST is de-asserted the Link Training and Status State Machine
(LTSSM) of a device should return to the initial state (Detect)
defined in the spec and the link training process should restart.

This bug was worked around in 9078f8268922 ("phb4: Delay training till
after PERST is deasserted") by setting the link disable bit at the
start of the FRESET process and clearing it after PERST was
de-asserted. Although this fixed the bug, the patch offered no
explaination of why the fix worked.

In b8b4c79d4419 ("hw/phb4: Factor out PERST control") the link disable
workaround was moved into phb4_assert_perst(). This is called
always in the CRESET case, but a following patch resulted in
assert_perst() not being called if phb4_freset() was entered following a
CRESET since p->skip_perst was set in the CRESET handler. This is bad
since a side-effect of the CRESET is that the Link Disable bit is
cleared.

This, combined with the RAID card ignoring PERST results in the PCIe
link being trained by the PHB while we're waiting out the 100ms
ETU reset time. If we hack skiboot to print a DLP trace after returning
from phb4_hw_init() we get:

 PHB#0001[0:1]: Initialization complete
 PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000001101000000 23ms          GEN1:x16:detect
 PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000183101000000 29ms training GEN1:x16:config
 PHB#0001[0:1]: TRACE:0x00001c5881000000 30ms training GEN1:x08:recovery
 PHB#0001[0:1]: TRACE:0x00001c5883000000 30ms training GEN3:x08:recovery
 PHB#0001[0:1]: TRACE:0x0000144883000000 33ms presence GEN3:x08:L0
 PHB#0001[0:1]: TRACE:0x0000154883000000 33ms trained  GEN3:x08:L0
 PHB#0001[0:1]: CRESET: wait_time = 100
 PHB#0001[0:1]: FRESET: Starts
 PHB#0001[0:1]: FRESET: Prepare for link down
 PHB#0001[0:1]: FRESET: Assert skipped
 PHB#0001[0:1]: FRESET: Deassert
 PHB#0001[0:1]: TRACE:0x0000154883000000  0ms trained  GEN3:x08:L0
 PHB#0001[0:1]: TRACE: Reached target state
 PHB#0001[0:1]: LINK: Start polling
 PHB#0001[0:1]: LINK: Electrical link detected
 PHB#0001[0:1]: LINK: Link is up
 PHB#0001[0:1]: LINK: Went down waiting for stabilty
 PHB#0001[0:1]: LINK: DLP train control: 0x0000105101000000
 PHB#0001[0:1]: CRESET: Starts

What has happened here is that the link is trained to 8x Gen3 33ms after
we return from phb4_init_hw(), and before we've waitined to 100ms
that we normally wait after re-initialising the ETU. When we "deassert"
PERST later on in the FRESET handler the link in L0 (normal) state. At
this point we try to read from the Vendor/Device ID register to verify
that the link is stable and immediately get a PHB fence due to a PCIe
Completion Timeout. Skiboot attempts to recover by doing another CRESET,
but this will encounter the same issue.

This patch fixes the problem by setting the Link Disable bit (by calling
phb4_assert_perst()) immediately after we return from phb4_init_hw().
This prevents the link from being trained while PERST is asserted which
seems to avoid the Completion Timeout. With the patch applied we get:

 PHB#0001[0:1]: Initialization complete
 PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000001101000000 23ms          GEN1:x16:detect
 PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000909101000000 29ms presence GEN1:x16:disabled
 PHB#0001[0:1]: CRESET: wait_time = 100
 PHB#0001[0:1]: FRESET: Starts
 PHB#0001[0:1]: FRESET: Prepare for link down
 PHB#0001[0:1]: FRESET: Assert skipped
 PHB#0001[0:1]: FRESET: Deassert
 PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
 PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000001101000000 24ms          GEN1:x16:detect
 PHB#0001[0:1]: TRACE:0x0000102101000000 36ms presence GEN1:x16:polling
 PHB#0001[0:1]: TRACE:0x0000183101000000 97ms training GEN1:x16:config
 PHB#0001[0:1]: TRACE:0x00001c5881000000 97ms training GEN1:x08:recovery
 PHB#0001[0:1]: TRACE:0x00001c5883000000 97ms training GEN3:x08:recovery
 PHB#0001[0:1]: TRACE:0x0000144883000000 99ms presence GEN3:x08:L0
 PHB#0001[0:1]: TRACE: Reached target state
 PHB#0001[0:1]: LINK: Start polling
 PHB#0001[0:1]: LINK: Electrical link detected
 PHB#0001[0:1]: LINK: Link is up
 PHB#0001[0:1]: LINK: Link is stable
 PHB#0001[0:1]: LINK: Card [9005:028c] Optimal Retry:disabled
 PHB#0001[0:1]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3
 PHB#0001[0:1]: LINK: Width Train:x08 PHB:x08 DEV:x08
 PHB#0001[0:1]: LINK: RX Errors Now:0 Max:8 Lane:0x0000

Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
 hw/phb4.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Stewart Smith June 3, 2019, 2:10 a.m. UTC | #1
"Oliver O'Halloran" <oohall@gmail.com> writes:
> The cursed RAID card in ozrom1 has a bug where it ignores PERST being
> asserted. The PCIe Base spec is a little vague about what happens
> while PERST is asserted, but it does clearly specify that when
> PERST is de-asserted the Link Training and Status State Machine
> (LTSSM) of a device should return to the initial state (Detect)
> defined in the spec and the link training process should restart.
>
> This bug was worked around in 9078f8268922 ("phb4: Delay training till
> after PERST is deasserted") by setting the link disable bit at the
> start of the FRESET process and clearing it after PERST was
> de-asserted. Although this fixed the bug, the patch offered no
> explaination of why the fix worked.
>
> In b8b4c79d4419 ("hw/phb4: Factor out PERST control") the link disable
> workaround was moved into phb4_assert_perst(). This is called
> always in the CRESET case, but a following patch resulted in
> assert_perst() not being called if phb4_freset() was entered following a
> CRESET since p->skip_perst was set in the CRESET handler. This is bad
> since a side-effect of the CRESET is that the Link Disable bit is
> cleared.
>
> This, combined with the RAID card ignoring PERST results in the PCIe
> link being trained by the PHB while we're waiting out the 100ms
> ETU reset time. If we hack skiboot to print a DLP trace after returning
> from phb4_hw_init() we get:
>
>  PHB#0001[0:1]: Initialization complete
>  PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000001101000000 23ms          GEN1:x16:detect
>  PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000183101000000 29ms training GEN1:x16:config
>  PHB#0001[0:1]: TRACE:0x00001c5881000000 30ms training GEN1:x08:recovery
>  PHB#0001[0:1]: TRACE:0x00001c5883000000 30ms training GEN3:x08:recovery
>  PHB#0001[0:1]: TRACE:0x0000144883000000 33ms presence GEN3:x08:L0
>  PHB#0001[0:1]: TRACE:0x0000154883000000 33ms trained  GEN3:x08:L0
>  PHB#0001[0:1]: CRESET: wait_time = 100
>  PHB#0001[0:1]: FRESET: Starts
>  PHB#0001[0:1]: FRESET: Prepare for link down
>  PHB#0001[0:1]: FRESET: Assert skipped
>  PHB#0001[0:1]: FRESET: Deassert
>  PHB#0001[0:1]: TRACE:0x0000154883000000  0ms trained  GEN3:x08:L0
>  PHB#0001[0:1]: TRACE: Reached target state
>  PHB#0001[0:1]: LINK: Start polling
>  PHB#0001[0:1]: LINK: Electrical link detected
>  PHB#0001[0:1]: LINK: Link is up
>  PHB#0001[0:1]: LINK: Went down waiting for stabilty
>  PHB#0001[0:1]: LINK: DLP train control: 0x0000105101000000
>  PHB#0001[0:1]: CRESET: Starts
>
> What has happened here is that the link is trained to 8x Gen3 33ms after
> we return from phb4_init_hw(), and before we've waitined to 100ms
> that we normally wait after re-initialising the ETU. When we "deassert"
> PERST later on in the FRESET handler the link in L0 (normal) state. At
> this point we try to read from the Vendor/Device ID register to verify
> that the link is stable and immediately get a PHB fence due to a PCIe
> Completion Timeout. Skiboot attempts to recover by doing another CRESET,
> but this will encounter the same issue.
>
> This patch fixes the problem by setting the Link Disable bit (by calling
> phb4_assert_perst()) immediately after we return from phb4_init_hw().
> This prevents the link from being trained while PERST is asserted which
> seems to avoid the Completion Timeout. With the patch applied we get:
>
>  PHB#0001[0:1]: Initialization complete
>  PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000001101000000 23ms          GEN1:x16:detect
>  PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000909101000000 29ms presence GEN1:x16:disabled
>  PHB#0001[0:1]: CRESET: wait_time = 100
>  PHB#0001[0:1]: FRESET: Starts
>  PHB#0001[0:1]: FRESET: Prepare for link down
>  PHB#0001[0:1]: FRESET: Assert skipped
>  PHB#0001[0:1]: FRESET: Deassert
>  PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
>  PHB#0001[0:1]: TRACE:0x0000102101000000  0ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000001101000000 24ms          GEN1:x16:detect
>  PHB#0001[0:1]: TRACE:0x0000102101000000 36ms presence GEN1:x16:polling
>  PHB#0001[0:1]: TRACE:0x0000183101000000 97ms training GEN1:x16:config
>  PHB#0001[0:1]: TRACE:0x00001c5881000000 97ms training GEN1:x08:recovery
>  PHB#0001[0:1]: TRACE:0x00001c5883000000 97ms training GEN3:x08:recovery
>  PHB#0001[0:1]: TRACE:0x0000144883000000 99ms presence GEN3:x08:L0
>  PHB#0001[0:1]: TRACE: Reached target state
>  PHB#0001[0:1]: LINK: Start polling
>  PHB#0001[0:1]: LINK: Electrical link detected
>  PHB#0001[0:1]: LINK: Link is up
>  PHB#0001[0:1]: LINK: Link is stable
>  PHB#0001[0:1]: LINK: Card [9005:028c] Optimal Retry:disabled
>  PHB#0001[0:1]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3
>  PHB#0001[0:1]: LINK: Width Train:x08 PHB:x08 DEV:x08
>  PHB#0001[0:1]: LINK: RX Errors Now:0 Max:8 Lane:0x0000
>
> Cc: Michael Neuling <mikey@neuling.org>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
>  hw/phb4.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>

Merged to master as of 02a683bf09d94757a72eec00e602c5609aa8d754.
diff mbox series

Patch

diff --git a/hw/phb4.c b/hw/phb4.c
index 01a371dcb2e4..5ebb28ad8b0c 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3368,6 +3368,12 @@  static int64_t phb4_creset(struct pci_slot *slot)
 		phb4_init_hw(p);
 		pci_slot_set_state(slot, PHB4_SLOT_CRESET_FRESET);
 
+		/*
+		 * The PERST is sticky across resets, but LINK_DIS isn't.
+		 * Re-assert it here now that we've reset the PHB.
+		 */
+		phb4_assert_perst(slot, true);
+
 		/*
 		 * wait either 100ms (for the ETU logic) or until we've had
 		 * PERST asserted for 250ms.