diff mbox series

[v2] PCI: aardvark: Use LTSSM state to build link training flag

Message ID 20190316161243.29517-1-repk@triplefau.lt
State Superseded
Delegated to: Lorenzo Pieralisi
Headers show
Series [v2] PCI: aardvark: Use LTSSM state to build link training flag | expand

Commit Message

Remi Pommarel March 16, 2019, 4:12 p.m. UTC
The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
config register does not reflect the actual link training state and is
always cleared. The Link Training and Status State Machine (LTSSM) flag
in LMI config register could be used as a link training indicator.
Indeed if the LTSSM is in L0 or upper state then link training has
completed (see [1]).

Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
instantly imply a LTSSM state change (e.g. L0s to recovery state
transition takes some time), LTSSM can be in L0 but link training has
not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
state sequence has to be seen to be sure that link training has been
done.

Because one may not call a pcie conf register read on LNKSTA after
doing a retrain link or may miss the link down state due to timing, a
20ms timeout is used. Passing this timeout link is considered retrained.

This fixes boot hang or kernel panic with the following callstack due to
ASPM setup doing a link re-train and polling for PCI_EXP_LNKSTA_LT flag
to be cleared before using it.

-------------------- 8< -------------------
	[    0.915389]  dump_backtrace+0x0/0x140
	[    0.915391]  show_stack+0x14/0x20
	[    0.915393]  dump_stack+0x90/0xb4
	[    0.915394]  panic+0x134/0x2c0
	[    0.915396]  nmi_panic+0x6c/0x70
	[    0.915398]  arm64_serror_panic+0x74/0x80
	[    0.915400]  is_valid_bugaddr+0x0/0x8
	[    0.915402]  el1_error+0x7c/0xe4
	[    0.915404]  advk_pcie_rd_conf+0x4c/0x250
	[    0.915406]  pci_bus_read_config_word+0x7c/0xd0
	[    0.915408]  pcie_capability_read_word+0x90/0xc8
	[    0.915410]  pcie_get_aspm_reg+0x68/0x118
	[    0.915412]  pcie_aspm_init_link_state+0x460/0xa98
	[    0.915414]  pci_scan_slot+0xe8/0x100
	[    0.915416]  pci_scan_child_bus_extend+0x50/0x288
	[    0.915418]  pci_scan_bridge_extend+0x348/0x4f0
	[    0.915420]  pci_scan_child_bus_extend+0x1dc/0x288
	[    0.915423]  pci_scan_root_bus_bridge+0xc4/0xe0
	[    0.915424]  pci_host_probe+0x14/0xa8
	[    0.915426]  advk_pcie_probe+0x838/0x910
	[...]
-------------------- 8< -------------------

[1] "PCI Express Base Specification", REV. 2.1
    PCI Express, March 4 2009, Table 4-7

Signed-off-by: Remi Pommarel <repk@triplefau.lt>
---
Changes since v1:
  - Rename retraining flag field
  - Fix DEVCTL register writing
---
 drivers/pci/controller/pci-aardvark.c | 33 ++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

Comments

Lorenzo Pieralisi April 25, 2019, 11:08 a.m. UTC | #1
On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> config register does not reflect the actual link training state and is
> always cleared. The Link Training and Status State Machine (LTSSM) flag
> in LMI config register could be used as a link training indicator.
> Indeed if the LTSSM is in L0 or upper state then link training has
> completed (see [1]).
> 
> Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> instantly imply a LTSSM state change (e.g. L0s to recovery state
> transition takes some time), LTSSM can be in L0 but link training has
> not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> state sequence has to be seen to be sure that link training has been
> done.

Hi Remi,

I am a bit confused, so you are saying that the LTSSM flag in the
LMI config register can't be used to detect when training is completed ?

Certainly it can't be used by ASPM core that relies on:

PCI_EXP_LNKSTA_LT flag

in the PCI_EXP_LNKSTA register, and that's what you are setting through
this timeout mechanism IIUC.

Please elaborate on that.

I am picking Bjorn's brain on this patch since what you are doing
seems quite arbitrary and honestly it is a bit of a hack.

I wonder whether it is not better to avoid advertising ASPM support.

Lorenzo

> Because one may not call a pcie conf register read on LNKSTA after
> doing a retrain link or may miss the link down state due to timing, a
> 20ms timeout is used. Passing this timeout link is considered retrained.
> 
> This fixes boot hang or kernel panic with the following callstack due to
> ASPM setup doing a link re-train and polling for PCI_EXP_LNKSTA_LT flag
> to be cleared before using it.
> 
> -------------------- 8< -------------------
> 	[    0.915389]  dump_backtrace+0x0/0x140
> 	[    0.915391]  show_stack+0x14/0x20
> 	[    0.915393]  dump_stack+0x90/0xb4
> 	[    0.915394]  panic+0x134/0x2c0
> 	[    0.915396]  nmi_panic+0x6c/0x70
> 	[    0.915398]  arm64_serror_panic+0x74/0x80
> 	[    0.915400]  is_valid_bugaddr+0x0/0x8
> 	[    0.915402]  el1_error+0x7c/0xe4
> 	[    0.915404]  advk_pcie_rd_conf+0x4c/0x250
> 	[    0.915406]  pci_bus_read_config_word+0x7c/0xd0
> 	[    0.915408]  pcie_capability_read_word+0x90/0xc8
> 	[    0.915410]  pcie_get_aspm_reg+0x68/0x118
> 	[    0.915412]  pcie_aspm_init_link_state+0x460/0xa98
> 	[    0.915414]  pci_scan_slot+0xe8/0x100
> 	[    0.915416]  pci_scan_child_bus_extend+0x50/0x288
> 	[    0.915418]  pci_scan_bridge_extend+0x348/0x4f0
> 	[    0.915420]  pci_scan_child_bus_extend+0x1dc/0x288
> 	[    0.915423]  pci_scan_root_bus_bridge+0xc4/0xe0
> 	[    0.915424]  pci_host_probe+0x14/0xa8
> 	[    0.915426]  advk_pcie_probe+0x838/0x910
> 	[...]
> -------------------- 8< -------------------
> 
> [1] "PCI Express Base Specification", REV. 2.1
>     PCI Express, March 4 2009, Table 4-7
> 
> Signed-off-by: Remi Pommarel <repk@triplefau.lt>
> ---
> Changes since v1:
>   - Rename retraining flag field
>   - Fix DEVCTL register writing
> ---
>  drivers/pci/controller/pci-aardvark.c | 33 ++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
> index eb58dfdaba1b..47b707b5fc2c 100644
> --- a/drivers/pci/controller/pci-aardvark.c
> +++ b/drivers/pci/controller/pci-aardvark.c
> @@ -180,6 +180,7 @@
>  #define LINK_WAIT_MAX_RETRIES		10
>  #define LINK_WAIT_USLEEP_MIN		90000
>  #define LINK_WAIT_USLEEP_MAX		100000
> +#define LINK_RETRAIN_DELAY_MAX		(20 * HZ / 1000) /* 20 ms */
>  
>  #define MSI_IRQ_NUM			32
>  
> @@ -199,6 +200,8 @@ struct advk_pcie {
>  	u16 msi_msg;
>  	int root_bus_nr;
>  	struct pci_bridge_emul bridge;
> +	unsigned long rl_deadline; /* Retrain link jiffies deadline */
> +	u8 rl_asked; /* Retraining has been asked and is in transition */
>  };
>  
>  static inline void advk_writel(struct advk_pcie *pcie, u32 val, u64 reg)
> @@ -400,6 +403,19 @@ static int advk_pcie_wait_pio(struct advk_pcie *pcie)
>  	return -ETIMEDOUT;
>  }
>  
> +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> +{
> +	if (!advk_pcie_link_up(pcie)) {
> +		pcie->rl_asked = 0;
> +		return 1;
> +	}
> +
> +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> +		return 1;
> +
> +	pcie->rl_asked = 0;
> +	return 0;
> +}
>  
>  static pci_bridge_emul_read_status_t
>  advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> @@ -426,11 +442,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	}
>  
> +	case PCI_EXP_LNKCTL: {
> +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> +			~(PCI_EXP_LNKSTA_LT << 16);
> +		if (advk_pcie_link_retraining(pcie))
> +			val |= (PCI_EXP_LNKSTA_LT << 16);
> +		*value = val;
> +		return PCI_BRIDGE_EMUL_HANDLED;
> +	}
> +
>  	case PCI_CAP_LIST_ID:
>  	case PCI_EXP_DEVCAP:
>  	case PCI_EXP_DEVCTL:
>  	case PCI_EXP_LNKCAP:
> -	case PCI_EXP_LNKCTL:
>  		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	default:
> @@ -447,8 +471,15 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
>  
>  	switch (reg) {
>  	case PCI_EXP_DEVCTL:
> +		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		break;
> +
>  	case PCI_EXP_LNKCTL:
>  		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		if (new & PCI_EXP_LNKCTL_RL) {
> +			pcie->rl_asked = 1;
> +			pcie->rl_deadline = jiffies + LINK_RETRAIN_DELAY_MAX;
> +		}
>  		break;
>  
>  	case PCI_EXP_RTCTL:
> -- 
> 2.20.1
>
Remi Pommarel April 25, 2019, 2:23 p.m. UTC | #2
Hi Lorenzo,

On Thu, Apr 25, 2019 at 12:08:30PM +0100, Lorenzo Pieralisi wrote:
> On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > config register does not reflect the actual link training state and is
> > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > in LMI config register could be used as a link training indicator.
> > Indeed if the LTSSM is in L0 or upper state then link training has
> > completed (see [1]).
> > 
> > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > transition takes some time), LTSSM can be in L0 but link training has
> > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > state sequence has to be seen to be sure that link training has been
> > done.
> 
> Hi Remi,
> 
> I am a bit confused, so you are saying that the LTSSM flag in the
> LMI config register can't be used to detect when training is completed ?

Not exactly, I am saying that PCI_EXP_LNKSTA_LT from PCI_EXP_LNKSTA
register can't be used with this hardware, but can be emulated with
LTSSM flag.

> 
> Certainly it can't be used by ASPM core that relies on:
> 
> PCI_EXP_LNKSTA_LT flag
> 
> in the PCI_EXP_LNKSTA register, and that's what you are setting through
> this timeout mechanism IIUC.
> 
> Please elaborate on that.

The problem here is that the hardware does not change PCI_EXP_LNKSTA_LT
at all. So in order to support link re-training feature we need to
emulate this flag. To do so LTSSM flag can be used.

Indeed we can set the emulated PCI_EXP_LNKSTA_LT as soon as re-training
is asked and wait for LTSSM flag to be back to a configured state
(e.g. L0, L0s) before clearing it.

The problem with that is that LTSSM flag does not change instantly after
link re-training has been asked, and will stay in configured state for a
small amount of time. So the idea is to poll the LTSSM flag and wait for
it to enter a recovery state then waiting for it to be back in
configured state.

The timeout is only here as a fallback in the unlikely event that we
missed the LTSSM flag entering recovery state.

> 
> I am picking Bjorn's brain on this patch since what you are doing
> seems quite arbitrary and honestly it is a bit of a hack.

Yes, sorry, it is a bit of a hack because I try to workaround a
hardware issue.

Please note that vendor has been contacted about this in the meantime
and answered the following:

"FW can poll LTSSM state equals any of the following values: 0xB or 0xD
or 0xC or 0xE. After that, polls for LTSSM equals 0x10. For your
information, LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
........... -> 0x10".

It is basically what this patch does, I've just added a timeout fallback
to not poll LTSSM state forever if its transition to 0xB, 0xD, 0xC or
0xE has been missed.
Lorenzo Pieralisi April 25, 2019, 3:06 p.m. UTC | #3
On Thu, Apr 25, 2019 at 04:23:53PM +0200, Remi Pommarel wrote:
> Hi Lorenzo,
> 
> On Thu, Apr 25, 2019 at 12:08:30PM +0100, Lorenzo Pieralisi wrote:
> > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > > config register does not reflect the actual link training state and is
> > > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > > in LMI config register could be used as a link training indicator.
> > > Indeed if the LTSSM is in L0 or upper state then link training has
> > > completed (see [1]).
> > > 
> > > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> > > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > > transition takes some time), LTSSM can be in L0 but link training has
> > > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > > state sequence has to be seen to be sure that link training has been
> > > done.
> > 
> > Hi Remi,
> > 
> > I am a bit confused, so you are saying that the LTSSM flag in the
> > LMI config register can't be used to detect when training is completed ?
> 
> Not exactly, I am saying that PCI_EXP_LNKSTA_LT from PCI_EXP_LNKSTA
> register can't be used with this hardware, but can be emulated with
> LTSSM flag.
> 
> > 
> > Certainly it can't be used by ASPM core that relies on:
> > 
> > PCI_EXP_LNKSTA_LT flag
> > 
> > in the PCI_EXP_LNKSTA register, and that's what you are setting through
> > this timeout mechanism IIUC.
> > 
> > Please elaborate on that.
> 
> The problem here is that the hardware does not change PCI_EXP_LNKSTA_LT
> at all. So in order to support link re-training feature we need to
> emulate this flag. To do so LTSSM flag can be used.

Understood.

> Indeed we can set the emulated PCI_EXP_LNKSTA_LT as soon as re-training
> is asked and wait for LTSSM flag to be back to a configured state
> (e.g. L0, L0s) before clearing it.

The check for the LTSSM is carried out through advk_pcie_link_up()
(ie register CFG_REG), correct ?

> The problem with that is that LTSSM flag does not change instantly after
> link re-training has been asked, and will stay in configured state for a
> small amount of time. So the idea is to poll the LTSSM flag and wait for
> it to enter a recovery state then waiting for it to be back in
> configured state.

When you say "poll" you mean checking advk_pcie_link_up() ?

More below on the code.

> The timeout is only here as a fallback in the unlikely event that we
> missed the LTSSM flag entering recovery state.
> 
> > 
> > I am picking Bjorn's brain on this patch since what you are doing
> > seems quite arbitrary and honestly it is a bit of a hack.
> 
> Yes, sorry, it is a bit of a hack because I try to workaround a
> hardware issue.

No problems, it is not your fault.
> 
> Please note that vendor has been contacted about this in the meantime
> and answered the following:
> 
> "FW can poll LTSSM state equals any of the following values: 0xB or 0xD
> or 0xC or 0xE. After that, polls for LTSSM equals 0x10. For your
> information, LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> ........... -> 0x10".
> 
> It is basically what this patch does, I've just added a timeout fallback
> to not poll LTSSM state forever if its transition to 0xB, 0xD, 0xC or
> 0xE has been missed.

When you say "missed" you mean advk_pcie_link_up() returning true, right ?

[...]

> > > +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> > > +{
> > > +	if (!advk_pcie_link_up(pcie)) {

That's the bit I find confusing. Is this check here to detect if the
link went through the sequence below ? Should not it be carried
out only if (pcie->rl_asked == 1) ?

"... LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
 ........... -> 0x10".

> > > +		pcie->rl_asked = 0;

Why ?

> > > +		return 1;
> > > +	}
> > > +
> > > +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> > > +		return 1;

This ensures that if the LTSSM >= 0x10 we still wait for a delay before
considering the link up (because I suppose, after asking a retraining
it takes a while for the LTSSM state to become < 0x10), correct ?

You have to comment this code since it is not easy to grasp.

Lorenzo

> > > +
> > > +	pcie->rl_asked = 0;
> > > +	return 0;
> > > +}
> > >  
> > >  static pci_bridge_emul_read_status_t
> > >  advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> > > @@ -426,11 +442,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> > >  		return PCI_BRIDGE_EMUL_HANDLED;
> > >  	}
> > >  
> > > +	case PCI_EXP_LNKCTL: {
> > > +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> > > +			~(PCI_EXP_LNKSTA_LT << 16);
> > > +		if (advk_pcie_link_retraining(pcie))
> > > +			val |= (PCI_EXP_LNKSTA_LT << 16);
> > > +		*value = val;
> > > +		return PCI_BRIDGE_EMUL_HANDLED;
> > > +	}
> > > +
> > >  	case PCI_CAP_LIST_ID:
> > >  	case PCI_EXP_DEVCAP:
> > >  	case PCI_EXP_DEVCTL:
> > >  	case PCI_EXP_LNKCAP:
> > > -	case PCI_EXP_LNKCTL:
> > >  		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
> > >  		return PCI_BRIDGE_EMUL_HANDLED;
> > >  	default:
> > > @@ -447,8 +471,15 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
> > >  
> > >  	switch (reg) {
> > >  	case PCI_EXP_DEVCTL:
> > > +		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> > > +		break;
> > > +
> > >  	case PCI_EXP_LNKCTL:
> > >  		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> > > +		if (new & PCI_EXP_LNKCTL_RL) {
> > > +			pcie->rl_asked = 1;
> > > +			pcie->rl_deadline = jiffies + LINK_RETRAIN_DELAY_MAX;
> > > +		}
> > >  		break;
> > >  
> > >  	case PCI_EXP_RTCTL:
> > > -- 
> > > 2.20.1
> > >
Bjorn Helgaas April 25, 2019, 9:04 p.m. UTC | #4
Hi Remi,

On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> config register does not reflect the actual link training state and is
> always cleared. The Link Training and Status State Machine (LTSSM) flag
> in LMI config register could be used as a link training indicator.

LMI?  I assume this is an Aardvark-specific register?  Maybe "Aardvark
LMI register", since the other things here are generic PCIe registers?

Is this a hardware erratum?  I know advk does some software emulation,
but it looks like the Aardvark PCIE_CORE_PCIEXP_CAP + PCI_EXP_LNKSTA
register is awfully close to being exactly the PCIe-defined
PCI_EXP_LNKSTA, so the difference seems like a mistake.

> Indeed if the LTSSM is in L0 or upper state then link training has
> completed (see [1]).
> 
> Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not

s/PCI_EXP_LINCTL_RL/PCI_EXP_LNKCTL_RL/

> instantly imply a LTSSM state change (e.g. L0s to recovery state
> transition takes some time), LTSSM can be in L0 but link training has
> not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> state sequence has to be seen to be sure that link training has been
> done.
> 
> Because one may not call a pcie conf register read on LNKSTA after
> doing a retrain link or may miss the link down state due to timing, a
> 20ms timeout is used. Passing this timeout link is considered retrained.

It sounds like reading and/or writing some registers during a retrain
causes some sort of EL1 error?  Is this a separate erratum?  Is there
a list of the registers and operations (read/write) that are affected?
The backtrace below suggests that it's actually a read of LNKCAP or
LNKCTL (not LNKSTA) that caused the error.

It sounds like there are really two problems:

  1) Reading PCI_EXP_LNKSTA (or the Aardvark equivalent) doesn't give
     valid data for PCI_EXP_LNKSTA_LT.

  2) Sometimes config reads cause EL1 errors.

If that's the case and if it's possible, can you split this into a
patch for each issue?

> This fixes boot hang or kernel panic with the following callstack due to
> ASPM setup doing a link re-train and polling for PCI_EXP_LNKSTA_LT flag
> to be cleared before using it.
> 
> -------------------- 8< -------------------
> 	[    0.915389]  dump_backtrace+0x0/0x140
> 	[    0.915391]  show_stack+0x14/0x20
> 	[    0.915393]  dump_stack+0x90/0xb4
> 	[    0.915394]  panic+0x134/0x2c0
> 	[    0.915396]  nmi_panic+0x6c/0x70
> 	[    0.915398]  arm64_serror_panic+0x74/0x80
> 	[    0.915400]  is_valid_bugaddr+0x0/0x8
> 	[    0.915402]  el1_error+0x7c/0xe4
> 	[    0.915404]  advk_pcie_rd_conf+0x4c/0x250
> 	[    0.915406]  pci_bus_read_config_word+0x7c/0xd0
> 	[    0.915408]  pcie_capability_read_word+0x90/0xc8
> 	[    0.915410]  pcie_get_aspm_reg+0x68/0x118
> 	[    0.915412]  pcie_aspm_init_link_state+0x460/0xa98

This backtrace doesn't make sense to me as being related to this
issue.  You said above that the bug was that PCI_EXP_LNKSTA_LT is not
updated.  But apparently even *reading* a register at the wrong time
causes this EL1 error.  And pcie_get_aspm_reg() doesn't even read
LNKSTA; it only reads LNKCAP and LNKCTL.

BTW, if you're including a backtrace in a commit log, you can strip
out the timestamps and the "cut" lines because they don't contribute
information that's relevant in this context.

> 	[    0.915414]  pci_scan_slot+0xe8/0x100
> 	[    0.915416]  pci_scan_child_bus_extend+0x50/0x288
> 	[    0.915418]  pci_scan_bridge_extend+0x348/0x4f0
> 	[    0.915420]  pci_scan_child_bus_extend+0x1dc/0x288
> 	[    0.915423]  pci_scan_root_bus_bridge+0xc4/0xe0
> 	[    0.915424]  pci_host_probe+0x14/0xa8
> 	[    0.915426]  advk_pcie_probe+0x838/0x910
> 	[...]
> -------------------- 8< -------------------
> 
> [1] "PCI Express Base Specification", REV. 2.1
>     PCI Express, March 4 2009, Table 4-7
> 
> Signed-off-by: Remi Pommarel <repk@triplefau.lt>
> ---
> Changes since v1:
>   - Rename retraining flag field
>   - Fix DEVCTL register writing
> ---
>  drivers/pci/controller/pci-aardvark.c | 33 ++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
> index eb58dfdaba1b..47b707b5fc2c 100644
> --- a/drivers/pci/controller/pci-aardvark.c
> +++ b/drivers/pci/controller/pci-aardvark.c
> @@ -180,6 +180,7 @@
>  #define LINK_WAIT_MAX_RETRIES		10
>  #define LINK_WAIT_USLEEP_MIN		90000
>  #define LINK_WAIT_USLEEP_MAX		100000
> +#define LINK_RETRAIN_DELAY_MAX		(20 * HZ / 1000) /* 20 ms */
>  
>  #define MSI_IRQ_NUM			32
>  
> @@ -199,6 +200,8 @@ struct advk_pcie {
>  	u16 msi_msg;
>  	int root_bus_nr;
>  	struct pci_bridge_emul bridge;
> +	unsigned long rl_deadline; /* Retrain link jiffies deadline */
> +	u8 rl_asked; /* Retraining has been asked and is in transition */
>  };
>  
>  static inline void advk_writel(struct advk_pcie *pcie, u32 val, u64 reg)
> @@ -400,6 +403,19 @@ static int advk_pcie_wait_pio(struct advk_pcie *pcie)
>  	return -ETIMEDOUT;
>  }
>  
> +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> +{
> +	if (!advk_pcie_link_up(pcie)) {
> +		pcie->rl_asked = 0;
> +		return 1;
> +	}
> +
> +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> +		return 1;
> +
> +	pcie->rl_asked = 0;
> +	return 0;
> +}
>  
>  static pci_bridge_emul_read_status_t
>  advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> @@ -426,11 +442,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	}
>  
> +	case PCI_EXP_LNKCTL: {

Don't you mean PCI_EXP_LNKSTA here?

> +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> +			~(PCI_EXP_LNKSTA_LT << 16);
> +		if (advk_pcie_link_retraining(pcie))
> +			val |= (PCI_EXP_LNKSTA_LT << 16);
> +		*value = val;
> +		return PCI_BRIDGE_EMUL_HANDLED;
> +	}
> +
>  	case PCI_CAP_LIST_ID:
>  	case PCI_EXP_DEVCAP:
>  	case PCI_EXP_DEVCTL:
>  	case PCI_EXP_LNKCAP:
> -	case PCI_EXP_LNKCTL:

If you did mean PCI_EXP_LNKSTA above, I suppose you would leave
PCI_EXP_LNKCTL here?

>  		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	default:
> @@ -447,8 +471,15 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
>  
>  	switch (reg) {
>  	case PCI_EXP_DEVCTL:
> +		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		break;

What's the purpose of this DEVCTL change?  Could it be a separate patch?
I can't tell that it's related to the PCI_EXP_LNKSTA_LT issue.

>  	case PCI_EXP_LNKCTL:
>  		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		if (new & PCI_EXP_LNKCTL_RL) {
> +			pcie->rl_asked = 1;
> +			pcie->rl_deadline = jiffies + LINK_RETRAIN_DELAY_MAX;
> +		}
>  		break;
>  
>  	case PCI_EXP_RTCTL:
> -- 
> 2.20.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Remi Pommarel April 25, 2019, 10:27 p.m. UTC | #5
Hi Bjorn,

On Thu, Apr 25, 2019 at 04:04:39PM -0500, Bjorn Helgaas wrote:
> Hi Remi,
> 
> On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > config register does not reflect the actual link training state and is
> > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > in LMI config register could be used as a link training indicator.
> 
> LMI?  I assume this is an Aardvark-specific register?  Maybe "Aardvark
> LMI register", since the other things here are generic PCIe registers?

Yes this is an Aardvark specific register, I would have said that LMI
stands for LTSSM management interface, but this is only guessing as I
don't have access to a datasheet.

> Is this a hardware erratum?  I know advk does some software emulation,
> but it looks like the Aardvark PCIE_CORE_PCIEXP_CAP + PCI_EXP_LNKSTA
> register is awfully close to being exactly the PCIe-defined
> PCI_EXP_LNKSTA, so the difference seems like a mistake.

Vendor says that PCI_EXP_LNKSTA_LT bit in PCI_EXP_LNKSTA is marked as
"reserved" and is not implemented. It does look like a mistake to me.

> 
> > Indeed if the LTSSM is in L0 or upper state then link training has
> > completed (see [1]).
> > 
> > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> 
> s/PCI_EXP_LINCTL_RL/PCI_EXP_LNKCTL_RL/
> 
> > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > transition takes some time), LTSSM can be in L0 but link training has
> > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > state sequence has to be seen to be sure that link training has been
> > done.
> > 
> > Because one may not call a pcie conf register read on LNKSTA after
> > doing a retrain link or may miss the link down state due to timing, a
> > 20ms timeout is used. Passing this timeout link is considered retrained.
> 
> It sounds like reading and/or writing some registers during a retrain
> causes some sort of EL1 error?  Is this a separate erratum?  Is there
> a list of the registers and operations (read/write) that are affected?
> The backtrace below suggests that it's actually a read of LNKCAP or
> LNKCTL (not LNKSTA) that caused the error.

IIUC, the backtrace below produces an EL1 error when doing a PIO
transfer while the link is still retraining. See my comment below for
more about that. But accessing any root complex's register seems fine.
> 
> It sounds like there are really two problems:
> 
>   1) Reading PCI_EXP_LNKSTA (or the Aardvark equivalent) doesn't give
>      valid data for PCI_EXP_LNKSTA_LT.

The 1) is correct.

> 
>   2) Sometimes config reads cause EL1 errors.

Actually EL1 error happens when we try to access device's register with
a PIO transfer, which is when we try to use the link while it is being
retrained.

IMHO, 1) and 2) are linked. ASPM core tries to use the link too early
because it has read invalid data for PCI_EXP_LNKSTA_LT.

> 
> If that's the case and if it's possible, can you split this into a
> patch for each issue?
> 
> > This fixes boot hang or kernel panic with the following callstack due to
> > ASPM setup doing a link re-train and polling for PCI_EXP_LNKSTA_LT flag
> > to be cleared before using it.
> > 
> > -------------------- 8< -------------------
> > 	[    0.915389]  dump_backtrace+0x0/0x140
> > 	[    0.915391]  show_stack+0x14/0x20
> > 	[    0.915393]  dump_stack+0x90/0xb4
> > 	[    0.915394]  panic+0x134/0x2c0
> > 	[    0.915396]  nmi_panic+0x6c/0x70
> > 	[    0.915398]  arm64_serror_panic+0x74/0x80
> > 	[    0.915400]  is_valid_bugaddr+0x0/0x8
> > 	[    0.915402]  el1_error+0x7c/0xe4
> > 	[    0.915404]  advk_pcie_rd_conf+0x4c/0x250
> > 	[    0.915406]  pci_bus_read_config_word+0x7c/0xd0
> > 	[    0.915408]  pcie_capability_read_word+0x90/0xc8
> > 	[    0.915410]  pcie_get_aspm_reg+0x68/0x118
> > 	[    0.915412]  pcie_aspm_init_link_state+0x460/0xa98
> 
> This backtrace doesn't make sense to me as being related to this
> issue.  You said above that the bug was that PCI_EXP_LNKSTA_LT is not
> updated.  But apparently even *reading* a register at the wrong time
> causes this EL1 error.  And pcie_get_aspm_reg() doesn't even read
> LNKSTA; it only reads LNKCAP and LNKCTL.

In fact here, advk_pcie_rd_conf+0x4c seems to be this line:
	advk_writel(pcie, 0, PIO_START);

So EL1 error seems not to happen while accessing root complex registery
but while doing a PIO transfer using the link.

> 
> BTW, if you're including a backtrace in a commit log, you can strip
> out the timestamps and the "cut" lines because they don't contribute
> information that's relevant in this context.
> 
> > 	[    0.915414]  pci_scan_slot+0xe8/0x100
> > 	[    0.915416]  pci_scan_child_bus_extend+0x50/0x288
> > 	[    0.915418]  pci_scan_bridge_extend+0x348/0x4f0
> > 	[    0.915420]  pci_scan_child_bus_extend+0x1dc/0x288
> > 	[    0.915423]  pci_scan_root_bus_bridge+0xc4/0xe0
> > 	[    0.915424]  pci_host_probe+0x14/0xa8
> > 	[    0.915426]  advk_pcie_probe+0x838/0x910
> > 	[...]
> > -------------------- 8< -------------------
> > 
> > [1] "PCI Express Base Specification", REV. 2.1
> >     PCI Express, March 4 2009, Table 4-7
> > 
> > Signed-off-by: Remi Pommarel <repk@triplefau.lt>
> > ---
> > Changes since v1:
> >   - Rename retraining flag field
> >   - Fix DEVCTL register writing
> > ---
> >  drivers/pci/controller/pci-aardvark.c | 33 ++++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
> > index eb58dfdaba1b..47b707b5fc2c 100644
> > --- a/drivers/pci/controller/pci-aardvark.c
> > +++ b/drivers/pci/controller/pci-aardvark.c
> > @@ -180,6 +180,7 @@
> >  #define LINK_WAIT_MAX_RETRIES		10
> >  #define LINK_WAIT_USLEEP_MIN		90000
> >  #define LINK_WAIT_USLEEP_MAX		100000
> > +#define LINK_RETRAIN_DELAY_MAX		(20 * HZ / 1000) /* 20 ms */
> >  
> >  #define MSI_IRQ_NUM			32
> >  
> > @@ -199,6 +200,8 @@ struct advk_pcie {
> >  	u16 msi_msg;
> >  	int root_bus_nr;
> >  	struct pci_bridge_emul bridge;
> > +	unsigned long rl_deadline; /* Retrain link jiffies deadline */
> > +	u8 rl_asked; /* Retraining has been asked and is in transition */
> >  };
> >  
> >  static inline void advk_writel(struct advk_pcie *pcie, u32 val, u64 reg)
> > @@ -400,6 +403,19 @@ static int advk_pcie_wait_pio(struct advk_pcie *pcie)
> >  	return -ETIMEDOUT;
> >  }
> >  
> > +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> > +{
> > +	if (!advk_pcie_link_up(pcie)) {
> > +		pcie->rl_asked = 0;
> > +		return 1;
> > +	}
> > +
> > +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> > +		return 1;
> > +
> > +	pcie->rl_asked = 0;
> > +	return 0;
> > +}
> >  
> >  static pci_bridge_emul_read_status_t
> >  advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> > @@ -426,11 +442,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> >  		return PCI_BRIDGE_EMUL_HANDLED;
> >  	}
> >  
> > +	case PCI_EXP_LNKCTL: {
> 
> Don't you mean PCI_EXP_LNKSTA here?

No, PCI_EXP_LNKSTA and PCI_EXP_LNKCTL are consecutive 16bit registers
but bridge emulation accesses those registers by 32bit chunk. So when
one wants to read PCI_EXP_LNKSTA register, pci bridge reads 32bit data
from PCI_EXP_LNKCTL and 16 bit shift the result to the right.

This is why I use (PCI_EXP_LNKSTA_LT << 16) below.

> 
> > +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> > +			~(PCI_EXP_LNKSTA_LT << 16);
> > +		if (advk_pcie_link_retraining(pcie))
> > +			val |= (PCI_EXP_LNKSTA_LT << 16);
> > +		*value = val;
> > +		return PCI_BRIDGE_EMUL_HANDLED;
> > +	}
> > +
> >  	case PCI_CAP_LIST_ID:
> >  	case PCI_EXP_DEVCAP:
> >  	case PCI_EXP_DEVCTL:
> >  	case PCI_EXP_LNKCAP:
> > -	case PCI_EXP_LNKCTL:
> 
> If you did mean PCI_EXP_LNKSTA above, I suppose you would leave
> PCI_EXP_LNKCTL here?
> 
> >  		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
> >  		return PCI_BRIDGE_EMUL_HANDLED;
> >  	default:
> > @@ -447,8 +471,15 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
> >  
> >  	switch (reg) {
> >  	case PCI_EXP_DEVCTL:
> > +		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> > +		break;
> 
> What's the purpose of this DEVCTL change?  Could it be a separate patch?
> I can't tell that it's related to the PCI_EXP_LNKSTA_LT issue.

In fact that is the computed diff that is a bit weird here. DEVCTL does
not really change, the advk_writel() stays the same as it was before
the change.

I have actually just modified the PCI_EXP_LNKCTL case.

> 
> >  	case PCI_EXP_LNKCTL:
> >  		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> > +		if (new & PCI_EXP_LNKCTL_RL) {
> > +			pcie->rl_asked = 1;
> > +			pcie->rl_deadline = jiffies + LINK_RETRAIN_DELAY_MAX;
> > +		}
> >  		break;
> >  
> >  	case PCI_EXP_RTCTL:
> > -- 
> > 2.20.1
> > 
> > 
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Remi Pommarel April 29, 2019, 3:32 p.m. UTC | #6
Hi Lorenzo,

Sorry for duplicates I forgot to include everyone.

On Thu, Apr 25, 2019 at 04:06:40PM +0100, Lorenzo Pieralisi wrote:
> On Thu, Apr 25, 2019 at 04:23:53PM +0200, Remi Pommarel wrote:
> > Hi Lorenzo,
> > 
> > On Thu, Apr 25, 2019 at 12:08:30PM +0100, Lorenzo Pieralisi wrote:
> > > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > > > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > > > config register does not reflect the actual link training state and is
> > > > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > > > in LMI config register could be used as a link training indicator.
> > > > Indeed if the LTSSM is in L0 or upper state then link training has
> > > > completed (see [1]).
> > > > 
> > > > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> > > > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > > > transition takes some time), LTSSM can be in L0 but link training has
> > > > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > > > state sequence has to be seen to be sure that link training has been
> > > > done.
> > > 
> > > Hi Remi,
> > > 
> > > I am a bit confused, so you are saying that the LTSSM flag in the
> > > LMI config register can't be used to detect when training is completed ?
> > 
> > Not exactly, I am saying that PCI_EXP_LNKSTA_LT from PCI_EXP_LNKSTA
> > register can't be used with this hardware, but can be emulated with
> > LTSSM flag.
> > 
> > > 
> > > Certainly it can't be used by ASPM core that relies on:
> > > 
> > > PCI_EXP_LNKSTA_LT flag
> > > 
> > > in the PCI_EXP_LNKSTA register, and that's what you are setting through
> > > this timeout mechanism IIUC.
> > > 
> > > Please elaborate on that.
> > 
> > The problem here is that the hardware does not change PCI_EXP_LNKSTA_LT
> > at all. So in order to support link re-training feature we need to
> > emulate this flag. To do so LTSSM flag can be used.
> 
> Understood.
> 
> > Indeed we can set the emulated PCI_EXP_LNKSTA_LT as soon as re-training
> > is asked and wait for LTSSM flag to be back to a configured state
> > (e.g. L0, L0s) before clearing it.
> 
> The check for the LTSSM is carried out through advk_pcie_link_up()
> (ie register CFG_REG), correct ?
> 

Yes that is correct.

> > The problem with that is that LTSSM flag does not change instantly after
> > link re-training has been asked, and will stay in configured state for a
> > small amount of time. So the idea is to poll the LTSSM flag and wait for
> > it to enter a recovery state then waiting for it to be back in
> > configured state.
> 
> When you say "poll" you mean checking advk_pcie_link_up() ?
> 

I mean checking advk_pcie_link_up() in a loop. This loop is done by the
user (e.g. ASPM core). ASPM core waits for PCI_EXP_LNKSTA_LT to be
cleared in pcie_aspm_configure_common_clock() just after it has set
PCI_EXP_LNKCTL_RL.

So the idea was to check advk_pcie_link_up() each time ASPM core checks
the PCI_EXP_LNKSTA_LT flag. Please see below patch for an alternative
to that.

> More below on the code.
> 
> > The timeout is only here as a fallback in the unlikely event that we
> > missed the LTSSM flag entering recovery state.
> > 
> > > 
> > > I am picking Bjorn's brain on this patch since what you are doing
> > > seems quite arbitrary and honestly it is a bit of a hack.
> > 
> > Yes, sorry, it is a bit of a hack because I try to workaround a
> > hardware issue.
> 
> No problems, it is not your fault.
> > 
> > Please note that vendor has been contacted about this in the meantime
> > and answered the following:
> > 
> > "FW can poll LTSSM state equals any of the following values: 0xB or 0xD
> > or 0xC or 0xE. After that, polls for LTSSM equals 0x10. For your
> > information, LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> > ........... -> 0x10".
> > 
> > It is basically what this patch does, I've just added a timeout fallback
> > to not poll LTSSM state forever if its transition to 0xB, 0xD, 0xC or
> > 0xE has been missed.
> 
> When you say "missed" you mean advk_pcie_link_up() returning true, right ?
> 

Not exactly, I mean that LTSSM had the time to go down and back up
between advk_pcie_link_up() because, for example, ASPM core loop took
too much time between two PCI_EXP_LNKSTA_LT flag checks.

> [...]
> 
> > > > +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> > > > +{
> > > > +	if (!advk_pcie_link_up(pcie)) {
> 
> That's the bit I find confusing. Is this check here to detect if the
> link went through the sequence below ? Should not it be carried
> out only if (pcie->rl_asked == 1) ?
> 
> "... LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
>  ........... -> 0x10".

Yes it is the check to detect the sequence. advk_pcie_link_up() returns
false if LTSSM <= 0x10.

This cannot be done only if (pcie->rl_asked == 1) because I still
want this function to return 1 if link is still down.

> 
> > > > +		pcie->rl_asked = 0;
> 
> Why ?
> 

rl_asked is not a good name, I could have called it
pcie->wait_for_link_down instead. So if advk_pcie_link_up() returns
false that means that we don't need to wait for link being down any more
and just wait for (LTSSM >= 0x10). In this case the delay is not needed.

> > > > +		return 1;
> > > > +	}
> > > > +
> > > > +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> > > > +		return 1;
> 
> This ensures that if the LTSSM >= 0x10 we still wait for a delay before
> considering the link up (because I suppose, after asking a retraining
> it takes a while for the LTSSM state to become < 0x10), correct ?

Yes it takes a while to become < 0x10 after retraining hence the delay.
But here we don't need to always wait for a delay. Indeed if we've
already seen the link being < 0x10 (i.e if "pcie->rl_asked == 0") and
if after that link is >= 0x10 then we know that retraining process has
finished.

Anyway I did it this way because I wanted to keep
advk_pci_bridge_emul_pcie_conf_write() from polling. But this is
obviously a bad reason as it makes the code way too complex and relies
on user (ASPM core) to do the poll instead.

So if you find the following better I'll send a v3 with that:

---
diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
index eb58dfdaba1b..67e8ae4e313e 100644
--- a/drivers/pci/controller/pci-aardvark.c
+++ b/drivers/pci/controller/pci-aardvark.c
@@ -180,6 +180,9 @@
 #define LINK_WAIT_MAX_RETRIES		10
 #define LINK_WAIT_USLEEP_MIN		90000
 #define LINK_WAIT_USLEEP_MAX		100000
+#define RETRAIN_WAIT_MAX_RETRIES	20
+#define RETRAIN_WAIT_USLEEP_MIN		2000
+#define RETRAIN_WAIT_USLEEP_MAX		5000
 
 #define MSI_IRQ_NUM			32
 
@@ -239,6 +242,17 @@ static int advk_pcie_wait_for_link(struct advk_pcie *pcie)
 	return -ETIMEDOUT;
 }
 
+static void advk_pcie_wait_for_retrain(struct advk_pcie *pcie)
+{
+	size_t retries;
+
+	for (retries = 0; retries < RETRAIN_WAIT_MAX_RETRIES; ++retries) {
+		if (!advk_pcie_link_up(pcie))
+			break;
+		usleep_range(RETRAIN_WAIT_USLEEP_MIN, RETRAIN_WAIT_USLEEP_MAX);
+	}
+}
+
 static void advk_pcie_setup_hw(struct advk_pcie *pcie)
 {
 	u32 reg;
@@ -426,11 +440,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
 		return PCI_BRIDGE_EMUL_HANDLED;
 	}
 
+	case PCI_EXP_LNKCTL: {
+		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
+			~(PCI_EXP_LNKSTA_LT << 16);
+		if (!advk_pcie_link_up(pcie))
+			val |= (PCI_EXP_LNKSTA_LT << 16);
+		*value = val;
+		return PCI_BRIDGE_EMUL_HANDLED;
+	}
+
 	case PCI_CAP_LIST_ID:
 	case PCI_EXP_DEVCAP:
 	case PCI_EXP_DEVCTL:
 	case PCI_EXP_LNKCAP:
-	case PCI_EXP_LNKCTL:
 		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
 		return PCI_BRIDGE_EMUL_HANDLED;
 	default:
@@ -447,8 +469,13 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
 
 	switch (reg) {
 	case PCI_EXP_DEVCTL:
+		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
+		break;
+
 	case PCI_EXP_LNKCTL:
 		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
+		if (new & PCI_EXP_LNKCTL_RL)
+			advk_pcie_wait_for_retrain(pcie);
 		break;
 
 	case PCI_EXP_RTCTL:
Bjorn Helgaas April 29, 2019, 7:45 p.m. UTC | #7
On Fri, Apr 26, 2019 at 12:27:57AM +0200, Remi Pommarel wrote:
> On Thu, Apr 25, 2019 at 04:04:39PM -0500, Bjorn Helgaas wrote:
> > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:

> > It sounds like reading and/or writing some registers during a retrain
> > causes some sort of EL1 error?  Is this a separate erratum?  Is there
> > a list of the registers and operations (read/write) that are affected?
> > The backtrace below suggests that it's actually a read of LNKCAP or
> > LNKCTL (not LNKSTA) that caused the error.
> 
> IIUC, the backtrace below produces an EL1 error when doing a PIO
> transfer while the link is still retraining. See my comment below for
> more about that. But accessing any root complex's register seems fine.
> > 
> > It sounds like there are really two problems:
> > 
> >   1) Reading PCI_EXP_LNKSTA (or the Aardvark equivalent) doesn't give
> >      valid data for PCI_EXP_LNKSTA_LT.
> 
> The 1) is correct.
> 
> >   2) Sometimes config reads cause EL1 errors.
> 
> Actually EL1 error happens when we try to access device's register with
> a PIO transfer, which is when we try to use the link while it is being
> retrained.
> 
> IMHO, 1) and 2) are linked. ASPM core tries to use the link too early
> because it has read invalid data for PCI_EXP_LNKSTA_LT.

From the software point of view, there is no such thing as "using the
link too early".  The pattern of:

  - Verify that link is up
  - Access device on other end of link

is always racy because the link can go down at any time due to hotplug
or other issues.  In particular, the link can go down after we verify
that the link is up, but before we access the device.

Software must be able to deal with that gracefully.  I don't know
whether that means catching and recovering from that EL1 error, or
masking it, or what.  This is architecture-specific stuff that's
outside the scope of PCIe itself.

But a link going down should never directly cause a kernel panic.

> > > This fixes boot hang or kernel panic with the following callstack due to
> > > ASPM setup doing a link re-train and polling for PCI_EXP_LNKSTA_LT flag
> > > to be cleared before using it.
> > > 
> > > -------------------- 8< -------------------
> > > 	[    0.915389]  dump_backtrace+0x0/0x140
> > > 	[    0.915391]  show_stack+0x14/0x20
> > > 	[    0.915393]  dump_stack+0x90/0xb4
> > > 	[    0.915394]  panic+0x134/0x2c0
> > > 	[    0.915396]  nmi_panic+0x6c/0x70
> > > 	[    0.915398]  arm64_serror_panic+0x74/0x80
> > > 	[    0.915400]  is_valid_bugaddr+0x0/0x8
> > > 	[    0.915402]  el1_error+0x7c/0xe4
> > > 	[    0.915404]  advk_pcie_rd_conf+0x4c/0x250
> > > 	[    0.915406]  pci_bus_read_config_word+0x7c/0xd0
> > > 	[    0.915408]  pcie_capability_read_word+0x90/0xc8
> > > 	[    0.915410]  pcie_get_aspm_reg+0x68/0x118
> > > 	[    0.915412]  pcie_aspm_init_link_state+0x460/0xa98

> > > +	case PCI_EXP_LNKCTL: {
> > 
> > Don't you mean PCI_EXP_LNKSTA here?
> 
> No, PCI_EXP_LNKSTA and PCI_EXP_LNKCTL are consecutive 16bit registers
> but bridge emulation accesses those registers by 32bit chunk. So when
> one wants to read PCI_EXP_LNKSTA register, pci bridge reads 32bit data
> from PCI_EXP_LNKCTL and 16 bit shift the result to the right.
> 
> This is why I use (PCI_EXP_LNKSTA_LT << 16) below.

Ah, that makes sense.  A comment along the lines of "u32 contains both
PCI_EXP_LNKCTL and PCI_EXP_LNKSTA" would be a nice hint.

Bjorn
Remi Pommarel April 30, 2019, 8:40 a.m. UTC | #8
On Mon, Apr 29, 2019 at 02:45:32PM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 26, 2019 at 12:27:57AM +0200, Remi Pommarel wrote:
> > On Thu, Apr 25, 2019 at 04:04:39PM -0500, Bjorn Helgaas wrote:
> > > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> 
> > > It sounds like reading and/or writing some registers during a retrain
> > > causes some sort of EL1 error?  Is this a separate erratum?  Is there
> > > a list of the registers and operations (read/write) that are affected?
> > > The backtrace below suggests that it's actually a read of LNKCAP or
> > > LNKCTL (not LNKSTA) that caused the error.
> > 
> > IIUC, the backtrace below produces an EL1 error when doing a PIO
> > transfer while the link is still retraining. See my comment below for
> > more about that. But accessing any root complex's register seems fine.
> > > 
> > > It sounds like there are really two problems:
> > > 
> > >   1) Reading PCI_EXP_LNKSTA (or the Aardvark equivalent) doesn't give
> > >      valid data for PCI_EXP_LNKSTA_LT.
> > 
> > The 1) is correct.
> > 
> > >   2) Sometimes config reads cause EL1 errors.
> > 
> > Actually EL1 error happens when we try to access device's register with
> > a PIO transfer, which is when we try to use the link while it is being
> > retrained.
> > 
> > IMHO, 1) and 2) are linked. ASPM core tries to use the link too early
> > because it has read invalid data for PCI_EXP_LNKSTA_LT.
> 
> From the software point of view, there is no such thing as "using the
> link too early".  The pattern of:
> 
>   - Verify that link is up
>   - Access device on other end of link
> 
> is always racy because the link can go down at any time due to hotplug
> or other issues.  In particular, the link can go down after we verify
> that the link is up, but before we access the device.
> 
> Software must be able to deal with that gracefully.  I don't know
> whether that means catching and recovering from that EL1 error, or
> masking it, or what.  This is architecture-specific stuff that's
> outside the scope of PCIe itself.
> 
> But a link going down should never directly cause a kernel panic.

Ah, yes, you are right. There is "worse" than the EL1 error though, boot
can also hang while accessing those registers when link is not in a
ready state.

So, yes, I do agree that there are two issues here. The
PCI_EXP_LNKSTA_LT register one and the EL1 error or hang one. On the
other hand I don't think I can split it in two because this patch only
fixes the former which happens to not trigger the latter (ASPM core is
kind enough to wait for the link to be ready after retraining).

Thus the second issue remains and hot plugging for example would
likely trigger it. I'll try to see with Thomas if we could reach the
vendor about that.

By the way, I have replied to Lorenzo with, what I think, is a more
legible patch. I could send a v3 with it if you prefer this one.
Lorenzo Pieralisi April 30, 2019, 11:34 a.m. UTC | #9
On Mon, Apr 29, 2019 at 05:32:35PM +0200, Remi Pommarel wrote:
> Hi Lorenzo,
> 
> Sorry for duplicates I forgot to include everyone.
> 
> On Thu, Apr 25, 2019 at 04:06:40PM +0100, Lorenzo Pieralisi wrote:
> > On Thu, Apr 25, 2019 at 04:23:53PM +0200, Remi Pommarel wrote:
> > > Hi Lorenzo,
> > > 
> > > On Thu, Apr 25, 2019 at 12:08:30PM +0100, Lorenzo Pieralisi wrote:
> > > > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > > > > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > > > > config register does not reflect the actual link training state and is
> > > > > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > > > > in LMI config register could be used as a link training indicator.
> > > > > Indeed if the LTSSM is in L0 or upper state then link training has
> > > > > completed (see [1]).
> > > > > 
> > > > > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> > > > > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > > > > transition takes some time), LTSSM can be in L0 but link training has
> > > > > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > > > > state sequence has to be seen to be sure that link training has been
> > > > > done.
> > > > 
> > > > Hi Remi,
> > > > 
> > > > I am a bit confused, so you are saying that the LTSSM flag in the
> > > > LMI config register can't be used to detect when training is completed ?
> > > 
> > > Not exactly, I am saying that PCI_EXP_LNKSTA_LT from PCI_EXP_LNKSTA
> > > register can't be used with this hardware, but can be emulated with
> > > LTSSM flag.
> > > 
> > > > 
> > > > Certainly it can't be used by ASPM core that relies on:
> > > > 
> > > > PCI_EXP_LNKSTA_LT flag
> > > > 
> > > > in the PCI_EXP_LNKSTA register, and that's what you are setting through
> > > > this timeout mechanism IIUC.
> > > > 
> > > > Please elaborate on that.
> > > 
> > > The problem here is that the hardware does not change PCI_EXP_LNKSTA_LT
> > > at all. So in order to support link re-training feature we need to
> > > emulate this flag. To do so LTSSM flag can be used.
> > 
> > Understood.
> > 
> > > Indeed we can set the emulated PCI_EXP_LNKSTA_LT as soon as re-training
> > > is asked and wait for LTSSM flag to be back to a configured state
> > > (e.g. L0, L0s) before clearing it.
> > 
> > The check for the LTSSM is carried out through advk_pcie_link_up()
> > (ie register CFG_REG), correct ?
> > 
> 
> Yes that is correct.
> 
> > > The problem with that is that LTSSM flag does not change instantly after
> > > link re-training has been asked, and will stay in configured state for a
> > > small amount of time. So the idea is to poll the LTSSM flag and wait for
> > > it to enter a recovery state then waiting for it to be back in
> > > configured state.
> > 
> > When you say "poll" you mean checking advk_pcie_link_up() ?
> > 
> 
> I mean checking advk_pcie_link_up() in a loop. This loop is done by the
> user (e.g. ASPM core). ASPM core waits for PCI_EXP_LNKSTA_LT to be
> cleared in pcie_aspm_configure_common_clock() just after it has set
> PCI_EXP_LNKCTL_RL.
> 
> So the idea was to check advk_pcie_link_up() each time ASPM core checks
> the PCI_EXP_LNKSTA_LT flag. Please see below patch for an alternative
> to that.
> 
> > More below on the code.
> > 
> > > The timeout is only here as a fallback in the unlikely event that we
> > > missed the LTSSM flag entering recovery state.
> > > 
> > > > 
> > > > I am picking Bjorn's brain on this patch since what you are doing
> > > > seems quite arbitrary and honestly it is a bit of a hack.
> > > 
> > > Yes, sorry, it is a bit of a hack because I try to workaround a
> > > hardware issue.
> > 
> > No problems, it is not your fault.
> > > 
> > > Please note that vendor has been contacted about this in the meantime
> > > and answered the following:
> > > 
> > > "FW can poll LTSSM state equals any of the following values: 0xB or 0xD
> > > or 0xC or 0xE. After that, polls for LTSSM equals 0x10. For your
> > > information, LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> > > ........... -> 0x10".
> > > 
> > > It is basically what this patch does, I've just added a timeout fallback
> > > to not poll LTSSM state forever if its transition to 0xB, 0xD, 0xC or
> > > 0xE has been missed.
> > 
> > When you say "missed" you mean advk_pcie_link_up() returning true, right ?
> > 
> 
> Not exactly, I mean that LTSSM had the time to go down and back up
> between advk_pcie_link_up() because, for example, ASPM core loop took
> too much time between two PCI_EXP_LNKSTA_LT flag checks.
> 
> > [...]
> > 
> > > > > +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> > > > > +{
> > > > > +	if (!advk_pcie_link_up(pcie)) {
> > 
> > That's the bit I find confusing. Is this check here to detect if the
> > link went through the sequence below ? Should not it be carried
> > out only if (pcie->rl_asked == 1) ?
> > 
> > "... LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> >  ........... -> 0x10".
> 
> Yes it is the check to detect the sequence. advk_pcie_link_up() returns
> false if LTSSM <= 0x10.
> 
> This cannot be done only if (pcie->rl_asked == 1) because I still
> want this function to return 1 if link is still down.
> 
> > 
> > > > > +		pcie->rl_asked = 0;
> > 
> > Why ?
> > 
> 
> rl_asked is not a good name, I could have called it
> pcie->wait_for_link_down instead. So if advk_pcie_link_up() returns
> false that means that we don't need to wait for link being down any more
> and just wait for (LTSSM >= 0x10). In this case the delay is not needed.
> 
> > > > > +		return 1;
> > > > > +	}
> > > > > +
> > > > > +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> > > > > +		return 1;
> > 
> > This ensures that if the LTSSM >= 0x10 we still wait for a delay before
> > considering the link up (because I suppose, after asking a retraining
> > it takes a while for the LTSSM state to become < 0x10), correct ?
> 
> Yes it takes a while to become < 0x10 after retraining hence the delay.
> But here we don't need to always wait for a delay. Indeed if we've
> already seen the link being < 0x10 (i.e if "pcie->rl_asked == 0") and
> if after that link is >= 0x10 then we know that retraining process has
> finished.
> 
> Anyway I did it this way because I wanted to keep
> advk_pci_bridge_emul_pcie_conf_write() from polling. But this is
> obviously a bad reason as it makes the code way too complex and relies
> on user (ASPM core) to do the poll instead.
> 
> So if you find the following better I'll send a v3 with that:
> 
> ---
> diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
> index eb58dfdaba1b..67e8ae4e313e 100644
> --- a/drivers/pci/controller/pci-aardvark.c
> +++ b/drivers/pci/controller/pci-aardvark.c
> @@ -180,6 +180,9 @@
>  #define LINK_WAIT_MAX_RETRIES		10
>  #define LINK_WAIT_USLEEP_MIN		90000
>  #define LINK_WAIT_USLEEP_MAX		100000
> +#define RETRAIN_WAIT_MAX_RETRIES	20
> +#define RETRAIN_WAIT_USLEEP_MIN		2000
> +#define RETRAIN_WAIT_USLEEP_MAX		5000
>  
>  #define MSI_IRQ_NUM			32
>  
> @@ -239,6 +242,17 @@ static int advk_pcie_wait_for_link(struct advk_pcie *pcie)
>  	return -ETIMEDOUT;
>  }
>  
> +static void advk_pcie_wait_for_retrain(struct advk_pcie *pcie)
> +{
> +	size_t retries;
> +
> +	for (retries = 0; retries < RETRAIN_WAIT_MAX_RETRIES; ++retries) {
> +		if (!advk_pcie_link_up(pcie))
> +			break;
> +		usleep_range(RETRAIN_WAIT_USLEEP_MIN, RETRAIN_WAIT_USLEEP_MAX);
> +	}
> +}
> +
>  static void advk_pcie_setup_hw(struct advk_pcie *pcie)
>  {
>  	u32 reg;
> @@ -426,11 +440,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	}
>  
> +	case PCI_EXP_LNKCTL: {
> +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> +			~(PCI_EXP_LNKSTA_LT << 16);
> +		if (!advk_pcie_link_up(pcie))

Is this correct ?

"PCI Express Base Specification Rev4.0 Version 1.0" page 758

"Link Training: this read-only bit indicates that
the physical layer LTSSM is in the Configuration or
Recovery state or that 1b was written to the Retrain
Link..."

Isn't that a subset of states for which !advk_pcie_link_up()
return true ?

Lorenzo

> +			val |= (PCI_EXP_LNKSTA_LT << 16);
> +		*value = val;
> +		return PCI_BRIDGE_EMUL_HANDLED;
> +	}
> +
>  	case PCI_CAP_LIST_ID:
>  	case PCI_EXP_DEVCAP:
>  	case PCI_EXP_DEVCTL:
>  	case PCI_EXP_LNKCAP:
> -	case PCI_EXP_LNKCTL:
>  		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
>  		return PCI_BRIDGE_EMUL_HANDLED;
>  	default:
> @@ -447,8 +469,13 @@ advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
>  
>  	switch (reg) {
>  	case PCI_EXP_DEVCTL:
> +		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		break;
> +
>  	case PCI_EXP_LNKCTL:
>  		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
> +		if (new & PCI_EXP_LNKCTL_RL)
> +			advk_pcie_wait_for_retrain(pcie);
>  		break;
>  
>  	case PCI_EXP_RTCTL:
Remi Pommarel May 21, 2019, 3:34 p.m. UTC | #10
Hi Lorenzo,

On Tue, Apr 30, 2019 at 12:34:27PM +0100, Lorenzo Pieralisi wrote:
> On Mon, Apr 29, 2019 at 05:32:35PM +0200, Remi Pommarel wrote:
> > Hi Lorenzo,
> > 
> > Sorry for duplicates I forgot to include everyone.
> > 
> > On Thu, Apr 25, 2019 at 04:06:40PM +0100, Lorenzo Pieralisi wrote:
> > > On Thu, Apr 25, 2019 at 04:23:53PM +0200, Remi Pommarel wrote:
> > > > Hi Lorenzo,
> > > > 
> > > > On Thu, Apr 25, 2019 at 12:08:30PM +0100, Lorenzo Pieralisi wrote:
> > > > > On Sat, Mar 16, 2019 at 05:12:43PM +0100, Remi Pommarel wrote:
> > > > > > The PCI_EXP_LNKSTA_LT flag in the emulated root device's PCI_EXP_LNKSTA
> > > > > > config register does not reflect the actual link training state and is
> > > > > > always cleared. The Link Training and Status State Machine (LTSSM) flag
> > > > > > in LMI config register could be used as a link training indicator.
> > > > > > Indeed if the LTSSM is in L0 or upper state then link training has
> > > > > > completed (see [1]).
> > > > > > 
> > > > > > Unfortunately because setting the PCI_EXP_LINCTL_RL flag does not
> > > > > > instantly imply a LTSSM state change (e.g. L0s to recovery state
> > > > > > transition takes some time), LTSSM can be in L0 but link training has
> > > > > > not finished yet. Thus a lower L0 LTSSM state followed by a L0 or upper
> > > > > > state sequence has to be seen to be sure that link training has been
> > > > > > done.
> > > > > 
> > > > > Hi Remi,
> > > > > 
> > > > > I am a bit confused, so you are saying that the LTSSM flag in the
> > > > > LMI config register can't be used to detect when training is completed ?
> > > > 
> > > > Not exactly, I am saying that PCI_EXP_LNKSTA_LT from PCI_EXP_LNKSTA
> > > > register can't be used with this hardware, but can be emulated with
> > > > LTSSM flag.
> > > > 
> > > > > 
> > > > > Certainly it can't be used by ASPM core that relies on:
> > > > > 
> > > > > PCI_EXP_LNKSTA_LT flag
> > > > > 
> > > > > in the PCI_EXP_LNKSTA register, and that's what you are setting through
> > > > > this timeout mechanism IIUC.
> > > > > 
> > > > > Please elaborate on that.
> > > > 
> > > > The problem here is that the hardware does not change PCI_EXP_LNKSTA_LT
> > > > at all. So in order to support link re-training feature we need to
> > > > emulate this flag. To do so LTSSM flag can be used.
> > > 
> > > Understood.
> > > 
> > > > Indeed we can set the emulated PCI_EXP_LNKSTA_LT as soon as re-training
> > > > is asked and wait for LTSSM flag to be back to a configured state
> > > > (e.g. L0, L0s) before clearing it.
> > > 
> > > The check for the LTSSM is carried out through advk_pcie_link_up()
> > > (ie register CFG_REG), correct ?
> > > 
> > 
> > Yes that is correct.
> > 
> > > > The problem with that is that LTSSM flag does not change instantly after
> > > > link re-training has been asked, and will stay in configured state for a
> > > > small amount of time. So the idea is to poll the LTSSM flag and wait for
> > > > it to enter a recovery state then waiting for it to be back in
> > > > configured state.
> > > 
> > > When you say "poll" you mean checking advk_pcie_link_up() ?
> > > 
> > 
> > I mean checking advk_pcie_link_up() in a loop. This loop is done by the
> > user (e.g. ASPM core). ASPM core waits for PCI_EXP_LNKSTA_LT to be
> > cleared in pcie_aspm_configure_common_clock() just after it has set
> > PCI_EXP_LNKCTL_RL.
> > 
> > So the idea was to check advk_pcie_link_up() each time ASPM core checks
> > the PCI_EXP_LNKSTA_LT flag. Please see below patch for an alternative
> > to that.
> > 
> > > More below on the code.
> > > 
> > > > The timeout is only here as a fallback in the unlikely event that we
> > > > missed the LTSSM flag entering recovery state.
> > > > 
> > > > > 
> > > > > I am picking Bjorn's brain on this patch since what you are doing
> > > > > seems quite arbitrary and honestly it is a bit of a hack.
> > > > 
> > > > Yes, sorry, it is a bit of a hack because I try to workaround a
> > > > hardware issue.
> > > 
> > > No problems, it is not your fault.
> > > > 
> > > > Please note that vendor has been contacted about this in the meantime
> > > > and answered the following:
> > > > 
> > > > "FW can poll LTSSM state equals any of the following values: 0xB or 0xD
> > > > or 0xC or 0xE. After that, polls for LTSSM equals 0x10. For your
> > > > information, LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> > > > ........... -> 0x10".
> > > > 
> > > > It is basically what this patch does, I've just added a timeout fallback
> > > > to not poll LTSSM state forever if its transition to 0xB, 0xD, 0xC or
> > > > 0xE has been missed.
> > > 
> > > When you say "missed" you mean advk_pcie_link_up() returning true, right ?
> > > 
> > 
> > Not exactly, I mean that LTSSM had the time to go down and back up
> > between advk_pcie_link_up() because, for example, ASPM core loop took
> > too much time between two PCI_EXP_LNKSTA_LT flag checks.
> > 
> > > [...]
> > > 
> > > > > > +static int advk_pcie_link_retraining(struct advk_pcie *pcie)
> > > > > > +{
> > > > > > +	if (!advk_pcie_link_up(pcie)) {
> > > 
> > > That's the bit I find confusing. Is this check here to detect if the
> > > link went through the sequence below ? Should not it be carried
> > > out only if (pcie->rl_asked == 1) ?
> > > 
> > > "... LTSSM will transit from 0x10 -> 0xB -> 0xD -> 0xC or 0xE
> > >  ........... -> 0x10".
> > 
> > Yes it is the check to detect the sequence. advk_pcie_link_up() returns
> > false if LTSSM <= 0x10.
> > 
> > This cannot be done only if (pcie->rl_asked == 1) because I still
> > want this function to return 1 if link is still down.
> > 
> > > 
> > > > > > +		pcie->rl_asked = 0;
> > > 
> > > Why ?
> > > 
> > 
> > rl_asked is not a good name, I could have called it
> > pcie->wait_for_link_down instead. So if advk_pcie_link_up() returns
> > false that means that we don't need to wait for link being down any more
> > and just wait for (LTSSM >= 0x10). In this case the delay is not needed.
> > 
> > > > > > +		return 1;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
> > > > > > +		return 1;
> > > 
> > > This ensures that if the LTSSM >= 0x10 we still wait for a delay before
> > > considering the link up (because I suppose, after asking a retraining
> > > it takes a while for the LTSSM state to become < 0x10), correct ?
> > 
> > Yes it takes a while to become < 0x10 after retraining hence the delay.
> > But here we don't need to always wait for a delay. Indeed if we've
> > already seen the link being < 0x10 (i.e if "pcie->rl_asked == 0") and
> > if after that link is >= 0x10 then we know that retraining process has
> > finished.
> > 
> > Anyway I did it this way because I wanted to keep
> > advk_pci_bridge_emul_pcie_conf_write() from polling. But this is
> > obviously a bad reason as it makes the code way too complex and relies
> > on user (ASPM core) to do the poll instead.
> > 
> > So if you find the following better I'll send a v3 with that:
> > 
> > ---
> > diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
> > index eb58dfdaba1b..67e8ae4e313e 100644
> > --- a/drivers/pci/controller/pci-aardvark.c
> > +++ b/drivers/pci/controller/pci-aardvark.c
> > @@ -180,6 +180,9 @@
> >  #define LINK_WAIT_MAX_RETRIES		10
> >  #define LINK_WAIT_USLEEP_MIN		90000
> >  #define LINK_WAIT_USLEEP_MAX		100000
> > +#define RETRAIN_WAIT_MAX_RETRIES	20
> > +#define RETRAIN_WAIT_USLEEP_MIN		2000
> > +#define RETRAIN_WAIT_USLEEP_MAX		5000
> >  
> >  #define MSI_IRQ_NUM			32
> >  
> > @@ -239,6 +242,17 @@ static int advk_pcie_wait_for_link(struct advk_pcie *pcie)
> >  	return -ETIMEDOUT;
> >  }
> >  
> > +static void advk_pcie_wait_for_retrain(struct advk_pcie *pcie)
> > +{
> > +	size_t retries;
> > +
> > +	for (retries = 0; retries < RETRAIN_WAIT_MAX_RETRIES; ++retries) {
> > +		if (!advk_pcie_link_up(pcie))
> > +			break;
> > +		usleep_range(RETRAIN_WAIT_USLEEP_MIN, RETRAIN_WAIT_USLEEP_MAX);
> > +	}
> > +}
> > +
> >  static void advk_pcie_setup_hw(struct advk_pcie *pcie)
> >  {
> >  	u32 reg;
> > @@ -426,11 +440,19 @@ advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
> >  		return PCI_BRIDGE_EMUL_HANDLED;
> >  	}
> >  
> > +	case PCI_EXP_LNKCTL: {
> > +		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
> > +			~(PCI_EXP_LNKSTA_LT << 16);
> > +		if (!advk_pcie_link_up(pcie))
> 
> Is this correct ?
> 
> "PCI Express Base Specification Rev4.0 Version 1.0" page 758
> 
> "Link Training: this read-only bit indicates that
> the physical layer LTSSM is in the Configuration or
> Recovery state or that 1b was written to the Retrain
> Link..."
> 
> Isn't that a subset of states for which !advk_pcie_link_up()
> return true ?

Yes you are right, unfortunately I don't know exactly what the LTSSM
value for Configuration or Recovery states. All I can observe is that
LTSSM goes to 0xb which is likely either Recovery or Configuration
state.

Sorry for long response delay, I was waiting for Marvell answer on that
specific subject but I don't think it is going to come anytime soon. So
in the meantime I suggest we could either use !advk_pcie_link_up() or
check for LTSSM != 0xb. Would you be ok with that ?

Thanks,
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c
index eb58dfdaba1b..47b707b5fc2c 100644
--- a/drivers/pci/controller/pci-aardvark.c
+++ b/drivers/pci/controller/pci-aardvark.c
@@ -180,6 +180,7 @@ 
 #define LINK_WAIT_MAX_RETRIES		10
 #define LINK_WAIT_USLEEP_MIN		90000
 #define LINK_WAIT_USLEEP_MAX		100000
+#define LINK_RETRAIN_DELAY_MAX		(20 * HZ / 1000) /* 20 ms */
 
 #define MSI_IRQ_NUM			32
 
@@ -199,6 +200,8 @@  struct advk_pcie {
 	u16 msi_msg;
 	int root_bus_nr;
 	struct pci_bridge_emul bridge;
+	unsigned long rl_deadline; /* Retrain link jiffies deadline */
+	u8 rl_asked; /* Retraining has been asked and is in transition */
 };
 
 static inline void advk_writel(struct advk_pcie *pcie, u32 val, u64 reg)
@@ -400,6 +403,19 @@  static int advk_pcie_wait_pio(struct advk_pcie *pcie)
 	return -ETIMEDOUT;
 }
 
+static int advk_pcie_link_retraining(struct advk_pcie *pcie)
+{
+	if (!advk_pcie_link_up(pcie)) {
+		pcie->rl_asked = 0;
+		return 1;
+	}
+
+	if (pcie->rl_asked && time_before(jiffies, pcie->rl_deadline))
+		return 1;
+
+	pcie->rl_asked = 0;
+	return 0;
+}
 
 static pci_bridge_emul_read_status_t
 advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
@@ -426,11 +442,19 @@  advk_pci_bridge_emul_pcie_conf_read(struct pci_bridge_emul *bridge,
 		return PCI_BRIDGE_EMUL_HANDLED;
 	}
 
+	case PCI_EXP_LNKCTL: {
+		u32 val = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg) &
+			~(PCI_EXP_LNKSTA_LT << 16);
+		if (advk_pcie_link_retraining(pcie))
+			val |= (PCI_EXP_LNKSTA_LT << 16);
+		*value = val;
+		return PCI_BRIDGE_EMUL_HANDLED;
+	}
+
 	case PCI_CAP_LIST_ID:
 	case PCI_EXP_DEVCAP:
 	case PCI_EXP_DEVCTL:
 	case PCI_EXP_LNKCAP:
-	case PCI_EXP_LNKCTL:
 		*value = advk_readl(pcie, PCIE_CORE_PCIEXP_CAP + reg);
 		return PCI_BRIDGE_EMUL_HANDLED;
 	default:
@@ -447,8 +471,15 @@  advk_pci_bridge_emul_pcie_conf_write(struct pci_bridge_emul *bridge,
 
 	switch (reg) {
 	case PCI_EXP_DEVCTL:
+		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
+		break;
+
 	case PCI_EXP_LNKCTL:
 		advk_writel(pcie, new, PCIE_CORE_PCIEXP_CAP + reg);
+		if (new & PCI_EXP_LNKCTL_RL) {
+			pcie->rl_asked = 1;
+			pcie->rl_deadline = jiffies + LINK_RETRAIN_DELAY_MAX;
+		}
 		break;
 
 	case PCI_EXP_RTCTL: