PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

Message ID 20181206041951.22413-1-david@gibson.dropbear.id.au
State Not Applicable
Headers show
Series
  • PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch warning total: 0 errors, 2 warnings, 0 checks, 40 lines checked
snowpatch_ozlabs/build-pmac32 success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64le success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied

Commit Message

David Gibson Dec. 6, 2018, 4:19 a.m.
Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
unbound from their regular driver and attached to vfio-pci in order to pass
them through to a guest.

This goes away if the disable_idle_d3 option is used, so it looks like a
problem with the hardware handling D3 state.  To fix that more permanently,
use a device quirk to disable D3 state for these devices.

We do this by renaming the existing quirk_no_ata_d3() more generally and
attaching it to the ConnectX-[45] devices (0x15b3:0x1013).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 drivers/pci/quirks.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Comments

Leon Romanovsky Dec. 6, 2018, 6:45 a.m. | #1
On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> unbound from their regular driver and attached to vfio-pci in order to pass
> them through to a guest.
>
> This goes away if the disable_idle_d3 option is used, so it looks like a
> problem with the hardware handling D3 state.  To fix that more permanently,
> use a device quirk to disable D3 state for these devices.
>
> We do this by renaming the existing quirk_no_ata_d3() more generally and
> attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  drivers/pci/quirks.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
>

Hi David,

Thank for your patch,

I would like to reproduce the calltrace before moving forward,
but have trouble to reproduce the original issue.

I'm working with vfio-pci and CX-4/5 cards on daily basis,
tried manually enter into D3 state now, and it worked for me.

Can you please post your full calltrace, and "lspci -s PCI_ID -vv"
output?

Thanks
David Gibson Dec. 11, 2018, 2:31 a.m. | #2
On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote:
> On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> > unbound from their regular driver and attached to vfio-pci in order to pass
> > them through to a guest.
> >
> > This goes away if the disable_idle_d3 option is used, so it looks like a
> > problem with the hardware handling D3 state.  To fix that more permanently,
> > use a device quirk to disable D3 state for these devices.
> >
> > We do this by renaming the existing quirk_no_ata_d3() more generally and
> > attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  drivers/pci/quirks.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> >
> 
> Hi David,
> 
> Thank for your patch,
> 
> I would like to reproduce the calltrace before moving forward,
> but have trouble to reproduce the original issue.
> 
> I'm working with vfio-pci and CX-4/5 cards on daily basis,
> tried manually enter into D3 state now, and it worked for me.
> 
> Can you please post your full calltrace, and "lspci -s PCI_ID -vv"
> output?

Sorry, I may have jumped the gun on this.  Using disable_idle_d3 seems
to do _something_ for these cards, but there are some other things
going wrong which are confusing the issue.  This is on POWER, which
might affect the situation.  I'll get back to you once I have some
more information.
Bjorn Helgaas Dec. 11, 2018, 2:01 p.m. | #3
Hi David,

I see you're still working on this, but if you do end up going this
direction eventually, would you mind splitting this into two patches:
1) rename the quirk to make it more generic (but not changing any
behavior), and 2) add the ConnectX devices to the quirk.  That way
the ConnectX change is smaller and more easily understood/reverted/etc.

On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> unbound from their regular driver and attached to vfio-pci in order to pass
> them through to a guest.
> 
> This goes away if the disable_idle_d3 option is used, so it looks like a
> problem with the hardware handling D3 state.  To fix that more permanently,
> use a device quirk to disable D3 state for these devices.
> 
> We do this by renaming the existing quirk_no_ata_d3() more generally and
> attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  drivers/pci/quirks.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4700d24e5d55..add3f516ca12 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1315,23 +1315,24 @@ static void quirk_ide_samemode(struct pci_dev *pdev)
>  }
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
>  
> -/* Some ATA devices break if put into D3 */
> -static void quirk_no_ata_d3(struct pci_dev *pdev)
> +/* Some devices (including a number of ATA cards) break if put into D3 */
> +static void quirk_no_d3(struct pci_dev *pdev)
>  {
>  	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
>  }
> +
>  /* Quirk the legacy ATA devices only. The AHCI ones are ok */
>  DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SERVERWORKS, PCI_ANY_ID,
> -				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
> +				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
>  DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
> -				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
> +				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
>  /* ALi loses some register settings that we cannot then restore */
>  DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AL, PCI_ANY_ID,
> -				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
> +				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
>  /* VIA comes back fine but we need to keep it alive or ACPI GTM failures
>     occur when mode detecting */
>  DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
> -				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
> +				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
>  
>  /*
>   * This was originally an Alpha-specific thing, but it really fits here.
> @@ -3367,6 +3368,10 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
>  			mellanox_check_broken_intx_masking);
>  
> +/* Mellanox MT27800 (ConnectX-5) IB card seems to break with D3
> + * In particular this shows up when the device is bound to the vfio-pci driver */

Follow usual multiline comment style, i.e.,

  /*
   * text ...
   * more text ...
   */

> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4, quirk_no_d3)
> +
>  static void quirk_no_bus_reset(struct pci_dev *dev)
>  {
>  	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> -- 
> 2.19.2
>
David Gibson Dec. 12, 2018, 12:22 a.m. | #4
On Tue, Dec 11, 2018 at 08:01:43AM -0600, Bjorn Helgaas wrote:
> Hi David,
> 
> I see you're still working on this, but if you do end up going this
> direction eventually, would you mind splitting this into two patches:
> 1) rename the quirk to make it more generic (but not changing any
> behavior), and 2) add the ConnectX devices to the quirk.  That way
> the ConnectX change is smaller and more easily
> understood/reverted/etc.

Sure.  Would it make sense to send (1) as an independent cleanup,
while I'm still working out exactly what (if anything) we need for
(2)?
Bjorn Helgaas Dec. 12, 2018, 3:04 a.m. | #5
On Tue, Dec 11, 2018 at 6:38 PM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Tue, Dec 11, 2018 at 08:01:43AM -0600, Bjorn Helgaas wrote:
> > Hi David,
> >
> > I see you're still working on this, but if you do end up going this
> > direction eventually, would you mind splitting this into two patches:
> > 1) rename the quirk to make it more generic (but not changing any
> > behavior), and 2) add the ConnectX devices to the quirk.  That way
> > the ConnectX change is smaller and more easily
> > understood/reverted/etc.
>
> Sure.  Would it make sense to send (1) as an independent cleanup,
> while I'm still working out exactly what (if anything) we need for
> (2)?

You could, but I don't think there's really much benefit in doing the
first without the second, and I think there is some value in handling
both patches at the same time.
David Gibson Jan. 4, 2019, 3:44 a.m. | #6
On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote:
> On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> > unbound from their regular driver and attached to vfio-pci in order to pass
> > them through to a guest.
> >
> > This goes away if the disable_idle_d3 option is used, so it looks like a
> > problem with the hardware handling D3 state.  To fix that more permanently,
> > use a device quirk to disable D3 state for these devices.
> >
> > We do this by renaming the existing quirk_no_ata_d3() more generally and
> > attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  drivers/pci/quirks.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> >
> 
> Hi David,
> 
> Thank for your patch,
> 
> I would like to reproduce the calltrace before moving forward,
> but have trouble to reproduce the original issue.
> 
> I'm working with vfio-pci and CX-4/5 cards on daily basis,
> tried manually enter into D3 state now, and it worked for me.

Interesting.  I've investigated this further, though I don't have as
many new clues as I'd like.  The problem occurs reliably, at least on
one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
I don't yet know if it occurs with other machines, I'm having trouble
getting access to other machines with a suitable card.  I didn't
manage to reproduce it on a different POWER8 machine with a
ConnectX-5, but I don't know if it's the difference in machine or
difference in card revision that's important.

So possibilities that occur to me:
  * It's something specific about how the vfio-pci driver uses D3
    state - have you tried rebinding your device to vfio-pci?
  * It's something specific about POWER, either the kernel or the PCI
    bridge hardware
  * It's something specific about this particular type of machine

> Can you please post your full calltrace, and "lspci -s PCI_ID -vv"
> output?

[root@ibm-p8-garrison-01 ~]# lspci -vv -s 0008:01:00
0008:01:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: IBM Device 04f1
	Physical Slot: Slot1
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 473
	NUMA node: 1
	Region 0: Memory at 240000000000 (64-bit, prefetchable) [size=512M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: 2-port 100Gb EDR IB PCIe x16 Adapter   
		Read-only fields:
			[PN] Part number: 00WT039
			[EC] Engineering changes: P40057
			[FN] Unknown: 30 30 57 54 30 37 35
			[SN] Serial number: YA50YF58P080
			[FC] Unknown: 45 43 33 46
			[CC] Unknown: 32 43 45 41
			[VK] Vendor specific: ipzSeries
			[MN] Manufacture ID: 532X4590060204 
			[Z0] Unknown: 49 42 4d 32 31 39 30 31 31 30 30 33 32
			[RV] Reserved: checksum good, 0 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Device Serial Number ba-da-ce-55-de-ad-ca-fe
	Capabilities: [110 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [170 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Kernel driver in use: vfio-pci
	Kernel modules: mlx5_core

0008:01:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
	Subsystem: IBM Device 04f1
	Physical Slot: Slot1
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 473
	NUMA node: 1
	Region 0: Memory at 240020000000 (64-bit, prefetchable) [size=512M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: 2-port 100Gb EDR IB PCIe x16 Adapter   
		Read-only fields:
			[PN] Part number: 00WT039
			[EC] Engineering changes: P40057
			[FN] Unknown: 30 30 57 54 30 37 35
			[SN] Serial number: YA50YF58P080
			[FC] Unknown: 45 43 33 46
			[CC] Unknown: 32 43 45 41
			[VK] Vendor specific: ipzSeries
			[MN] Manufacture ID: 532X4590060204 
			[Z0] Unknown: 49 42 4d 32 31 39 30 31 31 30 30 33 32
			[RV] Reserved: checksum good, 0 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable- Count=128 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Device Serial Number ba-da-ce-55-de-ad-ca-fe
	Capabilities: [110 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [170 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Kernel driver in use: vfio-pci
	Kernel modules: mlx5_core


The problem is manifesting as an EEH failure (a POWER specific error
reporting system similar in intent to AER but entirely different in
implementation).  That's in turn causing the device to be reset and
the call trace from there.  There are bugs in the EEH recovery that
we're pursuing elsewhere, but the problem at issue here is why we're
tripping a hardware reported failure in the first place.

Given that, the trace probably isn't very meaningful (it's from the
recovery path, not the mlx or vfio driver), but fwiw:

[  132.573829] EEH: PHB#8 failure detected, location: N/A
[  132.573944] CPU: 64 PID: 397 Comm: kworker/64:0 Kdump: loaded Not tainted 4.18.0-57.el8.ppc64le #1
[  132.574052] Workqueue: events work_for_cpu_fn
[  132.574083] Call Trace:
[  132.574100] [c0000037f54d38c0] [c000000000c9ceec] dump_stack+0xb0/0xf4 (unreliable)
[  132.574147] [c0000037f54d3900] [c000000000042664] eeh_dev_check_failure+0x524/0x5f0
[  132.574300] [c0000037f54d39a0] [c0000000000bf108] pnv_pci_read_config+0x148/0x180
[  132.574348] [c0000037f54d39e0] [c000000000731694] pci_read_config_word+0xa4/0x130
[  132.574393] [c0000037f54d3a40] [c00000000073aa18] pci_raw_set_power_state+0xf8/0x300
[  132.574438] [c0000037f54d3ad0] [c000000000743450] pci_set_power_state+0x60/0x250
[  132.574486] [c0000037f54d3b10] [d000000013561e4c] vfio_pci_probe+0x184/0x270 [vfio_pci]
[  132.574531] [c0000037f54d3bb0] [c00000000074bb3c] local_pci_probe+0x6c/0x140
[  132.574577] [c0000037f54d3c40] [c00000000015aa18] work_for_cpu_fn+0x38/0x60
[  132.574615] [c0000037f54d3c70] [c00000000015fb84] process_one_work+0x2f4/0x5b0
[  132.574660] [c0000037f54d3d10] [c000000000161190] worker_thread+0x330/0x760
[  132.574803] [c0000037f54d3dc0] [c00000000016a4fc] kthread+0x1ac/0x1c0
[  132.574842] [c0000037f54d3e30] [c00000000000b75c] ret_from_kernel_thread+0x5c/0x80
[  132.574894] EEH: Detected error on PHB#8
[  132.574926] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[  132.574981] EEH: Notify device drivers to shutdown
[  132.575011] EEH: Beginning: 'error_detected(IO frozen)'
[  132.575040] EEH: PE#fe (PCI 0008:00:00.0): no driver
[  132.575193] EEH: PE#0 (PCI 0008:01:00.0): Invoking vfio-pci->error_detected(IO frozen)
[  132.575253] EEH: PE#0 (PCI 0008:01:00.0): vfio-pci driver reports: 'can recover'
[  132.575514] EEH: PE#0 (PCI 0008:01:00.1): Invoking vfio-pci->error_detected(IO frozen)
[  132.575592] EEH: PE#0 (PCI 0008:01:00.1): vfio-pci driver reports: 'can recover'
[  132.575634] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
[  132.575684] EEH: Collect temporary log
[  132.575706] PHB3 PHB#8 Diag-data (Version: 1)
[  132.575734] brdgCtl:     0000ffff
[  132.575756] RootSts:     ffffffff ffffffff ffffffff ffffffff 0000ffff
[  132.575790] RootErrSts:  ffffffff ffffffff ffffffff
[  132.575933] RootErrLog:  ffffffff ffffffff ffffffff ffffffff
[  132.575973] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[  132.576014] nFir:        0000808000000000 0030006e00000000 0000800000000000
[  132.576048] PhbSts:      0000001800000000 0000001800000000
[  132.576076] Lem:         0000020000080000 42498e367f502eae 0000000000080000
[  132.576111] OutErr:      0000002000000000 0000002000000000 0000000000000000 0000000000000000
[  132.576159] InAErr:      0000000020000000 0000000020000000 8080000000000000 0000000000000000
[  132.576327] EEH: Reset without hotplug activity
[  132.606003] vfio-pci 0008:01:00.0: Refused to change power state, currently in D3
[  132.606062] iommu: Removing device 0008:01:00.0 from group 0
[  132.636000] vfio-pci 0008:01:00.1: Refused to change power state, currently in D3
[  132.636057] iommu: Removing device 0008:01:00.1 from group 0
[  137.196696] EEH: Sleep 5s ahead of partial hotplug
[  142.236046] pci 0008:01:00.0: [15b3:1013] type 00 class 0x020700
[  142.236156] pci 0008:01:00.0: reg 0x10: [mem 0x240000000000-0x24001fffffff 64bit pref]
[  142.236932] pci 0008:01:00.1: [15b3:1013] type 00 class 0x020700
[  142.237030] pci 0008:01:00.1: reg 0x10: [mem 0x240020000000-0x24003fffffff 64bit pref]
[  142.238763] pci 0008:00:00.0: BAR 14: assigned [mem 0x3fe200000000-0x3fe23fffffff]
[  142.238940] pci 0008:01:00.0: BAR 0: assigned [mem 0x240000000000-0x24001fffffff 64bit pref]
[  142.239021] pci 0008:01:00.1: BAR 0: assigned [mem 0x240020000000-0x24003fffffff 64bit pref]
[  142.239112] pci 0008:01:00.0: Can't enable device memory
[  142.239417] mlx5_core 0008:01:00.0: Cannot enable PCI device, aborting
[  142.239476] mlx5_core 0008:01:00.0: mlx5_pci_init failed with error code -22
[  142.239539] mlx5_core: probe of 0008:01:00.0 failed with error -22
[  142.239590] vfio-pci: probe of 0008:01:00.0 failed with error -22
[  142.239631] pci 0008:01:00.1: Can't enable device memory
[  142.241612] mlx5_core 0008:01:00.1: Cannot enable PCI device, aborting
[  142.241654] mlx5_core 0008:01:00.1: mlx5_pci_init failed with error code -22
[  142.241716] mlx5_core: probe of 0008:01:00.1 failed with error -22
[  142.241762] vfio-pci: probe of 0008:01:00.1 failed with error -22
[  142.241800] EEH: Notify device drivers the completion of reset
[  142.241835] EEH: Beginning: 'slot_reset'
[  142.241856] EEH: PE#fe (PCI 0008:00:00.0): no driver
[  142.241884] EEH: Finished:'slot_reset' with aggregate recovery state:'none'
[  142.241918] EEH: Notify device driver to resume
[  142.241947] EEH: Beginning: 'resume'
[  142.241968] EEH: PE#fe (PCI 0008:00:00.0): no driver
[  142.241996] EEH: Finished:'resume'
[  142.241996] EEH: Recovery successful.
Jason Gunthorpe Jan. 5, 2019, 5:51 p.m. | #7
On Fri, Jan 04, 2019 at 02:44:01PM +1100, David Gibson wrote:
> On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote:
> > On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> > > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> > > unbound from their regular driver and attached to vfio-pci in order to pass
> > > them through to a guest.
> > >
> > > This goes away if the disable_idle_d3 option is used, so it looks like a
> > > problem with the hardware handling D3 state.  To fix that more permanently,
> > > use a device quirk to disable D3 state for these devices.
> > >
> > > We do this by renaming the existing quirk_no_ata_d3() more generally and
> > > attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > >  drivers/pci/quirks.c | 17 +++++++++++------
> > >  1 file changed, 11 insertions(+), 6 deletions(-)
> > >
> > 
> > Hi David,
> > 
> > Thank for your patch,
> > 
> > I would like to reproduce the calltrace before moving forward,
> > but have trouble to reproduce the original issue.
> > 
> > I'm working with vfio-pci and CX-4/5 cards on daily basis,
> > tried manually enter into D3 state now, and it worked for me.
> 
> Interesting.  I've investigated this further, though I don't have as
> many new clues as I'd like.  The problem occurs reliably, at least on
> one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> I don't yet know if it occurs with other machines, I'm having trouble
> getting access to other machines with a suitable card.  I didn't
> manage to reproduce it on a different POWER8 machine with a
> ConnectX-5, but I don't know if it's the difference in machine or
> difference in card revision that's important.

Make sure the card has the latest firmware is always good advice..

> So possibilities that occur to me:
>   * It's something specific about how the vfio-pci driver uses D3
>     state - have you tried rebinding your device to vfio-pci?
>   * It's something specific about POWER, either the kernel or the PCI
>     bridge hardware
>   * It's something specific about this particular type of machine

Does the EEH indicate what happend to actually trigger it?

Jason
Benjamin Herrenschmidt Jan. 5, 2019, 10:43 p.m. | #8
On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> 
> > Interesting.  I've investigated this further, though I don't have as
> > many new clues as I'd like.  The problem occurs reliably, at least on
> > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > I don't yet know if it occurs with other machines, I'm having trouble
> > getting access to other machines with a suitable card.  I didn't
> > manage to reproduce it on a different POWER8 machine with a
> > ConnectX-5, but I don't know if it's the difference in machine or
> > difference in card revision that's important.
> 
> Make sure the card has the latest firmware is always good advice..
> 
> > So possibilities that occur to me:
> >   * It's something specific about how the vfio-pci driver uses D3
> >     state - have you tried rebinding your device to vfio-pci?
> >   * It's something specific about POWER, either the kernel or the PCI
> >     bridge hardware
> >   * It's something specific about this particular type of machine
> 
> Does the EEH indicate what happend to actually trigger it?

In a very cryptic way that requires manual parsing using non-public
docs sadly but yes. From the look of it, it's a completion timeout.

Looks to me like we don't get a response to a config space access
during the change of D state. I don't know if it's the write of the D3
state itself or the read back though (it's probably detected on the
read back or a subsequent read, but that doesn't tell me which specific
one failed).

Some extra logging in OPAL might help pin that down by checking the InA
error state in the config accessor after the config write (and polling
on it for a while as from a CPU perspective I don't knw if the write is
synchronous, probably not).

Cheers,
Ben.
Jason Gunthorpe Jan. 8, 2019, 4:01 a.m. | #9
On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> > 
> > > Interesting.  I've investigated this further, though I don't have as
> > > many new clues as I'd like.  The problem occurs reliably, at least on
> > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > > I don't yet know if it occurs with other machines, I'm having trouble
> > > getting access to other machines with a suitable card.  I didn't
> > > manage to reproduce it on a different POWER8 machine with a
> > > ConnectX-5, but I don't know if it's the difference in machine or
> > > difference in card revision that's important.
> > 
> > Make sure the card has the latest firmware is always good advice..
> > 
> > > So possibilities that occur to me:
> > >   * It's something specific about how the vfio-pci driver uses D3
> > >     state - have you tried rebinding your device to vfio-pci?
> > >   * It's something specific about POWER, either the kernel or the PCI
> > >     bridge hardware
> > >   * It's something specific about this particular type of machine
> > 
> > Does the EEH indicate what happend to actually trigger it?
> 
> In a very cryptic way that requires manual parsing using non-public
> docs sadly but yes. From the look of it, it's a completion timeout.
> 
> Looks to me like we don't get a response to a config space access
> during the change of D state. I don't know if it's the write of the D3
> state itself or the read back though (it's probably detected on the
> read back or a subsequent read, but that doesn't tell me which specific
> one failed).

If it is just one card doing it (again, check you have latest
firmware) I wonder if it is a sketchy PCI-E electrical link that is
causing a long re-training cycle? Can you tell if the PCI-E link is
permanently gone or does it eventually return?

Does the card work in Gen 3 when it starts? Is there any indication of
PCI-E link errors?

Everytime or sometimes?

POWER 8 firmware is good? If the link does eventually come back, is
the POWER8's D3 resumption timeout long enough?

If this doesn't lead to an obvious conclusion you'll probably need to
connect to IBM's Mellanox support team to get more information from
the card side.

Jason
Leon Romanovsky Jan. 8, 2019, 6:07 a.m. | #10
On Mon, Jan 07, 2019 at 09:01:29PM -0700, Jason Gunthorpe wrote:
> On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote:
> > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> > >
> > > > Interesting.  I've investigated this further, though I don't have as
> > > > many new clues as I'd like.  The problem occurs reliably, at least on
> > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > > > I don't yet know if it occurs with other machines, I'm having trouble
> > > > getting access to other machines with a suitable card.  I didn't
> > > > manage to reproduce it on a different POWER8 machine with a
> > > > ConnectX-5, but I don't know if it's the difference in machine or
> > > > difference in card revision that's important.
> > >
> > > Make sure the card has the latest firmware is always good advice..
> > >
> > > > So possibilities that occur to me:
> > > >   * It's something specific about how the vfio-pci driver uses D3
> > > >     state - have you tried rebinding your device to vfio-pci?
> > > >   * It's something specific about POWER, either the kernel or the PCI
> > > >     bridge hardware
> > > >   * It's something specific about this particular type of machine
> > >
> > > Does the EEH indicate what happend to actually trigger it?
> >
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> >
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
>
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?
>
> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?
>
> Everytime or sometimes?
>
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
>
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.

+1, I tried to find any Mellanox-internal bugs related to your issue
and didn't find anything concrete.

Thanks

>
> Jason
Alexey Kardashevskiy Jan. 9, 2019, 4:53 a.m. | #11
On 06/01/2019 09:43, Benjamin Herrenschmidt wrote:
> On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
>>
>>> Interesting.  I've investigated this further, though I don't have as
>>> many new clues as I'd like.  The problem occurs reliably, at least on
>>> one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
>>> I don't yet know if it occurs with other machines, I'm having trouble
>>> getting access to other machines with a suitable card.  I didn't
>>> manage to reproduce it on a different POWER8 machine with a
>>> ConnectX-5, but I don't know if it's the difference in machine or
>>> difference in card revision that's important.
>>
>> Make sure the card has the latest firmware is always good advice..
>>
>>> So possibilities that occur to me:
>>>   * It's something specific about how the vfio-pci driver uses D3
>>>     state - have you tried rebinding your device to vfio-pci?
>>>   * It's something specific about POWER, either the kernel or the PCI
>>>     bridge hardware
>>>   * It's something specific about this particular type of machine
>>
>> Does the EEH indicate what happend to actually trigger it?
> 
> In a very cryptic way that requires manual parsing using non-public
> docs sadly but yes. From the look of it, it's a completion timeout.
> 
> Looks to me like we don't get a response to a config space access
> during the change of D state. I don't know if it's the write of the D3
> state itself or the read back though (it's probably detected on the
> read back or a subsequent read, but that doesn't tell me which specific
> one failed).

It is write:

pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);


> 
> Some extra logging in OPAL might help pin that down by checking the InA
> error state in the config accessor after the config write (and polling
> on it for a while as from a CPU perspective I don't knw if the write is
> synchronous, probably not).


Extra logging gives these straight after that write:

nFir:        0000808000000000 0030006e00000000 0000800000000000

PhbSts:      0000001800000000 0000001800000000

Lem:         0000020000088000 42498e367f502eae 0000000000080000

OutErr:      0000002000000000 0000002000000000 0000000000000000
0000000000000000
InAErr:      0000000030000000 0000000020000000 8080000000000000
0000000000000000


Decoded (my fancy script):

nFir:        0000808000000000 0030006e00000000 0000800000000000

 |- PCI Nest Fault Isolation Register(FIR) NestBase+0x00 _BE_ =
0000808000000000h:
 |   [0..63] 00000000 00000000 10000000 10000000 00000000 00000000
00000000 00000000
 |   #16 set: The PHB had a severe error and has fenced the AIB
 |   #24 set: The internal SCOM to ASB bridge has an error
 |   #29..30: Error bit from SCOM FIR engine = 0h
 |- PCI Nest FIR Mask NestBase+0x03 _BE_ = 0030006e00000000h:
 |   [0..63] 00000000 00110000 00000000 01101110 00000000 00000000
00000000 00000000
 |   #10 set: Any PowerBus data hang poll error(Only checked for CI Stores)
 |   #11 set: Any PowerBus command hang error (domestic address range)
 |   #25 set: A command received ack_dead, foreign data hang, or
Link_chk_abort from the foreign interface
 |   #26 set: Any PowerBus command hang error (foreign address range)
 |   #28 set: Error bit from BARS SCOM engines, Nest domain
 |   #29..30: Error bit from SCOM FIR engine = 3h/[0..1] 11
 |- PCI Nest FIR WOF (“Who's on First”) NestBase+0x08 _BE_ =
0000800000000000h:
 |   [0..63] 00000000 00000000 10000000 00000000 00000000 00000000
00000000 00000000
 |   #16 set: The PHB had a severe error and has fenced the AIB
 |   #29..30: Error bit from SCOM FIR engine = 0h
 |
PhbSts:      0000001800000000 0000001800000000

 |- 0x0120 Processor Load/Store Status Register _BE_ = 0000001800000000h:
 |   [0..63] 00000000 00000000 00000000 00011000 00000000 00000000
00000000 00000000
 |   #27 set: One of the PHB3’s error status register bits is set
 |   #28 set: One of the PHB3’s first error status register bits is set
 |- 0x0110 DMA Channel Status Register _BE_ = 0000001800000000h:
 |   [0..63] 00000000 00000000 00000000 00011000 00000000 00000000
00000000 00000000
 |   #27 set: One of the PHB3’s error status register bits is set
 |   #28 set: One of the PHB3’s first error status register bits is set
 |
Lem:         0000020000088000 42498e367f502eae 0000000000080000

 |- 0xC00 LEM FIR Accumulator Register _BE_ = 0000020000088000h:
 |   [0..63] 00000000 00000000 00000010 00000000 00000000 00001000
10000000 00000000
 |   #22 set: CFG Access Error
 |   #44 set: PCT Timeout Error
 |   #48 set: PCT Unexpected Completion
 |- 0xC18 LEM Error Mask Register = 42498e367f502eaeh
 |- 0xC40 LEM WOF Register _BE_ = 0000000000080000h:
 |   [0..63] 00000000 00000000 00000000 00000000 00000000 00001000
00000000 00000000
 |   #44 set: PCT Timeout Error
 |
OutErr:      0000002000000000 0000002000000000 0000000000000000
0000000000000000
 |- 0xD00 Outbound Error Status Register _BE_ = 0000002000000000h:
 |   [0..63] 00000000 00000000 00000000 00100000 00000000 00000000
00000000 00000000
 |   #26 set: CFG Address/Enable Error
 |- 0xD08 Outbound First Error Status Register _BE_ = 0000002000000000h:
 |   [0..63] 00000000 00000000 00000000 00100000 00000000 00000000
00000000 00000000
 |   #26 set: CFG Address/Enable Error
 |
InAErr:      0000000030000000 0000000020000000 8080000000000000
0000000000000000
 |- 0xD80 InboundA Error Status Register _BE_ = 0000000030000000h:
 |   [0..63] 00000000 00000000 00000000 00000000 00110000 00000000
00000000 00000000
 |   #34 set: PCT Timeout
 |   #35 set: PCT Unexpected Completion
 |- 0xD88 InboundA First Error Status Register _BE_ = 0000000020000000h:
 |   [0..63] 00000000 00000000 00000000 00000000 00100000 00000000
00000000 00000000
 |   #34 set: PCT Timeout
 |- 0xDC0 InboundA Error Log Register 0 = 8080000000000000h


"A PCI completion timeout occurred for an outstanding PCI-E transaction"
it is.

This is how I bind the device to vfio:

echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override'
echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override'
echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind'
echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind'
echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind


and I noticed that EEH only happens with the last command. The order
(.0,.1  or .1,.0) does not matter, it seems that putting one function to
D3 is fine but putting another one when the first one is already in D3 -
produces EEH. And I do not recall ever seeing this on the firestone
machine. Weird.
Benjamin Herrenschmidt Jan. 9, 2019, 5:09 a.m. | #12
On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
> 
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> > 
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
> 
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?

No, it's 100% reproducable on systems with that specific card model,
not card instance, and maybe different systems/cards as well, I'll let
David & Alexey comment further on that.

> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?

Nope.

> Everytime or sometimes?
> 
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
> 
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.

We are IBM :-) So far, it seems to be that the card is doing something
not quite right, but we don't know what. We might need to engage
Mellanox themselves.

Cheers,
Ben.
David Gibson Jan. 9, 2019, 5:30 a.m. | #13
On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
> > 
> > > In a very cryptic way that requires manual parsing using non-public
> > > docs sadly but yes. From the look of it, it's a completion timeout.
> > > 
> > > Looks to me like we don't get a response to a config space access
> > > during the change of D state. I don't know if it's the write of the D3
> > > state itself or the read back though (it's probably detected on the
> > > read back or a subsequent read, but that doesn't tell me which specific
> > > one failed).
> > 
> > If it is just one card doing it (again, check you have latest
> > firmware) I wonder if it is a sketchy PCI-E electrical link that is
> > causing a long re-training cycle? Can you tell if the PCI-E link is
> > permanently gone or does it eventually return?
> 
> No, it's 100% reproducable on systems with that specific card model,
> not card instance, and maybe different systems/cards as well, I'll let
> David & Alexey comment further on that.

Well, it's 100% reproducable on a particular model of system
(garrison) with a particular model of card.  I've had some suggestions
that it fails with some other systems card card models, but nothing
confirmed - the one other system model I've been able to try, which
also had a newer card model didn't reproduce the problem.

> > Does the card work in Gen 3 when it starts? Is there any indication of
> > PCI-E link errors?
> 
> Nope.
> 
> > Everytime or sometimes?
> > 
> > POWER 8 firmware is good? If the link does eventually come back, is
> > the POWER8's D3 resumption timeout long enough?
> > 
> > If this doesn't lead to an obvious conclusion you'll probably need to
> > connect to IBM's Mellanox support team to get more information from
> > the card side.
> 
> We are IBM :-) So far, it seems to be that the card is doing something
> not quite right, but we don't know what. We might need to engage
> Mellanox themselves.

Possibly.  On the other hand, I've had it reported that this is a
software regression at least with downstream red hat kernels.  I
haven't yet been able to eliminate factors that might be confusing
that, or try to find a working version upstream.
Alexey Kardashevskiy Jan. 9, 2019, 6:32 a.m. | #14
On 09/01/2019 16:30, David Gibson wrote:
> On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote:
>> On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
>>>
>>>> In a very cryptic way that requires manual parsing using non-public
>>>> docs sadly but yes. From the look of it, it's a completion timeout.
>>>>
>>>> Looks to me like we don't get a response to a config space access
>>>> during the change of D state. I don't know if it's the write of the D3
>>>> state itself or the read back though (it's probably detected on the
>>>> read back or a subsequent read, but that doesn't tell me which specific
>>>> one failed).
>>>
>>> If it is just one card doing it (again, check you have latest
>>> firmware) I wonder if it is a sketchy PCI-E electrical link that is
>>> causing a long re-training cycle? Can you tell if the PCI-E link is
>>> permanently gone or does it eventually return?
>>
>> No, it's 100% reproducable on systems with that specific card model,
>> not card instance, and maybe different systems/cards as well, I'll let
>> David & Alexey comment further on that.
> 
> Well, it's 100% reproducable on a particular model of system
> (garrison) with a particular model of card.  I've had some suggestions
> that it fails with some other systems card card models, but nothing
> confirmed - the one other system model I've been able to try, which
> also had a newer card model didn't reproduce the problem.

I have just moved the "Mellanox Technologies MT27700 Family
[ConnectX-4]" from garrison to firestone machine and there it does not
produce an EEH, with the same kernel and skiboot (both upstream + my
debug). Hm. I cannot really blame the card but I cannot see what could
cause the difference in skiboot either. I even tried disabling NPU so
garrison would look like firestone, still EEH'ing.



>>> Does the card work in Gen 3 when it starts? Is there any indication of
>>> PCI-E link errors?
>>
>> Nope.
>>
>>> Everytime or sometimes?
>>>
>>> POWER 8 firmware is good? If the link does eventually come back, is
>>> the POWER8's D3 resumption timeout long enough?
>>>
>>> If this doesn't lead to an obvious conclusion you'll probably need to
>>> connect to IBM's Mellanox support team to get more information from
>>> the card side.
>>
>> We are IBM :-) So far, it seems to be that the card is doing something
>> not quite right, but we don't know what. We might need to engage
>> Mellanox themselves.
> 
> Possibly.  On the other hand, I've had it reported that this is a
> software regression at least with downstream red hat kernels.  I
> haven't yet been able to eliminate factors that might be confusing
> that, or try to find a working version upstream.

Do you have tarballs handy? I'd diff...
Benjamin Herrenschmidt Jan. 9, 2019, 7:24 a.m. | #15
On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote:
> "A PCI completion timeout occurred for an outstanding PCI-E transaction"
> it is.
> 
> This is how I bind the device to vfio:
> 
> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override'
> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override'
> echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind'
> echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind'
> echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
> echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind
> 
> 
> and I noticed that EEH only happens with the last command. The order
> (.0,.1  or .1,.0) does not matter, it seems that putting one function to
> D3 is fine but putting another one when the first one is already in D3 -
> produces EEH. And I do not recall ever seeing this on the firestone
> machine. Weird.

Putting all functions into D3 is what allows the device to actually go
into D3.

Does it work with other devices ? We do have that bug on early P9
revisions where the attempt of bringing the link to L1 as part of the
D3 process fails in horrible ways, I thought P8 would be ok but maybe
not ...

Otherwise, it might be that our timeouts are too low (you may want to
talk to our PCIe guys internally)

Cheers,
Ben.
Benjamin Herrenschmidt Jan. 9, 2019, 7:25 a.m. | #16
On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote:
> I have just moved the "Mellanox Technologies MT27700 Family
> [ConnectX-4]" from garrison to firestone machine and there it does not
> produce an EEH, with the same kernel and skiboot (both upstream + my
> debug). Hm. I cannot really blame the card but I cannot see what could
> cause the difference in skiboot either. I even tried disabling NPU so
> garrison would look like firestone, still EEH'ing.

The systems have a different chip though, firestone is P8 and garrison
is P8', which a slightly different PHB revision. Worth checking if we
have anything significantly different in our inits and poke at the HW
guys.

BTW. Are the cards behind a switch in either case ?

Cheers,
Ben.
Alexey Kardashevskiy Jan. 9, 2019, 8:14 a.m. | #17
On 09/01/2019 18:25, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote:
>> I have just moved the "Mellanox Technologies MT27700 Family
>> [ConnectX-4]" from garrison to firestone machine and there it does not
>> produce an EEH, with the same kernel and skiboot (both upstream + my
>> debug). Hm. I cannot really blame the card but I cannot see what could
>> cause the difference in skiboot either. I even tried disabling NPU so
>> garrison would look like firestone, still EEH'ing.
> 
> The systems have a different chip though, firestone is P8 and garrison
> is P8', which a slightly different PHB revision. Worth checking if we
> have anything significantly different in our inits and poke at the HW
> guys.

Nope, we do not have anything different for these machines. Asking HW
guys never worked for me :-/

I think the easiest is just doing what we did for PHB4 and ignoring
these D3 requests on garrisons.


> BTW. Are the cards behind a switch in either case ?


No, directly connected to the root on both:

garrison:

0000:00:00.0 PCI bridge: IBM Device 03dc (rev ff)
0000:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4] (rev ff)
0000:01:00.1 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4] (rev ff)

firestone (phb #0 is taken by nvidia gpu):

0001:00:00.0 PCI bridge: IBM POWER8 Host Bridge (PHB3)
0001:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4]
0001:01:00.1 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4]
Alexey Kardashevskiy Jan. 9, 2019, 8:20 a.m. | #18
On 09/01/2019 18:24, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote:
>> "A PCI completion timeout occurred for an outstanding PCI-E transaction"
>> it is.
>>
>> This is how I bind the device to vfio:
>>
>> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.0/driver_override'
>> echo vfio-pci > '/sys/bus/pci/devices/0000:01:00.1/driver_override'
>> echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind'
>> echo '0000:01:00.1' > '/sys/bus/pci/devices/0000:01:00.1/driver/unbind'
>> echo '0000:01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
>> echo '0000:01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind
>>
>>
>> and I noticed that EEH only happens with the last command. The order
>> (.0,.1  or .1,.0) does not matter, it seems that putting one function to
>> D3 is fine but putting another one when the first one is already in D3 -
>> produces EEH. And I do not recall ever seeing this on the firestone
>> machine. Weird.
> 
> Putting all functions into D3 is what allows the device to actually go
> into D3.
> 
> Does it work with other devices ?

Works fine with on the very same garrison:

0009:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719
Gigabit Ethernet PCIe (rev 01)
0009:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719
Gigabit Ethernet PCIe (rev 01)

Bizarre.

> We do have that bug on early P9
> revisions where the attempt of bringing the link to L1 as part of the
> D3 process fails in horrible ways, I thought P8 would be ok but maybe
> not ...

> Otherwise, it might be that our timeouts are too low (you may want to
> talk to our PCIe guys internally)

This increases "Outbound non-posted transactions timeout configuration"
from 16ms to 1s and does not help anyway:


diff --git a/hw/phb3.c b/hw/phb3.c
index 38b8f46..cb14909 100644
--- a/hw/phb3.c
+++ b/hw/phb3.c
@@ -4065,7 +4065,7 @@ static void phb3_init_utl(struct phb3 *p)
        /* Init_82: PCI Express port control
         * SW283991: Set Outbound Non-Posted request timeout to 16ms (RTOS).
         */
-       out_be64(p->regs + UTL_PCIE_PORT_CONTROL,
0x8588007000000000);
+       out_be64(p->regs + UTL_PCIE_PORT_CONTROL,
0x858800d000000000);
Jason Gunthorpe Jan. 9, 2019, 3:27 p.m. | #19
On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote:

> > POWER 8 firmware is good? If the link does eventually come back, is
> > the POWER8's D3 resumption timeout long enough?
> > 
> > If this doesn't lead to an obvious conclusion you'll probably need to
> > connect to IBM's Mellanox support team to get more information from
> > the card side.
> 
> We are IBM :-) So far, it seems to be that the card is doing something
> not quite right, but we don't know what. We might need to engage
> Mellanox themselves.

Sorry, it was unclear, I ment the support team for IBM inside Mellanox
..

There might be internal debugging available that can show if the card
is detecting the beacon, how far it gets in renegotiation, etc.

From all the mails it really has the feel of a PCI-E interop problem between
these two specific chips..

Jason

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4700d24e5d55..add3f516ca12 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -1315,23 +1315,24 @@  static void quirk_ide_samemode(struct pci_dev *pdev)
 }
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_10, quirk_ide_samemode);
 
-/* Some ATA devices break if put into D3 */
-static void quirk_no_ata_d3(struct pci_dev *pdev)
+/* Some devices (including a number of ATA cards) break if put into D3 */
+static void quirk_no_d3(struct pci_dev *pdev)
 {
 	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
 }
+
 /* Quirk the legacy ATA devices only. The AHCI ones are ok */
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SERVERWORKS, PCI_ANY_ID,
-				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
+				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
-				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
+				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
 /* ALi loses some register settings that we cannot then restore */
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AL, PCI_ANY_ID,
-				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
+				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
 /* VIA comes back fine but we need to keep it alive or ACPI GTM failures
    occur when mode detecting */
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
-				PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
+				PCI_CLASS_STORAGE_IDE, 8, quirk_no_d3);
 
 /*
  * This was originally an Alpha-specific thing, but it really fits here.
@@ -3367,6 +3368,10 @@  static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
 			mellanox_check_broken_intx_masking);
 
+/* Mellanox MT27800 (ConnectX-5) IB card seems to break with D3
+ * In particular this shows up when the device is bound to the vfio-pci driver */
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4, quirk_no_d3)
+
 static void quirk_no_bus_reset(struct pci_dev *dev)
 {
 	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;