PCI: pciehp: Assume NoCompl+ for Thunderbolt ports

Message ID d0ac15a21de9e47157588a96b7d2f407295e358b.1515825290.git.lukas@wunner.de
State New
Headers show
Series
  • PCI: pciehp: Assume NoCompl+ for Thunderbolt ports
Related show

Commit Message

Lukas Wunner Jan. 13, 2018, 6:38 a.m.
Certain Thunderbolt 1 controllers claim to support Command Completed
events (value of 0b in the No Command Completed Support field of the
Slot Capabilities register) but in reality neither set the Command
Completed bit in the Slot Status register nor signal a Command
Completed interrupt.

As a result, every call to pcie_write_cmd() takes 2 seconds (1 second
for each invocation of pcie_wait_cmd()).  pcie_write_cmd() is called on
->remove via pcie_disable_notification(), so after such a Thunderbolt
device is unplugged, 2 seconds elapse until pciehp is unbound:

[  337.942727] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x1038 (issued 21176 msec ago)
[  340.014735] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x0000 (issued 2072 msec ago)

That by itself has always been unpleasant, but the situation has become
worse with commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services
during shutdown"):  Now pciehp is unbound on ->shutdown.  Because
Thunderbolt controllers typically have 4 hotplug ports, every reboot and
shutdown is now delayed by 8 seconds, plus another 2 seconds for every
attached Thunderbolt 1 device.

By testing with my own equipment and by googling I was able to ascertain
that at least the following chips are affected by this erratum:
8086:1513  CV82524 Thunderbolt Controller [Light Ridge 4C  2010]
8086:1547  DSL3510 Thunderbolt Controller [Cactus Ridge 4C 2012]
8086:1549  DSL2210 Thunderbolt Controller [Port Ridge 1C   2011]

The following newer chips are not affected as they have NoCompl+ in the
Slot Capabilities register:
8086:156d  DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C   2013]
8086:1578  DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C   2015]

Thunderbolt hotplug slots are not physical slots that one inserts cards
into, but rather logical hotplug slots implemented in silicon.  Devices
appear beyond those logical slots once a PCI tunnel is established on
top of the Thunderbolt Converged I/O switch.  One would expect that
commands written to the Slot Control register are executed pretty much
immediately by the silicon, without requiring a delay, so the NoCompl+
on newer chips makes sense.  It is unclear whether the older chips
likewise do not require observing a delay or how long that delay needs
to be.  There is still no public spec or errata sheet for Thunderbolt
controllers.

I checked what macOS does:
https://opensource.apple.com/source/IOPCIFamily/IOPCIFamily-320.1.1/IOPCIBridge.cpp.auto.html

They do not enable Command Completed events:
	// data link change, hot plug, presence detect change
	kSlotControlEnables = ((1 << 12) | (1 << 5) | (1 << 3))

They also do not call IODelay() or check the Command Completed bit after
writing to the Slot Control register:
	fBridgeDevice->configWrite16( fBridgeDevice->reserved->expressCapability + 0x18, ...);

Do the same and assume NoCompl+ for any Thunderbolt hotplug port.

I've stress-tested this by alternatingly binding and unbinding pciehp to
a Port Ridge hotplug bridge in an endless loop for a few minutes (by
enabling/disabling slot power via sysfs), which causes rapid writes to
the Slot Control register.  pciehp never missed a beat.

Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
Cc: Sinan Kaya <okaya@codeaurora.org>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Yehezkel Bernat <yehezkel.bernat@intel.com>
Cc: Michael Jamet <michael.jamet@intel.com>
Cc: Andreas Noever <andreas.noever@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Mika Westerberg Jan. 15, 2018, 11:09 a.m. | #1
On Sat, Jan 13, 2018 at 07:38:39AM +0100, Lukas Wunner wrote:
> Certain Thunderbolt 1 controllers claim to support Command Completed
> events (value of 0b in the No Command Completed Support field of the
> Slot Capabilities register) but in reality neither set the Command
> Completed bit in the Slot Status register nor signal a Command
> Completed interrupt.
> 
> As a result, every call to pcie_write_cmd() takes 2 seconds (1 second
> for each invocation of pcie_wait_cmd()).  pcie_write_cmd() is called on
> ->remove via pcie_disable_notification(), so after such a Thunderbolt
> device is unplugged, 2 seconds elapse until pciehp is unbound:
> 
> [  337.942727] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x1038 (issued 21176 msec ago)
> [  340.014735] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x0000 (issued 2072 msec ago)
> 
> That by itself has always been unpleasant, but the situation has become
> worse with commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services
> during shutdown"):  Now pciehp is unbound on ->shutdown.  Because
> Thunderbolt controllers typically have 4 hotplug ports, every reboot and
> shutdown is now delayed by 8 seconds, plus another 2 seconds for every
> attached Thunderbolt 1 device.
> 
> By testing with my own equipment and by googling I was able to ascertain
> that at least the following chips are affected by this erratum:
> 8086:1513  CV82524 Thunderbolt Controller [Light Ridge 4C  2010]
> 8086:1547  DSL3510 Thunderbolt Controller [Cactus Ridge 4C 2012]
> 8086:1549  DSL2210 Thunderbolt Controller [Port Ridge 1C   2011]
> 
> The following newer chips are not affected as they have NoCompl+ in the
> Slot Capabilities register:
> 8086:156d  DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C   2013]
> 8086:1578  DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C   2015]
> 
> Thunderbolt hotplug slots are not physical slots that one inserts cards
> into, but rather logical hotplug slots implemented in silicon.  Devices
> appear beyond those logical slots once a PCI tunnel is established on
> top of the Thunderbolt Converged I/O switch.  One would expect that
> commands written to the Slot Control register are executed pretty much
> immediately by the silicon, without requiring a delay, so the NoCompl+
> on newer chips makes sense.  It is unclear whether the older chips
> likewise do not require observing a delay or how long that delay needs
> to be.  There is still no public spec or errata sheet for Thunderbolt
> controllers.
> 
> I checked what macOS does:
> https://opensource.apple.com/source/IOPCIFamily/IOPCIFamily-320.1.1/IOPCIBridge.cpp.auto.html
> 
> They do not enable Command Completed events:
> 	// data link change, hot plug, presence detect change
> 	kSlotControlEnables = ((1 << 12) | (1 << 5) | (1 << 3))
> 
> They also do not call IODelay() or check the Command Completed bit after
> writing to the Slot Control register:
> 	fBridgeDevice->configWrite16( fBridgeDevice->reserved->expressCapability + 0x18, ...);
> 
> Do the same and assume NoCompl+ for any Thunderbolt hotplug port.
> 
> I've stress-tested this by alternatingly binding and unbinding pciehp to
> a Port Ridge hotplug bridge in an endless loop for a few minutes (by
> enabling/disabling slot power via sysfs), which causes rapid writes to
> the Slot Control register.  pciehp never missed a beat.
> 
> Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
> Cc: Sinan Kaya <okaya@codeaurora.org>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>

We asked from the hardware people and indeed in the legacy controllers
before RR this bit was incorrectly set to 0.

I also tested on Macbook with Cactus Ridge (1st generation controller)
and without the patch I see command timeouts and reboot takes that ~8
seconds. With this patch applied reboot works immediately.

Feel free to add my,

Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Thanks!
Bjorn Helgaas Jan. 15, 2018, 3:11 p.m. | #2
On Sat, Jan 13, 2018 at 07:38:39AM +0100, Lukas Wunner wrote:
> Certain Thunderbolt 1 controllers claim to support Command Completed
> events (value of 0b in the No Command Completed Support field of the
> Slot Capabilities register) but in reality neither set the Command
> Completed bit in the Slot Status register nor signal a Command
> Completed interrupt.
> 
> As a result, every call to pcie_write_cmd() takes 2 seconds (1 second
> for each invocation of pcie_wait_cmd()).  pcie_write_cmd() is called on
> ->remove via pcie_disable_notification(), so after such a Thunderbolt
> device is unplugged, 2 seconds elapse until pciehp is unbound:
> 
> [  337.942727] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x1038 (issued 21176 msec ago)
> [  340.014735] pciehp 0000:0a:00.0:pcie204: Timeout on hotplug command 0x0000 (issued 2072 msec ago)
> 
> That by itself has always been unpleasant, but the situation has become
> worse with commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services
> during shutdown"):  Now pciehp is unbound on ->shutdown.  Because
> Thunderbolt controllers typically have 4 hotplug ports, every reboot and
> shutdown is now delayed by 8 seconds, plus another 2 seconds for every
> attached Thunderbolt 1 device.
> 
> By testing with my own equipment and by googling I was able to ascertain
> that at least the following chips are affected by this erratum:
> 8086:1513  CV82524 Thunderbolt Controller [Light Ridge 4C  2010]
> 8086:1547  DSL3510 Thunderbolt Controller [Cactus Ridge 4C 2012]
> 8086:1549  DSL2210 Thunderbolt Controller [Port Ridge 1C   2011]
> 
> The following newer chips are not affected as they have NoCompl+ in the
> Slot Capabilities register:
> 8086:156d  DSL5520 Thunderbolt 2 Bridge [Falcon Ridge 4C   2013]
> 8086:1578  DSL6540 Thunderbolt 3 Bridge [Alpine Ridge 4C   2015]
> 
> Thunderbolt hotplug slots are not physical slots that one inserts cards
> into, but rather logical hotplug slots implemented in silicon.  Devices
> appear beyond those logical slots once a PCI tunnel is established on
> top of the Thunderbolt Converged I/O switch.  One would expect that
> commands written to the Slot Control register are executed pretty much
> immediately by the silicon, without requiring a delay, so the NoCompl+
> on newer chips makes sense.  It is unclear whether the older chips
> likewise do not require observing a delay or how long that delay needs
> to be.  There is still no public spec or errata sheet for Thunderbolt
> controllers.
> 
> I checked what macOS does:
> https://opensource.apple.com/source/IOPCIFamily/IOPCIFamily-320.1.1/IOPCIBridge.cpp.auto.html
> 
> They do not enable Command Completed events:
> 	// data link change, hot plug, presence detect change
> 	kSlotControlEnables = ((1 << 12) | (1 << 5) | (1 << 3))
> 
> They also do not call IODelay() or check the Command Completed bit after
> writing to the Slot Control register:
> 	fBridgeDevice->configWrite16( fBridgeDevice->reserved->expressCapability + 0x18, ...);
> 
> Do the same and assume NoCompl+ for any Thunderbolt hotplug port.
> 
> I've stress-tested this by alternatingly binding and unbinding pciehp to
> a Port Ridge hotplug bridge in an endless loop for a few minutes (by
> enabling/disabling slot power via sysfs), which causes rapid writes to
> the Slot Control register.  pciehp never missed a beat.
> 
> Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown")
> Cc: Sinan Kaya <okaya@codeaurora.org>
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> Cc: Yehezkel Bernat <yehezkel.bernat@intel.com>
> Cc: Michael Jamet <michael.jamet@intel.com>
> Cc: Andreas Noever <andreas.noever@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/hotplug/pciehp_hpc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 7bab0606f1a9..1904f46a09e4 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -848,6 +848,10 @@ struct controller *pcie_init(struct pcie_device *dev)
>  	if (pdev->hotplug_user_indicators)
>  		slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP);
>  
> +	/* Thunderbolt 1 ports falsely advertise Command Completed support */
> +	if (pdev->is_thunderbolt)
> +		slot_cap |= PCI_EXP_SLTCAP_NCCS;

This doesn't look right because the comment says "Thunderbolt 1" but
the code says "all Thunderbolt".  I see from the changelog why you did
that, but that's more digging that we should require of code readers.

Normally this sort of thing would be a quirk that identifies the
specific devices with the erratum.

If that's more complicated than this is worth, we should at least expand
the comment to say something like "We assume no Thunderbolt controllers
support Command Complete events, but some controllers falsely claim they
do."

>  	ctrl->slot_cap = slot_cap;
>  	mutex_init(&ctrl->ctrl_lock);
>  	init_waitqueue_head(&ctrl->queue);
> -- 
> 2.15.1
>

Patch

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 7bab0606f1a9..1904f46a09e4 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -848,6 +848,10 @@  struct controller *pcie_init(struct pcie_device *dev)
 	if (pdev->hotplug_user_indicators)
 		slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP);
 
+	/* Thunderbolt 1 ports falsely advertise Command Completed support */
+	if (pdev->is_thunderbolt)
+		slot_cap |= PCI_EXP_SLTCAP_NCCS;
+
 	ctrl->slot_cap = slot_cap;
 	mutex_init(&ctrl->ctrl_lock);
 	init_waitqueue_head(&ctrl->queue);