diff mbox series

[03/32] PCI: pciehp: Fix deadlock on unplug

Message ID 4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series Rework pciehp event handling & add runtime PM | expand

Commit Message

Lukas Wunner June 16, 2018, 7:25 p.m. UTC
When pciehp was conceived it apparently wasn't considered that one
hotplug port may be the parent of another, but with Thunderbolt this is
par for the course.

If the sysfs interface is used to initiate a simultaneous card removal
of two hotplug ports where one is a parent of the other, or if the two
ports simultaneously signal interrupts, e.g. because a link or presence
change was detected on both, a deadlock may occur if:

- The parent acquires pci_lock_rescan_remove(), starts removal of the
  child and waits for it to be unbound.
- The child waits to acquire pci_lock_rescan_remove() in order to
  service the pending sysfs command or interrupt before it can unbind.

Fix by using pci_dev_is_disconnected() as an indicator whether a parent
is removing the child, and avoid acquiring the lock in the child if so.
Should the child happen to acquire the lock first, there's no deadlock.

Testing or setting the disconnected flag needs to happen atomically with
acquisition of the lock, so introduce a mutex to protect that critical
section.

Previously the disconnected flag was only set if the slot was no longer
occupied.  That doesn't seem to make sense because the (logical) child
devices are going to be removed regardless of occupancy, so set the flag
unconditionally.  (Why is occupancy checked at all?  Because if the slot
was disabled via sysfs or a button press, the device remains physically
present for the time being, so the Command register is modified to
quiesce the device.)

The deadlock is probably rarely encountered in practice, but I can
easily reproduce it in poll mode:

    INFO: task pciehp_poll-4-2:102 blocked for more than 120 seconds.
    __schedule+0x291/0x880
    schedule+0x28/0x80
    schedule_timeout+0x1e3/0x370
    wait_for_completion+0x123/0x190
    kthread_stop+0x42/0xf0
    pciehp_release_ctrl+0xaa/0xb0
    pcie_port_remove_service+0x2f/0x40
    device_release_driver_internal+0x157/0x220
    bus_remove_device+0xe2/0x150
    device_del+0x124/0x340
    device_unregister+0x16/0x60
    remove_iter+0x1a/0x20
    device_for_each_child+0x4b/0x90
    pcie_port_device_remove+0x1e/0x30
    pci_device_remove+0x36/0xb0
    device_release_driver_internal+0x157/0x220
    pci_stop_bus_device+0x7d/0xa0
    pci_stop_bus_device+0x2b/0xa0
    pci_stop_and_remove_bus_device+0xe/0x20
    pciehp_unconfigure_device+0xb8/0x160
    pciehp_disable_slot+0x84/0x130
    pciehp_handle_card_not_present+0xd6/0x120
    pciehp_ist+0x111/0x150
    pciehp_poll+0x37/0x90
    kthread+0x111/0x130

    INFO: task pciehp_poll-9:104 blocked for more than 120 seconds.
    schedule+0x28/0x80
    schedule_preempt_disabled+0xa/0x10
    __mutex_lock.isra.1+0x1a0/0x4e0
    pciehp_configure_device+0x24/0x130
    pciehp_enable_slot+0x236/0x390
    pciehp_handle_card_present+0xe2/0x160
    pciehp_ist+0x11e/0x150
    pciehp_poll+0x37/0x90
    kthread+0x111/0x130

Cc: stable@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_pci.c | 38 ++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 12 deletions(-)

Comments

Mika Westerberg Sept. 6, 2018, 4:01 p.m. UTC | #1
Hi Lukas,

On Sat, Jun 16, 2018 at 09:25:00PM +0200, Lukas Wunner wrote:
> When pciehp was conceived it apparently wasn't considered that one
> hotplug port may be the parent of another, but with Thunderbolt this is
> par for the course.
> 
> If the sysfs interface is used to initiate a simultaneous card removal
> of two hotplug ports where one is a parent of the other, or if the two
> ports simultaneously signal interrupts, e.g. because a link or presence
> change was detected on both, a deadlock may occur if:
> 
> - The parent acquires pci_lock_rescan_remove(), starts removal of the
>   child and waits for it to be unbound.
> - The child waits to acquire pci_lock_rescan_remove() in order to
>   service the pending sysfs command or interrupt before it can unbind.
> 
> Fix by using pci_dev_is_disconnected() as an indicator whether a parent
> is removing the child, and avoid acquiring the lock in the child if so.
> Should the child happen to acquire the lock first, there's no deadlock.

This patch got dropped from your pciehp series but I wonder if you have
any plans to re-send it or rework it?

I'm asking because this is pretty real issue if you have longer chains
of devices and I've been using this patch to make the issue go away. So
even if this is not solving all possible problems, I think it may be
good idea to get included.
Lukas Wunner Sept. 6, 2018, 4:26 p.m. UTC | #2
On Thu, Sep 06, 2018 at 07:01:00PM +0300, Mika Westerberg wrote:
> On Sat, Jun 16, 2018 at 09:25:00PM +0200, Lukas Wunner wrote:
> > When pciehp was conceived it apparently wasn't considered that one
> > hotplug port may be the parent of another, but with Thunderbolt this is
> > par for the course.
> > 
> > If the sysfs interface is used to initiate a simultaneous card removal
> > of two hotplug ports where one is a parent of the other, or if the two
> > ports simultaneously signal interrupts, e.g. because a link or presence
> > change was detected on both, a deadlock may occur if:
> > 
> > - The parent acquires pci_lock_rescan_remove(), starts removal of the
> >   child and waits for it to be unbound.
> > - The child waits to acquire pci_lock_rescan_remove() in order to
> >   service the pending sysfs command or interrupt before it can unbind.
> > 
> > Fix by using pci_dev_is_disconnected() as an indicator whether a parent
> > is removing the child, and avoid acquiring the lock in the child if so.
> > Should the child happen to acquire the lock first, there's no deadlock.
> 
> This patch got dropped from your pciehp series but I wonder if you have
> any plans to re-send it or rework it?
> 
> I'm asking because this is pretty real issue if you have longer chains
> of devices and I've been using this patch to make the issue go away. So
> even if this is not solving all possible problems, I think it may be
> good idea to get included.

I asked for it to be dropped because while it fixes the problem,
it's not a good solution, it would just add technical debt to the
code base and come back to haunt us later.

Xiongfeng Wang found a similar race a few months ago that wouldn't
be fixed by my patch:
https://patchwork.ozlabs.org/patch/877835/

The PCI core nicely separates unbinding of drivers from destruction
of the pci_dev (pci_stop_bus_device() + pci_remove_bus_device()).
Problem is we currently protect both steps with pci_lock_rescan_remove().
We should only be protecting the second step.  The first step (unbinding)
needs to run lockless.

As a start, I've moved the call to pcie_aspm_exit_link_state() out
of pci_stop_dev().  This also fixes an ASPM bug.  Bjorn merged it
the day before yesterday:
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/aspm&id=1f3934b1d5e5

The call to device_release_driver() already uses locking internally
and can be run lockless.  The calls to pci_proc_detach_device() and
pci_remove_sysfs_dev_files() need to be amended with locking, I've
got some preliminary patches for this on my development branch that
I'll have to rework.  This is very hairy, historically grown code
that requires great care to avoid breakage.  It'll take a little
more time I'm afraid.

Thanks,

Lukas
Mika Westerberg Sept. 6, 2018, 6:08 p.m. UTC | #3
On Thu, Sep 06, 2018 at 06:26:34PM +0200, Lukas Wunner wrote:
> I asked for it to be dropped because while it fixes the problem,
> it's not a good solution, it would just add technical debt to the
> code base and come back to haunt us later.
> 
> Xiongfeng Wang found a similar race a few months ago that wouldn't
> be fixed by my patch:
> https://patchwork.ozlabs.org/patch/877835/
> 
> The PCI core nicely separates unbinding of drivers from destruction
> of the pci_dev (pci_stop_bus_device() + pci_remove_bus_device()).
> Problem is we currently protect both steps with pci_lock_rescan_remove().
> We should only be protecting the second step.  The first step (unbinding)
> needs to run lockless.
> 
> As a start, I've moved the call to pcie_aspm_exit_link_state() out
> of pci_stop_dev().  This also fixes an ASPM bug.  Bjorn merged it
> the day before yesterday:
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/aspm&id=1f3934b1d5e5
> 
> The call to device_release_driver() already uses locking internally
> and can be run lockless.  The calls to pci_proc_detach_device() and
> pci_remove_sysfs_dev_files() need to be amended with locking, I've
> got some preliminary patches for this on my development branch that
> I'll have to rework.  This is very hairy, historically grown code
> that requires great care to avoid breakage.  It'll take a little
> more time I'm afraid.

OK, thanks for the explanation. I see you have a good plan going forward
with this.

I can reproduce the issue easily so can assist in testing if needed.
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index 3f518dea856d..242395389293 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -20,15 +20,27 @@ 
 #include "../pci.h"
 #include "pciehp.h"
 
+static DEFINE_MUTEX(acquire_unless_disconnected);
+
 int pciehp_configure_device(struct slot *p_slot)
 {
 	struct pci_dev *dev;
-	struct pci_dev *bridge = p_slot->ctrl->pcie->port;
+	struct controller *ctrl = p_slot->ctrl;
+	struct pci_dev *bridge = ctrl->pcie->port;
 	struct pci_bus *parent = bridge->subordinate;
 	int num, ret = 0;
-	struct controller *ctrl = p_slot->ctrl;
 
+	/*
+	 * Avoid deadlock if an upstream hotplug port has already acquired
+	 * pci_lock_rescan_remove() in order to remove this hotplug port.
+	 */
+	mutex_lock(&acquire_unless_disconnected);
+	if (pci_dev_is_disconnected(bridge)) {
+		mutex_unlock(&acquire_unless_disconnected);
+		return -ENODEV;
+	}
 	pci_lock_rescan_remove();
+	mutex_unlock(&acquire_unless_disconnected);
 
 	dev = pci_get_slot(parent, PCI_DEVFN(0, 0));
 	if (dev) {
@@ -67,16 +79,24 @@  int pciehp_unconfigure_device(struct slot *p_slot)
 	int rc = 0;
 	u8 presence = 0;
 	struct pci_dev *dev, *temp;
-	struct pci_bus *parent = p_slot->ctrl->pcie->port->subordinate;
-	u16 command;
 	struct controller *ctrl = p_slot->ctrl;
+	struct pci_dev *bridge = ctrl->pcie->port;
+	struct pci_bus *parent = bridge->subordinate;
+	u16 command;
+
+	mutex_lock(&acquire_unless_disconnected);
+	if (pci_dev_is_disconnected(bridge)) {
+		mutex_unlock(&acquire_unless_disconnected);
+		return -ENODEV;
+	}
+	pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+	pci_lock_rescan_remove();
+	mutex_unlock(&acquire_unless_disconnected);
 
 	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
 		 __func__, pci_domain_nr(parent), parent->number);
 	pciehp_get_adapter_status(p_slot, &presence);
 
-	pci_lock_rescan_remove();
-
 	/*
 	 * Stopping an SR-IOV PF device removes all the associated VFs,
 	 * which will update the bus->devices list and confuse the
@@ -86,12 +106,6 @@  int pciehp_unconfigure_device(struct slot *p_slot)
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
 					 bus_list) {
 		pci_dev_get(dev);
-		if (!presence) {
-			pci_dev_set_disconnected(dev, NULL);
-			if (pci_has_subordinate(dev))
-				pci_walk_bus(dev->subordinate,
-					     pci_dev_set_disconnected, NULL);
-		}
 		pci_stop_and_remove_bus_device(dev);
 		/*
 		 * Ensure that no new Requests will be generated from