PCI: Device removal rescan deadlock

Message ID 20180921205752.3191-1-keith.busch@intel.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series
  • PCI: Device removal rescan deadlock
Related show

Commit Message

Keith Busch Sept. 21, 2018, 8:57 p.m.
The pci_rescan_remove_lock provides single-threaded enumeration. This
can deadlock if a task requests this lock to enumerate below a device
while another task wants to remove that device.

A simple example to create this dead lock from sysfs looks something
like the following:

  # echo 1 > /sys/bus/pci/devices/<parent>/remove & \
    echo 1 > /sys/bus/pci/devices/<child>/rescan &

If the parent removal locks the pci_rescan_remove_lock first, the child
thread is blocked from forward progress, but the parent can't release
the lock until the child thread releases the device.

Similar deadlocks are possible in the surprise removal path due to
deadlocked interrupt handlers.

This patch provides an alternate locking method that exits in failure
if the caller wants to scan below a device being removed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp_pci.c |  8 ++++----
 drivers/pci/pci-sysfs.c          |  3 ++-
 drivers/pci/pci.h                |  4 ++++
 drivers/pci/probe.c              | 13 +++++++++++++
 drivers/pci/remove.c             |  7 ++++---
 include/linux/pci.h              |  1 +
 6 files changed, 28 insertions(+), 8 deletions(-)

Comments

Lukas Wunner Sept. 22, 2018, 8:11 a.m. | #1
On Fri, Sep 21, 2018 at 02:57:52PM -0600, Keith Busch wrote:
> The pci_rescan_remove_lock provides single-threaded enumeration. This
> can deadlock if a task requests this lock to enumerate below a device
> while another task wants to remove that device.
> 
> A simple example to create this dead lock from sysfs looks something
> like the following:
> 
>   # echo 1 > /sys/bus/pci/devices/<parent>/remove & \
>     echo 1 > /sys/bus/pci/devices/<child>/rescan &
> 
> If the parent removal locks the pci_rescan_remove_lock first, the child
> thread is blocked from forward progress, but the parent can't release
> the lock until the child thread releases the device.
> 
> Similar deadlocks are possible in the surprise removal path due to
> deadlocked interrupt handlers.
> 
> This patch provides an alternate locking method that exits in failure
> if the caller wants to scan below a device being removed.

For context, alternative patches to tackle this problem were submitted
earlier by Xiongfeng Wang and myself:

https://patchwork.ozlabs.org/patch/877835/
https://patchwork.ozlabs.org/patch/930403/

The solution you've found lools good in principle, this might work.

However.  I had withdrawn my solution because I came to the conclusion
that pci_lock_rescan_remove() is basically the equivalent to the
Big Kernel Lock in the PCI subsystem.  It's way too coarse-grained and
the proper solution, it seems, is to move to more fine-grained locking.

Specifically, device removal comprises two phases, unbinding of the
driver (pci_stop_bus_device()) and destruction of the pci_dev
(pci_remove_bus_device()).  Currently pci_lock_rescan_remove()
protects both phases.  If we make it so that it only protects the
second phase, such that the first phase runs lockless, those deadlocks
automatically go away.  I've explained this in a little more detail
in this reply to Mika:
https://www.spinics.net/lists/linux-pci/msg75960.html

Making unbinding of the driver lockless requires careful examination
of core PCI enumeration code.  E.g., you've moved the call to
pci_dev_assign_added() up in pci_stop_dev().  I don't know if that's
safe.  Changing the "added" flag has consequences which I haven't
fully understood yet, e.g. it's used to determine newly discovered
devices during rescan, but I recall noticing that it also has other
meanings.  This is historically grown code which needs to be untangled
to make it run lockless, but that takes time.

The problem is real, so the conundrum is whether to apply a fix now
which essentially amounts to duct tape, but helps users, or fixing
the root cause, which takes more time and would leave users exposed
to the problem for now.  Applying duct tape reduces pressure to
address the root cause and makes untangling more difficult
in the long run.

Thanks,

Lukas
Keith Busch Sept. 24, 2018, 3:03 p.m. | #2
On Sat, Sep 22, 2018 at 10:11:17AM +0200, Lukas Wunner wrote:
> On Fri, Sep 21, 2018 at 02:57:52PM -0600, Keith Busch wrote:
> > The pci_rescan_remove_lock provides single-threaded enumeration. This
> > can deadlock if a task requests this lock to enumerate below a device
> > while another task wants to remove that device.
> > 
> > A simple example to create this dead lock from sysfs looks something
> > like the following:
> > 
> >   # echo 1 > /sys/bus/pci/devices/<parent>/remove & \
> >     echo 1 > /sys/bus/pci/devices/<child>/rescan &
> > 
> > If the parent removal locks the pci_rescan_remove_lock first, the child
> > thread is blocked from forward progress, but the parent can't release
> > the lock until the child thread releases the device.
> > 
> > Similar deadlocks are possible in the surprise removal path due to
> > deadlocked interrupt handlers.
> > 
> > This patch provides an alternate locking method that exits in failure
> > if the caller wants to scan below a device being removed.
> 
> For context, alternative patches to tackle this problem were submitted
> earlier by Xiongfeng Wang and myself:
> 
> https://patchwork.ozlabs.org/patch/877835/
> https://patchwork.ozlabs.org/patch/930403/
> 
> The solution you've found lools good in principle, this might work.
> 
> However.  I had withdrawn my solution because I came to the conclusion
> that pci_lock_rescan_remove() is basically the equivalent to the
> Big Kernel Lock in the PCI subsystem.  It's way too coarse-grained and
> the proper solution, it seems, is to move to more fine-grained locking.
>
> Specifically, device removal comprises two phases, unbinding of the
> driver (pci_stop_bus_device()) and destruction of the pci_dev
> (pci_remove_bus_device()).  Currently pci_lock_rescan_remove()
> protects both phases.  If we make it so that it only protects the
> second phase, such that the first phase runs lockless, those deadlocks
> automatically go away.  I've explained this in a little more detail
> in this reply to Mika:
> https://www.spinics.net/lists/linux-pci/msg75960.html
> 
> Making unbinding of the driver lockless requires careful examination
> of core PCI enumeration code.  E.g., you've moved the call to
> pci_dev_assign_added() up in pci_stop_dev().  I don't know if that's
> safe.  Changing the "added" flag has consequences which I haven't
> fully understood yet, e.g. it's used to determine newly discovered
> devices during rescan, but I recall noticing that it also has other
> meanings.  This is historically grown code which needs to be untangled
> to make it run lockless, but that takes time.
> 
> The problem is real, so the conundrum is whether to apply a fix now
> which essentially amounts to duct tape, but helps users, or fixing
> the root cause, which takes more time and would leave users exposed
> to the problem for now.  Applying duct tape reduces pressure to
> address the root cause and makes untangling more difficult
> in the long run.

Yep, I agree with all the points you've made.

The coarse locking is not the only issue either. The pci driver internals
are a bit too exposed: many archs and other drivers directly access what
should be private pointers to other parts of the hierarchy without
proper locking, and then uses them without reference accounting. Many
desirable changes may not necessarily be isolated to just pci core.

I originally set out to try to remove the Big PCI Lock, but it started
touching too many subsystems I wouldn't be able to test, so reached for
the duct tape instead. :(

Patch

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index b9c1396db6fe..7fd60c27ae81 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -34,8 +34,8 @@  int pciehp_configure_device(struct controller *ctrl)
 	struct pci_bus *parent = bridge->subordinate;
 	int num, ret = 0;
 
-	pci_lock_rescan_remove();
-
+	if (!pci_lock_rescan_remove_or_die(bridge))
+		return 0;
 	dev = pci_get_slot(parent, PCI_DEVFN(0, 0));
 	if (dev) {
 		/*
@@ -90,8 +90,8 @@  void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
 
 	if (!presence)
 		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
-
-	pci_lock_rescan_remove();
+	if (!pci_lock_rescan_remove_or_die(ctrl->pcie->port))
+		return;
 
 	/*
 	 * Stopping an SR-IOV PF device removes all the associated VFs,
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 9ecfe13157c0..26aa084179f0 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -456,7 +456,8 @@  static ssize_t dev_rescan_store(struct device *dev,
 		return -EINVAL;
 
 	if (val) {
-		pci_lock_rescan_remove();
+		if (!pci_lock_rescan_remove_or_die(pdev))
+			return count;
 		pci_rescan_bus(pdev->bus);
 		pci_unlock_rescan_remove();
 	}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb3125decffe..dc8aa66dca92 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -6,9 +6,13 @@ 
 
 #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
 
+#include <linux/wait.h>
+
 extern const unsigned char pcie_link_speed[];
 extern bool pci_early_dump;
 
+extern wait_queue_head_t bus_event;
+
 bool pcie_cap_has_lnkctl(const struct pci_dev *dev);
 
 /* Functions internal to the PCI core code */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index bb2999d1b199..a7475c47c46b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3166,6 +3166,8 @@  EXPORT_SYMBOL_GPL(pci_rescan_bus);
  */
 static DEFINE_MUTEX(pci_rescan_remove_lock);
 
+DECLARE_WAIT_QUEUE_HEAD(bus_event);
+
 void pci_lock_rescan_remove(void)
 {
 	mutex_lock(&pci_rescan_remove_lock);
@@ -3175,9 +3177,20 @@  EXPORT_SYMBOL_GPL(pci_lock_rescan_remove);
 void pci_unlock_rescan_remove(void)
 {
 	mutex_unlock(&pci_rescan_remove_lock);
+	wake_up_all(&bus_event);
 }
 EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove);
 
+bool pci_lock_rescan_remove_or_die(struct pci_dev *dev)
+{
+	int locked;
+
+	wait_event_interruptible(bus_event,
+		(locked = mutex_trylock(&pci_rescan_remove_lock)) != 0 ||
+			!pci_dev_is_added(dev));
+	return locked;
+}
+
 static int __init pci_sort_bf_cmp(const struct device *d_a,
 				  const struct device *d_b)
 {
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 461e7fd2756f..81f2e4f58730 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -19,11 +19,11 @@  static void pci_stop_dev(struct pci_dev *dev)
 	pci_pme_active(dev, false);
 
 	if (pci_dev_is_added(dev)) {
+		pci_dev_assign_added(dev, false);
+		wake_up_all(&bus_event);
 		device_release_driver(&dev->dev);
 		pci_proc_detach_device(dev);
 		pci_remove_sysfs_dev_files(dev);
-
-		pci_dev_assign_added(dev, false);
 	}
 
 	if (dev->bus->self)
@@ -122,7 +122,8 @@  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
 void pci_stop_and_remove_bus_device_locked(struct pci_dev *dev)
 {
-	pci_lock_rescan_remove();
+	if (!pci_lock_rescan_remove_or_die(dev))
+		return;
 	pci_stop_and_remove_bus_device(dev);
 	pci_unlock_rescan_remove();
 }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 896b42032ec5..01ef5e31cc16 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1189,6 +1189,7 @@  unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
 void pci_lock_rescan_remove(void);
 void pci_unlock_rescan_remove(void);
+bool pci_lock_rescan_remove_or_die(struct pci_dev *dev);
 
 /* Vital Product Data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);