diff mbox series

[v2] pciehp: fix a race between pciehp and removing operations by sysfs

Message ID 1565008378-4733-1-git-send-email-wangxiongfeng2@huawei.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series [v2] pciehp: fix a race between pciehp and removing operations by sysfs | expand

Commit Message

Xiongfeng Wang Aug. 5, 2019, 12:32 p.m. UTC
When I unplug a pcie slot and remove the pcie device by 'sysfs' at the
same time, I got the following calltrace.

 INFO: task irq/746-pciehp:41551 blocked for more than 120 seconds.
       Tainted: P        W  OE     4.19.25-vhulk1901.1.0.h111.aarch64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 irq/746-pciehp  D    0 41551      2 0x00000228
 Call trace:
  __switch_to+0x94/0xe8
  __schedule+0x270/0x8b0
  schedule+0x2c/0x88
  schedule_preempt_disabled+0x14/0x20
  __mutex_lock.isra.1+0x1fc/0x540
  __mutex_lock_slowpath+0x24/0x30
  mutex_lock+0x80/0xa8
  pci_lock_rescan_remove+0x20/0x28
  pciehp_configure_device+0x30/0x140
  pciehp_handle_presence_or_link_change+0x35c/0x4b0
  pciehp_ist+0x1cc/0x1d0
  irq_thread_fn+0x30/0x80
  irq_thread+0x128/0x200
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18
 INFO: task bash:6424 blocked for more than 120 seconds.
       Tainted: P        W  OE     4.19.25-vhulk1901.1.0.h111.aarch64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 bash            D    0  6424   2231 0x00000200
 Call trace:
  __switch_to+0x94/0xe8
  __schedule+0x270/0x8b0
  schedule+0x2c/0x88
  schedule_timeout+0x224/0x448
  wait_for_common+0x198/0x2a0
  wait_for_completion+0x28/0x38
  kthread_stop+0x60/0x190
  __free_irq+0x1c0/0x348
  free_irq+0x40/0x88
  pcie_shutdown_notification+0x54/0x80
  pciehp_remove+0x30/0x50
  pcie_port_remove_service+0x3c/0x58
  device_release_driver_internal+0x1b4/0x250
  device_release_driver+0x28/0x38
  bus_remove_device+0xd4/0x160
  device_del+0x128/0x348
  device_unregister+0x24/0x78
  remove_iter+0x48/0x58
  device_for_each_child+0x6c/0xb8
  pcie_port_device_remove+0x2c/0x48
  pcie_portdrv_remove+0x5c/0x68
  pci_device_remove+0x48/0xd8
  device_release_driver_internal+0x1b4/0x250
  device_release_driver+0x28/0x38
  pci_stop_bus_device+0x84/0xb8
  pci_stop_and_remove_bus_device_locked+0x24/0x40
  remove_store+0xa4/0xb8
  dev_attr_store+0x44/0x60
  sysfs_kf_write+0x58/0x80
  kernfs_fop_write+0xe8/0x1f0
  __vfs_write+0x60/0x190
  vfs_write+0xac/0x1c0
  ksys_write+0x6c/0xd8
  __arm64_sys_write+0x24/0x30
  el0_svc_common+0xa0/0x180
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc

When we remove a slot by sysfs.
'pci_stop_and_remove_bus_device_locked()' will be called. This function
will get the global mutex lock 'pci_rescan_remove_lock', and remove the
slot. If the irq thread 'pciehp_ist' is still running, we will wait
until it exits.

If a pciehp interrupt happens immediately after we remove the slot by
sysfs, but before we free the pciehp irq in
'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung
because the global mutex lock 'pci_rescan_remove_lock' is held by the
sysfs operation. But the sysfs operation is waiting for the pciehp irq
thread 'pciehp_ist' ends. Then a hung task occurs.

So this two kinds of operation, removing the slot triggered by pciehp
interrupt and removing through 'sysfs', should not be excuted at the
same time. This patch add a global variable to mark that one of these
operations is under processing. When this variable is set,  if another
operation is requested, it will be rejected.

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
---
 drivers/pci/hotplug/pciehp.h      |  3 +++
 drivers/pci/hotplug/pciehp_ctrl.c | 13 ++++++++++
 drivers/pci/hotplug/pciehp_hpc.c  | 52 +++++++++++++++++++++++++++++++++++----
 drivers/pci/pci-sysfs.c           |  7 ++++++
 drivers/pci/remove.c              |  5 ++++
 include/linux/pci.h               |  3 +++
 6 files changed, 78 insertions(+), 5 deletions(-)

Comments

Lukas Wunner Aug. 11, 2019, 8:07 a.m. UTC | #1
On Mon, Aug 05, 2019 at 08:32:58PM +0800, Xiongfeng Wang wrote:
> When we remove a slot by sysfs.
> 'pci_stop_and_remove_bus_device_locked()' will be called. This function
> will get the global mutex lock 'pci_rescan_remove_lock', and remove the
> slot. If the irq thread 'pciehp_ist' is still running, we will wait
> until it exits.
> 
> If a pciehp interrupt happens immediately after we remove the slot by
> sysfs, but before we free the pciehp irq in
> 'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung
> because the global mutex lock 'pci_rescan_remove_lock' is held by the
> sysfs operation. But the sysfs operation is waiting for the pciehp irq
> thread 'pciehp_ist' ends. Then a hung task occurs.
> 
> So this two kinds of operation, removing the slot triggered by pciehp
> interrupt and removing through 'sysfs', should not be excuted at the
> same time. This patch add a global variable to mark that one of these
> operations is under processing. When this variable is set,  if another
> operation is requested, it will be rejected.

It seems this patch involves an ABI change wherein "remove" as documented
in Documentation/ABI/testing/sysfs-bus-pci may now fail and need a retry,
possibly breaking existing scripts which write to this file.  ABI changes
are fairly problematic.

The return value -EWOULDBLOCK (which is identical to -EAGAIN) might be
more appropriate than -EINVAL.

Another problem is that this patch only addresses deadlocks occurring
because of a "remove" via sysfs and a simultaneous surprise removal
(or button press removal).  However the same kind of deadlock may
occur because of two simultaneous surprise removals if one of the
two devices is a parent of the other.  It would be better to have
a solution which addresses all types of deadlocks caused by the
pci_rescan_remove_lock().  I provided you with a suggestion in this
e-mail:

https://lore.kernel.org/linux-pci/20190805114053.srbngho3wbziy2uy@wunner.de/

   "What you can do is add a flag to struct pci_dev (or the priv_flags
    embedded therein) to indicate that a device is about to be removed.
    Set this flag on all children of the device being removed before
    acquiring pci_lock_rescan_remove() and avoid taking that lock in
    pciehp_unconfigure_device() if the flag is set on the hotplug port.

    But again, that approach is just a band-aid and the real fix is to
    unbind devices without holding the lock."

Thanks,

Lukas
Xiongfeng Wang Aug. 15, 2019, 9:32 a.m. UTC | #2
On 2019/8/11 16:07, Lukas Wunner wrote:
> On Mon, Aug 05, 2019 at 08:32:58PM +0800, Xiongfeng Wang wrote:
>> When we remove a slot by sysfs.
>> 'pci_stop_and_remove_bus_device_locked()' will be called. This function
>> will get the global mutex lock 'pci_rescan_remove_lock', and remove the
>> slot. If the irq thread 'pciehp_ist' is still running, we will wait
>> until it exits.
>>
>> If a pciehp interrupt happens immediately after we remove the slot by
>> sysfs, but before we free the pciehp irq in
>> 'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung
>> because the global mutex lock 'pci_rescan_remove_lock' is held by the
>> sysfs operation. But the sysfs operation is waiting for the pciehp irq
>> thread 'pciehp_ist' ends. Then a hung task occurs.
>>
>> So this two kinds of operation, removing the slot triggered by pciehp
>> interrupt and removing through 'sysfs', should not be excuted at the
>> same time. This patch add a global variable to mark that one of these
>> operations is under processing. When this variable is set,  if another
>> operation is requested, it will be rejected.
> 
> It seems this patch involves an ABI change wherein "remove" as documented
> in Documentation/ABI/testing/sysfs-bus-pci may now fail and need a retry,
> possibly breaking existing scripts which write to this file.  ABI changes
> are fairly problematic.
> 
> The return value -EWOULDBLOCK (which is identical to -EAGAIN) might be
> more appropriate than -EINVAL.
> 
> Another problem is that this patch only addresses deadlocks occurring
> because of a "remove" via sysfs and a simultaneous surprise removal
> (or button press removal).  However the same kind of deadlock may
> occur because of two simultaneous surprise removals if one of the
> two devices is a parent of the other.  It would be better to have
> a solution which addresses all types of deadlocks caused by the
> pci_rescan_remove_lock().  I provided you with a suggestion in this
> e-mail:
> 
> https://lore.kernel.org/linux-pci/20190805114053.srbngho3wbziy2uy@wunner.de/
> 
>    "What you can do is add a flag to struct pci_dev (or the priv_flags
>     embedded therein) to indicate that a device is about to be removed.
>     Set this flag on all children of the device being removed before
>     acquiring pci_lock_rescan_remove() and avoid taking that lock in
>     pciehp_unconfigure_device() if the flag is set on the hotplug port.
> 
>     But again, that approach is just a band-aid and the real fix is to
>     unbind devices without holding the lock."

This similar dead lock also exists when AER events and remove via sysfs
happens at the same time. The calltrace is as follows.

[ 2541.494130] INFO: task kworker/86:1:603 blocked for more than 120 seconds.
[ 2541.507844]       Tainted: G        W  OE     4.19.30-vhulk1903.5.1.h163.eulerosv3r1.aarch64 #2
[ 2541.525188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2541.540799] kworker/86:1    D    0   603      2 0x00000028
[ 2541.551742] Workqueue: events aer_isr
[ 2541.559040] Call trace:
[ 2541.563914]  __switch_to+0x94/0xe8
[ 2541.570691]  __schedule+0x430/0xda0
[ 2541.577640]  schedule+0x2c/0x88
[ 2541.583897]  schedule_preempt_disabled+0x14/0x20
[ 2541.593098]  __mutex_lock.isra.1+0x1fc/0x540
[ 2541.601607]  __mutex_lock_slowpath+0x24/0x30
[ 2541.610116]  mutex_lock+0x80/0xa8
[ 2541.616720]  pci_lock_rescan_remove+0x20/0x28
[ 2541.625403]  pcie_do_fatal_recovery+0x48/0x230
[ 2541.634258]  handle_error_source+0xb4/0xc8
[ 2541.642419]  aer_isr+0x208/0x3b8
[ 2541.648849]  process_one_work+0x1b4/0x3f8
[ 2541.656837]  worker_thread+0x54/0x470
[ 2541.664134]  kthread+0x134/0x138
[ 2541.670564]  ret_from_fork+0x10/0x18
[ 2541.677711] INFO: task bash:42546 blocked for more than 120 seconds.
[ 2541.690379]       Tainted: G        W  OE     4.19.30-vhulk1903.5.1.h163.eulerosv3r1.aarch64 #2
[ 2541.707721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2541.723331] bash            D    0 42546  42544 0x00000200
[ 2541.734267] Call trace:
[ 2541.739138]  __switch_to+0x94/0xe8
[ 2541.745914]  __schedule+0x430/0xda0
[ 2541.752862]  schedule+0x2c/0x88
[ 2541.759120]  schedule_timeout+0x224/0x448
[ 2541.767108]  wait_for_common+0x198/0x2a0
[ 2541.774923]  wait_for_completion+0x28/0x38
[ 2541.783085]  __flush_work+0x128/0x280
[ 2541.790381]  flush_work+0x24/0x30
[ 2541.796985]  aer_remove+0x54/0x118
[ 2541.803761]  pcie_port_remove_service+0x3c/0x58
[ 2541.812790]  device_release_driver_internal+0x1b4/0x250
[ 2541.823204]  device_release_driver+0x28/0x38
[ 2541.831714]  bus_remove_device+0xd4/0x160
[ 2541.839703]  device_del+0x128/0x348
[ 2541.846652]  device_unregister+0x24/0x78
[ 2541.854468]  remove_iter+0x48/0x58
[ 2541.861243]  device_for_each_child+0x6c/0xb8
[ 2541.869752]  pcie_port_device_remove+0x2c/0x48
[ 2541.878607]  pcie_portdrv_remove+0x5c/0x68
[ 2541.886769]  pci_device_remove+0x48/0xd8
[ 2541.894585]  device_release_driver_internal+0x1b4/0x250
[ 2541.904998]  device_release_driver+0x28/0x38
[ 2541.913508]  pci_stop_bus_device+0x84/0xb8
[ 2541.921669]  pci_stop_and_remove_bus_device_locked+0x24/0x40
[ 2541.932949]  remove_store+0xa4/0xb8
[ 2541.939899]  dev_attr_store+0x44/0x60
[ 2541.947196]  sysfs_kf_write+0x58/0x80
[ 2541.954493]  kernfs_fop_write+0xe8/0x1f0
[ 2541.962308]  __vfs_write+0x60/0x190
[ 2541.969257]  vfs_write+0xac/0x1c0
[ 2541.975860]  ksys_write+0x6c/0xd8
[ 2541.982463]  __arm64_sys_write+0x24/0x30
[ 2541.990279]  el0_svc_common+0xa0/0x180
[ 2541.997749]  el0_svc_handler+0x38/0x78
[ 2542.005219]  el0_svc+0x8/0xc

I am not sure if this method can fix this issue. When the AER events occurs,
if we find the device is being removed, we can abort the AER recovery process.
But if the AER event occurs first and the remove via sysfs happens later,
after the remove via sysfs returns, the pci device is rescaned.
Or we should reject the remove via sysfs.
I think we need a way to differentiate remove because of AER and remove via sysfs.

Thanks,
Xiongfeng

> 
> Thanks,
> 
> Lukas
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 8c51a04..b8983c1 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -58,6 +58,8 @@ 
  *	used for synchronous writes to the Slot Control register
  * @pending_events: used by the IRQ handler to save events retrieved from the
  *	Slot Status register for later consumption by the IRQ thread
+ * @delayed_events: when we delay an event handling because the slot is being
+ *	removed or rescanned, we use this member to pass the delayed event
  * @notification_enabled: whether the IRQ was requested successfully
  * @power_fault_detected: whether a power fault was detected by the hardware
  *	that has not yet been cleared by the user
@@ -91,6 +93,7 @@  struct controller {
 	wait_queue_head_t queue;
 
 	atomic_t pending_events;		/* event handling */
+	int delayed_events;
 	unsigned int notification_enabled:1;
 	unsigned int power_fault_detected;
 	struct task_struct *poll_thread;
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index 631ced0..f8b1212 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -141,6 +141,7 @@  void pciehp_queue_pushbutton_work(struct work_struct *work)
 {
 	struct controller *ctrl = container_of(work, struct controller,
 					       button_work.work);
+	int events = ctrl->delayed_events;
 
 	mutex_lock(&ctrl->state_lock);
 	switch (ctrl->state) {
@@ -151,6 +152,12 @@  void pciehp_queue_pushbutton_work(struct work_struct *work)
 		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
 		break;
 	default:
+		if (events) {
+			atomic_or(events, &ctrl->pending_events);
+			if (!pciehp_poll_mode)
+				irq_wake_thread(ctrl->pcie->irq, ctrl);
+		} else
+			slot_being_removed_rescanned = 0;
 		break;
 	}
 	mutex_unlock(&ctrl->state_lock);
@@ -174,6 +181,7 @@  void pciehp_handle_button_press(struct controller *ctrl)
 		/* blink green LED and turn off amber */
 		pciehp_green_led_blink(ctrl);
 		pciehp_set_attention_status(ctrl, 0);
+		ctrl->delayed_events = 0;
 		schedule_delayed_work(&ctrl->button_work, 5 * HZ);
 		break;
 	case BLINKINGOFF_STATE:
@@ -195,10 +203,12 @@  void pciehp_handle_button_press(struct controller *ctrl)
 		pciehp_set_attention_status(ctrl, 0);
 		ctrl_info(ctrl, "Slot(%s): Action canceled due to button press\n",
 			  slot_name(ctrl));
+		slot_being_removed_rescanned = 0;
 		break;
 	default:
 		ctrl_err(ctrl, "Slot(%s): Ignoring invalid state %#x\n",
 			 slot_name(ctrl), ctrl->state);
+		slot_being_removed_rescanned = 0;
 		break;
 	}
 	mutex_unlock(&ctrl->state_lock);
@@ -217,6 +227,7 @@  void pciehp_handle_disable_request(struct controller *ctrl)
 	mutex_unlock(&ctrl->state_lock);
 
 	ctrl->request_result = pciehp_disable_slot(ctrl, SAFE_REMOVAL);
+	slot_being_removed_rescanned = 0;
 }
 
 void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
@@ -254,6 +265,7 @@  void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
 	link_active = pciehp_check_link_active(ctrl);
 	if (!present && !link_active) {
 		mutex_unlock(&ctrl->state_lock);
+		slot_being_removed_rescanned = 0;
 		return;
 	}
 
@@ -276,6 +288,7 @@  void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events)
 		mutex_unlock(&ctrl->state_lock);
 		break;
 	}
+	slot_being_removed_rescanned = 0;
 }
 
 static int __pciehp_enable_slot(struct controller *ctrl)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index bd990e3..2db4731 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -631,7 +631,16 @@  static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	if (events & PCI_EXP_SLTSTA_ABP) {
 		ctrl_info(ctrl, "Slot(%s): Attention button pressed\n",
 			  slot_name(ctrl));
-		pciehp_handle_button_press(ctrl);
+		if (!test_and_set_bit(0, &slot_being_removed_rescanned))
+			pciehp_handle_button_press(ctrl);
+		else {
+			if (ctrl->state == BLINKINGOFF_STATE ||
+			    ctrl->state == BLINKINGON_STATE)
+				pciehp_handle_button_press(ctrl);
+			else
+				ctrl_info(ctrl, "Slot(%s): Slot operation failed because a remove or rescan operation is under processing, please try later!\n",
+					  slot_name(ctrl));
+		}
 	}
 
 	/* Check Power Fault Detected */
@@ -647,10 +656,43 @@  static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	 * or Data Link Layer State Changed events.
 	 */
 	down_read(&ctrl->reset_lock);
-	if (events & DISABLE_SLOT)
-		pciehp_handle_disable_request(ctrl);
-	else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
-		pciehp_handle_presence_or_link_change(ctrl, events);
+	if (events & DISABLE_SLOT) {
+		if (!test_and_set_bit(0, &slot_being_removed_rescanned))
+			pciehp_handle_disable_request(ctrl);
+		else {
+			if (ctrl->state == BLINKINGOFF_STATE ||
+			    ctrl->state == BLINKINGON_STATE)
+				pciehp_handle_disable_request(ctrl);
+			/*
+			 * If 'button_work.timer' is pending, delay the work
+			 * will cause BUG_ON() in add_timer().
+			 */
+			else if (!timer_pending(&ctrl->button_work.timer)) {
+
+				ctrl->delayed_events = DISABLE_SLOT;
+				schedule_delayed_work(&ctrl->button_work,
+						      3 * HZ);
+			}
+		}
+	} else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+		if (!test_and_set_bit(0, &slot_being_removed_rescanned))
+			pciehp_handle_presence_or_link_change(ctrl, events);
+		else {
+			if (ctrl->state == BLINKINGOFF_STATE ||
+			    ctrl->state == BLINKINGON_STATE)
+				pciehp_handle_presence_or_link_change(ctrl,
+						events);
+			else if (!timer_pending(&ctrl->button_work.timer)) {
+				ctrl_info(ctrl, "Slot(%s): Surprise link down/up in remove or rescan process!\n",
+					  slot_name(ctrl));
+				ctrl->delayed_events = events &
+							(PCI_EXP_SLTSTA_PDC |
+							PCI_EXP_SLTSTA_DLLSC);
+				schedule_delayed_work(&ctrl->button_work,
+						      3 * HZ);
+			}
+		}
+	}
 	up_read(&ctrl->reset_lock);
 
 	pci_config_pm_runtime_put(pdev);
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 965c721..8ddfd0d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -473,11 +473,18 @@  static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 {
 	unsigned long val;
 
+	if (test_and_set_bit(0, &slot_being_removed_rescanned)) {
+		pr_info("Slot is being removed or rescanned, please try later!\n");
+		return -EINVAL;
+	}
+
 	if (kstrtoul(buf, 0, &val) < 0)
 		return -EINVAL;
 
 	if (val && device_remove_file_self(dev, attr))
 		pci_stop_and_remove_bus_device_locked(to_pci_dev(dev));
+
+	slot_being_removed_rescanned = 0;
 	return count;
 }
 static struct device_attribute dev_remove_attr = __ATTR_IGNORE_LOCKDEP(remove,
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index e9c6b12..6d15ad0 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -3,6 +3,11 @@ 
 #include <linux/module.h>
 #include "pci.h"
 
+/*
+ * When a slot is being removed/rescanned, this flag is set.
+ */
+unsigned long slot_being_removed_rescanned;
+
 static void pci_free_resources(struct pci_dev *dev)
 {
 	int i;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9e700d9..92575ee 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -952,6 +952,9 @@  enum pcie_bus_config_types {
 /* Do NOT directly access these two variables, unless you are arch-specific PCI
  * code, or PCI core code. */
 extern struct list_head pci_root_buses;	/* List of all known PCI buses */
+
+extern unsigned long slot_being_removed_rescanned;
+
 /* Some device drivers need know if PCI is initiated */
 int no_pci_devices(void);