diff mbox series

[v3,39/41] cxlflash: Synchronize reset and remove ops

Message ID 1522082127-58900-1-git-send-email-ukrishn@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show
Series cxlflash: OCXL transport support and miscellaneous fixes | expand

Commit Message

Uma Krishnan March 26, 2018, 4:35 p.m. UTC
The following Oops can be encountered if a device removal or system
shutdown is initiated while an EEH recovery is in process:

[c000000ff2f479c0] c008000015256f18 cxlflash_pci_slot_reset+0xa0/0x100
                                      [cxlflash]
[c000000ff2f47a30] c00800000dae22e0 cxl_pci_slot_reset+0x168/0x290 [cxl]
[c000000ff2f47ae0] c00000000003ef1c eeh_report_reset+0xec/0x170
[c000000ff2f47b20] c00000000003d0b8 eeh_pe_dev_traverse+0x98/0x170
[c000000ff2f47bb0] c00000000003f80c eeh_handle_normal_event+0x56c/0x580
[c000000ff2f47c60] c00000000003fba4 eeh_handle_event+0x2a4/0x338
[c000000ff2f47d10] c0000000000400b8 eeh_event_handler+0x1f8/0x200
[c000000ff2f47dc0] c00000000013da48 kthread+0x1a8/0x1b0
[c000000ff2f47e30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4

The remove handler frees AFU memory while the EEH recovery is in progress,
leading to a race condition. This can result in a crash if the recovery
thread tries to access this memory.

To resolve this issue, the cxlflash remove handler will evaluate the
device state and yield to any active reset or probing threads.

Signed-off-by: Uma Krishnan <ukrishn@linux.vnet.ibm.com>
---
 drivers/scsi/cxlflash/main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Matthew R. Ochs March 28, 2018, 2:43 p.m. UTC | #1
On Mon, Mar 26, 2018 at 11:35:27AM -0500, Uma Krishnan wrote:
> The following Oops can be encountered if a device removal or system
> shutdown is initiated while an EEH recovery is in process:
> 
> [c000000ff2f479c0] c008000015256f18 cxlflash_pci_slot_reset+0xa0/0x100
>                                       [cxlflash]
> [c000000ff2f47a30] c00800000dae22e0 cxl_pci_slot_reset+0x168/0x290 [cxl]
> [c000000ff2f47ae0] c00000000003ef1c eeh_report_reset+0xec/0x170
> [c000000ff2f47b20] c00000000003d0b8 eeh_pe_dev_traverse+0x98/0x170
> [c000000ff2f47bb0] c00000000003f80c eeh_handle_normal_event+0x56c/0x580
> [c000000ff2f47c60] c00000000003fba4 eeh_handle_event+0x2a4/0x338
> [c000000ff2f47d10] c0000000000400b8 eeh_event_handler+0x1f8/0x200
> [c000000ff2f47dc0] c00000000013da48 kthread+0x1a8/0x1b0
> [c000000ff2f47e30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4
> 
> The remove handler frees AFU memory while the EEH recovery is in progress,
> leading to a race condition. This can result in a crash if the recovery
> thread tries to access this memory.
> 
> To resolve this issue, the cxlflash remove handler will evaluate the
> device state and yield to any active reset or probing threads.
> 
> Signed-off-by: Uma Krishnan <ukrishn@linux.vnet.ibm.com>

Looks good!

Acked-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
diff mbox series

Patch

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 42a95b7..dfe7648 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -946,9 +946,9 @@  static void cxlflash_remove(struct pci_dev *pdev)
 		return;
 	}
 
-	/* If a Task Management Function is active, wait for it to complete
-	 * before continuing with remove.
-	 */
+	/* Yield to running recovery threads before continuing with remove */
+	wait_event(cfg->reset_waitq, cfg->state != STATE_RESET &&
+				     cfg->state != STATE_PROBING);
 	spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
 	if (cfg->tmf_active)
 		wait_event_interruptible_lock_irq(cfg->tmf_waitq,