diff mbox

[RFC,1/2] mtd: nand: schedule() after releasing the device

Message ID 1448302147-19272-2-git-send-email-bigeasy@linutronix.de
State Superseded
Headers show

Commit Message

Sebastian Andrzej Siewior Nov. 23, 2015, 6:09 p.m. UTC
I have here a live lock in UBI doing
  ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb()
  MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling()

on the same PEB over and over again. The reason for MOVE_RETRY is that
the LEB-Lock owner is stucked in nand_get_device() and does not get the
device lock. The PEB-lock owner is only scheduled on the CPU while the UBI
thread is idle during erase or read while (again) owning the device-lock
so the LEB-lock owner makes no progress.

To fix this live lock the patch adds a schedule() invocation if the wait
queue for the nand-device lock is not empty so the waiter can grab the
lock and make progress.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/mtd/nand/nand_base.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Peter Zijlstra Nov. 23, 2015, 6:18 p.m. UTC | #1
On Mon, Nov 23, 2015 at 07:09:06PM +0100, Sebastian Andrzej Siewior wrote:
>  	/* Release the controller and the chip */
>  	spin_lock(&chip->controller->lock);
>  	chip->controller->active = NULL;
>  	chip->state = FL_READY;
> +	/*
> +	 * Check if we have a waiter. If so we will schedule() right away so the
> +	 * waiter can grab the device while it is released and not after _this_
> +	 * caller gained the device (again) without leaving the CPU in between.
> +	 */
> +	if (waitqueue_active(&chip->controller->wq))
> +		do_sched = true;
>  	wake_up(&chip->controller->wq);
>  	spin_unlock(&chip->controller->lock);
> +	if (do_sched)
> +		schedule();

I've not looked at the code, but this _cannot_ be a correct fix.

schedule() can be a no-op, that is, current can be the most eligible
task to run and be selected again.

(with FIFO and this the highest prio task in the system that is a
guarantee)

I'll try and have an actual look at the problem description later.
diff mbox

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index ece544efccc3..3dc2dff01802 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -133,13 +133,23 @@  static int check_offs_len(struct mtd_info *mtd,
 static void nand_release_device(struct mtd_info *mtd)
 {
 	struct nand_chip *chip = mtd->priv;
+	bool do_sched = false;
 
 	/* Release the controller and the chip */
 	spin_lock(&chip->controller->lock);
 	chip->controller->active = NULL;
 	chip->state = FL_READY;
+	/*
+	 * Check if we have a waiter. If so we will schedule() right away so the
+	 * waiter can grab the device while it is released and not after _this_
+	 * caller gained the device (again) without leaving the CPU in between.
+	 */
+	if (waitqueue_active(&chip->controller->wq))
+		do_sched = true;
 	wake_up(&chip->controller->wq);
 	spin_unlock(&chip->controller->lock);
+	if (do_sched)
+		schedule();
 }
 
 /**