Message ID | 1448302147-19272-2-git-send-email-bigeasy@linutronix.de |
---|---|
State | Superseded |
Headers | show |
On Mon, Nov 23, 2015 at 07:09:06PM +0100, Sebastian Andrzej Siewior wrote: > /* Release the controller and the chip */ > spin_lock(&chip->controller->lock); > chip->controller->active = NULL; > chip->state = FL_READY; > + /* > + * Check if we have a waiter. If so we will schedule() right away so the > + * waiter can grab the device while it is released and not after _this_ > + * caller gained the device (again) without leaving the CPU in between. > + */ > + if (waitqueue_active(&chip->controller->wq)) > + do_sched = true; > wake_up(&chip->controller->wq); > spin_unlock(&chip->controller->lock); > + if (do_sched) > + schedule(); I've not looked at the code, but this _cannot_ be a correct fix. schedule() can be a no-op, that is, current can be the most eligible task to run and be selected again. (with FIFO and this the highest prio task in the system that is a guarantee) I'll try and have an actual look at the problem description later.
diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c index ece544efccc3..3dc2dff01802 100644 --- a/drivers/mtd/nand/nand_base.c +++ b/drivers/mtd/nand/nand_base.c @@ -133,13 +133,23 @@ static int check_offs_len(struct mtd_info *mtd, static void nand_release_device(struct mtd_info *mtd) { struct nand_chip *chip = mtd->priv; + bool do_sched = false; /* Release the controller and the chip */ spin_lock(&chip->controller->lock); chip->controller->active = NULL; chip->state = FL_READY; + /* + * Check if we have a waiter. If so we will schedule() right away so the + * waiter can grab the device while it is released and not after _this_ + * caller gained the device (again) without leaving the CPU in between. + */ + if (waitqueue_active(&chip->controller->wq)) + do_sched = true; wake_up(&chip->controller->wq); spin_unlock(&chip->controller->lock); + if (do_sched) + schedule(); } /**
I have here a live lock in UBI doing ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb() MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling() on the same PEB over and over again. The reason for MOVE_RETRY is that the LEB-Lock owner is stucked in nand_get_device() and does not get the device lock. The PEB-lock owner is only scheduled on the CPU while the UBI thread is idle during erase or read while (again) owning the device-lock so the LEB-lock owner makes no progress. To fix this live lock the patch adds a schedule() invocation if the wait queue for the nand-device lock is not empty so the waiter can grab the lock and make progress. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- drivers/mtd/nand/nand_base.c | 10 ++++++++++ 1 file changed, 10 insertions(+)