diff mbox

Hang on reboot in nand_get_device()

Message ID 20151106180052.GE12143@google.com
State Accepted
Commit 9ca641b0f02a3a1eedbc8c296e695326da9bbaf9
Headers show

Commit Message

Brian Norris Nov. 6, 2015, 6 p.m. UTC
+ others

Hi Andrew,

Sorry for the delay here. I overlooked this.

On Thu, Jul 02, 2015 at 03:21:48PM -0400, Andrew E. Mileski wrote:
> I'm experiencing a hang on reboot with a Freescale P1022 PowerPC system, with a
> dual chip-select NAND part (specified in the device tree as two
> separate devices), and kernel v4.0.6.

Which driver?

> It appears to be a hang in mtd/nand/nand_base.c:nand_get_device() waiting on
> chip->controller->active.
> 
> Shouldn't nand_shutdown(), or perhaps a special case in nand_get_device() for
> FL_SHUTDOWN, set chip->controller->active = NULL before returning?

I don't think that's exactly the right solution, but you're in the right
ballpark I expect.

> This seems to fix the problem for me, but I don't know all the code well enough
> to know whether doing so is appropriate, or sufficient.

I'm going to guess you're seeing two reboot handlers trying to 'get' the
same controller structure. We could probably confirm that if we see your
driver.

But I see this could be a problem with a wide class of drivers.
Basically, any NAND driver that has multiple NAND chips attached will
see multiple reboot handlers that point at the same controller lock.
This will obviously deadlock, since only one of the chips will make it
through the nand_shutdown() function successfully.

And now that I begin describing the problem and grepping through the
source logs... I see that this problem was already resolved back in
2009, except for the FL_PM_SUSPENDED mode:

commit 6b0d9a84124937f048bcb8b21313152b23063978
Author: Li Yang <leoli@freescale.com>
Date:   Tue Nov 17 14:45:49 2009 -0800

    mtd: nand: fix multi-chip suspend problem

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6b0d9a84124937f048bcb8b21313152b23063978
    
I actually don't see why we can't just use
nand_get_device(FL_PM_SUSPENDED) for the shutdown/reboot case, like
this:



It's also possible that this could be better solved in a proper
refactor/rewrite of the NAND subsystem using a better controller/chip
split, so there's only one reboot handler per NAND controller. Boris has
been looking at that.

Anyway, if the above looks OK, I can send a proper patch and get your
'Tested-by's.

Brian

Comments

Andrew E. Mileski Nov. 9, 2015, 6:43 p.m. UTC | #1
On 2015-11-06 13:00, Brian Norris wrote:
>
> On Thu, Jul 02, 2015 at 03:21:48PM -0400, Andrew E. Mileski wrote:
>> I'm experiencing a hang on reboot with a Freescale P1022 PowerPC system, with a
>> dual chip-select NAND part (specified in the device tree as two
>> separate devices), and kernel v4.0.6.
>
> Which driver?

Sorry, I wasn't specific enough:  fsl_elbc_nand.c

Currently using the 4.2.0 Linux kernel version, and the problem persists.

I've actually made a small hack to it to enable software ECC, but it isn't fit 
for distribution.

The stock driver is rather ugly in my opinion, and makes some overly broad 
assumptions about the hardware; the driver really is only intended to support 
hardware ECC.  I've taken a few stabs at cleaning-up the stock driver, but they 
all end-up being complete re-writes (for multi-die support), which is a bit more 
than I can chew.

The driver otherwise is usable as-is, albeit with only hardware ECC (that is 
insufficient for many modern NAND devices), and exhibits the same problem.

>> It appears to be a hang in mtd/nand/nand_base.c:nand_get_device() waiting on
>> chip->controller->active.
>
> And now that I begin describing the problem and grepping through the
> source logs... I see that this problem was already resolved back in
> 2009, except for the FL_PM_SUSPENDED mode:
>
> commit 6b0d9a84124937f048bcb8b21313152b23063978
> Author: Li Yang <leoli@freescale.com>
> Date:   Tue Nov 17 14:45:49 2009 -0800
>
>      mtd: nand: fix multi-chip suspend problem
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6b0d9a84124937f048bcb8b21313152b23063978
>
> I actually don't see why we can't just use
> nand_get_device(FL_PM_SUSPENDED) for the shutdown/reboot case, like
> this:
>
> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> index cc74142938b0..ece544efccc3 100644
> --- a/drivers/mtd/nand/nand_base.c
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -3110,7 +3110,7 @@ static void nand_resume(struct mtd_info *mtd)
>    */
>   static void nand_shutdown(struct mtd_info *mtd)
>   {
> -	nand_get_device(mtd, FL_SHUTDOWN);
> +	nand_get_device(mtd, FL_PM_SUSPENDED);
>   }
>
>   /* Set default functions */
>
>
> It's also possible that this could be better solved in a proper
> refactor/rewrite of the NAND subsystem using a better controller/chip
> split, so there's only one reboot handler per NAND controller. Boris has
> been looking at that.
>
> Anyway, if the above looks OK, I can send a proper patch and get your
> 'Tested-by's.
>
> Brian

That seems to work too!  Thanks!

~~Andrew E. Mileski
Brian Norris Nov. 9, 2015, 7:16 p.m. UTC | #2
On Mon, Nov 09, 2015 at 01:43:41PM -0500, Andrew E. Mileski wrote:
> On 2015-11-06 13:00, Brian Norris wrote:
> >On Thu, Jul 02, 2015 at 03:21:48PM -0400, Andrew E. Mileski wrote:
> >>I'm experiencing a hang on reboot with a Freescale P1022 PowerPC system, with a
> >>dual chip-select NAND part (specified in the device tree as two
> >>separate devices), and kernel v4.0.6.
> >
> >Which driver?
> 
> Sorry, I wasn't specific enough:  fsl_elbc_nand.c
> 
> Currently using the 4.2.0 Linux kernel version, and the problem persists.
> 
> I've actually made a small hack to it to enable software ECC, but it
> isn't fit for distribution.
> 
> The stock driver is rather ugly in my opinion, and makes some overly
> broad assumptions about the hardware; the driver really is only
> intended to support hardware ECC.  I've taken a few stabs at
> cleaning-up the stock driver, but they all end-up being complete
> re-writes (for multi-die support), which is a bit more than I can
> chew.
> 
> The driver otherwise is usable as-is, albeit with only hardware ECC
> (that is insufficient for many modern NAND devices), and exhibits
> the same problem.

Yikes, I wish I hadn't actually taken a closer look at that driver :(
It could really use some rewrites for proper multi-chip/multi-die
support.

...

> >Anyway, if the above looks OK, I can send a proper patch and get your
> >'Tested-by's.
> >
> >Brian
> 
> That seems to work too!  Thanks!

Great! I'll review Boris's suggestions to see if there's a better
option.

Brian
diff mbox

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index cc74142938b0..ece544efccc3 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -3110,7 +3110,7 @@  static void nand_resume(struct mtd_info *mtd)
  */
 static void nand_shutdown(struct mtd_info *mtd)
 {
-	nand_get_device(mtd, FL_SHUTDOWN);
+	nand_get_device(mtd, FL_PM_SUSPENDED);
 }
 
 /* Set default functions */