diff mbox series

[RFC] mtd: rawnand: Cure MICRON NAND partial erase issue

Message ID alpine.DEB.2.21.1811292207570.1657@nanos.tec.linutronix.de
State RFC
Delegated to: Boris Brezillon
Headers show
Series [RFC] mtd: rawnand: Cure MICRON NAND partial erase issue | expand

Commit Message

Thomas Gleixner Nov. 29, 2018, 9:12 p.m. UTC
On some Micron NAND chips block erase fails occasionaly despite the chip
claiming that it succeeded. The flash block seems to be not completely
erased and subsequent usage of the block results in hard to decode and very
subtle failures or corruption.

The exact reason is unknown, but experimentation has shown that it is only
happening when erasing an erase block which is partially written. Partially
written erase blocks are not uncommon with UBI/UBIFS.  Note, that this does
not always happen. It's a rare and random, but eventually fatal failure.

For now, just blindly write 6 pages to 0. Again experimentation has shown
that it's not sufficient to write pages at the beginning of the erase
block. There need to be pages written in the second half of the erase block
as well. So write 3 pages before and past the middle of the block.

Less than 6 pages might be sufficient, but it might even be necessary to
write more pages to make sure that it's completely cured. Two pages still
failed, but the 6 held up in a stress test scenario.

This should be optimized by keeping track of writes, but that needs proper
information about the issue.

As it's just observation and experimentation based, it's probably wise to
hold off on this until there is proper clarification about the root cause
of the problem. The patch is for reference so others can avoid to decode
this again, but there is no guarantee that it actually fixes the issue
completely.

Therefore:

Not-yet-signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Cc: Boris Brezillon <boris.brezillon@bootlin.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>

---

P.S.: This was debugged on an older kernel version (sigh) and ported
      forward without actual testing on mainline. My MTD foo is a bit
      rusty, so I won't be surprised if there are better ways to do that.

---
 drivers/mtd/nand/raw/nand_base.c   |   89 +++++++++++++++++++++++++++++++++++++
 drivers/mtd/nand/raw/nand_micron.c |    6 ++
 include/linux/mtd/rawnand.h        |    3 +
 3 files changed, 98 insertions(+)

Comments

Boris Brezillon Dec. 2, 2018, 7:29 a.m. UTC | #1
+Bean,

Hi Thomas,

First of all, I'd like to thank you for sharing this patch. I'm
pretty sure this will save days of painful debug sessions to a lot of
people.

On Thu, 29 Nov 2018 22:12:50 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On some Micron NAND chips block erase fails occasionaly despite the chip
> claiming that it succeeded. The flash block seems to be not completely
> erased and subsequent usage of the block results in hard to decode and very
> subtle failures or corruption.
> 
> The exact reason is unknown, but experimentation has shown that it is only
> happening when erasing an erase block which is partially written. Partially
> written erase blocks are not uncommon with UBI/UBIFS.  Note, that this does
> not always happen. It's a rare and random, but eventually fatal failure.
> 
> For now, just blindly write 6 pages to 0. Again experimentation has shown
> that it's not sufficient to write pages at the beginning of the erase
> block. There need to be pages written in the second half of the erase block
> as well. So write 3 pages before and past the middle of the block.
> 
> Less than 6 pages might be sufficient, but it might even be necessary to
> write more pages to make sure that it's completely cured. Two pages still
> failed, but the 6 held up in a stress test scenario.
> 
> This should be optimized by keeping track of writes, but that needs proper
> information about the issue.
> 
> As it's just observation and experimentation based, it's probably wise to
> hold off on this until there is proper clarification about the root cause
> of the problem. The patch is for reference so others can avoid to decode
> this again, but there is no guarantee that it actually fixes the issue
> completely.

I agree. I Cc-ed Bean from Micron. Maybe he can provide more
information on this issue.

> 
> Therefore:
> 
> Not-yet-signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Cc: Boris Brezillon <boris.brezillon@bootlin.com>
> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
> Cc: Richard Weinberger <richard@nod.at>
> 
> ---
> 
> P.S.: This was debugged on an older kernel version (sigh) and ported
>       forward without actual testing on mainline. My MTD foo is a bit
>       rusty, so I won't be surprised if there are better ways to do that.

Let's first wait for Bean's feedback before discussing implementation
details. BTW, do you remember the part number(s) of the flash(es)
impacted by this problem in your case?

Thanks,

Boris

> 
> ---
>  drivers/mtd/nand/raw/nand_base.c   |   89 +++++++++++++++++++++++++++++++++++++
>  drivers/mtd/nand/raw/nand_micron.c |    6 ++
>  include/linux/mtd/rawnand.h        |    3 +
>  3 files changed, 98 insertions(+)
> 
> --- a/drivers/mtd/nand/raw/nand_base.c
> +++ b/drivers/mtd/nand/raw/nand_base.c
> @@ -4122,6 +4122,91 @@ static int nand_erase(struct mtd_info *m
>  	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);
>  }
>  
> +static bool page_empty(char *buf, int len)
> +{
> +	unsigned int *p = (unsigned int *) buf;
> +	int i;
> +
> +	for (i = 0; i < len >> 2; i++, p++) {
> +		if (*p != UINT_MAX)
> +			return false;
> +	}
> +	return true;
> +}
> +
> +#define NAND_ERASE_QUIRK_PAGES		6
> +
> +/**
> + * nand_erase_quirk - [INTERN] Work around partial erase issues
> + * @chip:	NAND chip object
> + * @page:	Eraseblock base page number
> + *
> + * On some Micron NAND chips block erase fails occasionaly despite the chip
> + * claiming that it succeeded. The flash block seems to be not completely
> + * erased and subsequent usage of the block results in hard to decode and
> + * very subtle failures or corruption.
> + *
> + * The exact reason is unknown, but experimentation has shown that it is
> + * only happening when erasing an erase block which is only partially
> + * written. Partially written erase blocks are not uncommon with UBI/UBIFS.
> + * Note, that this does not always happen. It's a rare and random, but
> + * eventually fatal failure.
> + *
> + * For now, just blindly write 6 pages to 0. Again experimentation has
> + * shown that it's not sufficient to write pages at the beginning of the
> + * erase block. There need to be pages written in the second half of the
> + * erase block as well. So write 3 pages before and past the middle of the
> + * block.
> + *
> + * Less than 6 pages might be sufficient, but it might even be necessary to
> + * write more pages to make sure that it's completely cured. 2 pages still
> + * failed, but the 6 held up in a stress test scenario.
> + *
> + * FIXME: This should be optimized by keeping track of writes, but that
> + * needs proper information about the issue.
> + */
> +static int nand_erase_quirk(struct mtd_info *mtd, int page)
> +{
> +	struct nand_chip *chip = mtd->priv;
> +	unsigned int i, offs;
> +	u8 *buf;
> +
> +	if (!(chip->options & NAND_ERASE_QUIRK))
> +		return 0;
> +
> +	buf = kmalloc(mtd->writesize, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	/* Start at (pages_per_block / 2) - 3 */
> +	offs = 1 << (chip->phys_erase_shift - chip->page_shift);
> +	offs = (offs >> 1) - (NAND_ERASE_QUIRK_PAGES / 2);
> +	page = page + offs;
> +
> +	for (i = 0; i < NAND_ERASE_QUIRK_PAGES; i++, page++ ) {
> +		struct mtd_oob_ops ops = {
> +			.datbuf	= buf,
> +			.len	= mtd->writesize,
> +		};
> +
> +		/*
> +		 * Read the page back and check whether it is completely
> +		 * empty.
> +		 */
> +		nand_do_read_ops(mtd, page << chip->page_shift, &ops);
> +		if (page_empty(buf, mtd->writesize))
> +			continue;
> +		memset(buf, 0, mtd->writesize);
> +		/*
> +		 * Fill page with zeros. Ignore write failure as there
> +		 * is no way to recover here.
> +		 */
> +		nand_do_write_ops(mtd, page << chip->page_shift, &ops);
> +	}
> +	kfree(buf);
> +	return 0;
> +}
> +
>  /**
>   * nand_erase_nand - [INTERN] erase block(s)
>   * @chip: NAND chip object
> @@ -4186,6 +4271,10 @@ int nand_erase_nand(struct nand_chip *ch
>  		    (page + pages_per_block))
>  			chip->pagebuf = -1;
>  
> +		ret = nand_erase_quirk(mtd, page);
> +		if (ret)
> +			goto erase_exit;
> +
>  		if (chip->legacy.erase)
>  			status = chip->legacy.erase(chip,
>  						    page & chip->pagemask);
> --- a/drivers/mtd/nand/raw/nand_micron.c
> +++ b/drivers/mtd/nand/raw/nand_micron.c
> @@ -447,6 +447,12 @@ static int micron_nand_init(struct nand_
>  	if (ret)
>  		goto err_free_manuf_data;
>  
> +	/*
> +	 * FIXME: Mark all Micron flash with the ERASE QUIRK bit for now as
> +	 * it is unclear which flash types are affected/
> +	 */
> +	chip->options |= NAND_ERASE_QUIRK;
> +
>  	if (mtd->writesize == 2048)
>  		chip->bbt_options |= NAND_BBT_SCAN2NDPAGE;
>  
> --- a/include/linux/mtd/rawnand.h
> +++ b/include/linux/mtd/rawnand.h
> @@ -163,6 +163,9 @@ enum nand_ecc_algo {
>  /* Device needs 3rd row address cycle */
>  #define NAND_ROW_ADDR_3		0x00004000
>  
> +/* Device requires erase quirk */
> +#define NAND_ERASE_QUIRK	0x00008000
> +
>  /* Options valid for Samsung large page devices */
>  #define NAND_SAMSUNG_LP_OPTIONS NAND_CACHEPRG
>
Thomas Gleixner Dec. 2, 2018, 2:22 p.m. UTC | #2
On Sun, 2 Dec 2018, Boris Brezillon wrote:
> First of all, I'd like to thank you for sharing this patch. I'm
> pretty sure this will save days of painful debug sessions to a lot of
> people.

Yeah. It's painful because it's a sporadic failure.

> On Thu, 29 Nov 2018 22:12:50 +0100 (CET)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> > P.S.: This was debugged on an older kernel version (sigh) and ported
> >       forward without actual testing on mainline. My MTD foo is a bit
> >       rusty, so I won't be surprised if there are better ways to do that.
> 
> Let's first wait for Bean's feedback before discussing implementation
> details. BTW, do you remember the part number(s) of the flash(es)
> impacted by this problem in your case?

MT29F8G08 is one of them. The other one I can't tell right now due to
traveling.

Thanks,

	Thomas
Bean Huo Dec. 7, 2018, 1:12 p.m. UTC | #3
>+Bean,
>
>Hi Thomas,
>
>First of all, I'd like to thank you for sharing this patch. I'm pretty sure this will
>save days of painful debug sessions to a lot of people.
>
>On Thu, 29 Nov 2018 22:12:50 +0100 (CET) Thomas Gleixner
><tglx@linutronix.de> wrote:
>
>> On some Micron NAND chips block erase fails occasionaly despite the
>> chip claiming that it succeeded. The flash block seems to be not
>> completely erased and subsequent usage of the block results in hard to
>> decode and very subtle failures or corruption.
>>
>> The exact reason is unknown, but experimentation has shown that it is
>> only happening when erasing an erase block which is partially written.
>> Partially written erase blocks are not uncommon with UBI/UBIFS.  Note,
>> that this does not always happen. It's a rare and random, but eventually
>fatal failure.
>>
>> For now, just blindly write 6 pages to 0. Again experimentation has
>> shown that it's not sufficient to write pages at the beginning of the
>> erase block. There need to be pages written in the second half of the
>> erase block as well. So write 3 pages before and past the middle of the block.
>>
>> Less than 6 pages might be sufficient, but it might even be necessary
>> to write more pages to make sure that it's completely cured. Two pages
>> still failed, but the 6 held up in a stress test scenario.
>>
>> This should be optimized by keeping track of writes, but that needs
>> proper information about the issue.
>>
>> As it's just observation and experimentation based, it's probably wise
>> to hold off on this until there is proper clarification about the root
>> cause of the problem. The patch is for reference so others can avoid
>> to decode this again, but there is no guarantee that it actually fixes
>> the issue completely.
>
>I agree. I Cc-ed Bean from Micron. Maybe he can provide more information
>on this issue.
>
>>
>> Therefore:
>>
>> Not-yet-signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> Cc: Boris Brezillon <boris.brezillon@bootlin.com>
>> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
>> Cc: Richard Weinberger <richard@nod.at>
>>
>> ---
>>
>> P.S.: This was debugged on an older kernel version (sigh) and ported
>>       forward without actual testing on mainline. My MTD foo is a bit
>>       rusty, so I won't be surprised if there are better ways to do that.
>
>Let's first wait for Bean's feedback before discussing implementation details.
>BTW, do you remember the part number(s) of the flash(es) impacted by this
>problem in your case?
>
Thanks, let me know this issue, I will look at this

>Thanks,
>
>Boris
>
Miquel Raynal Dec. 10, 2018, 3:40 p.m. UTC | #4
Hi Bean,

"Bean Huo (beanhuo)" <beanhuo@micron.com> wrote on Fri, 7 Dec 2018
13:12:56 +0000:

> >+Bean,
> >
> >Hi Thomas,
> >
> >First of all, I'd like to thank you for sharing this patch. I'm pretty sure this will
> >save days of painful debug sessions to a lot of people.
> >
> >On Thu, 29 Nov 2018 22:12:50 +0100 (CET) Thomas Gleixner
> ><tglx@linutronix.de> wrote:
> >  
> >> On some Micron NAND chips block erase fails occasionaly despite the
> >> chip claiming that it succeeded. The flash block seems to be not
> >> completely erased and subsequent usage of the block results in hard to
> >> decode and very subtle failures or corruption.
> >>
> >> The exact reason is unknown, but experimentation has shown that it is
> >> only happening when erasing an erase block which is partially written.
> >> Partially written erase blocks are not uncommon with UBI/UBIFS.  Note,
> >> that this does not always happen. It's a rare and random, but eventually  
> >fatal failure.  
> >>
> >> For now, just blindly write 6 pages to 0. Again experimentation has
> >> shown that it's not sufficient to write pages at the beginning of the
> >> erase block. There need to be pages written in the second half of the
> >> erase block as well. So write 3 pages before and past the middle of the block.
> >>
> >> Less than 6 pages might be sufficient, but it might even be necessary
> >> to write more pages to make sure that it's completely cured. Two pages
> >> still failed, but the 6 held up in a stress test scenario.
> >>
> >> This should be optimized by keeping track of writes, but that needs
> >> proper information about the issue.
> >>
> >> As it's just observation and experimentation based, it's probably wise
> >> to hold off on this until there is proper clarification about the root
> >> cause of the problem. The patch is for reference so others can avoid
> >> to decode this again, but there is no guarantee that it actually fixes
> >> the issue completely.  
> >
> >I agree. I Cc-ed Bean from Micron. Maybe he can provide more information
> >on this issue.
> >  
> >>
> >> Therefore:
> >>
> >> Not-yet-signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >>
> >> Cc: Boris Brezillon <boris.brezillon@bootlin.com>
> >> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
> >> Cc: Richard Weinberger <richard@nod.at>
> >>
> >> ---
> >>
> >> P.S.: This was debugged on an older kernel version (sigh) and ported
> >>       forward without actual testing on mainline. My MTD foo is a bit
> >>       rusty, so I won't be surprised if there are better ways to do that.  
> >
> >Let's first wait for Bean's feedback before discussing implementation details.
> >BTW, do you remember the part number(s) of the flash(es) impacted by this
> >problem in your case?
> >  
> Thanks, let me know this issue, I will look at this

I think it's time for you to comment on the situation.


Thanks,
Miquèl
Bean Huo Dec. 19, 2018, 11:04 a.m. UTC | #5
Hi, all
Micron developed one patch related to what mentioned here. which is based on v4.2.
I just submitted it here http://lists.infradead.org/pipermail/linux-mtd/2018-December/086446.html
, please review. And I will update that later based on your comments.

>+Bean,
>
>Hi Thomas,
>
>First of all, I'd like to thank you for sharing this patch. I'm pretty sure this will
>save days of painful debug sessions to a lot of people.
>
>On Thu, 29 Nov 2018 22:12:50 +0100 (CET) Thomas Gleixner
><tglx@linutronix.de> wrote:
>
>> On some Micron NAND chips block erase fails occasionaly despite the
>> chip claiming that it succeeded. The flash block seems to be not
>> completely erased and subsequent usage of the block results in hard to
>> decode and very subtle failures or corruption.
>>
>> The exact reason is unknown, but experimentation has shown that it is
>> only happening when erasing an erase block which is partially written.
>> Partially written erase blocks are not uncommon with UBI/UBIFS.  Note,
>> that this does not always happen. It's a rare and random, but eventually
>fatal failure.
>>
>> For now, just blindly write 6 pages to 0. Again experimentation has
>> shown that it's not sufficient to write pages at the beginning of the
>> erase block. There need to be pages written in the second half of the
>> erase block as well. So write 3 pages before and past the middle of the block.
>>
>> Less than 6 pages might be sufficient, but it might even be necessary
>> to write more pages to make sure that it's completely cured. Two pages
>> still failed, but the 6 held up in a stress test scenario.
>>
>> This should be optimized by keeping track of writes, but that needs
>> proper information about the issue.
>>
>> As it's just observation and experimentation based, it's probably wise
>> to hold off on this until there is proper clarification about the root
>> cause of the problem. The patch is for reference so others can avoid
>> to decode this again, but there is no guarantee that it actually fixes
>> the issue completely.
>
>I agree. I Cc-ed Bean from Micron. Maybe he can provide more information
>on this issue.
>
>>
>> Therefore:
>>
>> Not-yet-signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> Cc: Boris Brezillon <boris.brezillon@bootlin.com>
>> Cc: Miquel Raynal <miquel.raynal@bootlin.com>
>> Cc: Richard Weinberger <richard@nod.at>
>>
>> ---
>>
>> P.S.: This was debugged on an older kernel version (sigh) and ported
>>       forward without actual testing on mainline. My MTD foo is a bit
>>       rusty, so I won't be surprised if there are better ways to do that.
>
>Let's first wait for Bean's feedback before discussing implementation details.
>BTW, do you remember the part number(s) of the flash(es) impacted by this
>problem in your case?
>
>Thanks,
>
>Boris
>
>>
>> ---
>>  drivers/mtd/nand/raw/nand_base.c   |   89
>+++++++++++++++++++++++++++++++++++++
>>  drivers/mtd/nand/raw/nand_micron.c |    6 ++
>>  include/linux/mtd/rawnand.h        |    3 +
>>  3 files changed, 98 insertions(+)
>>
>> --- a/drivers/mtd/nand/raw/nand_base.c
>> +++ b/drivers/mtd/nand/raw/nand_base.c
>> @@ -4122,6 +4122,91 @@ static int nand_erase(struct mtd_info *m
>>  	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);  }
>>
>> +static bool page_empty(char *buf, int len) {
>> +	unsigned int *p = (unsigned int *) buf;
>> +	int i;
>> +
>> +	for (i = 0; i < len >> 2; i++, p++) {
>> +		if (*p != UINT_MAX)
>> +			return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +#define NAND_ERASE_QUIRK_PAGES		6
>> +
>> +/**
>> + * nand_erase_quirk - [INTERN] Work around partial erase issues
>> + * @chip:	NAND chip object
>> + * @page:	Eraseblock base page number
>> + *
>> + * On some Micron NAND chips block erase fails occasionaly despite
>> +the chip
>> + * claiming that it succeeded. The flash block seems to be not
>> +completely
>> + * erased and subsequent usage of the block results in hard to decode
>> +and
>> + * very subtle failures or corruption.
>> + *
>> + * The exact reason is unknown, but experimentation has shown that it
>> +is
>> + * only happening when erasing an erase block which is only partially
>> + * written. Partially written erase blocks are not uncommon with UBI/UBIFS.
>> + * Note, that this does not always happen. It's a rare and random,
>> +but
>> + * eventually fatal failure.
>> + *
>> + * For now, just blindly write 6 pages to 0. Again experimentation
>> +has
>> + * shown that it's not sufficient to write pages at the beginning of
>> +the
>> + * erase block. There need to be pages written in the second half of
>> +the
>> + * erase block as well. So write 3 pages before and past the middle
>> +of the
>> + * block.
>> + *
>> + * Less than 6 pages might be sufficient, but it might even be
>> +necessary to
>> + * write more pages to make sure that it's completely cured. 2 pages
>> +still
>> + * failed, but the 6 held up in a stress test scenario.
>> + *
>> + * FIXME: This should be optimized by keeping track of writes, but
>> +that
>> + * needs proper information about the issue.
>> + */
>> +static int nand_erase_quirk(struct mtd_info *mtd, int page) {
>> +	struct nand_chip *chip = mtd->priv;
>> +	unsigned int i, offs;
>> +	u8 *buf;
>> +
>> +	if (!(chip->options & NAND_ERASE_QUIRK))
>> +		return 0;
>> +
>> +	buf = kmalloc(mtd->writesize, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	/* Start at (pages_per_block / 2) - 3 */
>> +	offs = 1 << (chip->phys_erase_shift - chip->page_shift);
>> +	offs = (offs >> 1) - (NAND_ERASE_QUIRK_PAGES / 2);
>> +	page = page + offs;
>> +
>> +	for (i = 0; i < NAND_ERASE_QUIRK_PAGES; i++, page++ ) {
>> +		struct mtd_oob_ops ops = {
>> +			.datbuf	= buf,
>> +			.len	= mtd->writesize,
>> +		};
>> +
>> +		/*
>> +		 * Read the page back and check whether it is completely
>> +		 * empty.
>> +		 */
>> +		nand_do_read_ops(mtd, page << chip->page_shift, &ops);
>> +		if (page_empty(buf, mtd->writesize))
>> +			continue;
>> +		memset(buf, 0, mtd->writesize);
>> +		/*
>> +		 * Fill page with zeros. Ignore write failure as there
>> +		 * is no way to recover here.
>> +		 */
>> +		nand_do_write_ops(mtd, page << chip->page_shift, &ops);
>> +	}
>> +	kfree(buf);
>> +	return 0;
>> +}
>> +
>>  /**
>>   * nand_erase_nand - [INTERN] erase block(s)
>>   * @chip: NAND chip object
>> @@ -4186,6 +4271,10 @@ int nand_erase_nand(struct nand_chip *ch
>>  		    (page + pages_per_block))
>>  			chip->pagebuf = -1;
>>
>> +		ret = nand_erase_quirk(mtd, page);
>> +		if (ret)
>> +			goto erase_exit;
>> +
>>  		if (chip->legacy.erase)
>>  			status = chip->legacy.erase(chip,
>>  						    page & chip->pagemask);
>> --- a/drivers/mtd/nand/raw/nand_micron.c
>> +++ b/drivers/mtd/nand/raw/nand_micron.c
>> @@ -447,6 +447,12 @@ static int micron_nand_init(struct nand_
>>  	if (ret)
>>  		goto err_free_manuf_data;
>>
>> +	/*
>> +	 * FIXME: Mark all Micron flash with the ERASE QUIRK bit for now as
>> +	 * it is unclear which flash types are affected/
>> +	 */
>> +	chip->options |= NAND_ERASE_QUIRK;
>> +
>>  	if (mtd->writesize == 2048)
>>  		chip->bbt_options |= NAND_BBT_SCAN2NDPAGE;
>>
>> --- a/include/linux/mtd/rawnand.h
>> +++ b/include/linux/mtd/rawnand.h
>> @@ -163,6 +163,9 @@ enum nand_ecc_algo {
>>  /* Device needs 3rd row address cycle */
>>  #define NAND_ROW_ADDR_3		0x00004000
>>
>> +/* Device requires erase quirk */
>> +#define NAND_ERASE_QUIRK	0x00008000
>> +
>>  /* Options valid for Samsung large page devices */  #define
>> NAND_SAMSUNG_LP_OPTIONS NAND_CACHEPRG
>>
diff mbox series

Patch

--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -4122,6 +4122,91 @@  static int nand_erase(struct mtd_info *m
 	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);
 }
 
+static bool page_empty(char *buf, int len)
+{
+	unsigned int *p = (unsigned int *) buf;
+	int i;
+
+	for (i = 0; i < len >> 2; i++, p++) {
+		if (*p != UINT_MAX)
+			return false;
+	}
+	return true;
+}
+
+#define NAND_ERASE_QUIRK_PAGES		6
+
+/**
+ * nand_erase_quirk - [INTERN] Work around partial erase issues
+ * @chip:	NAND chip object
+ * @page:	Eraseblock base page number
+ *
+ * On some Micron NAND chips block erase fails occasionaly despite the chip
+ * claiming that it succeeded. The flash block seems to be not completely
+ * erased and subsequent usage of the block results in hard to decode and
+ * very subtle failures or corruption.
+ *
+ * The exact reason is unknown, but experimentation has shown that it is
+ * only happening when erasing an erase block which is only partially
+ * written. Partially written erase blocks are not uncommon with UBI/UBIFS.
+ * Note, that this does not always happen. It's a rare and random, but
+ * eventually fatal failure.
+ *
+ * For now, just blindly write 6 pages to 0. Again experimentation has
+ * shown that it's not sufficient to write pages at the beginning of the
+ * erase block. There need to be pages written in the second half of the
+ * erase block as well. So write 3 pages before and past the middle of the
+ * block.
+ *
+ * Less than 6 pages might be sufficient, but it might even be necessary to
+ * write more pages to make sure that it's completely cured. 2 pages still
+ * failed, but the 6 held up in a stress test scenario.
+ *
+ * FIXME: This should be optimized by keeping track of writes, but that
+ * needs proper information about the issue.
+ */
+static int nand_erase_quirk(struct mtd_info *mtd, int page)
+{
+	struct nand_chip *chip = mtd->priv;
+	unsigned int i, offs;
+	u8 *buf;
+
+	if (!(chip->options & NAND_ERASE_QUIRK))
+		return 0;
+
+	buf = kmalloc(mtd->writesize, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	/* Start at (pages_per_block / 2) - 3 */
+	offs = 1 << (chip->phys_erase_shift - chip->page_shift);
+	offs = (offs >> 1) - (NAND_ERASE_QUIRK_PAGES / 2);
+	page = page + offs;
+
+	for (i = 0; i < NAND_ERASE_QUIRK_PAGES; i++, page++ ) {
+		struct mtd_oob_ops ops = {
+			.datbuf	= buf,
+			.len	= mtd->writesize,
+		};
+
+		/*
+		 * Read the page back and check whether it is completely
+		 * empty.
+		 */
+		nand_do_read_ops(mtd, page << chip->page_shift, &ops);
+		if (page_empty(buf, mtd->writesize))
+			continue;
+		memset(buf, 0, mtd->writesize);
+		/*
+		 * Fill page with zeros. Ignore write failure as there
+		 * is no way to recover here.
+		 */
+		nand_do_write_ops(mtd, page << chip->page_shift, &ops);
+	}
+	kfree(buf);
+	return 0;
+}
+
 /**
  * nand_erase_nand - [INTERN] erase block(s)
  * @chip: NAND chip object
@@ -4186,6 +4271,10 @@  int nand_erase_nand(struct nand_chip *ch
 		    (page + pages_per_block))
 			chip->pagebuf = -1;
 
+		ret = nand_erase_quirk(mtd, page);
+		if (ret)
+			goto erase_exit;
+
 		if (chip->legacy.erase)
 			status = chip->legacy.erase(chip,
 						    page & chip->pagemask);
--- a/drivers/mtd/nand/raw/nand_micron.c
+++ b/drivers/mtd/nand/raw/nand_micron.c
@@ -447,6 +447,12 @@  static int micron_nand_init(struct nand_
 	if (ret)
 		goto err_free_manuf_data;
 
+	/*
+	 * FIXME: Mark all Micron flash with the ERASE QUIRK bit for now as
+	 * it is unclear which flash types are affected/
+	 */
+	chip->options |= NAND_ERASE_QUIRK;
+
 	if (mtd->writesize == 2048)
 		chip->bbt_options |= NAND_BBT_SCAN2NDPAGE;
 
--- a/include/linux/mtd/rawnand.h
+++ b/include/linux/mtd/rawnand.h
@@ -163,6 +163,9 @@  enum nand_ecc_algo {
 /* Device needs 3rd row address cycle */
 #define NAND_ROW_ADDR_3		0x00004000
 
+/* Device requires erase quirk */
+#define NAND_ERASE_QUIRK	0x00008000
+
 /* Options valid for Samsung large page devices */
 #define NAND_SAMSUNG_LP_OPTIONS NAND_CACHEPRG