Patchwork bbt and bitflip

login
register
mail settings
Submitter Matthieu CASTET
Date April 21, 2011, 5:17 p.m.
Message ID <4DB06693.5000100@parrot.com>
Download mbox | patch
Permalink /patch/92430/
State New
Headers show

Comments

Matthieu CASTET - April 21, 2011, 5:17 p.m.
Matthieu CASTET a écrit :
> Hi,
> 
> the current bad block table implementation doesn't seem robust against bit flip.
> 
> at boot we call :
> - search_read_bbts which scan for bbt using oob pattern.
> - check_create
> -- read_abs_bbt
> --- read_bbt which ignore ecc bit flip/error
> 
> So if bit flip happen in BBT, we never scrub it.
> And if bit flip accumulate and we can't correct it anymore, the code will parse
> the corrupted data and our bad block info will be wrong (valid block can be
> marked as bad and we lose bad, bad block can be see as valid).
> 
> 
> Also the pattern and version in oob isn't protected by ecc. They can be corrupted.
> 
> Are bbt safe to use ?
> 
> Are there any plan to make the bbt more robust ?
> 
Here a quick and dirty patch to make them more robust.
Any comment are welcomed.


Matthieu
Artem Bityutskiy - April 22, 2011, 8:14 a.m.
On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote:
> 			}
> +			if (res != -EUCLEAN) {
> +				printk(KERN_INFO "nand_bbt: Error reading bad block table2\n");
> +				return res;
> +			}

Just a side note:

I tend to think that we need to veto all mtd patches which touch/add
prints and force people to first turn all the ugly MTD printing mess
into dev_dbg/dev_err/etc messages... But I cannot actually propose it
before I clean up UBI/UBIFS in this respect :-)))
Artem Bityutskiy - April 22, 2011, 8:15 a.m.
On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote:
> Here a quick and dirty patch to make them more robust.
> Any comment are welcomed.

Would be great if you could test it and submit a nice patch with proper
commit message and Signed-off-by.
Brian Norris - June 23, 2011, 4:36 p.m.
Hello,

On Fri, Apr 22, 2011 at 1:15 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote:
>> Here a quick and dirty patch to make them more robust.
>> Any comment are welcomed.
>
> Would be great if you could test it and submit a nice patch with proper
> commit message and Signed-off-by.

I am interested in Artem's comments on the robustness of flash-based
BBT (here, and more recently on
http://lists.infradead.org/pipermail/linux-mtd/2011-June/036557.html).
I recently have moved to using flash-based BBT (in-band, actually),
and it seemed like several NAND drivers use flash-based BBT as well.
Is it really that un-trustworthy?

So I guess I'm looking for areas of improvement. I see a few suggestions here:

"Also the pattern and version in oob isn't protected by ecc. They can
be corrupted."

I noticed this one recently. My hardware ECC actually *can* do ECC for
what little OOB is actually free, but I realized that the nand_base
code doesn't even try to check the ECC stats (in nand_do_read_oob())
whereas some similar code for nand_do_read_ops *does* check the ECC
stats. Is it fair to adapt the code from nand_do_read_ops to
nand_do_read_oob so that both check for ECC errors, just in case the
hardware supports it? Would this cause any API problems, if users
didn't expect OOB reads to return ECC error statuses? For now, this
would only have any effect if a driver replaces the chip->ecc.read_oob
function with something that performs ECC operations independently and
increments the ECC counters.

And speaking of BBT in OOB:
Anyone know why the flash-based ident pattern and version is
"traditionally" stored in OOB? It was quite recently that Sebastian
Andrzej Siewior added the NAND_USE_FLASH_BBT_NO_OOB flag (which is
slated to be renamed/replaced, FYI). It seems like ...NO_OOB is a
generally good (or at least better) idea. Then we wouldn't even have
to worry much about ECC in OOB.

"read_bbt which ignore ecc bit flip/error"

If I understand right, read_bbt just prints warning message on ECC
flips/errors and otherwise ignores them? Would Matthieu's "quick and
dirty" patch be an OK start for fixing this? (where in the absence of
a valid backup tableb, an ECC error causes a flash rescan and an ECC
bitflip causes an erase/rewrite "scrub"?)

"The bbt should be protected with CRC and if it gets corrupted we
should re-scan the flash and re-create it."

Wouldn't CRC just be a lesser replacement for proper ECC protection?
Or am I missing something?

Brian
Artem Bityutskiy - June 24, 2011, 7:55 p.m.
On Thu, 2011-06-23 at 09:36 -0700, Brian Norris wrote:
> I am interested in Artem's comments on the robustness of flash-based
> BBT (here, and more recently on
> http://lists.infradead.org/pipermail/linux-mtd/2011-June/036557.html).
> I recently have moved to using flash-based BBT (in-band, actually),
> and it seemed like several NAND drivers use flash-based BBT as well.
> Is it really that un-trustworthy?

If you confirm that everything is great and robust when the on-flash BBT
gets corrupted - the NAND core for sure notices any corruptions and
falls-back to the traditional scanning method and restores the on-flash
BBT - then I apologize for saying that I do not trust it. Also, I do not
really know the details of this, so I may be completely wrong.

> "The bbt should be protected with CRC and if it gets corrupted we
> should re-scan the flash and re-create it."
> 
> Wouldn't CRC just be a lesser replacement for proper ECC protection?
> Or am I missing something?

I'd say ECC and CRC play different roles. ECC is about handling NAND
PITAs like read/write/erase disturb, and CRC is about noticing any
corruption and recover, instead of reporting inaccurate information up -
e.g., reporting a good block as bad.
Matthew L. Creech - June 24, 2011, 8:36 p.m.
On Fri, Jun 24, 2011 at 3:55 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> If you confirm that everything is great and robust when the on-flash BBT
> gets corrupted - the NAND core for sure notices any corruptions and
> falls-back to the traditional scanning method and restores the on-flash
> BBT - then I apologize for saying that I do not trust it. Also, I do not
> really know the details of this, so I may be completely wrong.
>

I'm still trying to pinpoint the cause of the BBT corruptions that I
encountered, but _if_ they were caused by bit-flips, then it
definitely appears that your initial statement was accurate: the
devices were in many cases unbootable because the corrupted BBT was
treated as legit.

This also seems to be the case when looking at the code in nand_bbt.c:
if an ECC error is encountered in read_bbt(), it prints a warning but
then continues to process the BBT data.  Maybe there's an additional
assumption that I'm overlooking, but at a glance it seems like that
would allow bit errors to accumulate and wrongly flag good blocks as
bad.

Patch

diff --git a/drivers/mtd/nand/nand_bbt.c b/drivers/mtd/nand/nand_bbt.c
index 5fedf4a..7b85b54 100644
--- a/drivers/mtd/nand/nand_bbt.c
+++ b/drivers/mtd/nand/nand_bbt.c
@@ -171,7 +171,7 @@  static int check_short_pattern(uint8_t *buf, struct nand_bbt_descr *td)
 static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num,
 		    int bits, int offs, int reserved_block_code)
 {
-	int res, i, j, act = 0;
+	int res, i, j, act = 0, ret = 0;
 	struct nand_chip *this = mtd->priv;
 	size_t retlen, len, totlen;
 	loff_t from;
@@ -188,6 +188,12 @@  static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num,
 				printk(KERN_INFO "nand_bbt: Error reading bad block table\n");
 				return res;
 			}
+			if (res != -EUCLEAN) {
+				printk(KERN_INFO "nand_bbt: Error reading bad block table2\n");
+				return res;
+			}
+			/* inform caller that there is bit flips */
+			ret |= res;
 			printk(KERN_WARNING "nand_bbt: ECC error while reading bad block table\n");
 		}
 
@@ -220,7 +226,7 @@  static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num,
 		totlen -= len;
 		from += len;
 	}
-	return 0;
+	return ret;
 }
 
 /**
@@ -900,7 +906,20 @@  static int check_create(struct mtd_info *mtd, uint8_t *buf, struct nand_bbt_desc
 	writecheck:
 		/* read back first ? */
 		if (rd)
-			read_abs_bbt(mtd, buf, rd, chipsel);
+			res = read_abs_bbt(mtd, buf, rd, chipsel);
+		if (!rd2) {
+			if (res == -EBADMSG) {
+				/* bad recreate it */
+				rd = NULL;
+				writeops = 0x03;
+				goto create;
+			}
+			else if (!rd2 && res == -EUCLEAN) {
+				/* rewrite it */
+				writeops = 0x03;
+			}
+		}
+
 		/* If they weren't versioned, read both. */
 		if (rd2)
 			read_abs_bbt(mtd, buf, rd2, chipsel);