Message ID | 4DB06693.5000100@parrot.com |
---|---|
State | New, archived |
Headers | show |
On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote: > } > + if (res != -EUCLEAN) { > + printk(KERN_INFO "nand_bbt: Error reading bad block table2\n"); > + return res; > + } Just a side note: I tend to think that we need to veto all mtd patches which touch/add prints and force people to first turn all the ugly MTD printing mess into dev_dbg/dev_err/etc messages... But I cannot actually propose it before I clean up UBI/UBIFS in this respect :-)))
On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote: > Here a quick and dirty patch to make them more robust. > Any comment are welcomed. Would be great if you could test it and submit a nice patch with proper commit message and Signed-off-by.
Hello, On Fri, Apr 22, 2011 at 1:15 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > On Thu, 2011-04-21 at 19:17 +0200, Matthieu CASTET wrote: >> Here a quick and dirty patch to make them more robust. >> Any comment are welcomed. > > Would be great if you could test it and submit a nice patch with proper > commit message and Signed-off-by. I am interested in Artem's comments on the robustness of flash-based BBT (here, and more recently on http://lists.infradead.org/pipermail/linux-mtd/2011-June/036557.html). I recently have moved to using flash-based BBT (in-band, actually), and it seemed like several NAND drivers use flash-based BBT as well. Is it really that un-trustworthy? So I guess I'm looking for areas of improvement. I see a few suggestions here: "Also the pattern and version in oob isn't protected by ecc. They can be corrupted." I noticed this one recently. My hardware ECC actually *can* do ECC for what little OOB is actually free, but I realized that the nand_base code doesn't even try to check the ECC stats (in nand_do_read_oob()) whereas some similar code for nand_do_read_ops *does* check the ECC stats. Is it fair to adapt the code from nand_do_read_ops to nand_do_read_oob so that both check for ECC errors, just in case the hardware supports it? Would this cause any API problems, if users didn't expect OOB reads to return ECC error statuses? For now, this would only have any effect if a driver replaces the chip->ecc.read_oob function with something that performs ECC operations independently and increments the ECC counters. And speaking of BBT in OOB: Anyone know why the flash-based ident pattern and version is "traditionally" stored in OOB? It was quite recently that Sebastian Andrzej Siewior added the NAND_USE_FLASH_BBT_NO_OOB flag (which is slated to be renamed/replaced, FYI). It seems like ...NO_OOB is a generally good (or at least better) idea. Then we wouldn't even have to worry much about ECC in OOB. "read_bbt which ignore ecc bit flip/error" If I understand right, read_bbt just prints warning message on ECC flips/errors and otherwise ignores them? Would Matthieu's "quick and dirty" patch be an OK start for fixing this? (where in the absence of a valid backup tableb, an ECC error causes a flash rescan and an ECC bitflip causes an erase/rewrite "scrub"?) "The bbt should be protected with CRC and if it gets corrupted we should re-scan the flash and re-create it." Wouldn't CRC just be a lesser replacement for proper ECC protection? Or am I missing something? Brian
On Thu, 2011-06-23 at 09:36 -0700, Brian Norris wrote: > I am interested in Artem's comments on the robustness of flash-based > BBT (here, and more recently on > http://lists.infradead.org/pipermail/linux-mtd/2011-June/036557.html). > I recently have moved to using flash-based BBT (in-band, actually), > and it seemed like several NAND drivers use flash-based BBT as well. > Is it really that un-trustworthy? If you confirm that everything is great and robust when the on-flash BBT gets corrupted - the NAND core for sure notices any corruptions and falls-back to the traditional scanning method and restores the on-flash BBT - then I apologize for saying that I do not trust it. Also, I do not really know the details of this, so I may be completely wrong. > "The bbt should be protected with CRC and if it gets corrupted we > should re-scan the flash and re-create it." > > Wouldn't CRC just be a lesser replacement for proper ECC protection? > Or am I missing something? I'd say ECC and CRC play different roles. ECC is about handling NAND PITAs like read/write/erase disturb, and CRC is about noticing any corruption and recover, instead of reporting inaccurate information up - e.g., reporting a good block as bad.
On Fri, Jun 24, 2011 at 3:55 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > If you confirm that everything is great and robust when the on-flash BBT > gets corrupted - the NAND core for sure notices any corruptions and > falls-back to the traditional scanning method and restores the on-flash > BBT - then I apologize for saying that I do not trust it. Also, I do not > really know the details of this, so I may be completely wrong. > I'm still trying to pinpoint the cause of the BBT corruptions that I encountered, but _if_ they were caused by bit-flips, then it definitely appears that your initial statement was accurate: the devices were in many cases unbootable because the corrupted BBT was treated as legit. This also seems to be the case when looking at the code in nand_bbt.c: if an ECC error is encountered in read_bbt(), it prints a warning but then continues to process the BBT data. Maybe there's an additional assumption that I'm overlooking, but at a glance it seems like that would allow bit errors to accumulate and wrongly flag good blocks as bad.
diff --git a/drivers/mtd/nand/nand_bbt.c b/drivers/mtd/nand/nand_bbt.c index 5fedf4a..7b85b54 100644 --- a/drivers/mtd/nand/nand_bbt.c +++ b/drivers/mtd/nand/nand_bbt.c @@ -171,7 +171,7 @@ static int check_short_pattern(uint8_t *buf, struct nand_bbt_descr *td) static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num, int bits, int offs, int reserved_block_code) { - int res, i, j, act = 0; + int res, i, j, act = 0, ret = 0; struct nand_chip *this = mtd->priv; size_t retlen, len, totlen; loff_t from; @@ -188,6 +188,12 @@ static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num, printk(KERN_INFO "nand_bbt: Error reading bad block table\n"); return res; } + if (res != -EUCLEAN) { + printk(KERN_INFO "nand_bbt: Error reading bad block table2\n"); + return res; + } + /* inform caller that there is bit flips */ + ret |= res; printk(KERN_WARNING "nand_bbt: ECC error while reading bad block table\n"); } @@ -220,7 +226,7 @@ static int read_bbt(struct mtd_info *mtd, uint8_t *buf, int page, int num, totlen -= len; from += len; } - return 0; + return ret; } /** @@ -900,7 +906,20 @@ static int check_create(struct mtd_info *mtd, uint8_t *buf, struct nand_bbt_desc writecheck: /* read back first ? */ if (rd) - read_abs_bbt(mtd, buf, rd, chipsel); + res = read_abs_bbt(mtd, buf, rd, chipsel); + if (!rd2) { + if (res == -EBADMSG) { + /* bad recreate it */ + rd = NULL; + writeops = 0x03; + goto create; + } + else if (!rd2 && res == -EUCLEAN) { + /* rewrite it */ + writeops = 0x03; + } + } + /* If they weren't versioned, read both. */ if (rd2) read_abs_bbt(mtd, buf, rd2, chipsel);