Patchwork state of support for "external ECC hardware"

login
register
mail settings
Submitter Christopher Harvey
Date Nov. 8, 2012, 3:21 p.m.
Message ID <20121108152125.GR2389@harvey-pc.matrox.com>
Download mbox | patch
Permalink /patch/197850/
State New
Headers show

Comments

Christopher Harvey - Nov. 8, 2012, 3:21 p.m.
On Thu, Nov 08, 2012 at 12:02:27PM +0100, Gerlando Falauto wrote:
> Hi Chris,
> 
> good to hear we're not alone in this thinking... :-)
> We're now facing the exact same issue as some Micron NAND chips (most 
> likely the same one you're dealing with) can no longer live with the 
> default, simple 1-bit ECC implementation used by default 
> (NAND_ECC_SOFT), I guess because chances of having multiple bitflips 
> within the same page are no longer negligible. So some 4-bit ECC 
> mechanism must be used at the very least.

We had BCH8 code running, but it wasn't enough. The main reason we
switched away from host side ECC was because we were getting bitflips
within the ECC codeword data itself. Yes, it would have been possible
to add a 1 byte hamming code to protect the main ECC data, but it was
just easier to say, "hey, Micron knows their hardware, so we'll trust
their algorithms", and enable the Micron ECC hardware. Although it
didn't require too much work to enable it's all a total hack. I took
the code that runs the "ECC disabled mode", and sprinkled in some
extra init code and error checking code. Would be nice to add an
"external ecc mode" to support these chips explicitly.

> Support for software-based multiple-bit-resilient ECC mechanism (BCH) 
> was posted (http://lwn.net/Articles/426856/) by Ivan Djelic (which I 
> took liberty to Cc:) and merged in March last year.
> I haven't been able to track how the situation evolved, but apparently 
> you need to enable it (in addition to within the kernel configuration), 
> also within your flash controller setup.
> Micron gives an example of how to enable it on a sample NAND host 
> controller S3C6410 in this TN (rest of the code, mainly from the above 
> patch, would be already present in recent kernels):
> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2971_software_bch_ecc_on_linux.pdf 

I haven't looked into current software ECC algorithms in the
kernel. Do the protect against corrupted ECC data? As in, corruptions
in the out of bounds area?

> As for hardware-based (or on-die) ECC support, one of the application 
> notes from Micron (TN-29-56 Enabling On-Die ECC for OMAP3 on 
> Linux/Android OS, 
> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2956_ondie_ecc_omap3_linux.pdf) 
> shows how to enable that (rather, it shows how to disable software ECC 
> altogether after enabling it on the chip). However, I haven't been able 
> to find a code section where the information returned by the chip 
> ("Rewrite recommended") is actually used to solicit scrubbing... Neither 
> on the TN, nor on the upstream linux kernel... My next step would be to 
> give it a go and see what happens.

I got that working, if you're running in "eec disabled mode", try something like this:


Ran a trace on some manually inserted bitflips, and the block was moved.

Hope that helps.

> I'd love to hear some feedback, if anyone has had experience with this.
> I know it's not been a long time since your post, but perhaps you've 
> heard something in the meantime?
> 
> I have one additional question though. Looking at the code I got the 
> impression that decisions upon ECC seem to be based on the flash 
> controller rather than on the flash chip itself.
> I mean, I would think of having a default 1-bit NAND_ECC_SOFT 
> implementation; only when it is detected that the flash part either 
> supports HW ECC or requires multiple-bit ECC, should the ECC mode get 
> switched to NAND_ECC_NONE or NAND_ECC_SOFT_BCH respectively.
> No matter what the flash controller, I would say.
> 
> Ivan, do you think that makes any sense?
> 
> Thank you so much!
> Gerlando
> 
> On 10/29/2012 09:42 PM, Christopher Harvey wrote:
> > I know of at least one Micron NAND chip that has the ability to handle
> > ECC completely on the NAND chip itself. All the host has to do is send
> > data and the OOB section is updated automatically. The automatic ECC
> > hardware can be enabled and disabled with the "Set Feature" command,
> > (0xEF) and bit flips are reported via get status after page reads. I
> > don't see support for this in 2.6.37, and a quick check in the logs
> > doesn't show anything new for these chips in the latest version of the
> > kernel. Any idea floating around on this list? Are these chips going
> > to be the future for NAND and does Linux care about them?
> >
> > thanks,
> > Chris
> >
Gerlando Falauto - Nov. 8, 2012, 4:32 p.m.
Hi Chris,

first of all thanks for answering this quick!

On 11/08/2012 04:21 PM, Christopher Harvey wrote:
> On Thu, Nov 08, 2012 at 12:02:27PM +0100, Gerlando Falauto wrote:
>> Hi Chris,
>>
>> good to hear we're not alone in this thinking... :-)
>> We're now facing the exact same issue as some Micron NAND chips (most
>> likely the same one you're dealing with) can no longer live with the
>> default, simple 1-bit ECC implementation used by default
>> (NAND_ECC_SOFT), I guess because chances of having multiple bitflips
>> within the same page are no longer negligible. So some 4-bit ECC
>> mechanism must be used at the very least.
>
> We had BCH8 code running, but it wasn't enough. The main reason we
> switched away from host side ECC was because we were getting bitflips
> within the ECC codeword data itself.

Wow... I mean, I figured it wouldn't be that easy to (purposedly) get 
bitflips in any area, I wonder what kind of test you managed to come up 
with in order to get bitflips within the ECC area itself.
In my case it takes several hours (of continuous reads) to get a single 
bitflip within a 1Gb (128MB) flash.

> Yes, it would have been possible
> to add a 1 byte hamming code to protect the main ECC data,

I'd have thought the algorithm would take care of that itself. Adding a 
further level of ECC seems a bit unnatural, at least to me.

> but it was
> just easier to say, "hey, Micron knows their hardware, so we'll trust
> their algorithms", and enable the Micron ECC hardware. Although it
> didn't require too much work to enable it's all a total hack. I took
> the code that runs the "ECC disabled mode", and sprinkled in some
> extra init code and error checking code. Would be nice to add an
> "external ecc mode" to support these chips explicitly.

That was sort of my point below. Would be nice to know whether there is 
some ongoing work for that matter.

>
>> Support for software-based multiple-bit-resilient ECC mechanism (BCH)
>> was posted (http://lwn.net/Articles/426856/) by Ivan Djelic (which I
>> took liberty to Cc:) and merged in March last year.
>> I haven't been able to track how the situation evolved, but apparently
>> you need to enable it (in addition to within the kernel configuration),
>> also within your flash controller setup.
>> Micron gives an example of how to enable it on a sample NAND host
>> controller S3C6410 in this TN (rest of the code, mainly from the above
>> patch, would be already present in recent kernels):
>> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2971_software_bch_ecc_on_linux.pdf
>
> I haven't looked into current software ECC algorithms in the
> kernel. Do the protect against corrupted ECC data? As in, corruptions
> in the out of bounds area?

I sort of assumed that BCH would take care of that, but I understand you 
are stating the opposite.

>> As for hardware-based (or on-die) ECC support, one of the application
>> notes from Micron (TN-29-56 Enabling On-Die ECC for OMAP3 on
>> Linux/Android OS,
>> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2956_ondie_ecc_omap3_linux.pdf)
>> shows how to enable that (rather, it shows how to disable software ECC
>> altogether after enabling it on the chip). However, I haven't been able
>> to find a code section where the information returned by the chip
>> ("Rewrite recommended") is actually used to solicit scrubbing... Neither
>> on the TN, nor on the upstream linux kernel... My next step would be to
>> give it a go and see what happens.
>
> I got that working, if you're running in "eec disabled mode", try something like this:
>
> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> index a796dd7..68af8b0 100644
> --- a/drivers/mtd/nand/nand_base.c
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -1069,7 +1069,16 @@ static int nand_read_page_raw(struct mtd_info *mtd, struct nand_chip *chip,
>                                uint8_t *buf, int page)
>   {
>          chip->read_buf(mtd, buf, mtd->writesize);
> -       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize);
> +       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize); /* (this data used in sw ecc) */
> +
> +       /* TODO: only do this CMD_STATUS if we have Micron NAND */
> +       chip->cmdfunc(mtd, NAND_CMD_STATUS, -1, -1);
> +       if (chip->read_byte(mtd)&  NAND_STATUS_REWRITE_RECOMMENDED) {
> +               /* Micron NAND is telling us that this block may be going bad,
> +                  tell Linux to move it */
> +               mtd->ecc_stats.corrected++; /* (we don't actually know if it's just one correction, could be up to 4) */

Right, datasheets and TNs don't even mention what the threshold actually 
is. They just say "Rewrite recommended". Perhaps you could get some 
feeling while running your tests? I mean, if you could get bitflips by 
using host-software ECC (within a reasonable time), and after enabling 
on-die ECC you couldn't anymore, it probably means HW ECC won't tell you 
about bitflips until they reach a number higher than 1. Am I right?

[Did you ever ask Micron by any chance?]

> +       }
> +
>          return 0;
>   }
>

I was going down the way pointed out by Micron in their TN, that is 
hacking into nand_read_page_hwecc(). But I like your approach more.

> Ran a trace on some manually inserted bitflips, and the block was moved.

Could you give some pointers on how to manually insert bitflips?
nanddump/nandwrite from mtd-utils perhaps?

> Hope that helps.

Yep, it does help a great deal! Thanks a bunch!

Gerlando

>
>> I'd love to hear some feedback, if anyone has had experience with this.
>> I know it's not been a long time since your post, but perhaps you've
>> heard something in the meantime?
>>
>> I have one additional question though. Looking at the code I got the
>> impression that decisions upon ECC seem to be based on the flash
>> controller rather than on the flash chip itself.
>> I mean, I would think of having a default 1-bit NAND_ECC_SOFT
>> implementation; only when it is detected that the flash part either
>> supports HW ECC or requires multiple-bit ECC, should the ECC mode get
>> switched to NAND_ECC_NONE or NAND_ECC_SOFT_BCH respectively.
>> No matter what the flash controller, I would say.
>>
>> Ivan, do you think that makes any sense?
>>
>> Thank you so much!
>> Gerlando
>>
>> On 10/29/2012 09:42 PM, Christopher Harvey wrote:
>>> I know of at least one Micron NAND chip that has the ability to handle
>>> ECC completely on the NAND chip itself. All the host has to do is send
>>> data and the OOB section is updated automatically. The automatic ECC
>>> hardware can be enabled and disabled with the "Set Feature" command,
>>> (0xEF) and bit flips are reported via get status after page reads. I
>>> don't see support for this in 2.6.37, and a quick check in the logs
>>> doesn't show anything new for these chips in the latest version of the
>>> kernel. Any idea floating around on this list? Are these chips going
>>> to be the future for NAND and does Linux care about them?
>>>
>>> thanks,
>>> Chris
>>>
Gerlando Falauto - Nov. 8, 2012, 4:37 p.m.
Hi Chris,
On 11/08/2012 05:32 PM, Gerlando Falauto wrote:
>
>> Ran a trace on some manually inserted bitflips, and the block was moved.
>
> Could you give some pointers on how to manually insert bitflips?
> nanddump/nandwrite from mtd-utils perhaps?

And BTW, wouldn't you also need to explicitly disable on-die ECC in 
order to force that, anyway?

Thanks again!
Gerlando
Christopher Harvey - Nov. 8, 2012, 5:02 p.m.
On Thu, Nov 08, 2012 at 05:32:27PM +0100, Gerlando Falauto wrote:
> Hi Chris,
> 
> first of all thanks for answering this quick!
> 
> On 11/08/2012 04:21 PM, Christopher Harvey wrote:
> > On Thu, Nov 08, 2012 at 12:02:27PM +0100, Gerlando Falauto wrote:
> >> Hi Chris,
> >>
> >> good to hear we're not alone in this thinking... :-)
> >> We're now facing the exact same issue as some Micron NAND chips (most
> >> likely the same one you're dealing with) can no longer live with the
> >> default, simple 1-bit ECC implementation used by default
> >> (NAND_ECC_SOFT), I guess because chances of having multiple bitflips
> >> within the same page are no longer negligible. So some 4-bit ECC
> >> mechanism must be used at the very least.
> >
> > We had BCH8 code running, but it wasn't enough. The main reason we
> > switched away from host side ECC was because we were getting bitflips
> > within the ECC codeword data itself.
> 
> Wow... I mean, I figured it wouldn't be that easy to (purposedly) get 
> bitflips in any area, I wonder what kind of test you managed to come up 
> with in order to get bitflips within the ECC area itself.
> In my case it takes several hours (of continuous reads) to get a single 
> bitflip within a 1Gb (128MB) flash.

I was surprised too. I was seeing about 30 bitflips per 512MB. Running
at about 1/3 of max bus speed. No error codes on write.

Micron never said that was abnormal for our chip.

> > Yes, it would have been possible
> > to add a 1 byte hamming code to protect the main ECC data,
> 
> I'd have thought the algorithm would take care of that itself. Adding a 
> further level of ECC seems a bit unnatural, at least to me.

I don't know the details of BCH, but apparently not. I asked Micron if
the OOB area was safer to write to, and they said no. Can somebody on
this list confirm this?

> > but it was
> > just easier to say, "hey, Micron knows their hardware, so we'll trust
> > their algorithms", and enable the Micron ECC hardware. Although it
> > didn't require too much work to enable it's all a total hack. I took
> > the code that runs the "ECC disabled mode", and sprinkled in some
> > extra init code and error checking code. Would be nice to add an
> > "external ecc mode" to support these chips explicitly.
> 
> That was sort of my point below. Would be nice to know whether there is 
> some ongoing work for that matter.
> 
> >
> >> Support for software-based multiple-bit-resilient ECC mechanism (BCH)
> >> was posted (http://lwn.net/Articles/426856/) by Ivan Djelic (which I
> >> took liberty to Cc:) and merged in March last year.
> >> I haven't been able to track how the situation evolved, but apparently
> >> you need to enable it (in addition to within the kernel configuration),
> >> also within your flash controller setup.
> >> Micron gives an example of how to enable it on a sample NAND host
> >> controller S3C6410 in this TN (rest of the code, mainly from the above
> >> patch, would be already present in recent kernels):
> >> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2971_software_bch_ecc_on_linux.pdf
> >
> > I haven't looked into current software ECC algorithms in the
> > kernel. Do the protect against corrupted ECC data? As in, corruptions
> > in the out of bounds area?
> 
> I sort of assumed that BCH would take care of that, but I understand you 
> are stating the opposite.
> 
> >> As for hardware-based (or on-die) ECC support, one of the application
> >> notes from Micron (TN-29-56 Enabling On-Die ECC for OMAP3 on
> >> Linux/Android OS,
> >> http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2956_ondie_ecc_omap3_linux.pdf)
> >> shows how to enable that (rather, it shows how to disable software ECC
> >> altogether after enabling it on the chip). However, I haven't been able
> >> to find a code section where the information returned by the chip
> >> ("Rewrite recommended") is actually used to solicit scrubbing... Neither
> >> on the TN, nor on the upstream linux kernel... My next step would be to
> >> give it a go and see what happens.
> >
> > I got that working, if you're running in "eec disabled mode", try something like this:
> >
> > diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> > index a796dd7..68af8b0 100644
> > --- a/drivers/mtd/nand/nand_base.c
> > +++ b/drivers/mtd/nand/nand_base.c
> > @@ -1069,7 +1069,16 @@ static int nand_read_page_raw(struct mtd_info *mtd, struct nand_chip *chip,
> >                                uint8_t *buf, int page)
> >   {
> >          chip->read_buf(mtd, buf, mtd->writesize);
> > -       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize);
> > +       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize); /* (this data used in sw ecc) */
> > +
> > +       /* TODO: only do this CMD_STATUS if we have Micron NAND */
> > +       chip->cmdfunc(mtd, NAND_CMD_STATUS, -1, -1);
> > +       if (chip->read_byte(mtd)&  NAND_STATUS_REWRITE_RECOMMENDED) {
> > +               /* Micron NAND is telling us that this block may be going bad,
> > +                  tell Linux to move it */
> > +               mtd->ecc_stats.corrected++; /* (we don't actually know if it's just one correction, could be up to 4) */
> 
> Right, datasheets and TNs don't even mention what the threshold actually 
> is. They just say "Rewrite recommended". Perhaps you could get some 
> feeling while running your tests? I mean, if you could get bitflips by 
> using host-software ECC (within a reasonable time), and after enabling 
> on-die ECC you couldn't anymore, it probably means HW ECC won't tell you 
> about bitflips until they reach a number higher than 1. Am I right?
> 
> [Did you ever ask Micron by any chance?]

Yeah, I asked but I don't remember the answer. I tested with 3 bit
flips in a block and didn't see the rewrite recommended bit. 4 did the
trick. I didn't go any higher.

> > +       }
> > +
> >          return 0;
> >   }
> >
> 
> I was going down the way pointed out by Micron in their TN, that is 
> hacking into nand_read_page_hwecc(). But I like your approach more.
> 
> > Ran a trace on some manually inserted bitflips, and the block was moved.
> 
> Could you give some pointers on how to manually insert bitflips?
> nanddump/nandwrite from mtd-utils perhaps?

I had 2 kernels in NAND, one that enables Micron ECC, one that
didn't. I booted the board over NFS then used nanddump, and a hex
editor to put 4 bit flips in a file of AAAAAAAAAA's. (UBIFS). After
doing a nandwrite and making sure Micron didn't update its ECC data I
rebooted and enabled Micron ECC. when doing 'cat the_aaaa_file', and I
was able to watch the UBIFS debug statements say it moved one PEB to
another. After dumping the new PEB, I was able to see the original
AAAA's and the new ECC data. Also, reading the ECC stats said there
was one bitflip detected. Be sure to completely power cycle your nand
(not just a reset signal), because the Micron ECC enabled bit is
persistent.

> > Hope that helps.
> 
> Yep, it does help a great deal! Thanks a bunch!
> 
> Gerlando
> 
> >
> >> I'd love to hear some feedback, if anyone has had experience with this.
> >> I know it's not been a long time since your post, but perhaps you've
> >> heard something in the meantime?
> >>
> >> I have one additional question though. Looking at the code I got the
> >> impression that decisions upon ECC seem to be based on the flash
> >> controller rather than on the flash chip itself.
> >> I mean, I would think of having a default 1-bit NAND_ECC_SOFT
> >> implementation; only when it is detected that the flash part either
> >> supports HW ECC or requires multiple-bit ECC, should the ECC mode get
> >> switched to NAND_ECC_NONE or NAND_ECC_SOFT_BCH respectively.
> >> No matter what the flash controller, I would say.
> >>
> >> Ivan, do you think that makes any sense?
> >>
> >> Thank you so much!
> >> Gerlando
> >>
> >> On 10/29/2012 09:42 PM, Christopher Harvey wrote:
> >>> I know of at least one Micron NAND chip that has the ability to handle
> >>> ECC completely on the NAND chip itself. All the host has to do is send
> >>> data and the OOB section is updated automatically. The automatic ECC
> >>> hardware can be enabled and disabled with the "Set Feature" command,
> >>> (0xEF) and bit flips are reported via get status after page reads. I
> >>> don't see support for this in 2.6.37, and a quick check in the logs
> >>> doesn't show anything new for these chips in the latest version of the
> >>> kernel. Any idea floating around on this list? Are these chips going
> >>> to be the future for NAND and does Linux care about them?
> >>>
> >>> thanks,
> >>> Chris
> >>>
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
Christopher Harvey - Nov. 8, 2012, 5:03 p.m.
On Thu, Nov 08, 2012 at 05:37:17PM +0100, Gerlando Falauto wrote:
> Hi Chris,
> On 11/08/2012 05:32 PM, Gerlando Falauto wrote:
> >
> >> Ran a trace on some manually inserted bitflips, and the block was moved.
> >
> > Could you give some pointers on how to manually insert bitflips?
> > nanddump/nandwrite from mtd-utils perhaps?
> 
> And BTW, wouldn't you also need to explicitly disable on-die ECC in 
> order to force that, anyway?

Depends on the version I think. IIRC, some are "enabled by default",
others are "disabled by default".

-C
Ivan Djelic - Nov. 8, 2012, 6:59 p.m.
On Thu, Nov 08, 2012 at 03:21:25PM +0000, Christopher Harvey wrote:
(...) 
> We had BCH8 code running, but it wasn't enough. The main reason we
> switched away from host side ECC was because we were getting bitflips
> within the ECC codeword data itself.

But the ECC bytes are part of the BCH codeword, therefore I don't understand
what the issue could be ? Are you sure bitflips were not in some unprotected
OOB area ?

 Yes, it would have been possible
> to add a 1 byte hamming code to protect the main ECC data, but it was
> just easier to say, "hey, Micron knows their hardware, so we'll trust
> their algorithms", and enable the Micron ECC hardware. Although it
> didn't require too much work to enable it's all a total hack. I took
> the code that runs the "ECC disabled mode", and sprinkled in some
> extra init code and error checking code. Would be nice to add an
> "external ecc mode" to support these chips explicitly.
> 
> > Support for software-based multiple-bit-resilient ECC mechanism (BCH) 
> > was posted (http://lwn.net/Articles/426856/) by Ivan Djelic (which I 
> > took liberty to Cc:) and merged in March last year.
> > I haven't been able to track how the situation evolved, but apparently 
> > you need to enable it (in addition to within the kernel configuration), 
> > also within your flash controller setup.
> > Micron gives an example of how to enable it on a sample NAND host 
> > controller S3C6410 in this TN (rest of the code, mainly from the above 
> > patch, would be already present in recent kernels):
> > http://www.micron.com/~/media/Documents/Products/Technical%20Note/NAND%20Flash/tn2971_software_bch_ecc_on_linux.pdf 
> 
> I haven't looked into current software ECC algorithms in the
> kernel. Do the protect against corrupted ECC data? As in, corruptions
> in the out of bounds area?

Yes, BCH ECC works by generating a codeword containing data+ecc bytes.
Errors can be detected and corrected in any location of the codeword (data and ecc).
Note that in practice, we are interested in actually fixing errors in data only (not ecc).
When an error is detected in ECC bytes, it must simply be reported to trigger block scrubbing.

The current software BCH implementation in MTD protects the page data area (and ecc bytes).
It does not protect additional bytes in the OOB area (like the Micron on-die ECC does),
but since the BCH library is not limited to any particular size, a simple patch could achieve this.
 
BR,
Ivan Djelic - Nov. 8, 2012, 7:07 p.m.
On Thu, Nov 08, 2012 at 04:32:27PM +0000, Gerlando Falauto wrote:
(...)
> Right, datasheets and TNs don't even mention what the threshold actually 
> is. They just say "Rewrite recommended". Perhaps you could get some 
> feeling while running your tests? I mean, if you could get bitflips by 
> using host-software ECC (within a reasonable time), and after enabling 
> on-die ECC you couldn't anymore, it probably means HW ECC won't tell you 
> about bitflips until they reach a number higher than 1. Am I right?
> 
> [Did you ever ask Micron by any chance?]

IIRC, Micron on-die ECC reports a "rewrite recommended" status when the
number of bitflips has reached the internal error correction capability
(4 in my case). In other words, a "rewrite recommended" means you should
rewrite the block ASAP before an additional bitflip triggers an ECC failure.

--
Ivan
Christopher Harvey - Nov. 8, 2012, 7:22 p.m.
On Thu, Nov 08, 2012 at 07:59:42PM +0100, Ivan Djelic wrote:
> On Thu, Nov 08, 2012 at 03:21:25PM +0000, Christopher Harvey wrote:
> (...) 
> > We had BCH8 code running, but it wasn't enough. The main reason we
> > switched away from host side ECC was because we were getting bitflips
> > within the ECC codeword data itself.
> 
> But the ECC bytes are part of the BCH codeword, therefore I don't understand
> what the issue could be ? Are you sure bitflips were not in some unprotected
> OOB area ?

Ok, the ECC bytes I had were stored in the OOB area and were
unprotected. Any bit flips in the OOB area was a disaster. This was
coming from a heavily modified forked kernel that had BCH8 bugs in the
past. For example, I had to fix this one before the patch came out:
http://arago-project.org/git/projects/linux-omap3.git?p=projects/linux-omap3.git;a=commitdiff;h=adc46d691d745604da1197d154fe712e10ec468d;hp=9e78267ed6302537474489e88bd59827315db15b
I can't explain why this implementation fails on ECC byte corruption.

-Chris
Ivan Djelic - Nov. 8, 2012, 7:33 p.m.
On Thu, Nov 08, 2012 at 07:22:50PM +0000, Christopher Harvey wrote:
> On Thu, Nov 08, 2012 at 07:59:42PM +0100, Ivan Djelic wrote:
> > On Thu, Nov 08, 2012 at 03:21:25PM +0000, Christopher Harvey wrote:
> > (...) 
> > > We had BCH8 code running, but it wasn't enough. The main reason we
> > > switched away from host side ECC was because we were getting bitflips
> > > within the ECC codeword data itself.
> > 
> > But the ECC bytes are part of the BCH codeword, therefore I don't understand
> > what the issue could be ? Are you sure bitflips were not in some unprotected
> > OOB area ?
> 
> Ok, the ECC bytes I had were stored in the OOB area and were
> unprotected. Any bit flips in the OOB area was a disaster. This was
> coming from a heavily modified forked kernel that had BCH8 bugs in the
> past. For example, I had to fix this one before the patch came out:
> http://arago-project.org/git/projects/linux-omap3.git?p=projects/linux-omap3.git;a=commitdiff;h=adc46d691d745604da1197d154fe712e10ec468d;hp=9e78267ed6302537474489e88bd59827315db15b
> I can't explain why this implementation fails on ECC byte corruption.

Oooh, I think I understand now... I had very similar issues with some BCH8 code on an OMAP3630 board.
The error correction code was buggy, and would trip on errors located in ecc bytes.
Actually, this (and performance issues) is what pushed me into writing lib/bch.c :)
BR,
Ricard Wanderlof - Nov. 9, 2012, 8:46 a.m.
On Thu, 8 Nov 2012, Gerlando Falauto wrote:

>> We had BCH8 code running, but it wasn't enough. The main reason we
>> switched away from host side ECC was because we were getting bitflips
>> within the ECC codeword data itself.
>
> Wow... I mean, I figured it wouldn't be that easy to (purposedly) get 
> bitflips in any area, I wonder what kind of test you managed to come up 
> with in order to get bitflips within the ECC area itself. In my case it 
> takes several hours (of continuous reads) to get a single bitflip within 
> a 1Gb (128MB) flash.

There are 1Gb flashes and 1Gb flashes. Depending on the technology used 
during manufacture (essentially the scale of the on-chip structures, 
usually specified as 'xxx nm technology') the bit error probabilities can 
vary.

"Traditional" 1Gb flashes where the manufacturer recommends 1-bit ECC in 
practice very rarely exhibit bit flips. I have seen bit flips in the OOB 
area as well as the main area (there was a bug in nand_ecc.c many years 
ago which didn't handle this correctly which is how I discovered what was 
going on); indeed there's nothing different about the OOB area in terms of 
bit flips, it's just another area of (the same type of) flash. The 
probability for the whole OOB area is of course less than for the rest as 
it is smaller, but it is the same per bit if I understand it correctly.

Some manufacturers (Micron for instance I believe) have started to deliver 
1 Gb chips using a higher density technology where they specify a 
requirement for 4-bit ECC. These naturally exhibit a much higher bitflip 
rate.

At any rate, the ECC algorithm itself should be able to take care of bit 
flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this 
by comparing the computed ECC with the actual ECC; if there's a difference 
of exactly one bit (rather than a more complex diff which after 
calculations points out the flipped bit in the main area), it is assumed 
that the bitflip is in the ECC area rather than the data. I don't know how 
BCH does this though.

/Ricard
Gerlando Falauto - Nov. 12, 2012, 5:19 p.m.
Hi everyone,

first of all I am very grateful for your feedback. Thanks a lot to all 
of you!!

On 11/09/2012 09:46 AM, Ricard Wanderlof wrote:

> Some manufacturers (Micron for instance I believe) have started to deliver
> 1 Gb chips using a higher density technology where they specify a
> requirement for 4-bit ECC. These naturally exhibit a much higher bitflip
> rate.

Would there be any reason *NOT* to use 4-bit ECC with parts which do not 
require it? Apart from performance, of course.

I mean, we need to be as flexible as possible as far as hardware parts 
are concerned, as long as the basic requirements are met.
So we would like to have a single kernel which can run on different 
flash parts, past, present, and (as far as we can predict) future.
As pointed out within this thread, dynamic detection might be a bit 
tricky, so perhaps finding a common solution might be a good compromise.

> At any rate, the ECC algorithm itself should be able to take care of bit
> flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this
> by comparing the computed ECC with the actual ECC; if there's a difference
> of exactly one bit (rather than a more complex diff which after
> calculations points out the flipped bit in the main area), it is assumed
> that the bitflip is in the ECC area rather than the data. I don't know how
> BCH does this though.

Ivan, I came to understand (but I am not sure), that the implementation 
you provided (and currently mainlined) *DOES* handle this correctly. It 
was instead an old one which did not handle this properly. Is my 
understanding correct?

Thank you again,
Gerlando




>
> /Ricard
Ivan Djelic - Nov. 12, 2012, 5:35 p.m.
On Mon, Nov 12, 2012 at 05:19:57PM +0000, Gerlando Falauto wrote:
(...)
> > At any rate, the ECC algorithm itself should be able to take care of bit
> > flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this
> > by comparing the computed ECC with the actual ECC; if there's a difference
> > of exactly one bit (rather than a more complex diff which after
> > calculations points out the flipped bit in the main area), it is assumed
> > that the bitflip is in the ECC area rather than the data. I don't know how
> > BCH does this though.
> 
> Ivan, I came to understand (but I am not sure), that the implementation 
> you provided (and currently mainlined) *DOES* handle this correctly. It 
> was instead an old one which did not handle this properly. Is my 
> understanding correct?

Yes you are correct. In BCH ECC, there is no difference between data and ecc bytes, they are
all part of larger codeword on which error correction is performed.
An old patch introducing BCH support in nand/omap2.c had a bug which was triggered when a bitflip
was detected in ecc bytes; but this has nothing to do with the way BCH algorithms work.
BR,
--
Ivan
Gerlando Falauto - Nov. 12, 2012, 5:39 p.m.
Hi Ivan,

wonderful, thanks a lot!
If you also happen to have an opionion to using it for chips only 
needing 1-bit correction, I'd love to hear that...

Thanks again!
Gerlando

On 11/12/2012 06:35 PM, Ivan Djelic wrote:
> On Mon, Nov 12, 2012 at 05:19:57PM +0000, Gerlando Falauto wrote:
> (...)
>>> At any rate, the ECC algorithm itself should be able to take care of bit
>>> flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this
>>> by comparing the computed ECC with the actual ECC; if there's a difference
>>> of exactly one bit (rather than a more complex diff which after
>>> calculations points out the flipped bit in the main area), it is assumed
>>> that the bitflip is in the ECC area rather than the data. I don't know how
>>> BCH does this though.
>>
>> Ivan, I came to understand (but I am not sure), that the implementation
>> you provided (and currently mainlined) *DOES* handle this correctly. It
>> was instead an old one which did not handle this properly. Is my
>> understanding correct?
>
> Yes you are correct. In BCH ECC, there is no difference between data and ecc bytes, they are
> all part of larger codeword on which error correction is performed.
> An old patch introducing BCH support in nand/omap2.c had a bug which was triggered when a bitflip
> was detected in ecc bytes; but this has nothing to do with the way BCH algorithms work.
> BR,
> --
> Ivan
Ivan Djelic - Nov. 12, 2012, 6:52 p.m.
On Mon, Nov 12, 2012 at 05:39:45PM +0000, Gerlando Falauto wrote:
> Hi Ivan,
> 
> wonderful, thanks a lot!
> If you also happen to have an opionion to using it for chips only 
> needing 1-bit correction, I'd love to hear that...

I would recommend using the strongest ECC your hardware can provide without
hurting performance too much. This is what I do on my hardware (e.g. 8-bit
correction on current 4-bit devices). I find it has 2 advantages:
- increased reliability
- seamless transition to newer devices with stronger ecc requirements
The latter is important, because changing ECC strength can be painful: it
means changing the OOB layout, impacting bootloader and kernel, thus breaking
compatibility, etc.
HTH,
--
Ivan
Gerlando Falauto - Nov. 14, 2012, 10:12 a.m.
Hi Ivan,

thanks once more.
Speaking of compatibility, I was wondering: doesn't a NAND flash have 
*any* spare storage space at all, where software could store some 
information about the current OOB layout and/or ECC mechanism?
Partition tables on hard drives for instance have a "partition type" 
byte which provides some hints about what to expect from the data within 
a partition.

This would be especially useful for *future* compatibility (i.e. old 
software reading a NAND "formatted" with unknown mechanism could simply 
stop working, or force read-only mode disabling ECC altogether).

Feasibility aside, would that make any sense?

Thank you,
Gerlando

On 11/12/2012 07:52 PM, Ivan Djelic wrote:
> On Mon, Nov 12, 2012 at 05:39:45PM +0000, Gerlando Falauto wrote:
>> Hi Ivan,
>>
>> wonderful, thanks a lot!
>> If you also happen to have an opionion to using it for chips only
>> needing 1-bit correction, I'd love to hear that...
>
> I would recommend using the strongest ECC your hardware can provide without
> hurting performance too much. This is what I do on my hardware (e.g. 8-bit
> correction on current 4-bit devices). I find it has 2 advantages:
> - increased reliability
> - seamless transition to newer devices with stronger ecc requirements
> The latter is important, because changing ECC strength can be painful: it
> means changing the OOB layout, impacting bootloader and kernel, thus breaking
> compatibility, etc.
> HTH,
> --
> Ivan
Angus CLARK - Nov. 14, 2012, 1:24 p.m.
Hi Gerlando,

On 11/14/2012 10:12 AM, Gerlando Falauto wrote:
> Hi Ivan,
> 
> thanks once more.
> Speaking of compatibility, I was wondering: doesn't a NAND flash have
> *any* spare storage space at all, where software could store some
> information about the current OOB layout and/or ECC mechanism?
> Partition tables on hard drives for instance have a "partition type"
> byte which provides some hints about what to expect from the data within
> a partition.
> 
> This would be especially useful for *future* compatibility (i.e. old
> software reading a NAND "formatted" with unknown mechanism could simply
> stop working, or force read-only mode disabling ECC altogether).
> 
> Feasibility aside, would that make any sense?
> 

In general I am in favour of anything that facilitates the automatic probing of
devices.  However, I can see a number of complications in trying to implement
what you suggest.  Storing static information in a fixed location is never a
good idea on NAND.  A further issue relates to the very information you are
trying to store.  The data itself would need to be protected by ECC, but for it
to be useful, you need to be able to retrieve it without knowing what ECC/layout
was used when storing it.  Perhaps, for this ECC/layout data, one could use a
special dedicated S/W ECC scheme, strong enough for any device.  Yet another
layout of complexity though.

With regards to "spare storage", I would probably suggest the ECC/layout data be
added to the BBT area, assuming Flash-Resident BBTs are being used.

My only doubt would be whether there is sufficient motivation to overcome some
of the complexities and implement such a scheme...

Cheers,

Angus

> Thank you,
> Gerlando
> 
> On 11/12/2012 07:52 PM, Ivan Djelic wrote:
>> On Mon, Nov 12, 2012 at 05:39:45PM +0000, Gerlando Falauto wrote:
>>> Hi Ivan,
>>>
>>> wonderful, thanks a lot!
>>> If you also happen to have an opionion to using it for chips only
>>> needing 1-bit correction, I'd love to hear that...
>>
>> I would recommend using the strongest ECC your hardware can provide
>> without
>> hurting performance too much. This is what I do on my hardware (e.g.
>> 8-bit
>> correction on current 4-bit devices). I find it has 2 advantages:
>> - increased reliability
>> - seamless transition to newer devices with stronger ecc requirements
>> The latter is important, because changing ECC strength can be painful: it
>> means changing the OOB layout, impacting bootloader and kernel, thus
>> breaking
>> compatibility, etc.
>> HTH,
>> -- 
>> Ivan
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>
Matthieu CASTET - Nov. 14, 2012, 2:48 p.m.
Angus CLARK a écrit :
> Hi Gerlando,
> 
> On 11/14/2012 10:12 AM, Gerlando Falauto wrote:
>> Hi Ivan,
>>

>>
>> Feasibility aside, would that make any sense?
>>
> 
> In general I am in favour of anything that facilitates the automatic probing of
> devices.  However, I can see a number of complications in trying to implement
> what you suggest.  Storing static information in a fixed location is never a
> good idea on NAND.  A further issue relates to the very information you are
> trying to store.  The data itself would need to be protected by ECC, but for it
> to be useful, you need to be able to retrieve it without knowing what ECC/layout
> was used when storing it.  Perhaps, for this ECC/layout data, one could use a
> special dedicated S/W ECC scheme, strong enough for any device.  Yet another
> layout of complexity though.
You can use what is used on onfi flash for read parameter data [1] :
- duplicate data over n page
- use crc to detect corruption



Matthieu

[1]
The host should issue the Read Parameter Page (ECh) command. This command returns
information that includes the capabilities, features, and operating parameters
of the device.
When the information is read from the device, the host shall check the CRC to
ensure that the
data was received correctly and without error prior to taking action on that data.
If the CRC of the first parameter page read is not valid (refer to section
5.7.1.24), the host should
read redundant parameter page copies. The host can determine whether a redundant
parameter
page is present or not by checking if the first four bytes contain at least two
bytes of the
parameter page signature. If the parameter page signature is present, then the
host should read
the entirety of that redundant parameter page. The host should then check the
CRC of that
redundant parameter page. If the CRC is correct, the host may take action based
on the contents
of that redundant parameter page. If the CRC is incorrect, then the host should
attempt to read
the next redundant parameter page by the same procedure.
The host should continue reading redundant parameter pages until the host is
able to accurately
reconstruct the parameter page contents. All parameter pages returned by the
Target may have
invalid CRC values; however, bit-wise majority or other ECC techniques may be
used to recover
the contents of the parameter page. The host may use bit-wise majority or other
ECC techniques
to recover the contents of the parameter page from the parameter page copies
present. When
the host determines that a parameter page signature is not present (refer to
section 5.7.1.1), then
all parameter pages have been read.
Ivan Djelic - Nov. 14, 2012, 8:22 p.m.
On Wed, Nov 14, 2012 at 01:24:43PM +0000, Angus CLARK wrote:
> Hi Gerlando,
> 
> On 11/14/2012 10:12 AM, Gerlando Falauto wrote:
> > Hi Ivan,
> > 
> > thanks once more.
> > Speaking of compatibility, I was wondering: doesn't a NAND flash have
> > *any* spare storage space at all, where software could store some
> > information about the current OOB layout and/or ECC mechanism?
> > Partition tables on hard drives for instance have a "partition type"
> > byte which provides some hints about what to expect from the data within
> > a partition.
> > 
> > This would be especially useful for *future* compatibility (i.e. old
> > software reading a NAND "formatted" with unknown mechanism could simply
> > stop working, or force read-only mode disabling ECC altogether).
> > 
> > Feasibility aside, would that make any sense?
> > 
> 
> In general I am in favour of anything that facilitates the automatic probing of
> devices.  However, I can see a number of complications in trying to implement
> what you suggest.  Storing static information in a fixed location is never a
> good idea on NAND.  A further issue relates to the very information you are
> trying to store.  The data itself would need to be protected by ECC, but for it
> to be useful, you need to be able to retrieve it without knowing what ECC/layout
> was used when storing it.  Perhaps, for this ECC/layout data, one could use a
> special dedicated S/W ECC scheme, strong enough for any device.  Yet another
> layout of complexity though.

FWIW, I have once implemented a kind of primitive "formatting" similar to what you are describing (i.e. storage of NAND parameters inside the device itself).
For that, I used a dedicated SW BCH ECC, that adds 3 redundant bytes to each useful byte (effectively multiplying by 4 the required storage).
The resulting data can sustain up to 4 bitflips in each 32-bit word; it is also stored redundantly in multiple blocks.

BR,
--
Ivan
Calvin Johnson - Nov. 20, 2012, 11:13 a.m.
Hi,

I thought of sharing my recent experience with MLC NAND which requires
24-bit ECC.

On Fri, Nov 9, 2012 at 2:16 PM, Ricard Wanderlof
<ricard.wanderlof@axis.com> wrote:
>
> On Thu, 8 Nov 2012, Gerlando Falauto wrote:
>
>>> We had BCH8 code running, but it wasn't enough. The main reason we
>>> switched away from host side ECC was because we were getting bitflips
>>> within the ECC codeword data itself.
>>
>>
>> Wow... I mean, I figured it wouldn't be that easy to (purposedly) get
>> bitflips in any area, I wonder what kind of test you managed to come up with
>> in order to get bitflips within the ECC area itself. In my case it takes
>> several hours (of continuous reads) to get a single bitflip within a 1Gb
>> (128MB) flash.
>
>
> There are 1Gb flashes and 1Gb flashes. Depending on the technology used
> during manufacture (essentially the scale of the on-chip structures, usually
> specified as 'xxx nm technology') the bit error probabilities can vary.
>
> "Traditional" 1Gb flashes where the manufacturer recommends 1-bit ECC in
> practice very rarely exhibit bit flips. I have seen bit flips in the OOB
> area as well as the main area (there was a bug in nand_ecc.c many years ago
> which didn't handle this correctly which is how I discovered what was going
> on); indeed there's nothing different about the OOB area in terms of bit
> flips, it's just another area of (the same type of) flash. The probability
> for the whole OOB area is of course less than for the rest as it is smaller,
> but it is the same per bit if I understand it correctly.
>
> Some manufacturers (Micron for instance I believe) have started to deliver 1
> Gb chips using a higher density technology where they specify a requirement
> for 4-bit ECC. These naturally exhibit a much higher bitflip rate.
>

I'm using Micron's MT29F16G08CBACA.
Minimum required ECC :-      24-bit ECC per 1080 bytes of data
The H/W ECC controller(external to NAND flash) I'm using supports 24-bit ECC.
Had a tough time initially when I started working on this NAND flash.
Without being aware of the minimum required ECC, I was using
Hamming(1-bit) correction. This showed inconsistency at a level of
1/6, i.e 1 boot out of 6 failed.

When I switched to 24-bit ECC with UBIFS, everything seems to work
properly without any issue so far.

But with JFFS2 still there are many issues. I assume that this can be
due to the bit flips in the OOB area which are not covered by ECC.
Also for the erased pages, there is no ECC protection and JFFS2 reads
first 256 bytes of data and checks for all 0xFF to confirm it is an
erased page along with the checking of clean marker it read from the
OOB.

From various articles in the internet, it seems that NAND flashes are
going to get more denser and the bit flips are going to increase.
Hence the H/W ECC controllers are going to have more demand. The S/W
BCH algorithm available in Linux will consume plenty of cycles which
can be offloaded to the H/W ECC controller.

> At any rate, the ECC algorithm itself should be able to take care of bit
> flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this
> by comparing the computed ECC with the actual ECC; if there's a difference
> of exactly one bit (rather than a more complex diff which after calculations
> points out the flipped bit in the main area), it is assumed that the bitflip
> is in the ECC area rather than the data. I don't know how BCH does this
> though.
>
regards,
Calvin
Gerlando Falauto - Nov. 20, 2012, 11:35 a.m.
Hi Calvin,

thanks for sharing your experience.

On 11/20/2012 12:13 PM, Calvin Johnson wrote:
> Hi,
>
> I thought of sharing my recent experience with MLC NAND which requires
> 24-bit ECC.

When you say 24-bit, you mean ECC capable of correcting up to 24 
bitflips within the same block, right? I guess that should be the case 
since I hear MLC NANDs are even less reliable than SLC.

>
> On Fri, Nov 9, 2012 at 2:16 PM, Ricard Wanderlof
> <ricard.wanderlof@axis.com>  wrote:
>>
>> On Thu, 8 Nov 2012, Gerlando Falauto wrote:
>>
>>>> We had BCH8 code running, but it wasn't enough. The main reason we
>>>> switched away from host side ECC was because we were getting bitflips
>>>> within the ECC codeword data itself.
>>>
>>>
>>> Wow... I mean, I figured it wouldn't be that easy to (purposedly) get
>>> bitflips in any area, I wonder what kind of test you managed to come up with
>>> in order to get bitflips within the ECC area itself. In my case it takes
>>> several hours (of continuous reads) to get a single bitflip within a 1Gb
>>> (128MB) flash.
>>
>>
>> There are 1Gb flashes and 1Gb flashes. Depending on the technology used
>> during manufacture (essentially the scale of the on-chip structures, usually
>> specified as 'xxx nm technology') the bit error probabilities can vary.
>>
>> "Traditional" 1Gb flashes where the manufacturer recommends 1-bit ECC in
>> practice very rarely exhibit bit flips. I have seen bit flips in the OOB
>> area as well as the main area (there was a bug in nand_ecc.c many years ago
>> which didn't handle this correctly which is how I discovered what was going
>> on); indeed there's nothing different about the OOB area in terms of bit
>> flips, it's just another area of (the same type of) flash. The probability
>> for the whole OOB area is of course less than for the rest as it is smaller,
>> but it is the same per bit if I understand it correctly.
>>
>> Some manufacturers (Micron for instance I believe) have started to deliver 1
>> Gb chips using a higher density technology where they specify a requirement
>> for 4-bit ECC. These naturally exhibit a much higher bitflip rate.
>>
>
> I'm using Micron's MT29F16G08CBACA.
> Minimum required ECC :-      24-bit ECC per 1080 bytes of data
> The H/W ECC controller(external to NAND flash) I'm using supports 24-bit ECC.

Could you please share, just for the record, what controller you are 
using? Do you also know what algorithm is being used?
Is that already supported in the kernel or did you have to write the 
code for it?

> Had a tough time initially when I started working on this NAND flash.
> Without being aware of the minimum required ECC, I was using
> Hamming(1-bit) correction. This showed inconsistency at a level of
> 1/6, i.e 1 boot out of 6 failed.
>
> When I switched to 24-bit ECC with UBIFS, everything seems to work
> properly without any issue so far.
>
> But with JFFS2 still there are many issues. I assume that this can be
> due to the bit flips in the OOB area which are not covered by ECC.

I'm not that familiar with the whole thing, but I thought you could 
specify what portions of the OOB area were to be used by the filesystem 
(like in the case of the on-die HW ECC for Micron as specified in their 
TN's and discussed here).
Or perhaps JFFS2 is too demanding in terms of OOB data that you're also 
forced to use unprotected portions?

> Also for the erased pages, there is no ECC protection and JFFS2 reads
> first 256 bytes of data and checks for all 0xFF to confirm it is an
> erased page along with the checking of clean marker it read from the
> OOB.
>
>  From various articles in the internet, it seems that NAND flashes are
> going to get more denser and the bit flips are going to increase.
> Hence the H/W ECC controllers are going to have more demand. The S/W
> BCH algorithm available in Linux will consume plenty of cycles which
> can be offloaded to the H/W ECC controller.

Right, so... what is the current support then?

>> At any rate, the ECC algorithm itself should be able to take care of bit
>> flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does this
>> by comparing the computed ECC with the actual ECC; if there's a difference
>> of exactly one bit (rather than a more complex diff which after calculations
>> points out the flipped bit in the main area), it is assumed that the bitflip
>> is in the ECC area rather than the data. I don't know how BCH does this
>> though.
>>
> regards,
> Calvin

Thanks again,
Gerlando
Calvin Johnson - Nov. 20, 2012, 12:12 p.m.
Hi Gerlando,

On Tue, Nov 20, 2012 at 5:05 PM, Gerlando Falauto
<gerlando.falauto@keymile.com> wrote:
> Hi Calvin,
>
> thanks for sharing your experience.
>
>
> On 11/20/2012 12:13 PM, Calvin Johnson wrote:
>>
>> Hi,
>>
>> I thought of sharing my recent experience with MLC NAND which requires
>> 24-bit ECC.
>
>
> When you say 24-bit, you mean ECC capable of correcting up to 24 bitflips
> within the same block, right? I guess that should be the case since I hear
> MLC NANDs are even less reliable than SLC.

Yes, 24-bit ECC means any number of bit flips upto 24 per ECC block
can be corrected using this. Generally ECC block size  can be 512
Bytes or 1K Bytes according to the ECC H/W engine's buffer capacity.

>>
>> On Fri, Nov 9, 2012 at 2:16 PM, Ricard Wanderlof
>> <ricard.wanderlof@axis.com>  wrote:
>>>
>>>
>>> On Thu, 8 Nov 2012, Gerlando Falauto wrote:
>>>
>>>>> We had BCH8 code running, but it wasn't enough. The main reason we
>>>>> switched away from host side ECC was because we were getting bitflips
>>>>> within the ECC codeword data itself.
>>>>
>>>>
>>>>
>>>> Wow... I mean, I figured it wouldn't be that easy to (purposedly) get
>>>> bitflips in any area, I wonder what kind of test you managed to come up
>>>> with
>>>> in order to get bitflips within the ECC area itself. In my case it takes
>>>> several hours (of continuous reads) to get a single bitflip within a 1Gb
>>>> (128MB) flash.
>>>
>>>
>>>
>>> There are 1Gb flashes and 1Gb flashes. Depending on the technology used
>>> during manufacture (essentially the scale of the on-chip structures,
>>> usually
>>> specified as 'xxx nm technology') the bit error probabilities can vary.
>>>
>>> "Traditional" 1Gb flashes where the manufacturer recommends 1-bit ECC in
>>> practice very rarely exhibit bit flips. I have seen bit flips in the OOB
>>> area as well as the main area (there was a bug in nand_ecc.c many years
>>> ago
>>> which didn't handle this correctly which is how I discovered what was
>>> going
>>> on); indeed there's nothing different about the OOB area in terms of bit
>>> flips, it's just another area of (the same type of) flash. The
>>> probability
>>> for the whole OOB area is of course less than for the rest as it is
>>> smaller,
>>> but it is the same per bit if I understand it correctly.
>>>
>>> Some manufacturers (Micron for instance I believe) have started to
>>> deliver 1
>>> Gb chips using a higher density technology where they specify a
>>> requirement
>>> for 4-bit ECC. These naturally exhibit a much higher bitflip rate.
>>>
>>
>> I'm using Micron's MT29F16G08CBACA.
>> Minimum required ECC :-      24-bit ECC per 1080 bytes of data
>> The H/W ECC controller(external to NAND flash) I'm using supports 24-bit
>> ECC.
>
>
> Could you please share, just for the record, what controller you are using?
> Do you also know what algorithm is being used?
> Is that already supported in the kernel or did you have to write the code
> for it?

The controller is inside the SoC. AFAIK, there are 2 popular error
correction algorithms. Hamming and BCH. Hamming is used for 2-bit
error detection and single bit error correction. BCH can correct to
higher levels of bit errors per ECC block size. I used BCH. Although
kernel has some H/W ECC support functions, I had to write calculate
and correct functions.

>> Had a tough time initially when I started working on this NAND flash.
>> Without being aware of the minimum required ECC, I was using
>> Hamming(1-bit) correction. This showed inconsistency at a level of
>> 1/6, i.e 1 boot out of 6 failed.
>>
>> When I switched to 24-bit ECC with UBIFS, everything seems to work
>> properly without any issue so far.
>>
>> But with JFFS2 still there are many issues. I assume that this can be
>> due to the bit flips in the OOB area which are not covered by ECC.
>
>
> I'm not that familiar with the whole thing, but I thought you could specify
> what portions of the OOB area were to be used by the filesystem (like in the
> case of the on-die HW ECC for Micron as specified in their TN's and
> discussed here).
> Or perhaps JFFS2 is too demanding in terms of OOB data that you're also
> forced to use unprotected portions?

JFFS2 places clean markers in the OOB area and any time bits which
make up this marker can flip resulting in inconsistent behaviour.

>
>> Also for the erased pages, there is no ECC protection and JFFS2 reads
>> first 256 bytes of data and checks for all 0xFF to confirm it is an
>> erased page along with the checking of clean marker it read from the
>> OOB.
>>
>>  From various articles in the internet, it seems that NAND flashes are
>> going to get more denser and the bit flips are going to increase.
>> Hence the H/W ECC controllers are going to have more demand. The S/W
>> BCH algorithm available in Linux will consume plenty of cycles which
>> can be offloaded to the H/W ECC controller.
>
>
> Right, so... what is the current support then?

If anyone is concerned about freeing the processor from performing the
SW BCH calculation, can get some HW ECC controllers from the market. I
don't know who all supply them.

>>> At any rate, the ECC algorithm itself should be able to take care of bit
>>> flips in the ECC codes. For the 1-bit algorithm in nand_ecc.c it does
>>> this
>>> by comparing the computed ECC with the actual ECC; if there's a
>>> difference
>>> of exactly one bit (rather than a more complex diff which after
>>> calculations
>>> points out the flipped bit in the main area), it is assumed that the
>>> bitflip
>>> is in the ECC area rather than the data. I don't know how BCH does this
>>> though.

regards,
Calvin
Ricard Wanderlof - Nov. 20, 2012, 4:16 p.m.
On Tue, 20 Nov 2012, Calvin Johnson wrote:

> I thought of sharing my recent experience with MLC NAND which requires
> 24-bit ECC.
> ...

Thanks for sharing your experiences.

> From various articles in the internet, it seems that NAND flashes are
> going to get more denser and the bit flips are going to increase.
> Hence the H/W ECC controllers are going to have more demand. The S/W
> BCH algorithm available in Linux will consume plenty of cycles which
> can be offloaded to the H/W ECC controller.

That is certainly the case for the newer and larger flashes. However, in 
the past year or so it seems that manufacturers have appeared which are 
offering "small" (i.e. 1 Gb and thereabouts) flashes with 1 bit ECC 
requirements, and are planning to do so for a number of years.

It seems that while the big manufacturers (Micron, Samsung, etc) are 
moving on to state-of-the-art higher densities, there is still a market 
interest for smaller, more reliable flashes, I would think mostly for code 
storage for embedded systems.

/Ricard

Patch

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index a796dd7..68af8b0 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -1069,7 +1069,16 @@  static int nand_read_page_raw(struct mtd_info *mtd, struct nand_chip *chip,
                              uint8_t *buf, int page)
 {
        chip->read_buf(mtd, buf, mtd->writesize);
-       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize);
+       chip->read_buf(mtd, chip->oob_poi, mtd->oobsize); /* (this data used in sw ecc) */
+
+       /* TODO: only do this CMD_STATUS if we have Micron NAND */
+       chip->cmdfunc(mtd, NAND_CMD_STATUS, -1, -1);
+       if (chip->read_byte(mtd) & NAND_STATUS_REWRITE_RECOMMENDED) {
+               /* Micron NAND is telling us that this block may be going bad,
+                  tell Linux to move it */
+               mtd->ecc_stats.corrected++; /* (we don't actually know if it's just one correction, could be up to 4) */
+       }
+
        return 0;
 }