Message ID | c7baa4a77775b54f2ef8450d8d1f44d1dc011b41.1298527669.git.viresh.kumar@st.com |
---|---|
State | New, archived |
Headers | show |
On Thu, Feb 24, 2011 at 06:10:16AM +0000, Viresh Kumar wrote: > From: Vipin Kumar <vipin.kumar@st.com> > > A newly erased page contains ff in data as well as spare area. While reading an > erased page, the read out ecc from spare area does not match the ecc generated > by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff > ff. This leads to errors when jffs2 fs erases and reads back the pages to > ensure consistency. > > This patch adds a software workaround to ensure that the ecc check is not > performed for erased pages. An erased page is checked by checking data as ff ff. Hello Vipin, Just a suggestion: maybe you could mention in your comments the fact that you cannot workaround the problem using a mask to get a valid ECC on erased pages, because your controller does not allow it ? If you plan to use your workaround on recent NAND devices with UBIFS, you may still experience problems because of uncorrected bitflips on erased pages, and get errors such as: UBIFS error (pid 576): ubifs_scan: corrupt empty space at LEB 509:126586 UBIFS error (pid 576): ubifs_scanned_corruption: corruption at LEB 509:126586 UBIFS error (pid 576): ubifs_scan: LEB 509 scanning failed UBIFS warning (pid 576): ubifs_ro_mode: switched to read-only mode, error -117 Regards, Ivan
On 2/24/2011 3:08 PM, Ivan Djelic wrote: > On Thu, Feb 24, 2011 at 06:10:16AM +0000, Viresh Kumar wrote: >> From: Vipin Kumar <vipin.kumar@st.com> >> >> A newly erased page contains ff in data as well as spare area. While reading an >> erased page, the read out ecc from spare area does not match the ecc generated >> by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff >> ff. This leads to errors when jffs2 fs erases and reads back the pages to >> ensure consistency. >> >> This patch adds a software workaround to ensure that the ecc check is not >> performed for erased pages. An erased page is checked by checking data as ff ff. > > Hello Vipin, > Hello Ivan, > Just a suggestion: maybe you could mention in your comments the fact that you > cannot workaround the problem using a mask to get a valid ECC on erased pages, > because your controller does not allow it ? > > If you plan to use your workaround on recent NAND devices with UBIFS, you may > still experience problems because of uncorrected bitflips on erased pages, and > get errors such as: > Let me explain the problem again. The problem is that the BCH algorithm (used by this controller to generate ecc and correct bitflips) generates an ecc which is not 0xffff for an erased 512 bytes. Since erasing a page results in all data including the spare area of the page resetting to 0xffff, and the ecc written in the spare area is incorrect. This ecc is not useful to correct bitflips One way to solve this problem is to write the correct ecc in the erased pages spare area. The other is to ensure that the page is erased and not run the correction algorithm. We are using the second option but there would not be any unwanted bitflips in any of the cases. Let me know if this answers your questions > UBIFS error (pid 576): ubifs_scan: corrupt empty space at LEB 509:126586 > UBIFS error (pid 576): ubifs_scanned_corruption: corruption at LEB 509:126586 > UBIFS error (pid 576): ubifs_scan: LEB 509 scanning failed > UBIFS warning (pid 576): ubifs_ro_mode: switched to read-only mode, error -117 > > Regards, > > Ivan > . > Regards Vipin
On Thu, Feb 24, 2011 at 10:20:59AM +0000, Vipin Kumar wrote: > On 2/24/2011 3:08 PM, Ivan Djelic wrote: (...) > > Just a suggestion: maybe you could mention in your comments the fact that you > > cannot workaround the problem using a mask to get a valid ECC on erased pages, > > because your controller does not allow it ? > > > > If you plan to use your workaround on recent NAND devices with UBIFS, you may > > still experience problems because of uncorrected bitflips on erased pages, and > > get errors such as: > > > > Let me explain the problem again. > > The problem is that the BCH algorithm (used by this controller to generate ecc > and correct bitflips) generates an ecc which is not 0xffff for an erased 512 > bytes. > > Since erasing a page results in all data including the spare area of the page > resetting to 0xffff, and the ecc written in the spare area is incorrect. > This ecc is not useful to correct bitflips > > One way to solve this problem is to write the correct ecc in the erased pages > spare area. The other is to ensure that the page is erased and not run the > correction algorithm. There is a third option: add (before writing oob/after reading oob) a fixed polynomial to your HW-generated BCH ECC codeword. This polynomial is chosen such that your ECC code on an erased page now becomes a sequence of 0xff bytes. That way, erased pages can be read with ECC check enabled. That was my point. I assume you cannot alter oob contents as described above, because your controller performs error detection "on-the-fly" as you transfer data to/from NAND device (?). > We are using the second option but there would not be > any unwanted bitflips in any of the cases. On recent NAND devices, bitflips _do_ appear on erased pages, sometimes immediately after a block erase. Regards, Ivan
On 2/24/2011 4:40 PM, Ivan Djelic wrote: > On Thu, Feb 24, 2011 at 10:20:59AM +0000, Vipin Kumar wrote: >> On 2/24/2011 3:08 PM, Ivan Djelic wrote: > (...) >>> Just a suggestion: maybe you could mention in your comments the fact that you >>> cannot workaround the problem using a mask to get a valid ECC on erased pages, >>> because your controller does not allow it ? >>> >>> If you plan to use your workaround on recent NAND devices with UBIFS, you may >>> still experience problems because of uncorrected bitflips on erased pages, and >>> get errors such as: >>> >> >> Let me explain the problem again. >> >> The problem is that the BCH algorithm (used by this controller to generate ecc >> and correct bitflips) generates an ecc which is not 0xffff for an erased 512 >> bytes. >> >> Since erasing a page results in all data including the spare area of the page >> resetting to 0xffff, and the ecc written in the spare area is incorrect. >> This ecc is not useful to correct bitflips >> >> One way to solve this problem is to write the correct ecc in the erased pages >> spare area. The other is to ensure that the page is erased and not run the >> correction algorithm. > > There is a third option: add (before writing oob/after reading oob) a fixed > polynomial to your HW-generated BCH ECC codeword. This polynomial is chosen such > that your ECC code on an erased page now becomes a sequence of 0xff bytes. > That way, erased pages can be read with ECC check enabled. That was my point. > > I assume you cannot alter oob contents as described above, because your controller > performs error detection "on-the-fly" as you transfer data to/from NAND device (?). > This is true. the controller performs the error detection on the fly ie. as the cpu reads the oob bytes The expectation of the controller is 1. reset ecc logic 2. start reading 512 bytes + 13 bytes (ecc - lying in oob) 3. the controller would provide the information about flipped bit offsets This is the reason that the above option would not work >> We are using the second option but there would not be >> any unwanted bitflips in any of the cases. > > On recent NAND devices, bitflips _do_ appear on erased pages, sometimes immediately > after a block erase. > > Regards, > > Ivan > . > Regards Vipin
On 02/24/2011 11:40 AM, Viresh KUMAR wrote: > From: Vipin Kumar <vipin.kumar@st.com> > > A newly erased page contains ff in data as well as spare area. While reading an > erased page, the read out ecc from spare area does not match the ecc generated > by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff > ff. This leads to errors when jffs2 fs erases and reads back the pages to > ensure consistency. > > This patch adds a software workaround to ensure that the ecc check is not > performed for erased pages. An erased page is checked by checking data as ff ff. > > Signed-off-by: Vipin Kumar <vipin.kumar@st.com> > Signed-off-by: Viresh Kumar <viresh.kumar@st.com> > Acked-by: Linus Walleij <linus.walleij@stericsson.com> Artem, David, Have you pushed this patch for 39-rc1?
Hi, sorry for late reply, was too busy and also missed your patch somehow. On Thu, 2011-02-24 at 11:40 +0530, Viresh Kumar wrote: > From: Vipin Kumar <vipin.kumar@st.com> > > A newly erased page contains ff in data as well as spare area. While reading an > erased page, the read out ecc from spare area does not match the ecc generated > by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff > ff. This leads to errors when jffs2 fs erases and reads back the pages to > ensure consistency. > > This patch adds a software workaround to ensure that the ecc check is not > performed for erased pages. An erased page is checked by checking data as ff ff. > > Signed-off-by: Vipin Kumar <vipin.kumar@st.com> > Signed-off-by: Viresh Kumar <viresh.kumar@st.com> > Acked-by: Linus Walleij <linus.walleij@stericsson.com> ... > + /* > + * This is a temporary erase check. A newly erased page read > + * would result in an ecc error because the oob data is also > + * erased to FF and the calculated ecc for an FF data is not > + * FF..FF. > + * This is a workaround to skip performing correction in case > + * data is FF..FF > + */ > + for (k = 0; k < eccsize; k++) { > + if (*(p + k) != 0xff) > + break; > + } > + So eccsize is 512, and you are going to compare 512 bytes on every read? Isn't it a very bad and wasteful solution? May be you could invent something less straight-forward but more optimal? Let's suppose ECC corresponding to all 0xFFs is P. May be you could do this wasteful check only if ECC == P, not all the time? Also, Ivan pointed you the right thing - you might have bit-flips on an erased eraseblock. If not on freshly, then on an erasblock which was erased and then not used for long time. If this is not of your concern, then at least write this in the comments and tell that you know about this issue but for reasons X and Y you do not handle them. Thanks!
On 3/31/2011 7:21 PM, Artem Bityutskiy wrote: > Hi, > Hello Artem, > sorry for late reply, was too busy and also missed your patch somehow. > > On Thu, 2011-02-24 at 11:40 +0530, Viresh Kumar wrote: >> From: Vipin Kumar <vipin.kumar@st.com> >> >> A newly erased page contains ff in data as well as spare area. While reading an >> erased page, the read out ecc from spare area does not match the ecc generated >> by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff >> ff. This leads to errors when jffs2 fs erases and reads back the pages to >> ensure consistency. >> >> This patch adds a software workaround to ensure that the ecc check is not >> performed for erased pages. An erased page is checked by checking data as ff ff. >> >> Signed-off-by: Vipin Kumar <vipin.kumar@st.com> >> Signed-off-by: Viresh Kumar <viresh.kumar@st.com> >> Acked-by: Linus Walleij <linus.walleij@stericsson.com> > > ... > >> + /* >> + * This is a temporary erase check. A newly erased page read >> + * would result in an ecc error because the oob data is also >> + * erased to FF and the calculated ecc for an FF data is not >> + * FF..FF. >> + * This is a workaround to skip performing correction in case >> + * data is FF..FF >> + */ >> + for (k = 0; k < eccsize; k++) { >> + if (*(p + k) != 0xff) >> + break; >> + } >> + > > So eccsize is 512, and you are going to compare 512 bytes on every read? > Isn't it a very bad and wasteful solution? May be you could invent > something less straight-forward but more optimal? > > Let's suppose ECC corresponding to all 0xFFs is P. May be you could do > this wasteful check only if ECC == P, not all the time? > Basically, the hardware only reports the flipped bit offset and not the actual ecc at the time of reading. It reports ecc only at the time of writing. So, comparing ECCs is not a possibility while reading. The other solution can be to write P into OOB every time a page is erased. In that case we put a hook in erase. But since the read implementation was anyway specific for fsmc bch ecc4 scheme, this implementation was more convinient... > Also, Ivan pointed you the right thing - you might have bit-flips on an > erased eraseblock. If not on freshly, then on an erasblock which was > erased and then not used for long time. If this is not of your concern, In that case an ecc error would be reported since the ecc wont match the stored ecc i.e FFFF and the driver would mark it as a normal corrupted page > then at least write this in the comments and tell that you know about > this issue but for reasons X and Y you do not handle them. > OK. I would add this in the patch comments > Thanks! > Regards Vipin
On Fri, 2011-04-01 at 11:58 +0530, Vipin Kumar wrote: > On 3/31/2011 7:21 PM, Artem Bityutskiy wrote: > > Hi, > > > > Hello Artem, > > > sorry for late reply, was too busy and also missed your patch somehow. > > > > On Thu, 2011-02-24 at 11:40 +0530, Viresh Kumar wrote: > >> From: Vipin Kumar <vipin.kumar@st.com> > >> > >> A newly erased page contains ff in data as well as spare area. While reading an > >> erased page, the read out ecc from spare area does not match the ecc generated > >> by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff > >> ff. This leads to errors when jffs2 fs erases and reads back the pages to > >> ensure consistency. > >> > >> This patch adds a software workaround to ensure that the ecc check is not > >> performed for erased pages. An erased page is checked by checking data as ff ff. > >> > >> Signed-off-by: Vipin Kumar <vipin.kumar@st.com> > >> Signed-off-by: Viresh Kumar <viresh.kumar@st.com> > >> Acked-by: Linus Walleij <linus.walleij@stericsson.com> > > > > ... > > > >> + /* > >> + * This is a temporary erase check. A newly erased page read > >> + * would result in an ecc error because the oob data is also > >> + * erased to FF and the calculated ecc for an FF data is not > >> + * FF..FF. > >> + * This is a workaround to skip performing correction in case > >> + * data is FF..FF > >> + */ > >> + for (k = 0; k < eccsize; k++) { > >> + if (*(p + k) != 0xff) > >> + break; > >> + } > >> + > > > > So eccsize is 512, and you are going to compare 512 bytes on every read? > > Isn't it a very bad and wasteful solution? May be you could invent > > something less straight-forward but more optimal? > > > > Let's suppose ECC corresponding to all 0xFFs is P. May be you could do > > this wasteful check only if ECC == P, not all the time? > > > > Basically, the hardware only reports the flipped bit offset and not the actual ecc > at the time of reading. It reports ecc only at the time of writing. So, comparing > ECCs is not a possibility while reading. Sorry, my suggestion was stupid. But can you do like this? 1. Read ECC. 2. Compare it to all 0xFFs 3. If all 0xFFs, then check the page contents against 0xFF. 4. If not all 0xFFs, then this is not an erased page for sure, because erased pages have to have all 0xFFs in OOB. > The other solution can be to write P into OOB every time a page is erased. In that > case we put a hook in erase. You cannot write anything to the ECC positions in OOB after erasure because if you do this, how are you going to store real ECC in the subsequent writes? > > Also, Ivan pointed you the right thing - you might have bit-flips on an > > erased eraseblock. If not on freshly, then on an erasblock which was > > erased and then not used for long time. If this is not of your concern, > > In that case an ecc error would be reported since the ecc wont match the stored > ecc i.e FFFF and the driver would mark it as a normal corrupted page I'm confused. So you erased eraseblock A. Everything there contains 0xFFs, including the OOB area. Now you have one of the modern lashes. You gen bit-flips in the page. Say, a couple of bits flip. You read this page. You compare the contents of the page with 0xFF and find out that the contends in not all 0xFFs. What do you do then?
On 4/1/2011 12:21 PM, Artem Bityutskiy wrote: > On Fri, 2011-04-01 at 11:58 +0530, Vipin Kumar wrote: >> On 3/31/2011 7:21 PM, Artem Bityutskiy wrote: >>> Hi, >>> >> >> Hello Artem, >> Hello Artem, >>> sorry for late reply, was too busy and also missed your patch somehow. >>> >>> On Thu, 2011-02-24 at 11:40 +0530, Viresh Kumar wrote: >>>> From: Vipin Kumar <vipin.kumar@st.com> >>>> >>>> A newly erased page contains ff in data as well as spare area. While reading an >>>> erased page, the read out ecc from spare area does not match the ecc generated >>>> by fsmc ecc hardware accelarator. This is because ecc of data ff ff is not ff >>>> ff. This leads to errors when jffs2 fs erases and reads back the pages to >>>> ensure consistency. >>>> >>>> This patch adds a software workaround to ensure that the ecc check is not >>>> performed for erased pages. An erased page is checked by checking data as ff ff. >>>> >>>> Signed-off-by: Vipin Kumar <vipin.kumar@st.com> >>>> Signed-off-by: Viresh Kumar <viresh.kumar@st.com> >>>> Acked-by: Linus Walleij <linus.walleij@stericsson.com> >>> >>> ... >>> >>>> + /* >>>> + * This is a temporary erase check. A newly erased page read >>>> + * would result in an ecc error because the oob data is also >>>> + * erased to FF and the calculated ecc for an FF data is not >>>> + * FF..FF. >>>> + * This is a workaround to skip performing correction in case >>>> + * data is FF..FF >>>> + */ >>>> + for (k = 0; k < eccsize; k++) { >>>> + if (*(p + k) != 0xff) >>>> + break; >>>> + } >>>> + >>> >>> So eccsize is 512, and you are going to compare 512 bytes on every read? >>> Isn't it a very bad and wasteful solution? May be you could invent >>> something less straight-forward but more optimal? >>> >>> Let's suppose ECC corresponding to all 0xFFs is P. May be you could do >>> this wasteful check only if ECC == P, not all the time? >>> >> >> Basically, the hardware only reports the flipped bit offset and not the actual ecc >> at the time of reading. It reports ecc only at the time of writing. So, comparing >> ECCs is not a possibility while reading. > > Sorry, my suggestion was stupid. > > But can you do like this? > > 1. Read ECC. > 2. Compare it to all 0xFFs > 3. If all 0xFFs, then check the page contents against 0xFF. > 4. If not all 0xFFs, then this is not an erased page for sure, because > erased pages have to have all 0xFFs in OOB. > Yes, OK >> The other solution can be to write P into OOB every time a page is erased. In that >> case we put a hook in erase. > > You cannot write anything to the ECC positions in OOB after erasure > because if you do this, how are you going to store real ECC in the > subsequent writes? > Yes, you are right...This is not a feasible solution >>> Also, Ivan pointed you the right thing - you might have bit-flips on an >>> erased eraseblock. If not on freshly, then on an erasblock which was >>> erased and then not used for long time. If this is not of your concern, >> >> In that case an ecc error would be reported since the ecc wont match the stored >> ecc i.e FFFF and the driver would mark it as a normal corrupted page > > I'm confused. So you erased eraseblock A. Everything there contains > 0xFFs, including the OOB area. > > Now you have one of the modern lashes. You gen bit-flips in the page. > Say, a couple of bits flip. You read this page. You compare the contents > of the page with 0xFF and find out that the contends in not all 0xFFs. > What do you do then? > Then, the normal driver takes over and it reports an error because the number of errors in the page are beyond 8 bits (maximum the FSMC ecc logic can correct). Effectively speaking, the read page returns an error indicating that the page could not be read properly Ideally, any filesystem would mark it as a bad block Regards Vipin
On Fri, 2011-04-01 at 14:03 +0530, Vipin Kumar wrote: > >>> Also, Ivan pointed you the right thing - you might have bit-flips > on an > >>> erased eraseblock. If not on freshly, then on an erasblock which > was > >>> erased and then not used for long time. If this is not of your > concern, > >> > >> In that case an ecc error would be reported since the ecc wont > match the stored > >> ecc i.e FFFF and the driver would mark it as a normal corrupted > page > > > > I'm confused. So you erased eraseblock A. Everything there contains > > 0xFFs, including the OOB area. > > > > Now you have one of the modern lashes. You gen bit-flips in the > page. > > Say, a couple of bits flip. You read this page. You compare the > contents > > of the page with 0xFF and find out that the contends in not all > 0xFFs. > > What do you do then? > > > > Then, the normal driver takes over and it reports an error because > the > number of errors in the page are beyond 8 bits (maximum the FSMC ecc > logic can correct). Why 8? It may be just 1 single bit-flip. Just one bits becomes 0 instead of 1. > Effectively speaking, the read page returns an error > indicating that the page could not be read properly But why? It can be read properly. If this is just 1 wrong bit, you should be able to correct it. And as Ivan indicated, modern flashes are so crappy that 1 bit-flip on erased eraseblock is just normal there. > Ideally, any filesystem would mark it as a bad block That's the point - no. This is normal on modern flashes. I think one solution could be that you make your check more sophisticated. You check for 0xFFs, if this is not true, you see is this "almost all 0xFFs" and count amount of non-0xFF bits. If the count is, say, 2, you assume this page contains all 0xFFs plus 2 bit-flips. But I'm not sure it would work. Anyway, If you do not care about such bit-flips for your SoC - fine. I just wanted you to understand and accept the issue and write about it in the comment. And I also wanted you to _not_ do expensive 0xFF comparison every time - but it seems you accepted this :-)
On 4/1/2011 2:09 PM, Artem Bityutskiy wrote: > On Fri, 2011-04-01 at 14:03 +0530, Vipin Kumar wrote: >>>>> Also, Ivan pointed you the right thing - you might have bit-flips >> on an >>>>> erased eraseblock. If not on freshly, then on an erasblock which >> was >>>>> erased and then not used for long time. If this is not of your >> concern, >>>> >>>> In that case an ecc error would be reported since the ecc wont >> match the stored >>>> ecc i.e FFFF and the driver would mark it as a normal corrupted >> page >>> >>> I'm confused. So you erased eraseblock A. Everything there contains >>> 0xFFs, including the OOB area. >>> >>> Now you have one of the modern lashes. You gen bit-flips in the >> page. >>> Say, a couple of bits flip. You read this page. You compare the >> contents >>> of the page with 0xFF and find out that the contends in not all >> 0xFFs. >>> What do you do then? >>> >> >> Then, the normal driver takes over and it reports an error because >> the >> number of errors in the page are beyond 8 bits (maximum the FSMC ecc >> logic can correct). > > Why 8? It may be just 1 single bit-flip. Just one bits becomes 0 instead > of 1. > It is a maximum of 8. So, the logic can correct any number of bitflips from 1 to 8 in 512 bytes of data >> Effectively speaking, the read page returns an error >> indicating that the page could not be read properly > > But why? It can be read properly. If this is just 1 wrong bit, you > should be able to correct it. And as Ivan indicated, modern flashes are > so crappy that 1 bit-flip on erased eraseblock is just normal there. > That's the problem. Ideally the ecc should have been programmed in OOB and then the driver would be able to correct the flipped bits. The problem happens only if we try to read the erased pages. >> Ideally, any filesystem would mark it as a bad block > > That's the point - no. This is normal on modern flashes. > > I think one solution could be that you make your check more > sophisticated. You check for 0xFFs, if this is not true, you see is this > "almost all 0xFFs" and count amount of non-0xFF bits. If the count is, > say, 2, you assume this page contains all 0xFFs plus 2 bit-flips. But > I'm not sure it would work. > > Anyway, If you do not care about such bit-flips for your SoC - fine. I > just wanted you to understand and accept the issue and write about it in > the comment. And I also wanted you to _not_ do expensive 0xFF comparison > every time - but it seems you accepted this :-) > Yes, I had to accept this :-) The flip side is that the hardware itself should not report errors when it reads all ff data and ff ecc..It should assume it as an erased page and not report any errors Yes, I understand the issue and would write more details about it in the patch comment. Moreover, the comparison id only expensive when the page contains all FFs.. Most of the time comparison would fail at first few bytes Regards Vipin
On Fri, 2011-04-01 at 14:36 +0530, Vipin Kumar wrote: > Yes, I understand the issue and would write more details about it in the > patch comment. Moreover, the comparison id only expensive when the page > contains all FFs.. Most of the time comparison would fail at first few bytes OK, actually yes, probably this ECC check is not needed. Anyway, will you send the updated patch please?
On Fri, Apr 01, 2011 at 10:06:51AM +0100, Vipin Kumar wrote: > That's the problem. Ideally the ecc should have been programmed in OOB and then > the driver would be able to correct the flipped bits. The problem happens only > if we try to read the erased pages. > > >> Ideally, any filesystem would mark it as a bad block > > > > That's the point - no. This is normal on modern flashes. > > > > I think one solution could be that you make your check more > > sophisticated. You check for 0xFFs, if this is not true, you see is this > > "almost all 0xFFs" and count amount of non-0xFF bits. If the count is, > > say, 2, you assume this page contains all 0xFFs plus 2 bit-flips. But > > I'm not sure it would work. > > > > Anyway, If you do not care about such bit-flips for your SoC - fine. I > > just wanted you to understand and accept the issue and write about it in > > the comment. And I also wanted you to _not_ do expensive 0xFF comparison > > every time - but it seems you accepted this :-) > > > > Yes, I had to accept this :-) > The flip side is that the hardware itself should not report errors when it > reads all ff data and ff ecc..It should assume it as an erased page and not > report any errors Hello Vipin, Did you consider this idea: if you have an unused byte available in oob, program it to 0x00 when a page is programmed. That way, you just need to check a single byte when you read a page in order to distinguish erased pages from programmed pages. And by counting the number of 1s in the byte, you can be robust to bitflips. As a special refinement, you could also "cleanup" pages detected as erased, in order to iron out possible bitflips. I think that this method is used by Micron for their internal on-die ecc engine: they add a parity byte (0x00 or 0x01) to their BCH code, which can be used to: 1) detect failures (using parity) when the max error count is reached 2) distinguish between erased and programmed pages Best Regards, Ivan
On Fri, 2011-04-01 at 14:14 +0200, Ivan Djelic wrote: > Did you consider this idea: if you have an unused byte available in oob, > program it to 0x00 when a page is programmed. I guess this depends on the the controller, but probably this could mean a substantial write overhead, not? > That way, you just need to check a single byte when you read a page in order > to distinguish erased pages from programmed pages. And by counting the number > of 1s in the byte, you can be robust to bitflips. Could you please explain some more what do you mean? So you write the 0x00 byte. Then when you read, you count the number of "1" bits in the byte. And what exactly this count gives you and to which bitflips you become robust? > As a special refinement, you could also "cleanup" pages detected as erased, in > order to iron out possible bitflips. What do you mean by "cleanup" a page? Thanks!
On Fri, Apr 01, 2011 at 02:04:41PM +0100, Artem Bityutskiy wrote: > On Fri, 2011-04-01 at 14:14 +0200, Ivan Djelic wrote: > > Did you consider this idea: if you have an unused byte available in oob, > > program it to 0x00 when a page is programmed. > > I guess this depends on the the controller, but probably this could mean > a substantial write overhead, not? Sorry, my explanation was probably not very clear. When you program a page, you send data and oob contents to your NAND controller. The controller may modify oob contents to add ECC bytes, in a more or less automatic way, then it will physically program the page. So it's up to you to decide what to program into _free_ oob bytes. And you can decide to systematically put 0x00 in a specific byte, to indicate that the page has been programmed. Think of it as a marker (a bit like the bad block marker). Vipin just needs to change a little bit the oob layout of his driver, removing a free byte (assuming he has some free bytes left) and clear this byte in the oob array before writing a page - no overhead. When the page oob is read back, he will only need to check his marker in the oob array. If the marker is 0xff, the page has not been programmed. If the marker is 0x00, the page has been programmed. > > That way, you just need to check a single byte when you read a page in order > > to distinguish erased pages from programmed pages. And by counting the number > > of 1s in the byte, you can be robust to bitflips. > > Could you please explain some more what do you mean? So you write the > 0x00 byte. Then when you read, you count the number of "1" bits in the > byte. And what exactly this count gives you and to which bitflips you > become robust? I meant that you can be robust to bitflips occurring in the marker byte itself. If you just compare the marker byte to 0x00 or 0xff, you will not be able to handle values such as 0x01 or 0x7f (occuring because of bitflips). You can count the number of 1s in the marker, and decide for instance: count < 4 => marker is 0x00 count >= 4 => marker is 0xff That way, the marker is reliable up to 3 bitflips (which is very improbable in the same byte). > > As a special refinement, you could also "cleanup" pages detected as erased, in > > order to iron out possible bitflips. > > What do you mean by "cleanup" a page? Let us assume Vipin implements the marker idea. He reads a page, and the marker test tells him the page has never been programmed. So he will avoid performing ECC correction on that page. But that page could nevertheless contains bitflips (i.e. not all bytes equal to 0xFF). He could memset page data with 0xFFs to blindly remove possible bitflips, as if the ECC engine had fixed them. Best Regards, Ivan
On Fri, 2011-04-01 at 16:04 +0200, Ivan Djelic wrote: > On Fri, Apr 01, 2011 at 02:04:41PM +0100, Artem Bityutskiy wrote: > > On Fri, 2011-04-01 at 14:14 +0200, Ivan Djelic wrote: > > > Did you consider this idea: if you have an unused byte available in oob, > > > program it to 0x00 when a page is programmed. > > > > I guess this depends on the the controller, but probably this could mean > > a substantial write overhead, not? > > Sorry, my explanation was probably not very clear. > > When you program a page, you send data and oob contents to your NAND controller. > The controller may modify oob contents to add ECC bytes, in a more or less > automatic way, then it will physically program the page. Right, I just assumed that some controllers allow you to send only data, and programming OOB would be a separate operation. But if this is not the case in ST's HW - then this sounds like a very good approach! > > > That way, you just need to check a single byte when you read a page in order > > > to distinguish erased pages from programmed pages. And by counting the number > > > of 1s in the byte, you can be robust to bitflips. > > > > Could you please explain some more what do you mean? So you write the > > 0x00 byte. Then when you read, you count the number of "1" bits in the > > byte. And what exactly this count gives you and to which bitflips you > > become robust? > > I meant that you can be robust to bitflips occurring in the marker byte itself. > If you just compare the marker byte to 0x00 or 0xff, you will not be able to > handle values such as 0x01 or 0x7f (occuring because of bitflips). > You can count the number of 1s in the marker, and decide for instance: > count < 4 => marker is 0x00 > count >= 4 => marker is 0xff > That way, the marker is reliable up to 3 bitflips (which is very improbable in > the same byte). I see. I'm curious, if the OOB area is so unreliable, how can we trust it to store our ECC? We need a mini-ECC for the ECC then? > Let us assume Vipin implements the marker idea. He reads a page, and the marker > test tells him the page has never been programmed. So he will avoid performing > ECC correction on that page. But that page could nevertheless contains bitflips > (i.e. not all bytes equal to 0xFF). He could memset page data with 0xFFs to > blindly remove possible bitflips, as if the ECC engine had fixed them. Ah, I see. Although hiding bit-flips from upper layers might be not the best idea sometimes.
On Fri, Apr 01, 2011 at 03:16:01PM +0100, Artem Bityutskiy wrote: > > That way, the marker is reliable up to 3 bitflips (which is very improbable in > > the same byte). > > I see. > > I'm curious, if the OOB area is so unreliable, how can we trust it to > store our ECC? We need a mini-ECC for the ECC then? The oob area is no different from the data area (same technology). Fortunately, Hamming and BCH codes (and usual ECC codes in general) actually have built-in robustness to corruption. But your point is still valid: when people want to use oob for storing other things than ECC bytes, say filesystem metadata, they do have a problem; and they need to make sure that either: - ECC covers data AND oob metadata (easy to do with BCH, difficult with Hamming) or - extra ECC bytes are added to protect oob metadata YAFFS implements the latter solution. > > Let us assume Vipin implements the marker idea. He reads a page, and the marker > > test tells him the page has never been programmed. So he will avoid performing > > ECC correction on that page. But that page could nevertheless contains bitflips > > (i.e. not all bytes equal to 0xFF). He could memset page data with 0xFFs to > > blindly remove possible bitflips, as if the ECC engine had fixed them. > > Ah, I see. Although hiding bit-flips from upper layers might be not the > best idea sometimes. I agree, this "blind" cleanup is not such a good idea after all :) BR, -- Ivan
On Fri, 1 Apr 2011, Ivan Djelic wrote: >> I'm curious, if the OOB area is so unreliable, how can we trust it to >> store our ECC? We need a mini-ECC for the ECC then? > > The oob area is no different from the data area (same technology). > Fortunately, Hamming and BCH codes (and usual ECC codes in general) actually > have built-in robustness to corruption. At least with Hamming, when the ECC has been calculated, if it differs from the ECC read from the oob, and the difference is just a single bit, it is assumed that the ECC stored in the oob was subjected to a bitflip, and the data is in fact ok. (A long time ago there was a bug in this in mtd, so that a (single) bit flip in the ECC bytes was flagged as an ECC failure). I don't know how BCH (and other multiple-error-bit algorithms) deals with this though. Does it assume that there will be at most one bitflip within the ECC data (which is much smaller than the bytes covered by the ECC), or does it accept multiple bit errors also in the ECC data? One would assume that it would start to get difficult to distinguish multiple bit errors in the ECC data with valid ECC indicating bit errors in the data, but perhaps this isn't so thanks to the algorithm used? /Ricard
On Fri, Apr 01, 2011 at 03:58:47PM +0100, Ricard Wanderlof wrote: > > On Fri, 1 Apr 2011, Ivan Djelic wrote: > > >> I'm curious, if the OOB area is so unreliable, how can we trust it to > >> store our ECC? We need a mini-ECC for the ECC then? > > > > The oob area is no different from the data area (same technology). > > Fortunately, Hamming and BCH codes (and usual ECC codes in general) actually > > have built-in robustness to corruption. > > At least with Hamming, when the ECC has been calculated, if it differs > from the ECC read from the oob, and the difference is just a single bit, > it is assumed that the ECC stored in the oob was subjected to a bitflip, > and the data is in fact ok. (A long time ago there was a bug in this in > mtd, so that a (single) bit flip in the ECC bytes was flagged as an ECC > failure). Exactly, because a single bitflip in data results in a 12-bit flip in a Hamming 24-bit code. Therefore you can easily distinguish a single bitflip in data (12 bits change) from a single bitflip in the ECC code (1 bit change). > I don't know how BCH (and other multiple-error-bit algorithms) deals with > this though. Does it assume that there will be at most one bitflip within > the ECC data (which is much smaller than the bytes covered by the ECC), or > does it accept multiple bit errors also in the ECC data? One would assume > that it would start to get difficult to distinguish multiple bit errors in > the ECC data with valid ECC indicating bit errors in the data, but perhaps > this isn't so thanks to the algorithm used? With BCH and Reed-Solomon codes, data bytes and ecc bytes form a single codeword. This whole codeword can be corrected, not just the data part. Think of codewords as points in a space of binary words. In this space of binary words, distance is measured as the number of differing bits between two words. A codeword has a remarkable property: it is distant from other codewords by at least 2t+1 bits. You program the codeword to flash, and later read it back as a (potentially) corrupted binary word. If you assume that this binary word is distant from at most t bits from the original codeword (i.e. no more than t bitflips occured in flash), then you can retrieve the original codeword (it is the closest to your binary word). The whole point of BCH (or RS) codes it to build such a codeword from the input data, by adding the right ECC bits. BR, Ivan
On Fri, Apr 01, 2011 at 03:16:01PM +0100, Artem Bityutskiy wrote: > On Fri, 2011-04-01 at 16:04 +0200, Ivan Djelic wrote: > > On Fri, Apr 01, 2011 at 02:04:41PM +0100, Artem Bityutskiy wrote: > > > On Fri, 2011-04-01 at 14:14 +0200, Ivan Djelic wrote: > > > > Did you consider this idea: if you have an unused byte available in oob, > > > > program it to 0x00 when a page is programmed. > > > > > > I guess this depends on the the controller, but probably this could mean > > > a substantial write overhead, not? > > > > Sorry, my explanation was probably not very clear. > > > > When you program a page, you send data and oob contents to your NAND controller. > > The controller may modify oob contents to add ECC bytes, in a more or less > > automatic way, then it will physically program the page. > > Right, I just assumed that some controllers allow you to send only data, > and programming OOB would be a separate operation. But if this is not > the case in ST's HW - then this sounds like a very good approach! I had a quick look at fsmc_nand.c, and I don't see anything in the controller preventing this approach. The driver provides an IO_ADDR_W address for sending data to the NAND device, and lets mtd upper layers do the job. By implementing the page writing function in the driver, one could clear the marker in the oob array before programming (and check it in the page reading function). I would be very surprised if the controller did not allow control over oob contents other than ecc bytes. By doing so, it would prevent things like software bad block marking. But sometimes hardware can be very surprising :) BR, Ivan
On Fri, 2011-04-01 at 18:09 +0200, ext Ivan Djelic wrote: > I had a quick look at fsmc_nand.c, and I don't see anything in the controller > preventing this approach. The driver provides an IO_ADDR_W address for sending > data to the NAND device, and lets mtd upper layers do the job. By implementing > the page writing function in the driver, one could clear the marker in the oob > array before programming (and check it in the page reading function). OK, cool! > I would be very surprised if the controller did not allow control over oob > contents other than ecc bytes. By doing so, it would prevent things like > software bad block marking. But sometimes hardware can be very surprising :) Well, I actually meant that programming OOB could be a separate operation (not impossible): instead of "write data, oob data and ECC in one go", something like "write data and oob, wait for completion, write additional oob byte". So I meant that this could have write speed impact. But this does not matter, since this is not the case now.
diff --git a/drivers/mtd/nand/fsmc_nand.c b/drivers/mtd/nand/fsmc_nand.c index 205b10b..7df270a 100644 --- a/drivers/mtd/nand/fsmc_nand.c +++ b/drivers/mtd/nand/fsmc_nand.c @@ -420,7 +420,7 @@ static int fsmc_read_page_hwecc(struct mtd_info *mtd, struct nand_chip *chip, struct fsmc_nand_data *host = container_of(mtd, struct fsmc_nand_data, mtd); struct fsmc_eccplace *ecc_place = host->ecc_place; - int i, j, s, stat, eccsize = chip->ecc.size; + int i, j, k, s, stat, eccsize = chip->ecc.size; int eccbytes = chip->ecc.bytes; int eccsteps = chip->ecc.steps; uint8_t *p = buf; @@ -460,11 +460,27 @@ static int fsmc_read_page_hwecc(struct mtd_info *mtd, struct nand_chip *chip, memcpy(&ecc_code[i], oob, 13); chip->ecc.calculate(mtd, p, &ecc_calc[i]); - stat = chip->ecc.correct(mtd, p, &ecc_code[i], &ecc_calc[i]); - if (stat < 0) - mtd->ecc_stats.failed++; - else - mtd->ecc_stats.corrected += stat; + /* + * This is a temporary erase check. A newly erased page read + * would result in an ecc error because the oob data is also + * erased to FF and the calculated ecc for an FF data is not + * FF..FF. + * This is a workaround to skip performing correction in case + * data is FF..FF + */ + for (k = 0; k < eccsize; k++) { + if (*(p + k) != 0xff) + break; + } + + if (k < eccsize) { + stat = chip->ecc.correct(mtd, p, &ecc_code[i], + &ecc_calc[i]); + if (stat < 0) + mtd->ecc_stats.failed++; + else + mtd->ecc_stats.corrected += stat; + } } return 0;