diff mbox

[RFC] UBIFS: remove check for all(0xff) for empty pages

Message ID 1394518992-26969-1-git-send-email-pekon@ti.com
State Rejected
Headers show

Commit Message

pekon gupta March 11, 2014, 6:23 a.m. UTC
UBIFS throws following error if an blank page does not contain perfect all(0xff)
data, during LEB scan, even though if number of bit-flips is within ecc.strength
	"UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 417:53168"

But newer technology devices (like MLC NAND) are prone to read-disturb bit-flips
and so having perfect all(0xff) data on empty-pages is practically not feasible
on these newer technology devices (like MLC NAND).
Also, such device spec suggest using stronger ECC schemes (like BCH16), and have
large OOB size to guarantee longer field life.

So this check of perfect all(0xff) data can be safely removed from UBIFS, as
UBI and MTD layers already these before passing the data to UBIFS:
 if (number-of-bitflips < mtd->bitflip_threshold)
        Driver's ECC scheme is strong enough to handle existing bit-flips in
        future, so both MTD and UBI ignore such bit-flips.
 if (number-of-bitflips >= mtd->bitflip_threshold)
        MTD flags -EUCLEAN.
        UBI schedules the block for scrubbing.
 if (number-of-bitflips > nand_chip->ecc.strength)
        MTD flags -EBADMSG.
        UBI tries to extract data relevant data from block based on sanity of
        its headers, and then schedule it for re-erase.

Signed-off-by: Pekon Gupta <pekon@ti.com>
---
 fs/ubifs/scan.c | 11 -----------
 1 file changed, 11 deletions(-)

Comments

Artem Bityutskiy March 11, 2014, 7:42 a.m. UTC | #1
On Tue, 2014-03-11 at 11:53 +0530, Pekon Gupta wrote:
> UBIFS throws following error if an blank page does not contain perfect all(0xff)
> data, during LEB scan, even though if number of bit-flips is within ecc.strength
> 	"UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 417:53168"
> 
> But newer technology devices (like MLC NAND) are prone to read-disturb bit-flips
> and so having perfect all(0xff) data on empty-pages is practically not feasible
> on these newer technology devices (like MLC NAND).
> Also, such device spec suggest using stronger ECC schemes (like BCH16), and have
> large OOB size to guarantee longer field life.
> 
> So this check of perfect all(0xff) data can be safely removed from UBIFS, as
> UBI and MTD layers already these before passing the data to UBIFS:
>  if (number-of-bitflips < mtd->bitflip_threshold)
>         Driver's ECC scheme is strong enough to handle existing bit-flips in
>         future, so both MTD and UBI ignore such bit-flips.
>  if (number-of-bitflips >= mtd->bitflip_threshold)
>         MTD flags -EUCLEAN.
>         UBI schedules the block for scrubbing.
>  if (number-of-bitflips > nand_chip->ecc.strength)
>         MTD flags -EBADMSG.
>         UBI tries to extract data relevant data from block based on sanity of
>         its headers, and then schedule it for re-erase.
> 
> Signed-off-by: Pekon Gupta <pekon@ti.com>
> ---
>  fs/ubifs/scan.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
> index 58aa05d..b6ab396 100644
> --- a/fs/ubifs/scan.c
> +++ b/fs/ubifs/scan.c
> @@ -332,17 +332,6 @@ struct ubifs_scan_leb *ubifs_scan(const struct ubifs_info *c, int lnum,
>  
>  	ubifs_end_scan(c, sleb, lnum, offs);
>  
> -	for (; len > 4; offs += 4, buf = buf + 4, len -= 4)
> -		if (*(uint32_t *)buf != 0xffffffff)
> -			break;
> -	for (; len; offs++, buf++, len--)
> -		if (*(uint8_t *)buf != 0xff) {
> -			if (!quiet)
> -				ubifs_err("corrupt empty space at LEB %d:%d",
> -					  lnum, offs);
> -			goto corrupted;
> -		}

Let me try to explain why this code is present in UBIFS. When we
designed and implemented UBIFS we assumed that empty space cannot have
bit-flips. And NANDs that we had never exposed this kind of issues.

The other assumption that we were kept in mind is that we cannot trust
the media. The flash may contain any garbage, the contents may be
inconsistent, there may be any random corruptions anywhere.

This is why we have all these CRC32 checksums everywhere, and all this
"validate this and that, check the sanity" functions.

We explicitly did not want to try handling random corruptions. Simply
because this is dangerous. E.g., you have one node corrupted, and it
contains important data, and UBIFS just kills it - you have no chance to
recover your important data any more.

We thought that random corruptions handling belongs to user-space.
User-space tools can ask questions, have many options, preserve data in
a separate file, etc.

However, there is one type of corruptions we _did_ want to gracefully
handle - corruptions caused by power cuts.

So in the kernel space we tried to be very very careful in
distinguishing between power cut corruptions and random corruptions.

Now, power cut corruptions have the following properties:

1. Can only happen in journal LEBs.
2. Happen only in one single write unit (max_write_size), because UBIFS
always writes one write unit (or less) at a time.
3. The next write unit (the one coming after the corrupted write unit)
should be empty, never written to. Just because we write unit after unit
sequentially.

And UBIFS simply checks if these conditions hold. If they are, UBIFS
assumes this is a power cut corruption, and simply wipes all the
corrupted nodes out. The corrupted nodes will all sit in the corrupted
write unit, and/or in nodes that _end_ in the corrupted write unit.

And what UBIFS recovery procedure does - it simply moves all the good
nodes from the original PEB to a different PEB, re-maps the LEB to the
new PEB, and erases the older PEB. And it clears up all the corrupted
garbage this way, so we end up with a partially-filled LEB. And we even
keep adding more data there afterwards.

Now what you do, you just remove the check for condition #3. Not good, I
think...

Removing these check means that if there is corruption in the middle of
a journal LEB, and this corruption is _not_ caused by a power cut, UBIFS
will kill all the nodes in the corrupted write unit and all the good
nodes after the corrupted write unit (I did not verify if this is true,
relying on my memory here).

Instead, what is needed is a way to distinguish between "empty" space
and "written-to space".

For some HW the driver could tell by looking at the OOB.

Or we could simply verify against 0xFFs but allow for a number of 0
bits, which depends on the HW ECC strength.

Or may be try the former, and if it is not supported by the driver,
fall-back to the latter.

But I do not think that just removing the checks is a good idea. They
really help to keep us sane. And they helped already to expose the
shortcomings of the design.

HTH.
pekon gupta March 11, 2014, 11:41 a.m. UTC | #2
Hi Artem,

Thanks for this detailed explanation. I have few more queries, after
going through the presentation [1]

>From: Artem Bityutskiy [mailto:dedekind1@gmail.com]
>>On Tue, 2014-03-11 at 11:53 +0530, Pekon Gupta wrote:
>> UBIFS throws following error if an blank page does not contain perfect all(0xff)
>> data, during LEB scan, even though if number of bit-flips is within ecc.strength
>> 	"UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 417:53168"
>>
>> But newer technology devices (like MLC NAND) are prone to read-disturb bit-flips
>> and so having perfect all(0xff) data on empty-pages is practically not feasible
>> on these newer technology devices (like MLC NAND).
>> Also, such device spec suggest using stronger ECC schemes (like BCH16), and have
>> large OOB size to guarantee longer field life.
>>
>> So this check of perfect all(0xff) data can be safely removed from UBIFS, as
>> UBI and MTD layers already these before passing the data to UBIFS:
>>  if (number-of-bitflips < mtd->bitflip_threshold)
>>         Driver's ECC scheme is strong enough to handle existing bit-flips in
>>         future, so both MTD and UBI ignore such bit-flips.
>>  if (number-of-bitflips >= mtd->bitflip_threshold)
>>         MTD flags -EUCLEAN.
>>         UBI schedules the block for scrubbing.
>>  if (number-of-bitflips > nand_chip->ecc.strength)
>>         MTD flags -EBADMSG.
>>         UBI tries to extract data relevant data from block based on sanity of
>>         its headers, and then schedule it for re-erase.
>>
>> Signed-off-by: Pekon Gupta <pekon@ti.com>
>> ---
[...]
>
>Let me try to explain why this code is present in UBIFS. When we
>designed and implemented UBIFS we assumed that empty space cannot have
>bit-flips. And NANDs that we had never exposed this kind of issues.
>
>The other assumption that we were kept in mind is that we cannot trust
>the media. The flash may contain any garbage, the contents may be
>inconsistent, there may be any random corruptions anywhere.
>
>This is why we have all these CRC32 checksums everywhere, and all this
>"validate this and that, check the sanity" functions.
>
>We explicitly did not want to try handling random corruptions. Simply
>because this is dangerous. E.g., you have one node corrupted, and it
>contains important data, and UBIFS just kills it - you have no chance to
>recover your important data any more.
>
>We thought that random corruptions handling belongs to user-space.
>User-space tools can ask questions, have many options, preserve data in
>a separate file, etc.
>
>However, there is one type of corruptions we _did_ want to gracefully
>handle - corruptions caused by power cuts.
>
>So in the kernel space we tried to be very very careful in
>distinguishing between power cut corruptions and random corruptions.
>
>Now, power cut corruptions have the following properties:
>
>1. Can only happen in journal LEBs.
>2. Happen only in one single write unit (max_write_size), because UBIFS
>always writes one write unit (or less) at a time.
>3. The next write unit (the one coming after the corrupted write unit)
>should be empty, never written to. Just because we write unit after unit
>sequentially.
>
>And UBIFS simply checks if these conditions hold. If they are, UBIFS
>assumes this is a power cut corruption, and simply wipes all the
>corrupted nodes out. The corrupted nodes will all sit in the corrupted
>write unit, and/or in nodes that _end_ in the corrupted write unit.
>
Is there a way to distinguish PEB(s) serving as part of moving 'Journal'
during device scan ?
- If yes, then is the above all(0xff) check limited to only to those LEB which
  are part of 'un-committed' Journal.
- If no, then this all(0xff) check will apply to all LEB.

But, I don't think UBIFS can distinguish between LEBs as part of uncommitted
Journal,  as these LEBs become part of index tree after commit.  Correct ?
In such case, this all(0xff) check will apply to all LEB during mount. Correct ?


>And what UBIFS recovery procedure does - it simply moves all the good
>nodes from the original PEB to a different PEB, re-maps the LEB to the
>new PEB, and erases the older PEB. And it clears up all the corrupted
>garbage this way, so we end up with a partially-filled LEB. And we even
>keep adding more data there afterwards.
>
How is this different from scrubbing, which is done by UBI layer ?


>Now what you do, you just remove the check for condition #3. Not good, I
>think...
>
>Removing these check means that if there is corruption in the middle of
>a journal LEB, and this corruption is _not_ caused by a power cut, UBIFS
>will kill all the nodes in the corrupted write unit and all the good
>nodes after the corrupted write unit (I did not verify if this is true,
>relying on my memory here).
>
I'm not against recovery, but the way how a corrupted node is detected.
Instead of comparing empty-space by all(0xff) itself, UBIFS should
depend on return-codes from UBI like:
- UBI_IO_FF: clean empty page
- UBI_IO_FF_BITFLIP: empty page with correctable bit-flips
- UBI_IO_BITFLIP: programmed page with correctable bit-flips
- UBI_IO_BADMSG: programmed page with un-correctable bit-flips


>Instead, what is needed is a way to distinguish between "empty" space
>and "written-to space".
>
This is a bigger change, because NAND driver works on units of pages,
while UBIFS works on units of LEBs. A return code of -EBADMSG by NAND
driver cannot be converted into a similar -EBADMSG for the whole LEB.
Thus we need to separate out 'bit-flips in empty region' v/s
'bit-flips in programmed-region'.
This points to my earlier proposal on same [2], can you please review
in current context ?


>For some HW the driver could tell by looking at the OOB.
>
>Or we could simply verify against 0xFFs but allow for a number of 0
>bits, which depends on the HW ECC strength.
>
>Or may be try the former, and if it is not supported by the driver,
>fall-back to the latter.
>
>But I do not think that just removing the checks is a good idea. They
>really help to keep us sane. And they helped already to expose the
>shortcomings of the design.
>

[1] http://www.linux-mtd.infradead.org/doc/ubifs.pdf
[2] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051560.html

with regards, pekon
Matthieu CASTET March 11, 2014, 12:21 p.m. UTC | #3
Le Tue, 11 Mar 2014 09:42:46 +0200,
Artem Bityutskiy <dedekind1@gmail.com> a écrit :

> On Tue, 2014-03-11 at 11:53 +0530, Pekon Gupta wrote:
> 
> However, there is one type of corruptions we _did_ want to gracefully
> handle - corruptions caused by power cuts.
> 
> So in the kernel space we tried to be very very careful in
> distinguishing between power cut corruptions and random corruptions.
> 
> Now, power cut corruptions have the following properties:
> 
> 1. Can only happen in journal LEBs.
> 2. Happen only in one single write unit (max_write_size), because UBIFS
> always writes one write unit (or less) at a time.
> 3. The next write unit (the one coming after the corrupted write unit)
> should be empty, never written to. Just because we write unit after unit
> sequentially.
> 
I think this is not true with MLC flash that Pekon want to handle.

There is the paired page problem [1] and the power cut corruption can
happen in more than one page.



[1]
[NEED WORK] There is another aspect of MLC flashes which may need closer attention: the "paired pages" problem (e.g., see this Power Point presentation). Namely, MLC NAND pages are coupled in a sense that if you cut power while writing to a page, you corrupt not only this page, but also one of the previous pages which is paired with the current one. For example, pages 0 and 3, 1 and 4, 2 and 5, 3 and 6 in and so on (in the same eraseblock) may be paired (page distance is 4, but there may be other distances). So if you write data to, say, page 3 and cut the power, you may end up with corrupted data in page 0. UBIFS is not ready to handle this problem at the moment and this needs some work.

UBIFS can handle this problem by avoiding using the rest of free space in LEBs after a sync or commit operation. E.g., if start writing to a new journal LEB, and then have a sync or commit, we should "waste" some amount of free space in this LEB to make sure that the previous paired page does not contain synced data. This way we guarantee that a power cut will not corrupt the synced or committed data. And the "wasted" free space can be re-used after that LEB has been garbage-collected. Similar to all the other LEBs we write to (LPT, log, orphan, etc). This would require some work and would make UBIFS slower, so this should probably be optional. The way to attack this issue is to improve UBIFS power cut emulation and implement "paired-pages" emulation, then use the integck test for testing. After all the issues are fixed, real power-cut tests could be carried out.
Matthieu CASTET March 11, 2014, 12:35 p.m. UTC | #4
Le Tue, 11 Mar 2014 11:53:12 +0530,
Pekon Gupta <pekon@ti.com> a écrit :

> UBIFS throws following error if an blank page does not contain perfect all(0xff)
> data, during LEB scan, even though if number of bit-flips is within ecc.strength
> 	"UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 417:53168"
> 
> But newer technology devices (like MLC NAND) are prone to read-disturb bit-flips
> and so having perfect all(0xff) data on empty-pages is practically not feasible
> on these newer technology devices (like MLC NAND).
> Also, such device spec suggest using stronger ECC schemes (like BCH16), and have
> large OOB size to guarantee longer field life.
> 
> So this check of perfect all(0xff) data can be safely removed from UBIFS, as
> UBI and MTD layers already these before passing the data to UBIFS:
>  if (number-of-bitflips < mtd->bitflip_threshold)
>         Driver's ECC scheme is strong enough to handle existing bit-flips in
>         future, so both MTD and UBI ignore such bit-flips.
>  if (number-of-bitflips >= mtd->bitflip_threshold)
>         MTD flags -EUCLEAN.
>         UBI schedules the block for scrubbing.
>  if (number-of-bitflips > nand_chip->ecc.strength)
>         MTD flags -EBADMSG.
>         UBI tries to extract data relevant data from block based on sanity of
>         its headers, and then schedule it for re-erase.

If the mtd layer check the number of bitflip on empty page why it
doesn't correct it and UBIFS see all 0xff ?

In the other hand if mtd layer is not able to check empty page this is
very dangerous. What happen if an empty page already have too much
bitflip ?
The data we write on this page will be corrupted.

In my opinion the change should be done on mtd/UBI layer not ubifs. 
The main problem today is that some driver don't read empty page with
ecc (because their ecc of empty page is not 0xff).
We can add metadata (in spare because ubifs want power of 2
write unit !) that can help to see it the page is empty or not.


Matthieu
Kent Li March 11, 2014, 12:51 p.m. UTC | #5
I agree with Matthieu, the empty check should be done in UBI, MTD or Flash driver layer.
When reading an erased page, the ECC data fail to recover correct the data area, because the OOB data are also all 0xff.
All flash controller should be able to detect this situation, 
but unfortunately, almost all flash driver ignore this, and pass the data to upper layer directly.

Kent Li

-----Original Message-----
From: Matthieu CASTET [mailto:matthieu.castet@parrot.com] 

Sent: 2014年3月11日 20:36
To: Pekon Gupta
Cc: Brian Norris; Artem Bityutskiy; Kent Li; HOUR Frderic; linux-mtd; Ezequiel Garcia; Stefan Roese
Subject: Re: [RFC PATCH] UBIFS: remove check for all(0xff) for empty pages

Le Tue, 11 Mar 2014 11:53:12 +0530,
Pekon Gupta <pekon@ti.com> a écrit :

> UBIFS throws following error if an blank page does not contain perfect 

> all(0xff) data, during LEB scan, even though if number of bit-flips is within ecc.strength

> 	"UBIFS error (pid 1): ubifs_scan: corrupt empty space at LEB 417:53168"

> 

> But newer technology devices (like MLC NAND) are prone to read-disturb 

> bit-flips and so having perfect all(0xff) data on empty-pages is 

> practically not feasible on these newer technology devices (like MLC NAND).

> Also, such device spec suggest using stronger ECC schemes (like 

> BCH16), and have large OOB size to guarantee longer field life.

> 

> So this check of perfect all(0xff) data can be safely removed from 

> UBIFS, as UBI and MTD layers already these before passing the data to UBIFS:

>  if (number-of-bitflips < mtd->bitflip_threshold)

>         Driver's ECC scheme is strong enough to handle existing bit-flips in

>         future, so both MTD and UBI ignore such bit-flips.

>  if (number-of-bitflips >= mtd->bitflip_threshold)

>         MTD flags -EUCLEAN.

>         UBI schedules the block for scrubbing.

>  if (number-of-bitflips > nand_chip->ecc.strength)

>         MTD flags -EBADMSG.

>         UBI tries to extract data relevant data from block based on sanity of

>         its headers, and then schedule it for re-erase.


If the mtd layer check the number of bitflip on empty page why it doesn't correct it and UBIFS see all 0xff ?

In the other hand if mtd layer is not able to check empty page this is very dangerous. What happen if an empty page already have too much bitflip ?
The data we write on this page will be corrupted.

In my opinion the change should be done on mtd/UBI layer not ubifs. 
The main problem today is that some driver don't read empty page with ecc (because their ecc of empty page is not 0xff).
We can add metadata (in spare because ubifs want power of 2 write unit !) that can help to see it the page is empty or not.


Matthieu
Artem Bityutskiy March 11, 2014, 1:06 p.m. UTC | #6
On Tue, 2014-03-11 at 13:21 +0100, Matthieu CASTET wrote:
> I think this is not true with MLC flash that Pekon want to handle.

Correct. But this is a different issue. Let's not confuse ourselves and
not mix them. Right now we are discussing bit-flips in the empty space.
We are not discussing MLC issues.

> There is the paired page problem [1] and the power cut corruption can
> happen in more than one page.

That's right. But this is another issue: UBIFS was not designed for MLC.
This can also be fixed, but this is a completely separate activity.
pekon gupta March 21, 2014, 7:57 a.m. UTC | #7
>From: Matthieu CASTET [mailto:matthieu.castet@parrot.com]
>>Pekon Gupta <pekon@ti.com> a écrit :
>
[...]
>
>If the mtd layer check the number of bitflip on empty page why it
>doesn't correct it and UBIFS see all 0xff ?
>
Most drivers already do that. But my question is bit different.
UBI returns UBI_IO_FF_BITFLIP, when it finds bit-flips in erased-pages
Then why can't UBIFS reuse that information, instead of repeating the
check by itself (comparing each byte with 0xff)


>In the other hand if mtd layer is not able to check empty page this is
>very dangerous. What happen if an empty page already have too much
>bitflip ?
>The data we write on this page will be corrupted.
>
It already does so, nand_base.c returns with -EBADMSG when there are
bitflip_count > ecc.strength.


>In my opinion the change should be done on mtd/UBI layer not ubifs.
>The main problem today is that some driver don't read empty page with
>ecc (because their ecc of empty page is not 0xff).
>We can add metadata (in spare because ubifs want power of 2	
>write unit !) that can help to see it the page is empty or not.
>
We cannot add any extra marker to OOB/spare area because:
(1) That breaks compatibility with ECC layout expected by ROM code.
  Thus, many legacy devices might fail to boot from NAND.

(2) extra marker in ECC layout is itself subjected to bit-flips, thereby
  making an erased-page getting recognized as programmed-page,
  so this is not a fool-proof solution.


Artem,
Can you please review [1], and if there is a way in which we can avoid
UBIFS giving "corrupt white-space" error during scan, then 
I'll try to get patch for that.

[1] http://lists.infradead.org/pipermail/linux-mtd/2014-March/052513.html


with regards, pekon
diff mbox

Patch

diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
index 58aa05d..b6ab396 100644
--- a/fs/ubifs/scan.c
+++ b/fs/ubifs/scan.c
@@ -332,17 +332,6 @@  struct ubifs_scan_leb *ubifs_scan(const struct ubifs_info *c, int lnum,
 
 	ubifs_end_scan(c, sleb, lnum, offs);
 
-	for (; len > 4; offs += 4, buf = buf + 4, len -= 4)
-		if (*(uint32_t *)buf != 0xffffffff)
-			break;
-	for (; len; offs++, buf++, len--)
-		if (*(uint8_t *)buf != 0xff) {
-			if (!quiet)
-				ubifs_err("corrupt empty space at LEB %d:%d",
-					  lnum, offs);
-			goto corrupted;
-		}
-
 	return sleb;
 
 corrupted: