diff mbox

UBIFS Corrupt during power failure

Message ID 1239378582.3390.66.camel@localhost.localdomain
State New, archived
Headers show

Commit Message

Artem Bityutskiy April 10, 2009, 3:49 p.m. UTC
On Fri, 2009-04-10 at 18:17 +0300, Artem Bityutskiy wrote:
> Hi,
> 
> On Fri, 2009-04-10 at 08:27 -0600, Eric Holmberg wrote:
> > Test setup:
> >  * Using U-Boot 1.3.0
> >  * Write buffering enabled
> >  * S29GL256F 256Mbit NOR flash w/ 32-word write buffer
> >  * Test software that performs read/erase/write operations
> >  * JTAG debugger that randomly resets the board
> > 
> > Reset during write (unexpected test pattern written after un-programmed
> > values):
> > 
> > 30352240  aa55aa0a aa55aa0a aa55aa0a aa55aa0a
> > 30352250  aa55aa0a aa55aa0a aa55aa0a aa55aa0a
> > 30352260  aa55aa0a aa55aa0a aa55aa0a aa55aa0a
> > 30352270  aa55aa0a aa55aa0a aa55aa0a aa55aa0a
> > 30352280  ffffffff ffffffff ffffffff ffffffff
> > 30352290  ffffffff ffffffff ffffffff ffffffff
> > 303522a0  ffffffff ffffffff ffffffff ffffffff
> > 303522b0  aa55aa0a aa55aa0a aa55aa0a aa55aa0a
> > 303522c0  ffffffff ffffffff ffffffff ffffffff
> > 303522d0  ffffffff ffffffff ffffffff ffffffff
> > 303522e0  ffffffff ffffffff ffffffff ffffffff
> 
> Yeah, I think the recovery assumes that if you cut power during
> writing than:
> 
> 1. The min. I/O unit which has been written to at the moment power
>    cut happened will contain garbage.
> 2. But the next min. I/O unit will contain 0xFFs.
> 
> We have been working only with NAND flash, and min. I/O unit
> for NAND is one NAND page (usually 2KiB). We have never worked
> with NOR flash. We only tested UBIFS several times on the mtdram
> NOR flash emulator.
> 
> In case of NOR, UBIFS assumes min. I/O unit size is 8 bytes. Well,
> it is actually 1 byte, but because UBIFS aligns all its on-flash
> data structures to 8-byte boundaries, we used 8 for NOR, because
> it was easier implementation-wise.
> 
> Thus, UBIFS will panic when it meets the above pattern. And UBIFS
> would need some changes to make it understand this type of
> corruptions. All the recovery logic is in recovery.c. It should
> not be very difficult to change this.
> 
> You may ask - if while scanning you meet a corrupted node - why do
> you keep checking the rest of the node, and want to see 0xFFs there?
> 
> The reason why we do this check is that if we meet a corrupted node,
> we want to figure out the nature of the corruption - is this a
> non-finished write or a physical corruption, e.g. due to radiation,
> worn-out flash, etc. UBIFS writes eraseblocks from the beginning,
> to the end - always. So if the corrupted node is the last, this
> is harmless corruption because of power-cut, and we recover. But
> if the corruption is in a middle, this is something serious and
> we panic.
> 
> So in your case, UBIFS decides that it met a corrupted node in
> the middle, and panics.

So you need to play with ubifs_recover_leb() function.

There is the following code:

        if (!empty_chkd && !is_empty(buf, len)) {
                if (is_last_write(c, buf, offs)) {
                        clean_buf(c, &buf, lnum, &offs, &len);
                        need_clean = 1;
                } else {
                        ubifs_err("corrupt empty space at LEB %d:%d",
                                  lnum, offs);
                        goto corrupted;
                }
        }

So in your case "is_last_write()" returns zero, and UBIFS prints
cryptic "corrupt empty space" and panics.

I would try to hack the code and remove that panic part, and see
what happens. UBIFS should probably successfully recover the LEB.
This is done in 'fix_unclean_leb()'. What this function will do
it will:

1. Read all _good_ nodes from this LEB (ubi_read())
2. Atomically change the corrupted LEB (ubi_leb_change())

Atomic LEB change is UBI operation, read here about it:
http://www.linux-mtd.infradead.org/doc/ubi.html#L_lebchange

In few words, on the physical flash level it will do:

1. Write the good nodes to a new, erased physical eraseblock
2. Erase the current physical eraseblock.

So, try the suggested hack out (inlined below). See what happens,
may be you discover other problems. After you played with recovery
code and have some success, we may push some nice solution to
UBIFS, e.g.

1. Introduce a mount option which tells UBIFS to assume that power-cuts
   during writing may disturb not only the current min_io_unit, but
   also the next ones.
2. Assume this if the flash type is NOR. May be there is some limit
   we may assume?

Comments

Eric Holmberg April 10, 2009, 5 p.m. UTC | #1
> diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c index 
> 1066297..9afa056 100644
> --- a/fs/ubifs/recovery.c
> +++ b/fs/ubifs/recovery.c
> @@ -675,9 +675,10 @@ struct ubifs_scan_leb 
> *ubifs_recover_leb(struct ubifs_info *c, int lnum,
>  			clean_buf(c, &buf, lnum, &offs, &len);
>  			need_clean = 1;
>  		} else {
> -			ubifs_err("corrupt empty space at LEB %d:%d",
> -				  lnum, offs);
> -			goto corrupted;
> +			ubifs_warn("ignore corrupt empty space 
> at LEB %d:%d",
> +				   lnum, offs);
> +			clean_buf(c, &buf, lnum, &offs, &len);
> +			need_clean = 1;
>  		}
>  	}
>  
> --
> Best regards,
> Artem Bityutskiy (Битюцкий Артём)

Artem,

Thank you very much for your help so far.  I am going to do two things:
 1. Turn off write buffering which converts the NOR minimum I/O size from 1 to effectively 32 16-bit words (64 bytes) and re-run all of the tests.
 2. While this is running, I'm going to start following the modifications and debugging path that you outlined.

I'll report back with findings and potential modifications -- as always, feel free to ping me if you don't hear anything!

Thanks again,

Eric
 

> -----Original Message-----
> From: Artem Bityutskiy [mailto:dedekind@infradead.org] 
> Sent: Friday, April 10, 2009 9:50 AM
> To: Eric Holmberg
> Cc: Urs Muff; linux-mtd@lists.infradead.org; Adrian Hunter
> Subject: RE: UBIFS Corrupt during power failure
> 
> On Fri, 2009-04-10 at 18:17 +0300, Artem Bityutskiy wrote:
> > Hi,
> > 
> > On Fri, 2009-04-10 at 08:27 -0600, Eric Holmberg wrote:
> > > Test setup:
> > >  * Using U-Boot 1.3.0
> > >  * Write buffering enabled
> > >  * S29GL256F 256Mbit NOR flash w/ 32-word write buffer
> > >  * Test software that performs read/erase/write operations
> > >  * JTAG debugger that randomly resets the board
> > > 
> > > Reset during write (unexpected test pattern written after 
> > > un-programmed
> > > values):
> > > 
> > > 30352240  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 30352250  aa55aa0a 
> > > aa55aa0a aa55aa0a aa55aa0a 30352260  aa55aa0a aa55aa0a aa55aa0a 
> > > aa55aa0a 30352270  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 30352280  
> > > ffffffff ffffffff ffffffff ffffffff 30352290  ffffffff ffffffff 
> > > ffffffff ffffffff 303522a0  ffffffff ffffffff ffffffff ffffffff 
> > > 303522b0  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 303522c0  ffffffff 
> > > ffffffff ffffffff ffffffff 303522d0  ffffffff ffffffff ffffffff 
> > > ffffffff 303522e0  ffffffff ffffffff ffffffff ffffffff
> > 
> > Yeah, I think the recovery assumes that if you cut power during 
> > writing than:
> > 
> > 1. The min. I/O unit which has been written to at the moment power
> >    cut happened will contain garbage.
> > 2. But the next min. I/O unit will contain 0xFFs.
> > 
> > We have been working only with NAND flash, and min. I/O 
> unit for NAND 
> > is one NAND page (usually 2KiB). We have never worked with 
> NOR flash. 
> > We only tested UBIFS several times on the mtdram NOR flash emulator.
> > 
> > In case of NOR, UBIFS assumes min. I/O unit size is 8 
> bytes. Well, it 
> > is actually 1 byte, but because UBIFS aligns all its on-flash data 
> > structures to 8-byte boundaries, we used 8 for NOR, because it was 
> > easier implementation-wise.
> > 
> > Thus, UBIFS will panic when it meets the above pattern. And UBIFS 
> > would need some changes to make it understand this type of 
> > corruptions. All the recovery logic is in recovery.c. It 
> should not be 
> > very difficult to change this.
> > 
> > You may ask - if while scanning you meet a corrupted node - 
> why do you 
> > keep checking the rest of the node, and want to see 0xFFs there?
> > 
> > The reason why we do this check is that if we meet a 
> corrupted node, 
> > we want to figure out the nature of the corruption - is this a 
> > non-finished write or a physical corruption, e.g. due to radiation, 
> > worn-out flash, etc. UBIFS writes eraseblocks from the 
> beginning, to 
> > the end - always. So if the corrupted node is the last, this is 
> > harmless corruption because of power-cut, and we recover. 
> But if the 
> > corruption is in a middle, this is something serious and we panic.
> > 
> > So in your case, UBIFS decides that it met a corrupted node in the 
> > middle, and panics.
> 
> So you need to play with ubifs_recover_leb() function.
> 
> There is the following code:
> 
>         if (!empty_chkd && !is_empty(buf, len)) {
>                 if (is_last_write(c, buf, offs)) {
>                         clean_buf(c, &buf, lnum, &offs, &len);
>                         need_clean = 1;
>                 } else {
>                         ubifs_err("corrupt empty space at LEB %d:%d",
>                                   lnum, offs);
>                         goto corrupted;
>                 }
>         }
> 
> So in your case "is_last_write()" returns zero, and UBIFS 
> prints cryptic "corrupt empty space" and panics.
> 
> I would try to hack the code and remove that panic part, and 
> see what happens. UBIFS should probably successfully recover the LEB.
> This is done in 'fix_unclean_leb()'. What this function will 
> do it will:
> 
> 1. Read all _good_ nodes from this LEB (ubi_read()) 2. 
> Atomically change the corrupted LEB (ubi_leb_change())
> 
> Atomic LEB change is UBI operation, read here about it:
> http://www.linux-mtd.infradead.org/doc/ubi.html#L_lebchange
> 
> In few words, on the physical flash level it will do:
> 
> 1. Write the good nodes to a new, erased physical eraseblock 
> 2. Erase the current physical eraseblock.
> 
> So, try the suggested hack out (inlined below). See what 
> happens, may be you discover other problems. After you played 
> with recovery code and have some success, we may push some 
> nice solution to UBIFS, e.g.
> 
> 1. Introduce a mount option which tells UBIFS to assume that 
> power-cuts
>    during writing may disturb not only the current min_io_unit, but
>    also the next ones.
> 2. Assume this if the flash type is NOR. May be there is some limit
>    we may assume?
> 
> diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c index 
> 1066297..9afa056 100644
> --- a/fs/ubifs/recovery.c
> +++ b/fs/ubifs/recovery.c
> @@ -675,9 +675,10 @@ struct ubifs_scan_leb 
> *ubifs_recover_leb(struct ubifs_info *c, int lnum,
>  			clean_buf(c, &buf, lnum, &offs, &len);
>  			need_clean = 1;
>  		} else {
> -			ubifs_err("corrupt empty space at LEB %d:%d",
> -				  lnum, offs);
> -			goto corrupted;
> +			ubifs_warn("ignore corrupt empty space 
> at LEB %d:%d",
> +				   lnum, offs);
> +			clean_buf(c, &buf, lnum, &offs, &len);
> +			need_clean = 1;
>  		}
>  	}
>  
> --
> Best regards,
> Artem Bityutskiy (Битюцкий Артём)
> 
>
Artem Bityutskiy April 10, 2009, 5:11 p.m. UTC | #2
On Fri, 2009-04-10 at 11:00 -0600, Eric Holmberg wrote:
> Thank you very much for your help so far. 
NP, this is all I can do now without having real NOR and
much time :-)

>  I am going to do two things:
>  1. Turn off write buffering which converts the NOR minimum I/O size from 1 to effectively 32 16-bit words (64 bytes) and re-run all of the tests.

Err, which buffering? Is this something at the flash driver
level? 

Note, UBIFS never talks to MTD level, it talks only
to UBI level, which talks to MTD. UBI will provide UBIFS
min. I/O unit size which it reads from mtd->writesize.
Then if it is < 8, UBIFS will use 8 instead. There is not
fundamental reason for this, it was just easier to implement
UBIFS this way. But this is not big deal.

In your case, as I see from your logs, min. I/O unit reported
from MTD to UBI is 1, and from UBI to UBIFS it is 1, but then
UBIFS will use 8 instead. This means all write commands UBIFS
will issue will be aligned to 8 and be multiple of 8.

So if you are talking about turning off some buffering on
MTD level, may be it is ok. But if you are talking about
UBIFS level - no need to do anything there.

>  2. While this is running, I'm going to start following the modifications and debugging path that you outlined.
> 
> I'll report back with findings and potential modifications -- as always, feel free to ping me if you don't hear anything!

OK.
Eric Holmberg April 10, 2009, 6:33 p.m. UTC | #3
> On Fri, 2009-04-10 at 11:00 -0600, Eric Holmberg wrote:
> > Thank you very much for your help so far. 
> NP, this is all I can do now without having real NOR and much time :-)
> 
> >  I am going to do two things:
> >  1. Turn off write buffering which converts the NOR minimum 
> I/O size from 1 to effectively 32 16-bit words (64 bytes) and 
> re-run all of the tests.
> 
> Err, which buffering? Is this something at the flash driver level? 

This is for the CFI flash interface.  The
drivers/mtd/chips/cfi_cmdset_0002.c driver has write buffers which is
uses to do a "block" write to the NOR flash which for my chip allows
writing up to 32 16-bit words.  This is what I used for the tests with
U-Boot and it is why you see patterns such as the one below.  The code I
used to write this pattern to flash wrote the entire block in 1 write.
The CFI driver then broke up the writes into 64-byte writes.
Apparently, it didn't do them in order (or the flash chip didn't), which
is why you have aa55aa0a values after ffffffff values.  Turning off the
write buffering in the CFI driver by setting FORCE_WORD_WRITE to 1
should solve this (although it will now probably be somewhere between
10x and 32x slower).


30352250  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
30352260  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
30352270  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
30352280  ffffffff ffffffff ffffffff ffffffff 
30352290  ffffffff ffffffff ffffffff ffffffff 
303522a0  ffffffff ffffffff ffffffff ffffffff 
303522b0  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
303522c0  ffffffff ffffffff ffffffff ffffffff 
303522d0  ffffffff ffffffff ffffffff ffffffff 

Does that make sense now, or am I on the wrong path?

-Eric

> 
> Note, UBIFS never talks to MTD level, it talks only to UBI 
> level, which talks to MTD. UBI will provide UBIFS min. I/O 
> unit size which it reads from mtd->writesize.
> Then if it is < 8, UBIFS will use 8 instead. There is not 
> fundamental reason for this, it was just easier to implement 
> UBIFS this way. But this is not big deal.
> 
> In your case, as I see from your logs, min. I/O unit reported 
> from MTD to UBI is 1, and from UBI to UBIFS it is 1, but then 
> UBIFS will use 8 instead. This means all write commands UBIFS 
> will issue will be aligned to 8 and be multiple of 8.
> 
> So if you are talking about turning off some buffering on MTD 
> level, may be it is ok. But if you are talking about UBIFS 
> level - no need to do anything there.
>
Artem Bityutskiy April 14, 2009, 6:11 a.m. UTC | #4
On Fri, 2009-04-10 at 12:33 -0600, Eric Holmberg wrote:
> > On Fri, 2009-04-10 at 11:00 -0600, Eric Holmberg wrote:
> > > Thank you very much for your help so far. 
> > NP, this is all I can do now without having real NOR and much time :-)
> > 
> > >  I am going to do two things:
> > >  1. Turn off write buffering which converts the NOR minimum 
> > I/O size from 1 to effectively 32 16-bit words (64 bytes) and 
> > re-run all of the tests.
> > 
> > Err, which buffering? Is this something at the flash driver level? 
> 
> This is for the CFI flash interface.  The
> drivers/mtd/chips/cfi_cmdset_0002.c driver has write buffers which is
> uses to do a "block" write to the NOR flash which for my chip allows
> writing up to 32 16-bit words. 

Oh, this is something from the CFI standard? Then we may just add this
knowledge to UBIFS: if this is NOR, then UBIFS knows that the up to 64
bytes may contain garbage.

>  This is what I used for the tests with
> U-Boot and it is why you see patterns such as the one below.  The code I
> used to write this pattern to flash wrote the entire block in 1 write.
> The CFI driver then broke up the writes into 64-byte writes.
> Apparently, it didn't do them in order (or the flash chip didn't), which
> is why you have aa55aa0a values after ffffffff values.  Turning off the
> write buffering in the CFI driver by setting FORCE_WORD_WRITE to 1
> should solve this (although it will now probably be somewhere between
> 10x and 32x slower).

Yes, it is worth disabling this and test. If it helps, we can add
some CFI/NOR-awareness to UBIFS.

> 30352250  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> 30352260  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> 30352270  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> 30352280  ffffffff ffffffff ffffffff ffffffff 
> 30352290  ffffffff ffffffff ffffffff ffffffff 
> 303522a0  ffffffff ffffffff ffffffff ffffffff 
> 303522b0  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> 303522c0  ffffffff ffffffff ffffffff ffffffff 
> 303522d0  ffffffff ffffffff ffffffff ffffffff 
> 
> Does that make sense now, or am I on the wrong path?

Yes, makes perfect sense for me.
Eric Holmberg April 14, 2009, 3:09 p.m. UTC | #5
> On Fri, 2009-04-10 at 12:33 -0600, Eric Holmberg wrote:
> > > On Fri, 2009-04-10 at 11:00 -0600, Eric Holmberg wrote:
> > > > Thank you very much for your help so far. 
> > > NP, this is all I can do now without having real NOR and 
> much time :-)
> > > 
> > > >  I am going to do two things:
> > > >  1. Turn off write buffering which converts the NOR minimum 
> > > I/O size from 1 to effectively 32 16-bit words (64 bytes) and 
> > > re-run all of the tests.
> > > 
> > > Err, which buffering? Is this something at the flash 
> driver level? 
> > 
> > This is for the CFI flash interface.  The
> > drivers/mtd/chips/cfi_cmdset_0002.c driver has write 
> buffers which is
> > uses to do a "block" write to the NOR flash which for my chip allows
> > writing up to 32 16-bit words. 
> 
> Oh, this is something from the CFI standard? Then we may just add this
> knowledge to UBIFS: if this is NOR, then UBIFS knows that the up to 64
> bytes may contain garbage.

The write-buffer command is part of the CFI standard, but the size of
the buffer is up to the chip manufacturer.  For example, we have two NOR
Flash chips on our board and one has a write buffer size of 1 word (2
bytes) and the other is 32 words (64 bytes).  CFI auto-detects the
maximum write-buffer size and places the value in the structure element
cfi_ident::MaxBufWriteSize (located in mtd/cfi.h).  That could always be
used to determine the size of writes to flash, but maybe a UBI
configuration value that is set manually would be a better option?

> 
> >  This is what I used for the tests with
> > U-Boot and it is why you see patterns such as the one 
> below.  The code I
> > used to write this pattern to flash wrote the entire block 
> in 1 write.
> > The CFI driver then broke up the writes into 64-byte writes.
> > Apparently, it didn't do them in order (or the flash chip 
> didn't), which
> > is why you have aa55aa0a values after ffffffff values.  
> Turning off the
> > write buffering in the CFI driver by setting FORCE_WORD_WRITE to 1
> > should solve this (although it will now probably be 
> somewhere between
> > 10x and 32x slower).
> 
> Yes, it is worth disabling this and test. If it helps, we can add
> some CFI/NOR-awareness to UBIFS.
> 
> > 30352250  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> > 30352260  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> > 30352270  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> > 30352280  ffffffff ffffffff ffffffff ffffffff 
> > 30352290  ffffffff ffffffff ffffffff ffffffff 
> > 303522a0  ffffffff ffffffff ffffffff ffffffff 
> > 303522b0  aa55aa0a aa55aa0a aa55aa0a aa55aa0a 
> > 303522c0  ffffffff ffffffff ffffffff ffffffff 
> > 303522d0  ffffffff ffffffff ffffffff ffffffff 
> > 
> > Does that make sense now, or am I on the wrong path?
> 
> Yes, makes perfect sense for me.


I ran a corruption test on 3 different boards which used an application
that writes to the flash continuously (doing read, write, and rename
operations on a UBIFS root file system) and then a script would randomly
remove power.

Here are the results for NOR flash with a block-size of 64 bytes -- the
data currently points to the block write size of 64 bytes being the
issue since changing it to 1 eliminated the corruption.  I'm going to
run one more test where I force it to 8 bytes (based upon your comment
that UBIFS allows up to 8-bytes to be garbage).  If fails, then there is
a different issue causing the problem.


Test #1 - FORCE_WORD_WRITE = 1
----------------------------------------------
cfi_cmdset_0002.c FORCE_WORD_WRITE is 1 (true) which disables block
writes to the NOR flash.  This fixed the problem as no corruption has
occurred after 96 hours of power cycling (over 6000 power cycles). 

Test #2 - Corrupt Empty Block Recovery
-------------------------------------------------
cfi_cmdset_0002.c FORCE_WORD_WRITE is 0 (false) and added the code that
you graciously provided to correct the corrupt-empty space LEB.  This
worked great for recovery of the corrupt empty space, but then
additional corruption occurred at which point it looks like the
super-block got changed to an orphan node (type 11) - see below.

Corruption occurred after approximately 2 hours of operation
(approximately 130 power cycles).

[42949375.790000] UBIFS error (pid 1): ubifs_read_node: bad node type
(11 but expected 6)
[42949375.800000] UBIFS error (pid 1): ubifs_read_node: bad node at LEB
0:0
[42949375.810000] List of all partitions:
[42949375.810000] 1f00         16 mtdblock0 (driver?)
[42949375.820000] 1f01          8 mtdblock1 (driver?)
[42949375.820000] 1f02          8 mtdblock2 (driver?)
[42949375.830000] 1f03         32 mtdblock3 (driver?)
[42949375.830000] 1f04        960 mtdblock4 (driver?)
[42949375.840000] 1f05       2048 mtdblock5 (driver?)
[42949375.840000] 1f06       2048 mtdblock6 (driver?)
[42949375.850000] 1f07      28672 mtdblock7 (driver?)
[42949375.850000] No filesystem could mount root, tried:  ubifs
[42949375.860000] Kernel panic - not syncing: VFS: Unable to mount root
fs on unknown-block(0,0)

LEB 0
316a0000: 23494255 00000001 00000000 01000000    UBI#............
316a0010: 40000000 80000000 00000000 00000000    ...@............
316a0020: 00000000 00000000 00000000 00000000    ................
316a0030: 00000000 00000000 00000000 889bc4d4    ................
316a0040: 21494255 00000101 00000000 00000000    UBI!............
316a0050: 00000000 00000000 00000000 00000000    ................
316a0060: 00000000 00000000 00000000 28000000    ...............(
316a0070: 00000000 00000000 00000000 cf33cc9a    ..............3.
316a0080: 06101831 6525e297 0000554e 00000000    1.....%eNU......
316a0090: 00000028 0000000b 00000007 80000000    (...............
316a00a0: 00000da6 00000000 06101831 7f52d274    ........1...t.R.
316a00b0: 00005741 00000000 00000028 0000000b    AW......(.......
316a00c0: 00000009 80000000 00000db6 00000000    ................
316a00d0: 06101831 52ecb032 00006def 00000000    1...2..R.m......
316a00e0: 00000028 0000000b 0000001c 80000000    (...............
316a00f0: 00000db3 00000000 ffffffff ffffffff    ................
316a0100: ffffffff ffffffff ffffffff ffffffff    ................
316a0110: ffffffff ffffffff ffffffff ffffffff    ................


Test #3 - Control
-------------------------------------------------
Stock kernel 2.6.27.  Corrupt empty-space failure occurred within 2
hours of running.

[42949375.720000] UBIFS: recovery needed
[42949375.780000] UBIFS error (pid 1): ubifs_scan: corrupt empty space
at LEB 6:14912, expected 0xFF, got 0x0
[42949375.790000] UBIFS error (pid 1): ubifs_scanned_corruption:
corrupted data at LEB 6:14912
[42949375.810000] UBIFS error (pid 1): ubifs_scan: LEB 6 scanning failed
[42949375.850000] UBIFS error (pid 1): ubifs_recover_leb: corrupt empty
space at LEB 6:224
[42949375.860000] UBIFS error (pid 1): ubifs_scanned_corruption:
corrupted data at LEB 6:224
[42949375.890000] UBIFS error (pid 1): ubifs_recover_leb: LEB 6 scanning
failed
[42949375.900000] VFS: Cannot open root device "ubi0:rootfs" or
unknown-block(0,0)
[42949375.900000] Please append a correct "root=" boot option; here are
the available partitions:
[42949375.910000] 1f00         16 mtdblock0 (driver?)
[42949375.920000] 1f01          8 mtdblock1 (driver?)
[42949375.920000] 1f02          8 mtdblock2 (driver?)
[42949375.930000] 1f03         32 mtdblock3 (driver?)
[42949375.930000] 1f04        960 mtdblock4 (driver?)
[42949375.930000] 1f05       2048 mtdblock5 (driver?)
[42949375.940000] 1f06       2048 mtdblock6 (driver?)
[42949375.940000] 1f07      28672 mtdblock7 (driver?)
[42949375.950000] Kernel panic - not syncing: VFS: Unable to mount root
fs on unknown-block(0,0)



> Turning off the
> > write buffering in the CFI driver by setting FORCE_WORD_WRITE to 1
> > should solve this (although it will now probably be 
> somewhere between
> > 10x and 32x slower).
> 
> Yes, it is worth disabling this and test. If it helps, we can add
> some CFI/NOR-awareness to UBIFS.

Next Steps
----------
I'm going to run a test with the write-buffer size set to 8 bytes.  If
that works, then I think the next task is to see how to add the
CFI/NOR-awareness to UBIFS.

Thanks again -- we're making progress!

Eric Holmberg
Artem Bityutskiy April 14, 2009, 3:45 p.m. UTC | #6
On Tue, 2009-04-14 at 09:09 -0600, Eric Holmberg wrote:
> The write-buffer command is part of the CFI standard, but the size of
> the buffer is up to the chip manufacturer.  For example, we have two NOR
> Flash chips on our board and one has a write buffer size of 1 word (2
> bytes) and the other is 32 words (64 bytes).  CFI auto-detects the
> maximum write-buffer size and places the value in the structure element
> cfi_ident::MaxBufWriteSize (located in mtd/cfi.h).  That could always be
> used to determine the size of writes to flash, but maybe a UBI
> configuration value that is set manually would be a better option?

I wonder how do I determine that the flash is flash is CFI flash...
UBI uses struct mtd_info, and uses the ->type field, if it is
MTD_NORFLASH, then it assumes the flash is NOR. But it has not
idea if it is a CFI flash or not...

I think adding an UBI option is not very good because it adds
yet another complication users would have to understand. It is
nicer to hide it from user.

Do you know what is the maximum possible buffer size? If it is 64,
we may just teach UBI assume 64 for all NORs. This should be good
enough.

> I ran a corruption test on 3 different boards which used an application
> that writes to the flash continuously (doing read, write, and rename
> operations on a UBIFS root file system) and then a script would randomly
> remove power.

OK.

> Here are the results for NOR flash with a block-size of 64 bytes -- the
> data currently points to the block write size of 64 bytes being the
> issue since changing it to 1 eliminated the corruption.  I'm going to
> run one more test where I force it to 8 bytes (based upon your comment
> that UBIFS allows up to 8-bytes to be garbage).  If fails, then there is
> a different issue causing the problem.

OK, let's see.

> Test #1 - FORCE_WORD_WRITE = 1
> ----------------------------------------------
> cfi_cmdset_0002.c FORCE_WORD_WRITE is 1 (true) which disables block
> writes to the NOR flash.  This fixed the problem as no corruption has
> occurred after 96 hours of power cycling (over 6000 power cycles). 

Good!

> Test #2 - Corrupt Empty Block Recovery
> -------------------------------------------------
> cfi_cmdset_0002.c FORCE_WORD_WRITE is 0 (false) and added the code that
> you graciously provided to correct the corrupt-empty space LEB.  This
> worked great for recovery of the corrupt empty space, but then
> additional corruption occurred at which point it looks like the
> super-block got changed to an orphan node (type 11) - see below.

Hmm, sounds familiar. I saw an error like this. Did you pull
the latest UBIFS changes from the UBIFS git tree? Please, pull
them from
git://git.infradead.org/~dedekind/ubifs-v2.6.27.git
(you use 2.6.27, AFAIK).

But may be there is a yet another place we need to change.

> Corruption occurred after approximately 2 hours of operation
> (approximately 130 power cycles).
> 
> [42949375.790000] UBIFS error (pid 1): ubifs_read_node: bad node type
> (11 but expected 6)
> [42949375.800000] UBIFS error (pid 1): ubifs_read_node: bad node at LEB
> 0:0

How about enabling debugging (no debugging messages, just debugging).
See
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_how_send_bugreport
item 3.

And please, try to boot with "ignore_loglevel" option in your command
line. This makes sure you see _all_ UBIFS messages on your console,
when it fails. There will be more useful info, e.g., stack dump
and node dump (see fs/ubifs/io.c, 'ubifs_read_node()').

> Test #3 - Control
> -------------------------------------------------
> Stock kernel 2.6.27.  Corrupt empty-space failure occurred within 2
> hours of running.
> 
> [42949375.720000] UBIFS: recovery needed
> [42949375.780000] UBIFS error (pid 1): ubifs_scan: corrupt empty space
> at LEB 6:14912, expected 0xFF, got 0x0
> [42949375.790000] UBIFS error (pid 1): ubifs_scanned_corruption:
> corrupted data at LEB 6:14912
> [42949375.810000] UBIFS error (pid 1): ubifs_scan: LEB 6 scanning failed
> [42949375.850000] UBIFS error (pid 1): ubifs_recover_leb: corrupt empty
> space at LEB 6:224
> [42949375.860000] UBIFS error (pid 1): ubifs_scanned_corruption:
> corrupted data at LEB 6:224
> [42949375.890000] UBIFS error (pid 1): ubifs_recover_leb: LEB 6 scanning
> failed

Similar. Please, pull the latest fixes. And next time attach
_all_ UBIFS messages, which you should get when you have
"ignore_loglevel".

> Next Steps
> ----------
> I'm going to run a test with the write-buffer size set to 8 bytes.  If
> that works, then I think the next task is to see how to add the
> CFI/NOR-awareness to UBIFS.

Yes, makes sense, lets see.
Artem Bityutskiy April 14, 2009, 3:53 p.m. UTC | #7
On Tue, 2009-04-14 at 18:45 +0300, Artem Bityutskiy wrote:
> > Corruption occurred after approximately 2 hours of operation
> > (approximately 130 power cycles).
> > 
> > [42949375.790000] UBIFS error (pid 1): ubifs_read_node: bad node type
> > (11 but expected 6)
> > [42949375.800000] UBIFS error (pid 1): ubifs_read_node: bad node at LEB
> > 0:0
> 
> How about enabling debugging (no debugging messages, just debugging).
> See
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_how_send_bugreport
> item 3

Note, not "extra checks" as well. Just debugging. It will add some
light-weight assertions plus verbose error reporting which include
dumps, etc.
Jamie Lokier April 14, 2009, 6 p.m. UTC | #8
Artem Bityutskiy wrote:
> > This is for the CFI flash interface.  The
> > drivers/mtd/chips/cfi_cmdset_0002.c driver has write buffers which is
> > uses to do a "block" write to the NOR flash which for my chip allows
> > writing up to 32 16-bit words. 
> 
> Oh, this is something from the CFI standard? Then we may just add this
> knowledge to UBIFS: if this is NOR, then UBIFS knows that the up to 64
> bytes may contain garbage.

I think the block write size is only used when you _submit_ a write of
that many words or more.

So it would be correct for UBIFS to know that writes of N > 64(*)
bytes may be broken into blocks, with corruption distributed
anywhere within each of those blocks, but not more than one block.

But it doesn't mean the minimum safe write size is 64(*), because if
UBIFS writes (say) 16 bytes for a small update, then the corruption
should be limited to just those 16 bytes.

If you wrote a 16-byte item which encodes "and the next item I write
will be 16-byte too", then you know if you see >16 corrupt bytes after
the item that it's not due to power failure.  This even with a 64 byte
buffer on the flash chip, because you know you will have done only a
16 byte write in that position.

I don't know if it's useful for UBIFS and its block scanning
algorithm to consider item sizes in that much detail.

The main thing is you can write smaller items safely, so you don't
have to pad them to 64 bytes and wear out the flash faster.  But
scanning may need to use a 64 byte window to decide the corruption type.

That means MTD and UBI should export two values:

    - Maximum block write size which may be affected by power failure / reset.
      (Maybe that needs an alignment too.)

    - Minimum write size for padding written items.
      (Is that assumed aligned?)

For this particular NOR, the two values would be 64 bytes and 1 byte.
I don't think it's specific to CFI.

-- Jamie
Artem Bityutskiy April 15, 2009, 6 a.m. UTC | #9
On Tue, 2009-04-14 at 19:00 +0100, Jamie Lokier wrote:
> Artem Bityutskiy wrote:
> > > This is for the CFI flash interface.  The
> > > drivers/mtd/chips/cfi_cmdset_0002.c driver has write buffers which is
> > > uses to do a "block" write to the NOR flash which for my chip allows
> > > writing up to 32 16-bit words. 
> > 
> > Oh, this is something from the CFI standard? Then we may just add this
> > knowledge to UBIFS: if this is NOR, then UBIFS knows that the up to 64
> > bytes may contain garbage.
> 
> I think the block write size is only used when you _submit_ a write of
> that many words or more.

Yes, this is my understanding too.

> So it would be correct for UBIFS to know that writes of N > 64(*)
> bytes may be broken into blocks, with corruption distributed
> anywhere within each of those blocks, but not more than one block.

I though the corruption will be inside the last block, because they
are written sequentially.

> But it doesn't mean the minimum safe write size is 64(*), because if
> UBIFS writes (say) 16 bytes for a small update, then the corruption
> should be limited to just those 16 bytes.

Right, I did not mean that. I meant that we can teach recovery code
to understand that the corruption may be withing 64-bytes.

> If you wrote a 16-byte item which encodes "and the next item I write
> will be 16-byte too", then you know if you see >16 corrupt bytes after
> the item that it's not due to power failure.  This even with a 64 byte
> buffer on the flash chip, because you know you will have done only a
> 16 byte write in that position.
> 
> I don't know if it's useful for UBIFS and its block scanning
> algorithm to consider item sizes in that much detail.

No, does not seem to be very useful.

> The main thing is you can write smaller items safely, so you don't
> have to pad them to 64 bytes and wear out the flash faster.  But
> scanning may need to use a 64 byte window to decide the corruption type.

Right. I assumed the same, may be I just did not put it clearly. But
thanks for this suggestion.

> That means MTD and UBI should export two values:
> 
>     - Maximum block write size which may be affected by power failure / reset.
>       (Maybe that needs an alignment too.)

Yup. MTD does not inform about this ATM, though.

>     - Minimum write size for padding written items.
>       (Is that assumed aligned?)

Right.

> For this particular NOR, the two values would be 64 bytes and 1 byte.
> I don't think it's specific to CFI.

Agree.
Eric Holmberg April 15, 2009, 3:17 p.m. UTC | #10
> On Tue, 2009-04-14 at 19:00 +0100, Jamie Lokier wrote:
> > Artem Bityutskiy wrote:
> > > > This is for the CFI flash interface.  The
> > > > drivers/mtd/chips/cfi_cmdset_0002.c driver has write 
> buffers which is
> > > > uses to do a "block" write to the NOR flash which for 
> my chip allows
> > > > writing up to 32 16-bit words. 
> > > 
> > > Oh, this is something from the CFI standard? Then we may 
> just add this
> > > knowledge to UBIFS: if this is NOR, then UBIFS knows that 
> the up to 64
> > > bytes may contain garbage.
> > 
> > I think the block write size is only used when you _submit_ 
> a write of
> > that many words or more.
> 
> Yes, this is my understanding too.
> 
> > So it would be correct for UBIFS to know that writes of N > 64(*)
> > bytes may be broken into blocks, with corruption distributed
> > anywhere within each of those blocks, but not more than one block.
> 
> I though the corruption will be inside the last block, because they
> are written sequentially.

Both are correct.  The corruption will be in the last block written to
the physical flash, but if the original UBIFS block write size is >
FLASH_MAX_BLOCK_WRITE_SIZE, then the MTD CFI driver will split it into
multiple smaller blocks.  The CFI code appears to write out the split
blocks sequentially from lowest address to highest address, but the
actual physical flash chip doesn't always write the flash words out
sequentially.

So, if we assume that the MTD CFI driver will always split blocks into
write-buffer sized blocks and write them sequentially (this is how the
code is currently written), then the maximum corruption due to an
interrupted write will be FLASH_MAX_BLOCK_WRITE_SIZE (which is 64 bytes
for my case).

> Right. I assumed the same, may be I just did not put it clearly. But
> thanks for this suggestion.
> 
> > That means MTD and UBI should export two values:
> > 
> >     - Maximum block write size which may be affected by 
> power failure / reset.
> >       (Maybe that needs an alignment too.)
> 
> Yup. MTD does not inform about this ATM, though.
> 
> >     - Minimum write size for padding written items.
> >       (Is that assumed aligned?)
> 
> Right.
> 
> > For this particular NOR, the two values would be 64 bytes 
> and 1 byte.
> > I don't think it's specific to CFI.
> 
> Agree.

I'll take a look at what is involved in exporting the two values.

-Eric
Jamie Lokier April 15, 2009, 4:09 p.m. UTC | #11
Eric Holmberg wrote:
> > On Tue, 2009-04-14 at 19:00 +0100, Jamie Lokier wrote:
> > > Artem Bityutskiy wrote:
> > > So it would be correct for UBIFS to know that writes of N > 64(*)
> > > bytes may be broken into blocks, with corruption distributed
> > > anywhere within each of those blocks, but not more than one block.
> > 
> > I though the corruption will be inside the last block, because they
> > are written sequentially.
> 
> Both are correct.  The corruption will be in the last block written to
> the physical flash, but if the original UBIFS block write size is >
> FLASH_MAX_BLOCK_WRITE_SIZE, then the MTD CFI driver will split it into
> multiple smaller blocks.  The CFI code appears to write out the split
> blocks sequentially from lowest address to highest address, but the
> actual physical flash chip doesn't always write the flash words out
> sequentially.

We all have the same idea, that's good :-)

> So, if we assume that the MTD CFI driver will always split blocks into
> write-buffer sized blocks and write them sequentially (this is how the
> code is currently written), then the maximum corruption due to an
> interrupted write will be FLASH_MAX_BLOCK_WRITE_SIZE (which is 64 bytes
> for my case).

Is this really different from NAND and it's page writes?

Do the CFI block writes have to be aligned (to 64 bytes) offset, or
can the 64 bytes start at any word position?

> I'll take a look at what is involved in exporting the two values.

Good plan :-)

-- Jamie
Artem Bityutskiy April 15, 2009, 4:12 p.m. UTC | #12
On Wed, 2009-04-15 at 17:09 +0100, Jamie Lokier wrote:
> Do the CFI block writes have to be aligned (to 64 bytes) offset, or
> can the 64 bytes start at any word position?

Yeah, good question. I wanted to ask the same.
Eric Holmberg April 15, 2009, 4:32 p.m. UTC | #13
> Is this really different from NAND and it's page writes?
> Do the CFI block writes have to be aligned (to 64 bytes) offset, or
> can the 64 bytes start at any word position?

Looking through the data sheet again, it looks like there is the concept
of a page for block writes in this particular NOR flash where a page
starts on any address evenly divisible by 0x20 (32-byte block).  The CFI
driver handles this when splitting up the blocks.

Here's the datasheet for reference:
 * http://www.spansion.com/datasheets/s29gl-p_00_a11_e.pdf?page=28

Thinking about it, this shouldn't affect the maximum possible corruption
count, since a fully-aligned write buffer is the worst case (a possible
of a <=64-byte corruption).  If the block is split between pages, then
32-bytes would occur in the first write and 32-bytes in the second
write, so the maximum corruption would be <=32 bytes.

Does this match with your thoughts?

-Eric

P.S.  I'm working on applying the latest patches and running a
corruption test where I limit the write-buffer size of 8 bytes.  So far,
so good.
Jamie Lokier April 15, 2009, 4:44 p.m. UTC | #14
Eric Holmberg wrote:
> > Is this really different from NAND and it's page writes?
> > Do the CFI block writes have to be aligned (to 64 bytes) offset, or
> > can the 64 bytes start at any word position?
> 
> Looking through the data sheet again, it looks like there is the concept
> of a page for block writes in this particular NOR flash where a page
> starts on any address evenly divisible by 0x20 (32-byte block).  The CFI
> driver handles this when splitting up the blocks.
> 
> Here's the datasheet for reference:
>  * http://www.spansion.com/datasheets/s29gl-p_00_a11_e.pdf?page=28
> 
> Thinking about it, this shouldn't affect the maximum possible corruption
> count, since a fully-aligned write buffer is the worst case (a possible
> of a <=64-byte corruption).  If the block is split between pages, then
> 32-bytes would occur in the first write and 32-bytes in the second
> write, so the maximum corruption would be <=32 bytes.
>
> Does this match with your thoughts?

Yes.  Another parameter could be exported by MTD: the max block
corruption alignment, 32 in this case.  Then UBIFS's scanner (or other
tools) could be a little more accurate.

I doubt if it's important to scan more accurately, but it would be a
good parameter to export from MTD anyway, while adding the other one,
just to describe the chip properly for tools, diagnostics etc.

Unrelated to this chip: I vaguely remember, aren't there some flash
chips which can do more than one block write in parallel, or a write
in parallel with an erase to a different block?

-- Jamie
Nicolas Pitre April 15, 2009, 6:26 p.m. UTC | #15
On Wed, 15 Apr 2009, Jamie Lokier wrote:

> Unrelated to this chip: I vaguely remember, aren't there some flash
> chips which can do more than one block write in parallel, or a write
> in parallel with an erase to a different block?

Some NOR flash parts can do a write _or_ an erase in one block while 
performing any amount of reads in other blocks.  At least that's what we 
support.  If some parts allow for concurrent writes we currently don't 
support that.


Nicolas
Jamie Lokier April 15, 2009, 6:38 p.m. UTC | #16
Nicolas Pitre wrote:
> On Wed, 15 Apr 2009, Jamie Lokier wrote:
> 
> > Unrelated to this chip: I vaguely remember, aren't there some flash
> > chips which can do more than one block write in parallel, or a write
> > in parallel with an erase to a different block?
> 
> Some NOR flash parts can do a write _or_ an erase in one block while 
> performing any amount of reads in other blocks.  At least that's what we 
> support.  If some parts allow for concurrent writes we currently don't 
> support that.

I don't remember if it was NOR, NAND or something else, but I remember
reading about some flash which supports 1 concurrent write and 1
erase, and thinking "oh that's clever, it means you can do streaming
writes or rapid fsync/database commits without long pauses for erasing".

Of course you can do that with two flash chips side by side :-)

Can MTD and/or UBI join two chips to look like a single partition in
that way and avoid pauses for erase by writing to the other part?

Thanks,
-- Jamie
Eric Holmberg April 15, 2009, 7:33 p.m. UTC | #17
> -----Original Message-----
> From: Jamie Lokier [mailto:jamie@shareable.org] 
> Sent: Wednesday, April 15, 2009 12:38 PM
> To: Nicolas Pitre
> Cc: Eric Holmberg; Urs Muff; linux-mtd@lists.infradead.org; 
> Adrian Hunter
> Subject: Re: UBIFS Corrupt during power failure
> 
> Nicolas Pitre wrote:
> > On Wed, 15 Apr 2009, Jamie Lokier wrote:
> > 
> > > Unrelated to this chip: I vaguely remember, aren't there 
> some flash
> > > chips which can do more than one block write in parallel, 
> or a write
> > > in parallel with an erase to a different block?
> > 
> > Some NOR flash parts can do a write _or_ an erase in one 
> block while 
> > performing any amount of reads in other blocks.  At least 
> that's what we 
> > support.  If some parts allow for concurrent writes we 
> currently don't 
> > support that.
> 
> I don't remember if it was NOR, NAND or something else, but I remember
> reading about some flash which supports 1 concurrent write and 1
> erase, and thinking "oh that's clever, it means you can do streaming
> writes or rapid fsync/database commits without long pauses 
> for erasing".

The evolution seems to be:
 1. Allow erase / program suspend to do a read from a different PEB (the
chip I'm using supports this)
 2. Allow simultaneous read while either erasing or programming a
different PEB
 3. Allow parallel operations on different flash banks
 4. Combine NOR and NAND onto the same chip

My understanding is that the parallel operations are only valid on
different flash banks, where a flash bank could be thought of
conceptually as a separate flash chip.  I'm no flash memory expert by
any means, so I'm sure there are some other systems out there.

> 
> Of course you can do that with two flash chips side by side :-)
> 
> Can MTD and/or UBI join two chips to look like a single partition in
> that way and avoid pauses for erase by writing to the other part?

The CONFIG_MTD_CONCAT option will join multiple chips together into a
single MTD device, but I haven't looked into the code to see if it
allows simultaneous operations on the separate chips.

Nothing like RAID0 on flash :)

-Eric
Nicolas Pitre April 15, 2009, 8:15 p.m. UTC | #18
On Wed, 15 Apr 2009, Eric Holmberg wrote:
> From: Jamie Lokier [mailto:jamie@shareable.org] 
> > I don't remember if it was NOR, NAND or something else, but I remember
> > reading about some flash which supports 1 concurrent write and 1
> > erase, and thinking "oh that's clever, it means you can do streaming
> > writes or rapid fsync/database commits without long pauses 
> > for erasing".
> 
> The evolution seems to be:
>  1. Allow erase / program suspend to do a read from a different PEB (the
> chip I'm using supports this)
>  2. Allow simultaneous read while either erasing or programming a
> different PEB
>  3. Allow parallel operations on different flash banks
>  4. Combine NOR and NAND onto the same chip

Most NOR chips do allow for #1 and we support it.

Recent Intel NOR parts allow for #2 and we support it.

It is already possible to do #3, whether you have multiple 
identical chips mapped 
contigously mapped, or through the MTD concat layer.

I think #4 is rather odd and unprobable.

> My understanding is that the parallel operations are only valid on
> different flash banks, where a flash bank could be thought of
> conceptually as a separate flash chip.  I'm no flash memory expert by
> any means, so I'm sure there are some other systems out there.

A single bank made of several contigous chips (and not the parallel chip 
arrangement often used to have a larger bus) already offers parallel 
operation possibilities, as long as parallel accesses have a 
sufficiently large offset between them to target different chips.

> > Of course you can do that with two flash chips side by side :-)

Beware.  The "two flash chips side by side" often means your chips are 
put in parallel on the bus, having one chip providing the lowest 16 
data bits, and the other chip providing the upper 16 bits.  This allows 
for a 2x read throughput, however this means that the erase block size 
is also doubled, while erase latency remains the same.

> > Can MTD and/or UBI join two chips to look like a single partition in
> > that way and avoid pauses for erase by writing to the other part?
> 
> The CONFIG_MTD_CONCAT option will join multiple chips together into a
> single MTD device, but I haven't looked into the code to see if it
> allows simultaneous operations on the separate chips.

It does.

> Nothing like RAID0 on flash :)

Well, the concat layer does not offer any kind of stripe management. I 
think someone posted a set of patches a while ago to add such a 
capability though.


Nicolas
Jamie Lokier April 15, 2009, 8:46 p.m. UTC | #19
Nicolas Pitre wrote:
> > > Can MTD and/or UBI join two chips to look like a single partition in
> > > that way and avoid pauses for erase by writing to the other part?
> > 
> > The CONFIG_MTD_CONCAT option will join multiple chips together into a
> > single MTD device, but I haven't looked into the code to see if it
> > allows simultaneous operations on the separate chips.
> 
> It does.
> 
> > Nothing like RAID0 on flash :)
> 
> Well, the concat layer does not offer any kind of stripe management. I 
> think someone posted a set of patches a while ago to add such a 
> capability though.

If MTD concat write/erase operations can run in parallel, then the
powerfail-corruption-block that UBIFS is scanning for will be one
block per underlying device, not one block for the whole MTD.

That means a single number describing the powerfail-corruption-block
is not enough for MTD concat devices.

However provided they are distinct erase blocks, maybe this does not
affect UBIFS's classification of dead blocks.

If they are arranged as stripes, the MTD concat's overall
powerfail-corruption-block size must be set to the maximum of the
underlying MTD's value and the strip size.  Alignment is left as an
exercise...

-- Jamie
Artem Bityutskiy April 16, 2009, 5:46 a.m. UTC | #20
On Wed, 2009-04-15 at 10:32 -0600, Eric Holmberg wrote:
> > Is this really different from NAND and it's page writes?
> > Do the CFI block writes have to be aligned (to 64 bytes) offset, or
> > can the 64 bytes start at any word position?
> 
> Looking through the data sheet again, it looks like there is the concept
> of a page for block writes in this particular NOR flash where a page
> starts on any address evenly divisible by 0x20 (32-byte block).  The CFI
> driver handles this when splitting up the blocks.
> 
> Here's the datasheet for reference:
>  * http://www.spansion.com/datasheets/s29gl-p_00_a11_e.pdf?page=28
> 
> Thinking about it, this shouldn't affect the maximum possible corruption
> count, since a fully-aligned write buffer is the worst case (a possible
> of a <=64-byte corruption).  If the block is split between pages, then
> 32-bytes would occur in the first write and 32-bytes in the second
> write, so the maximum corruption would be <=32 bytes.
> 
> Does this match with your thoughts?

No sure. I'm thinking how to extend the MTD device model. Jamie already
suggested this. At the moment we have:

1. eraseblock
2. Min. I/O unit size, which is mtd->writesize in MTD, and
ubi->min_io_size in UBI. This corresponds to NAND page, and 1 byte in
NOR.
3. There are also sub-pages in case of NAND, but I consider them as a
kind of hack. UBI does not expose information about them, and UBIFS does
not use them.

Now obviously, we need to extend this model. I would suggest to
introduce a notion of "max. I/O size". It would be:

1. 64-bytes in case of Eric's NOR. This would be taken from CFI info.
2. If we ever have a striping layer, which can interleave between 2 or
   more chips, then max. I/O size will be N * ubi->min_io_size.

Thoughts?

> P.S.  I'm working on applying the latest patches and running a
> corruption test where I limit the write-buffer size of 8 bytes.  So far,
> so good.

Right. I think I even know which commit fixed that "orphan nodes are
treated as superblock" bug:

http://git.infradead.org/users/dedekind/ubifs-v2.6.27.git?a=commit;h=e7e59cefd76778583ebc6d9b1ebc249feaba1d15

But I strongly suggest to take all patches. We are not very interested
to dig old code base, should something go wrong.
Artem Bityutskiy April 16, 2009, 5:51 a.m. UTC | #21
On Wed, 2009-04-15 at 13:33 -0600, Eric Holmberg wrote:
> > I don't remember if it was NOR, NAND or something else, but I remember
> > reading about some flash which supports 1 concurrent write and 1
> > erase, and thinking "oh that's clever, it means you can do streaming
> > writes or rapid fsync/database commits without long pauses 
> > for erasing".
> 
> The evolution seems to be:
>  1. Allow erase / program suspend to do a read from a different PEB (the
> chip I'm using supports this)
>  2. Allow simultaneous read while either erasing or programming a
> different PEB
>  3. Allow parallel operations on different flash banks
>  4. Combine NOR and NAND onto the same chip
> 
> My understanding is that the parallel operations are only valid on
> different flash banks, where a flash bank could be thought of
> conceptually as a separate flash chip.  I'm no flash memory expert by
> any means, so I'm sure there are some other systems out there.

As Nikolas noted, intel guys sent an mtdstripe layer implementation
here, but for some reasons they did not make it into mainline. That
level could interleave between several chips. E.g., you have 2 NANDs,
then that layer could present them as one virtual device with twice as
large eraseblock size and twice as large page size. And you get 2x
speed.
Jamie Lokier April 16, 2009, 9:34 p.m. UTC | #22
Artem Bityutskiy wrote:
> On Wed, 2009-04-15 at 10:32 -0600, Eric Holmberg wrote:
> > Looking through the data sheet again, it looks like there is the concept
> > of a page for block writes in this particular NOR flash where a page
> > starts on any address evenly divisible by 0x20 (32-byte block).  The CFI
> > driver handles this when splitting up the blocks.
> > 
> > Here's the datasheet for reference:
> >  * http://www.spansion.com/datasheets/s29gl-p_00_a11_e.pdf?page=28
> > 
> > Thinking about it, this shouldn't affect the maximum possible corruption
> > count, since a fully-aligned write buffer is the worst case (a possible
> > of a <=64-byte corruption).  If the block is split between pages, then
> > 32-bytes would occur in the first write and 32-bytes in the second
> > write, so the maximum corruption would be <=32 bytes.

I didn't find any reference to 32 bytes or 16 words in the datasheet.
0x20 only appears once, in the sample code:

    /* NOTES: Write buffer programming limited to 16 words. */
    /*        All addresses to be written to the flash in   */
    /*        one operation must be within the same flash   */
    /*        page. A flash page begins at addresses        */
    /*        evenly divisible by 0x20.                     */

But notice the sample code also limits the write buffer to 16 words /
32 bytes, while the datasheet says the write buffer is 32 words / 64
bytes.  So it looks to me like the sample code is for another device
with smaller write buffer, and the "32-byte page" is spurious and not
really applicable to this device.

> 1. eraseblock
> 2. Min. I/O unit size, which is mtd->writesize in MTD, and
> ubi->min_io_size in UBI. This corresponds to NAND page, and 1 byte in
> NOR.

I guess 1 byte in NOR because you can overwrite a word to set the other byte?
Logically min_io_size should be 1 bit :-)

> 3. There are also sub-pages in case of NAND, but I consider them as a
> kind of hack. UBI does not expose information about them, and UBIFS does
> not use them.

UBI FAQ (http://www.linux-mtd.infradead.org/faq/ubi.html#L_find_min_io_size)

    UBI: physical eraseblock size:   131072 bytes (128 KiB)
    UBI: logical eraseblock size:    129024 bytes
    UBI: smallest flash I/O unit:    2048
    UBI: sub-page size:              512

    Note, if sup-page size was not printed, the flash is not NAND
    flash or NAND flash which does not have sub-pages.

UBI does not expose information about sub-pages?

Googling for "NAND sub-page" didn't help explain them much.  Can you
recommend a URL, just so I can understand NAND sub-pages?

> Now obviously, we need to extend this model. I would suggest to
> introduce a notion of "max. I/O size". It would be:
> 
> 1. 64-bytes in case of Eric's NOR. This would be taken from CFI info.
> 2. If we ever have a striping layer, which can interleave between 2 or
>    more chips, then max. I/O size will be N * ubi->min_io_size.
> 
> Thoughts?

0. It's more accurate to call it "max parallel write size".
   That NOR chip has a read page size too, which is different :-)

1. Alignment, or can we assume alignment is the same as its size?

2. If striping uses larger stripes (the same way as RAID uses 1MB
   stripes instead of 1 sector stripes), then this value needs to be
   max(N * strip_size, N * ubi->min_io_size), because the chip block
   writes done in parallel are not contiguous in the combined MTD.

3. 2 assumes that striping works like this:

       Start write at offset P to chip 0, chip 1, chip 2, chip 3.
       Wait for _all_ chips to finish.
       Start write at offset P+block_size to chip 0, chip 1, chip 2, chip 3.
       Wait for _all_ chips to finish.
       etc.

   But if striping is implemented in a more relaxed way to get higher
   performance, it will do this:

       Start write at offset P to chip 0, chip 1, chip 2, chip 3.
       Wait for any chip to finish.
       Start write at offset P+block_size on the chip which finished.
       Wait for any chip to finish.
       Start write at next block on the chip which finished.
       Wait for any chip to finish.
       etc.

   That makes the range of parallel writes, and so
   corruption-on-power-loss, even larger than max(N * strip_size, N *
   block_size).  The range is as large as the whole write, if one chip
   is writing much faster than the others, so it cannot be represented
   by a small number.

-- Jamie
Artem Bityutskiy April 17, 2009, 8:56 a.m. UTC | #23
Jamie, thanks for feedback!

On Thu, 2009-04-16 at 22:34 +0100, Jamie Lokier wrote:
> > 1. eraseblock
> > 2. Min. I/O unit size, which is mtd->writesize in MTD, and
> > ubi->min_io_size in UBI. This corresponds to NAND page, and 1 byte in
> > NOR.
> 
> I guess 1 byte in NOR because you can overwrite a word to set the other byte?
> Logically min_io_size should be 1 bit :-)
> 
> > 3. There are also sub-pages in case of NAND, but I consider them as a
> > kind of hack. UBI does not expose information about them, and UBIFS does
> > not use them.
> 
> UBI FAQ (http://www.linux-mtd.infradead.org/faq/ubi.html#L_find_min_io_size)
> 
>     UBI: physical eraseblock size:   131072 bytes (128 KiB)
>     UBI: logical eraseblock size:    129024 bytes
>     UBI: smallest flash I/O unit:    2048
>     UBI: sub-page size:              512
> 
>     Note, if sup-page size was not printed, the flash is not NAND
>     flash or NAND flash which does not have sub-pages.
> 
> UBI does not expose information about sub-pages?

It prints about them, just for info. But the UBI "front-ent" API
does not contain sub-page info.

> Googling for "NAND sub-page" didn't help explain them much.  Can you
> recommend a URL, just so I can understand NAND sub-pages?

There is info at the MTD web site. But for your convenience, I've
also added this:

http://www.linux-mtd.infradead.org/faq/ubi.html#L_subpage

> > Now obviously, we need to extend this model. I would suggest to
> > introduce a notion of "max. I/O size". It would be:
> > 
> > 1. 64-bytes in case of Eric's NOR. This would be taken from CFI info.
> > 2. If we ever have a striping layer, which can interleave between 2 or
> >    more chips, then max. I/O size will be N * ubi->min_io_size.
> > 
> > Thoughts?
> 
> 0. It's more accurate to call it "max parallel write size".
>    That NOR chip has a read page size too, which is different :-)
> 
> 1. Alignment, or can we assume alignment is the same as its size?

Yes, I think so.

> 2. If striping uses larger stripes (the same way as RAID uses 1MB
>    stripes instead of 1 sector stripes), then this value needs to be
>    max(N * strip_size, N * ubi->min_io_size), because the chip block
>    writes done in parallel are not contiguous in the combined MTD.

OK.

> 
> 3. 2 assumes that striping works like this:
> 
>        Start write at offset P to chip 0, chip 1, chip 2, chip 3.
>        Wait for _all_ chips to finish.
>        Start write at offset P+block_size to chip 0, chip 1, chip 2, chip 3.
>        Wait for _all_ chips to finish.
>        etc.

Right.

>    But if striping is implemented in a more relaxed way to get higher
>    performance, it will do this:
> 
>        Start write at offset P to chip 0, chip 1, chip 2, chip 3.
>        Wait for any chip to finish.
>        Start write at offset P+block_size on the chip which finished.
>        Wait for any chip to finish.
>        Start write at next block on the chip which finished.
>        Wait for any chip to finish.
>        etc.

Yeah...

>    That makes the range of parallel writes, and so
>    corruption-on-power-loss, even larger than max(N * strip_size, N *
>    block_size).  The range is as large as the whole write, if one chip
>    is writing much faster than the others, so it cannot be represented
>    by a small number.

Then I guess we should just introduce mtd->max_corruption ? This would
mean maximum amount of bytes corruption may span in vase of power cuts?
Artem Bityutskiy April 17, 2009, 8:58 a.m. UTC | #24
On Thu, 2009-04-16 at 22:34 +0100, Jamie Lokier wrote:
> > 1. eraseblock
> > 2. Min. I/O unit size, which is mtd->writesize in MTD, and
> > ubi->min_io_size in UBI. This corresponds to NAND page, and 1 byte in
> > NOR.
> 
> I guess 1 byte in NOR because you can overwrite a word to set the other byte?
> Logically min_io_size should be 1 bit :-)

Right. But I'm talking more from UBI/UBIFS perspective. We do not
need bits, so for us this is 1 byte. But this is not essential now,
I guess.
Jamie Lokier April 17, 2009, 1:51 p.m. UTC | #25
Artem Bityutskiy wrote:
> Jamie, thanks for feedback!

Thanks for the new sections in the UBI FAQ!
> Yeah...
> 
> >    That makes the range of parallel writes, and so
> >    corruption-on-power-loss, even larger than max(N * strip_size, N *
> >    block_size).  The range is as large as the whole write, if one chip
> >    is writing much faster than the others, so it cannot be represented
> >    by a small number.
> 
> Then I guess we should just introduce mtd->max_corruption ? This would
> mean maximum amount of bytes corruption may span in vase of power cuts?

With that name maybe whoever implements striping will remember think
about parallelism and limit it :-)

  /* Max size of corrupted block when a write command is interrupted
     by reset or power failure. */
  u32 max_write_corruption;

-- Jamie




> 
> -- 
> Best regards,
> Artem Bityutskiy (Битюцкий Артём)
>
Artem Bityutskiy April 17, 2009, 2:36 p.m. UTC | #26
On Fri, 2009-04-17 at 14:51 +0100, Jamie Lokier wrote:
> Artem Bityutskiy wrote:
> > Jamie, thanks for feedback!
> 
> Thanks for the new sections in the UBI FAQ!
> > Yeah...
> > 
> > >    That makes the range of parallel writes, and so
> > >    corruption-on-power-loss, even larger than max(N * strip_size, N *
> > >    block_size).  The range is as large as the whole write, if one chip
> > >    is writing much faster than the others, so it cannot be represented
> > >    by a small number.
> > 
> > Then I guess we should just introduce mtd->max_corruption ? This would
> > mean maximum amount of bytes corruption may span in vase of power cuts?
> 
> With that name maybe whoever implements striping will remember think
> about parallelism and limit it :-)
> 
>   /* Max size of corrupted block when a write command is interrupted
>      by reset or power failure. */
>   u32 max_write_corruption;

Yeah, let's wait for Eric's results and then will work on
extending MTD device model with this parameter.
Eric Holmberg April 17, 2009, 11:49 p.m. UTC | #27
> > > 
> > > Then I guess we should just introduce mtd->max_corruption 
> ? This would
> > > mean maximum amount of bytes corruption may span in vase 
> of power cuts?
> > 
> > With that name maybe whoever implements striping will remember think
> > about parallelism and limit it :-)
> > 
> >   /* Max size of corrupted block when a write command is interrupted
> >      by reset or power failure. */
> >   u32 max_write_corruption;

I like this suggestion -- good variable name.

> 
> Yeah, let's wait for Eric's results and then will work on
> extending MTD device model with this parameter.
> 

As suggested, I patched my 2.6.27 kernel with the latest from
http://git.infradead.org/users/dedekind/ubifs-v2.6.27.git (includes all
updates up to and including fhe fix-recovery bug,
http://git.infradead.org/users/dedekind/ubifs-v2.6.27.git?a=commit;h=e14
4c1c037f1c6f7c687de5a2cd375cb40dfe71e).

I have the unit running with a maximum write buffer of 8 bytes (the NOR
flash chip is capable of 64 bytes).

I was seeing 4 different failure scenarios with the base 2.6.27 code,
but now I am only seeing one remaining failure after 30+ hours of power
cycling.  I added a stack dump this afternoon that will let me pinpoint
exactly what is happening, but haven't seen the failure, yet.

The failure happens when I get two corrupt empty LEB's.  I believe the
scenario is that an erase is interrupted and on the next boot, while the
file system is being recovered, another power failure occurs.

I can erase one of the LEB's manually in U-Boot and the file system
recovers properly.

I'm going to leave the units running over the weekend and see what is
waiting for me Monday morning.

Thanks for your help so far and have a great weekend!

-Eric

P.S.  I am scheduled to work on some higher-priority items next week, so
I won't be able to work on the max_write_corruption code.
Stefan Roese May 15, 2009, 7:16 a.m. UTC | #28
Hi Eric,

On Saturday 18 April 2009 01:49:52 Eric Holmberg wrote:
> > Yeah, let's wait for Eric's results and then will work on
> > extending MTD device model with this parameter.
>
> As suggested, I patched my 2.6.27 kernel with the latest from
> http://git.infradead.org/users/dedekind/ubifs-v2.6.27.git (includes all
> updates up to and including fhe fix-recovery bug,
> http://git.infradead.org/users/dedekind/ubifs-v2.6.27.git?a=commit;h=e14
> 4c1c037f1c6f7c687de5a2cd375cb40dfe71e).
>
> I have the unit running with a maximum write buffer of 8 bytes (the NOR
> flash chip is capable of 64 bytes).

How exactly did you do this? In cfi_cmdset_0002.c?

> I was seeing 4 different failure scenarios with the base 2.6.27 code,
> but now I am only seeing one remaining failure after 30+ hours of power
> cycling.  I added a stack dump this afternoon that will let me pinpoint
> exactly what is happening, but haven't seen the failure, yet.
>
> The failure happens when I get two corrupt empty LEB's.  I believe the
> scenario is that an erase is interrupted and on the next boot, while the
> file system is being recovered, another power failure occurs.
>
> I can erase one of the LEB's manually in U-Boot and the file system
> recovers properly.
>
> I'm going to leave the units running over the weekend and see what is
> waiting for me Monday morning.

Do you have an update for this? What's the current status on your system now? 
Which patches did you apply to work reliably with the Spansion FLASH?

I'm asking since we are seeing a similar issue on one of our boards equipped 
with the S29GL512P. This simple script triggers problems upon the next mount:

---
mount -t ubifs ubi0:testvolume /mnt
sync
reboot -n -f
---

The next mount will result most of the time in this:

UBIFS: recovery needed
UBIFS error (pid 406): ubifs_scan: corrupt empty space at LEB 3:130320
UBIFS error (pid 406): ubifs_scanned_corruption: corrupted data at LEB 
3:130320
UBIFS error (pid 406): ubifs_scan: LEB 3 scanning failed
UBIFS error (pid 406): ubifs_recover_leb: corrupt empty space at LEB 3:32
UBIFS error (pid 406): ubifs_scanned_corruption: corrupted data at LEB 3:32
UBIFS error (pid 406): ubifs_recover_leb: LEB 3 scanning failed
mount: Structure needs cleaning

This is without the patch from this thread included (in recovery.c). With this 
patch included the recovery is successful all the time, as far as we can see 
right now. But I'm wondering if we really need to disable the write buffer in 
the CFI driver or reduce the write buffer to 8.

Thanks.

Best regards,
Stefan

=====================================================================
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-0 Fax: +49-8142-66989-80  Email: office@denx.de
=====================================================================
Artem Bityutskiy June 3, 2009, 8:08 a.m. UTC | #29
On Fri, 2009-05-15 at 09:16 +0200, Stefan Roese wrote:
> Do you have an update for this? What's the current status on your system now? 
> Which patches did you apply to work reliably with the Spansion FLASH?
> 
> I'm asking since we are seeing a similar issue on one of our boards equipped 
> with the S29GL512P. This simple script triggers problems upon the next mount:

FYI. Since this issue has not been resolved due to lack of information,
I've updated UBIFS FAQ and put a note about this issue there:
http://www.linux-mtd.infradead.org/faq/ubifs.html#L_powercut
Stefan Roese June 3, 2009, 8:25 a.m. UTC | #30
On Wednesday 03 June 2009 10:08:30 Artem Bityutskiy wrote:
> On Fri, 2009-05-15 at 09:16 +0200, Stefan Roese wrote:
> > Do you have an update for this? What's the current status on your system
> > now? Which patches did you apply to work reliably with the Spansion
> > FLASH?
> >
> > I'm asking since we are seeing a similar issue on one of our boards
> > equipped with the S29GL512P. This simple script triggers problems upon
> > the next mount:
>
> FYI. Since this issue has not been resolved due to lack of information,
> I've updated UBIFS FAQ and put a note about this issue there:
> http://www.linux-mtd.infradead.org/faq/ubifs.html#L_powercut

Yes, I already noticed this. Thanks.

Best regards,
Stefan
Eric Holmberg June 3, 2009, 1:50 p.m. UTC | #31
> On Fri, 2009-05-15 at 09:16 +0200, Stefan Roese wrote:
> > Do you have an update for this? What's the current status 
> on your system now? 
> > Which patches did you apply to work reliably with the 
> Spansion FLASH?
> > 
> > I'm asking since we are seeing a similar issue on one of 
> our boards equipped 
> > with the S29GL512P. This simple script triggers problems 
> upon the next mount:
> 
> FYI. Since this issue has not been resolved due to lack of 
> information,
> I've updated UBIFS FAQ and put a note about this issue there:
> http://www.linux-mtd.infradead.org/faq/ubifs.html#L_powercut
> 
> -- 
> Best regards,
> Artem Bityutskiy (Битюцкий Артём)

Thanks Artem - I appreciate the honesty in the FAQ.

Sorry for the delays on getting you real information - I am working on it as much as I can.  Due to our project schedule here, I can only work on this a few minutes a day.

I have reproduced the CRC error, but looking at the data (shown below), I am not sure what data is expected in LEB 1 at the failed position.  I don't see anything that indicates that the write-buffer behavior that I have avoided by limiting the write-buffer size to 8 bytes is causing the problem.

[42949375.500000] UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x714960f4, read 0x3dae4f0a
[42949375.510000] UBIFS error (pid 1): ubifs_check_node: bad node at LEB 1:89600
[42949375.520000] UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:89600
[42949375.540000] UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
[42949375.580000] UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node

LEB 1:89600 refers to address 0x31c75e00 for the NOR flash and looks like it contains nothing but zeros.

31c75e00: 00000000 00000000 00000000 00000000    ................
31c75e10: 00000000 00000000 00000000 00000000    ................
31c75e20: 00000000 00000000 00000000 00000000    ................
31c75e30: 00000000 00000000 00000000 00000000    ................
31c75e40: 00000000 00000000 00000000 00000000    ................
31c75e50: 00000000 00000000 00000000 00000000    ................
31c75e60: 00000000 00000000 00000000 00000000    ................
31c75e70: 00000000 00000000 00000000 00000000    ................
31c75e80: 06101831 3dae4f0a 000ecc9b 00000000    1....O.=........

Since this is the root file system and is 28MB in size, I am working on creating a smaller file system and writing a fixed test pattern to it.  I will provide the dd images of these files along with log files as soon as possible which will hopefully be next Monday (June 8).

If you have any addition suggestions or requests, please let me know.

Regards,

Eric Holmberg
Trimble Navigation
Artem Bityutskiy June 7, 2009, 10:16 a.m. UTC | #32
On Wed, 2009-06-03 at 07:50 -0600, Eric Holmberg wrote:
> Sorry for the delays on getting you real information - I am working on it as much as I can.  Due to our project schedule here, I can only work on this a few minutes a day.
> 
> I have reproduced the CRC error, but looking at the data (shown below), I am not sure what data is expected in LEB 1 at the failed position.  I don't see anything that indicates that the write-buffer behavior that I have avoided by limiting the write-buffer size to 8 bytes is causing the problem.
> 
> [42949375.500000] UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x714960f4, read 0x3dae4f0a
> [42949375.510000] UBIFS error (pid 1): ubifs_check_node: bad node at LEB 1:89600
> [42949375.520000] UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:89600
> [42949375.540000] UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
> [42949375.580000] UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node
> 
> LEB 1:89600 refers to address 0x31c75e00 for the NOR flash and looks like it contains nothing but zeros.
> 
> 31c75e00: 00000000 00000000 00000000 00000000    ................
> 31c75e10: 00000000 00000000 00000000 00000000    ................
> 31c75e20: 00000000 00000000 00000000 00000000    ................
> 31c75e30: 00000000 00000000 00000000 00000000    ................
> 31c75e40: 00000000 00000000 00000000 00000000    ................
> 31c75e50: 00000000 00000000 00000000 00000000    ................
> 31c75e60: 00000000 00000000 00000000 00000000    ................
> 31c75e70: 00000000 00000000 00000000 00000000    ................
> 31c75e80: 06101831 3dae4f0a 000ecc9b 00000000    1....O.=........
> 
> Since this is the root file system and is 28MB in size, I am working on creating a smaller file system and writing a fixed test pattern to it.  I will provide the dd images of these files along with log files as soon as possible which will hopefully be next Monday (June 8).
> 
> If you have any addition suggestions or requests, please let me know.

Well, I would be cool to have full UBIFS debugging output, or better
the image of the partition.
news July 28, 2009, 12:01 p.m. UTC | #33
Artem Bityutskiy schrieb:
> On Wed, 2009-06-03 at 07:50 -0600, Eric Holmberg wrote:
>   
>> Sorry for the delays on getting you real information - I am working on it as much as I can.  Due to our project schedule here, I can only work on this a few minutes a day.
>>
>> I have reproduced the CRC error, but looking at the data (shown below), I am not sure what data is expected in LEB 1 at the failed position.  I don't see anything that indicates that the write-buffer behavior that I have avoided by limiting the write-buffer size to 8 bytes is causing the problem.
>>
>> [42949375.500000] UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x714960f4, read 0x3dae4f0a
>> [42949375.510000] UBIFS error (pid 1): ubifs_check_node: bad node at LEB 1:89600
>> [42949375.520000] UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:89600
>> [42949375.540000] UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
>> [42949375.580000] UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node
>>
>> LEB 1:89600 refers to address 0x31c75e00 for the NOR flash and looks like it contains nothing but zeros.
>>
>> 31c75e00: 00000000 00000000 00000000 00000000    ................
>> 31c75e10: 00000000 00000000 00000000 00000000    ................
>> 31c75e20: 00000000 00000000 00000000 00000000    ................
>> 31c75e30: 00000000 00000000 00000000 00000000    ................
>> 31c75e40: 00000000 00000000 00000000 00000000    ................
>> 31c75e50: 00000000 00000000 00000000 00000000    ................
>> 31c75e60: 00000000 00000000 00000000 00000000    ................
>> 31c75e70: 00000000 00000000 00000000 00000000    ................
>> 31c75e80: 06101831 3dae4f0a 000ecc9b 00000000    1....O.=........
>>
>> Since this is the root file system and is 28MB in size, I am working on creating a smaller file system and writing a fixed test pattern to it.  I will provide the dd images of these files along with log files as soon as possible which will hopefully be next Monday (June 8).
>>
>> If you have any addition suggestions or requests, please let me know.
>>     
>
> Well, I would be cool to have full UBIFS debugging output, or better
> the image of the partition.
>
>   
We have similar problems with a SPANSION falsh (S29GL01GP).
I think the reason of the problem is a feature of the chip.

I reduced the problem to the MTD access (without ubi/ubifs).
We noticed toggle flash-bit(s) after power off during a write cycle.
The toggle flash-bit(s) may occure after power of during an sector-erase
too.

Simple testsequence:
* flash_erase ...
* cp testfile /dev/mtd0
- automatic or manuel power off during the cp
Check the flash after reboot (e.g md5sum /dev/mtd0 helps).

We used the default settings from the CFI (MaxBufWriteSize=6 == 64 byte
buffer).
Adrian Hunter July 28, 2009, 12:24 p.m. UTC | #34
news wrote:
> Artem Bityutskiy schrieb:
>> On Wed, 2009-06-03 at 07:50 -0600, Eric Holmberg wrote:
>>   
>>> Sorry for the delays on getting you real information - I am working on it as much as I can.  Due to our project schedule here, I can only work on this a few minutes a day.
>>>
>>> I have reproduced the CRC error, but looking at the data (shown below), I am not sure what data is expected in LEB 1 at the failed position.  I don't see anything that indicates that the write-buffer behavior that I have avoided by limiting the write-buffer size to 8 bytes is causing the problem.
>>>
>>> [42949375.500000] UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x714960f4, read 0x3dae4f0a
>>> [42949375.510000] UBIFS error (pid 1): ubifs_check_node: bad node at LEB 1:89600
>>> [42949375.520000] UBIFS error (pid 1): ubifs_scanned_corruption: corrupted data at LEB 1:89600
>>> [42949375.540000] UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
>>> [42949375.580000] UBIFS error (pid 1): ubifs_recover_master_node: failed to recover master node
>>>
>>> LEB 1:89600 refers to address 0x31c75e00 for the NOR flash and looks like it contains nothing but zeros.
>>>
>>> 31c75e00: 00000000 00000000 00000000 00000000    ................
>>> 31c75e10: 00000000 00000000 00000000 00000000    ................
>>> 31c75e20: 00000000 00000000 00000000 00000000    ................
>>> 31c75e30: 00000000 00000000 00000000 00000000    ................
>>> 31c75e40: 00000000 00000000 00000000 00000000    ................
>>> 31c75e50: 00000000 00000000 00000000 00000000    ................
>>> 31c75e60: 00000000 00000000 00000000 00000000    ................
>>> 31c75e70: 00000000 00000000 00000000 00000000    ................
>>> 31c75e80: 06101831 3dae4f0a 000ecc9b 00000000    1....O.=........
>>>
>>> Since this is the root file system and is 28MB in size, I am working on creating a smaller file system and writing a fixed test pattern to it.  I will provide the dd images of these files along with log files as soon as possible which will hopefully be next Monday (June 8).
>>>
>>> If you have any addition suggestions or requests, please let me know.
>>>     
>> Well, I would be cool to have full UBIFS debugging output, or better
>> the image of the partition.
>>
>>   
> We have similar problems with a SPANSION falsh (S29GL01GP).
> I think the reason of the problem is a feature of the chip.
> 
> I reduced the problem to the MTD access (without ubi/ubifs).
> We noticed toggle flash-bit(s) after power off during a write cycle.
> The toggle flash-bit(s) may occure after power of during an sector-erase
> too.
> 
> Simple testsequence:
> * flash_erase ...
> * cp testfile /dev/mtd0
> - automatic or manuel power off during the cp
> Check the flash after reboot (e.g md5sum /dev/mtd0 helps).
> 
> We used the default settings from the CFI (MaxBufWriteSize=6 == 64 byte
> buffer).

Artem is on holiday.

Have you read the rest of this thread?  Does any of it help you?
Artem Bityutskiy Aug. 9, 2009, 4:59 a.m. UTC | #35
On 07/28/2009 03:01 PM, news wrote:
> We have similar problems with a SPANSION falsh (S29GL01GP).
> I think the reason of the problem is a feature of the chip.
>
> I reduced the problem to the MTD access (without ubi/ubifs).
> We noticed toggle flash-bit(s) after power off during a write cycle.
> The toggle flash-bit(s) may occure after power of during an sector-erase
> too.
>
> Simple testsequence:
> * flash_erase ...
> * cp testfile /dev/mtd0
> - automatic or manuel power off during the cp
> Check the flash after reboot (e.g md5sum /dev/mtd0 helps).
>
> We used the default settings from the CFI (MaxBufWriteSize=6 == 64 byte
> buffer).

This was fixed. See the latest UBI/UBIFS git trees. The fixes are also
in the back-port trees:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source
diff mbox

Patch

diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
index 1066297..9afa056 100644
--- a/fs/ubifs/recovery.c
+++ b/fs/ubifs/recovery.c
@@ -675,9 +675,10 @@  struct ubifs_scan_leb *ubifs_recover_leb(struct ubifs_info *c, int lnum,
 			clean_buf(c, &buf, lnum, &offs, &len);
 			need_clean = 1;
 		} else {
-			ubifs_err("corrupt empty space at LEB %d:%d",
-				  lnum, offs);
-			goto corrupted;
+			ubifs_warn("ignore corrupt empty space at LEB %d:%d",
+				   lnum, offs);
+			clean_buf(c, &buf, lnum, &offs, &len);
+			need_clean = 1;
 		}
 	}