diff mbox

UBIFS: don't fail on -EBADMSG when fixing free space

Message ID 1305751647-3122-1-git-send-email-bengardiner@nanometrics.ca
State New, archived
Headers show

Commit Message

Ben Gardiner May 18, 2011, 8:47 p.m. UTC
The free space fixup was introduced so that UBIFS on NAND flashes where the
empty pages at the end need to be dropped when flashing or else -74
(aka -EBADMSG) failures will occur -- but due to the limited functionality of
flash-programming facilities the emtpy pages cannot be dropped.

In fixup_leb, which is called during free space fixup, the now-expected return
code of -EBADMSG is treated as a failure.

Don't fail when -EBADMSG is encountered during ubi_read.

Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
CC: Matthew L. Creech <mlcreech@gmail.com>

---
I tested this patch on-top-of l2 head : 'bbf2b37 UBIFS: fix extremely rare
mount failure' using Matthew's 'mkfs.ubifs: add "-F" option for "free-space
fixup"' patch to create the UBIFS image with the flag set.

The test system was a da850evm, the image was flashed with U-boot's plain-old
'nand write' and the rootfs is mounted with 'ubi.mtd=X,2048 root=ubi0:rootfs
rootfstype=ubifs' bootargs.

Without this patch the first attemp to start fixing free space fails
with -EBADMSG:

UBIFS: start fixing up free space
UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes from PEB 18:4096, read 4096 bytes
VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0)
[...]

---
 fs/ubifs/sb.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

Comments

Matthew L. Creech May 18, 2011, 9:41 p.m. UTC | #1
On Wed, May 18, 2011 at 4:47 PM, Ben Gardiner
<bengardiner@nanometrics.ca> wrote:
> +       /*
> +        * Don't fail on -EBADMSG since these are precisely the error codes that
> +        * are returned by ubi_red in the cases where free-space fix-ups are
> +        * required.
> +        */

Hi Ben,

Off-hand, I don't see how this comment can be true.  It seems like
we'll have 2 cases:

1. The pages in question are actually erased, in which case they'll be
0xff-filled by ubi_read() without error, or
2. The pages were programmed with all 0xff data, in which case the ECC
info should be correct

There's a third possibility, that the ECC info is bad because there's
a real error, but that's not something we can reliably recover from.
Am I overlooking another scenario?
Ben Gardiner May 19, 2011, 1:28 p.m. UTC | #2
Hi Matthew,

On Wed, May 18, 2011 at 5:41 PM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> On Wed, May 18, 2011 at 4:47 PM, Ben Gardiner
> <bengardiner@nanometrics.ca> wrote:
>> +       /*
>> +        * Don't fail on -EBADMSG since these are precisely the error codes that
>> +        * are returned by ubi_red in the cases where free-space fix-ups are
>> +        * required.
>> +        */
>
> Hi Ben,
>
> Off-hand, I don't see how this comment can be true.  It seems like
> we'll have 2 cases:
>
> 1. The pages in question are actually erased, in which case they'll be
> 0xff-filled by ubi_read() without error, or
> 2. The pages were programmed with all 0xff data, in which case the ECC
> info should be correct
>
> There's a third possibility, that the ECC info is bad because there's
> a real error, but that's not something we can reliably recover from.
> Am I overlooking another scenario?

Perhaps I am running into a nuance of the davinci nand driver. Back
when you started the 'Programming ubinized images' thread [1] some of
you problem description caught me eye:

On Mon, Apr 25, 2011 at 2:37 PM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> We encountered one case in which we were re-flashing a device for
> testing using U-Boot's "nand erase", and got the "ubi_io_read: error
> -74" error from the FAQ.  That's no big deal, since we never do this
> in the field, and clearly "nand erase" isn't something we'd want to do
> even without this problem since it loses erase-counter info.

Because the "ubi_io_read: error -74 (ECC error)" is precisely what I
am encountering on my hardware when I do not flash with a utility that
drops empty pages at the end of eraseblocks. I imagined that this was
also the case for you. But I have also read that there are
peculiarities of the davinci nand driver (both in u-boot and linux).

So, at least on my hardware, the -74 error is expected when the 0xff
pages are not dropped and so without the 'err != -EBADMSG' exception
the free space fixup will cause the volume to fail mount for me:

UBIFS: start fixing up free space
UBI error: ubi_io_read: error -74 (ECC error) while reading 4096 bytes
from PEB 18:4096, read 4096 bytes
VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0)
[...]

Whereas, when the same image is flashed using a u-boot 'nand write'
variant that drops 0xff pages [2] there are no such -74 errors
encountered.

UBIFS: start fixing up free space
UBIFS: free space fixup complete
UBIFS: mounted UBI device 0, volume 1, name "rootfs"
[...]

The nand I am using is the micron part that ships with the da850evm;
it has eraseblocks of 128KiB and pages of 2048. Here is a dump of
eraseblock 18 from the ubinized image:

00240000  55 42 49 23 01 00 00 00  00 00 00 00 00 00 00 00  |UBI#............|
00240010  00 00 08 00 00 00 10 00  04 6c 24 45 00 00 00 00  |.........l$E....|
00240020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00240030  00 00 00 00 00 00 00 00  00 00 00 00 38 be e8 75  |............8..u|
00240040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00240800  55 42 49 21 01 01 00 00  00 00 00 01 00 00 00 01  |UBI!............|
00240810  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00240830  00 00 00 00 00 00 00 00  00 00 00 00 5f 7f ac 08  |............_...|
00240840  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00241000  31 18 10 06 5f 3f 84 94  65 32 01 00 00 00 00 00  |1..._?..e2......|
00241010  00 02 00 00 07 00 00 00  53 2a 00 00 00 00 00 00  |........S*......|
00241020  00 00 00 00 00 00 00 00  02 00 00 00 03 00 00 00  |................|
00241030  f0 01 00 00 e0 8a 01 00  44 00 00 00 e3 01 00 00  |........D.......|
00241040  f0 01 00 00 00 90 01 00  b0 c7 18 00 00 00 00 00  |................|
00241050  00 18 05 00 00 00 00 00  00 49 05 00 00 00 00 00  |.........I......|
00241060  50 77 8a 03 00 00 00 00  78 fc 04 00 00 00 00 00  |Pw......x.......|
00241070  78 b8 01 00 00 00 00 00  08 00 00 00 12 0a 00 00  |x...............|
00241080  08 00 00 00 00 10 00 00  08 00 00 00 1e 0a 00 00  |................|
00241090  00 00 00 00 00 00 00 00  0b 00 00 00 01 00 00 00  |................|
002410a0  0d 00 00 00 f1 01 00 00  00 00 00 00 00 00 00 00  |................|
002410b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00241200  31 18 10 06 ef de 0e ee  00 00 00 00 00 00 00 00  |1...............|
00241210  1c 00 00 00 05 00 00 00  e4 05 00 00 00 00 00 00  |................|
00241220  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00241800  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00260000  55 42 49 23 01 00 00 00  00 00 00 00 00 00 00 00  |UBI#............|
00260010  00 00 08 00 00 00 10 00  04 6c 24 45 00 00 00 00  |.........l$E....|
00260020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00260030  00 00 00 00 00 00 00 00  00 00 00 00 38 be e8 75  |............8..u|
00260040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00260800  55 42 49 21 01 01 00 00  00 00 00 01 00 00 00 02  |UBI!............|
[...]

The data between 0x00241800 and 0x00260000 is all 0xff, so there are
trailing empty pages in this block.

Here is a u-boot nand-dump of eraseblock 18 + 0x1800 when flashed
using the usual u-boot 'nand write':
        ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
[...]
        ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
OOB:
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        3f 27 56 f5 29 d8 61 d9
        9d 14 3f 27 56 f5 29 d8
        61 d9 9d 14 3f 27 56 f5
        29 d8 61 d9 9d 14 3f 27
        56 f5 29 d8 61 d9 9d 14

and here it is again when flashed with the 'nand write' variant that
drops 0xff pages [2]:

        ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
[...]
        ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
OOB:
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff
        ff ff ff ff ff ff ff ff

The former state of eraseblock 18 causes "ubi_io_read: error -74 (ECC
error)" during free-space-fixup and the latter does not.

So I guess my particular situation is a problem with the davinci nand
driver's ECC for 0xFF data and is _not_ covered by the free space
fixup? It would be really nice if the free space fixup supported both
1) setups that are such that the flash cannot be written to more than
once and 2) setups that are such that they return bogus -74 errors.
Both are caused by not dropping trailing 0xff pages when writing.

Best Regards,
Ben Gardiner

[1] http://article.gmane.org/gmane.linux.drivers.mtd/34890
[2] http://article.gmane.org/gmane.comp.boot-loaders.u-boot/98740

---
Nanometrics Inc.
http://www.nanometrics.ca
Matthew L. Creech May 19, 2011, 3:59 p.m. UTC | #3
On Thu, May 19, 2011 at 9:28 AM, Ben Gardiner
<bengardiner@nanometrics.ca> wrote:
>
> So I guess my particular situation is a problem with the davinci nand
> driver's ECC for 0xFF data and is _not_ covered by the free space
> fixup?
>

Yeah, that's what it sounds like.

Can you reproduce the -EBADMSG with a simpler test case that leaves
UBI/UBIFS out of the equation?  Maybe enable write-verify in U-Boot
and try using "nand write" again, or use mtd-utils "nandwrite" from an
initramfs image then try to read back the data -- that should give a
simpler test case that shows whether the ECC handling of empty space
is wrong.

If so, that probably needs to be fixed at the MTD level, since it will
affect more than this particular UBIFS case.
Artem Bityutskiy May 20, 2011, 6:21 a.m. UTC | #4
[Rmoving Adiran Hunter from CC - he's left Nokia and his e-mail does not
work anyway]

Hi,

On Thu, 2011-05-19 at 09:28 -0400, Ben Gardiner wrote:
> On Mon, Apr 25, 2011 at 2:37 PM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> > We encountered one case in which we were re-flashing a device for
> > testing using U-Boot's "nand erase", and got the "ubi_io_read: error
> > -74" error from the FAQ.  That's no big deal, since we never do this
> > in the field, and clearly "nand erase" isn't something we'd want to do
> > even without this problem since it loses erase-counter info.
> 
> Because the "ubi_io_read: error -74 (ECC error)" is precisely what I
> am encountering on my hardware when I do not flash with a utility that
> drops empty pages at the end of eraseblocks. I imagined that this was
> also the case for you. But I have also read that there are
> peculiarities of the davinci nand driver (both in u-boot and linux).
> 
> So, at least on my hardware, the -74 error is expected when the 0xff
> pages are not dropped and so without the 'err != -EBADMSG' exception
> the free space fixup will cause the volume to fail mount for me:

I am confused. The fix Matthew made is about the following situation:

1. You have completely erased flash (MTD partition) - no one has ever
written there. If you now read the flash, you'll get all 0xFFs with no
errors.

2. You use a "dumb" flasher to program an UBI image. This flasher will
write empty NAND pages "as is". If you now read the flash after the
"dumb" programming, you should have no errors.

3. You mount UBIFS for the very first time. It tries to fix up your
flash. Whatever eraseblock UBIFS reads, it should not encounter any
error.

Isn't it weird that a freshly programmed flash cannot be read without
-EBADMSG (ECC correction failure).

Why Matthew's patches are needed then? They are needed to _prevent_
UBIFS from writing to the NAND pages which have been programmed with all
0xFFs.
Artem Bityutskiy May 20, 2011, 6:29 a.m. UTC | #5
On Thu, 2011-05-19 at 09:28 -0400, Ben Gardiner wrote:
> So, at least on my hardware, the -74 error is expected when the 0xff
> pages are not dropped and so without the 'err != -EBADMSG' exception
> the free space fixup will cause the volume to fail mount for me:

Still confused. If you have empty and erased flash, then you program it
with all 0xFFs, you should be able to read it with no errors. If this is
not the case for you, you should fix the driver. UBIFS cannot help in
this case - we consider it as "broken flash driver" case.

I have not read your e-mail carefully because of limited amount of time,
but with quick reading I became confused - too much information :-)
Ben Gardiner May 24, 2011, 2:33 p.m. UTC | #6
Hi Artem,

On Fri, May 20, 2011 at 2:21 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Thu, 2011-05-19 at 09:28 -0400, Ben Gardiner wrote:
>> [...]
>>
>> Because the "ubi_io_read: error -74 (ECC error)" is precisely what I
>> am encountering on my hardware when I do not flash with a utility that
>> drops empty pages at the end of eraseblocks. I imagined that this was
>> also the case for you. But I have also read that there are
>> peculiarities of the davinci nand driver (both in u-boot and linux).
>>
>> So, at least on my hardware, the -74 error is expected when the 0xff
>> pages are not dropped and so without the 'err != -EBADMSG' exception
>> the free space fixup will cause the volume to fail mount for me:
>
> I am confused. The fix Matthew made is about the following situation:

Sorry for the confusion. In the first 'Programming ubinized images'
thread I saw that Matthew mentioned seeing lots of -74 errors so I was
assuming that his and my problems are one and the same. Now that I
have tested his patch I am seeing that this is not the case.

> 1. You have completely erased flash (MTD partition) - no one has ever
> written there. If you now read the flash, you'll get all 0xFFs with no
> errors.
>
> 2. You use a "dumb" flasher to program an UBI image. This flasher will
> write empty NAND pages "as is". If you now read the flash after the
> "dumb" programming, you should have no errors.
>
> 3. You mount UBIFS for the very first time. It tries to fix up your
> flash. Whatever eraseblock UBIFS reads, it should not encounter any
> error.

Yes I agree with the context for the patch. I can say so far that on
da850evm 1. occurs as you described whereas for 2. and 3. ECC errors
are encountered on read.

> Isn't it weird that a freshly programmed flash cannot be read without
> -EBADMSG (ECC correction failure).

Yes it is very weird. :) I had been assuming up to this point that the
-74 errors were the result of writing the 0xff pages when they should
be dropped. I can see now that this is not the case.

On Fri, May 20, 2011 at 2:29 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> On Thu, 2011-05-19 at 09:28 -0400, Ben Gardiner wrote:
>> So, at least on my hardware, the -74 error is expected when the 0xff
>> pages are not dropped and so without the 'err != -EBADMSG' exception
>> the free space fixup will cause the volume to fail mount for me:
>
> Still confused. If you have empty and erased flash, then you program it
> with all 0xFFs, you should be able to read it with no errors.

Right. In retrospect this seems to be a self-evident statement :)
I had been previously attributing the -74 errors to a combination of
1) writing 0xff pages when one should not and 2) lack of subpage
writing support workaround with -O 2048. But it is clearly (to me now)
a reading operation failure.

> If this is
> not the case for you, you should fix the driver. UBIFS cannot help in
> this case - we consider it as "broken flash driver" case.

Thank you. Yes, this appears to be the case and I'm glad to hear
confirmation of this from an expert such as yourself.

> I have not read your e-mail carefully because of limited amount of time,
> but with quick reading I became confused - too much information :-)

No problem. Thank you very much for what you have read and for your
insights into the ongoing NAND flash problems on da850evm.

Best Regards,
Ben Gardiner

---
Nanometrics Inc.
http://www.nanometrics.ca
diff mbox

Patch

diff --git a/fs/ubifs/sb.c b/fs/ubifs/sb.c
index c606f01..6e7da54 100644
--- a/fs/ubifs/sb.c
+++ b/fs/ubifs/sb.c
@@ -1,3 +1,4 @@ 
+#define DEBUG
 /*
  * This file is part of UBIFS.
  *
@@ -679,7 +680,12 @@  static int fixup_leb(struct ubifs_info *c, int lnum, int len)
 
 	dbg_mnt("fixup LEB %d, data len %d", lnum, len);
 	err = ubi_read(c->ubi, lnum, c->sbuf, 0, len);
-	if (err)
+	/*
+	 * Don't fail on -EBADMSG since these are precisely the error codes that
+	 * are returned by ubi_red in the cases where free-space fix-ups are
+	 * required.
+	 */
+	if (err && err != -EBADMSG)
 		return err;
 
 	return ubi_leb_change(c->ubi, lnum, c->sbuf, len, UBI_UNKNOWN);