Patchwork [2/2] UBIFS: add unstable pages problem description

login
register
mail settings
Submitter Artem Bityutskiy
Date Oct. 18, 2010, 10:02 a.m.
Message ID <1287396125-1890-1-git-send-email-dedekind1@gmail.com>
Download mbox | patch
Permalink /patch/68158/
State New
Headers show

Comments

Artem Bityutskiy - Oct. 18, 2010, 10:02 a.m.
From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>

Describe a problem reported by Matthieu CASTET which is currently
not handled by UBIFS.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 fs/ubifs/replay.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)
Artem Bityutskiy - Oct. 19, 2010, 7:57 a.m.
On Mon, 2010-10-18 at 13:02 +0300, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> 
> Describe a problem reported by Matthieu CASTET which is currently
> not handled by UBIFS.
> 
> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>

Matthiew, are you happy with this description? Does it properly reflect
your findings? Could you please correct, if not?

I'm starting working on your problem. Since I do not have much time,
I'll do a little everyday, but hope to come up with some patches this
week already. The thing is that it is a lot of work. We need to go
through a lot of UBI/UBIFS subsystems and analyze them.

Why a lot of work? Because we assumed everywhere we can rely on CRC - if
it is correct, we are safe. However, according to you this is not
reliable for unstable pages - you do not have guarantee that next time
you read it you will get correct data.

Also, I do not have HW to test this, so I expect you to help by testing,
are your testing set-ups kept ready? :-)
Matthieu CASTET - Oct. 20, 2010, 9:52 a.m.
Hi Artem,

Artem Bityutskiy a écrit :
> On Mon, 2010-10-18 at 13:02 +0300, Artem Bityutskiy wrote:
>> From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
>>
>> Describe a problem reported by Matthieu CASTET which is currently
>> not handled by UBIFS.
>>
>> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> 
> Matthiew, are you happy with this description? Does it properly reflect
> your findings? Could you please correct, if not?
Yes that seems correct.

> 
> I'm starting working on your problem. Since I do not have much time,
> I'll do a little everyday, but hope to come up with some patches this
> week already. The thing is that it is a lot of work. We need to go
> through a lot of UBI/UBIFS subsystems and analyze them.
> 
> Why a lot of work? Because we assumed everywhere we can rely on CRC - if
> it is correct, we are safe. However, according to you this is not
> reliable for unstable pages - you do not have guarantee that next time
> you read it you will get correct data.
> 
> Also, I do not have HW to test this, so I expect you to help by testing,
> are your testing set-ups kept ready? :-)
> 
Yes our boards are ready to test things.

But we can sent you flashs or boards with the problem.
What flash/board do you have on your side ?
Could you swap nand on your board (via tsop socket) ?

We could sent one of our board, but the update side can be complex/tricky.

Some of beagleboard may have the problem. But I am unable to test it.
On the beagleboard I have, I got strange ecc error [1] event without 
reboot. Also the driver look strange (for example doesn't do bad block 
scanning [2]). I end up with unusable nand [3]. Do you know if there is 
a better version of the nand driver for beagle (I use the one from 
ubi-2.6) ?

Matthieu

[1]
UBI error: ubi_io_read: error -74 (ECC error) while reading 4144
bytes from PEB 3:45056, read 4144 bytes
[...]
UBI error: do_sync_erase: cannot erase PEB 137, error -5


[2]
for each format I got
ubiformat: formatting eraseblock 137 -- 53 % complete
ubiformat: error!: failed to erase eraseblock 137
            error 5 (Input/output error)
ubiformat: marking block 137 bad

[3]
# ubiformat /dev/mtd3 -y 

ubiformat: mtd3 (nand), size 33554432 bytes (32.0 MiB), 256 eraseblocks 
of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 255 -- 100 % complete 

ubiformat: 255 eraseblocks have valid erase counter, mean value is 10 

ubiformat: 1 bad eraseblocks found, numbers: 137 

ubiformat: warning!: VID header and data offsets on flash are 2048 and 
4096, which is different to requested offsets 512 and 28
ubiformat: use new offsets 512 and 2048? (yes/no)  yes 

ubiformat: use offsets 512 and 2048 

ubiformat: formatting eraseblock 255 -- 100 % complete 

# ubiattach /dev/ubi_ctrl -m 3 -d 3 

[  166.922119] UBI: attaching mtd3 to ubi3 

[  166.926177] UBI: physical eraseblock size:   131072 bytes (128 KiB) 

[  166.932495] UBI: logical eraseblock size:    129024 bytes 

[  166.937927] UBI: smallest flash I/O unit:    2048 

[  166.942657] UBI: sub-page size:              512 

[  166.947326] UBI: VID header offset:          512 (aligned 512) 

[  166.953186] UBI: data offset:                2048 

[  166.958740] Correcting single bit ECC error at offset: 389, bit: 3 

[  167.137695] UBI: max. sequence number:       0 

[  167.142883] Correcting single bit ECC error at offset: 340, bit: 6 

[  167.149108] ecc failure 

[  167.151580] Correcting single bit ECC error at offset: 12, bit: 6 

[  167.158325] ecc failure 

[  167.160797] ecc failure 

[  167.163269] Correcting single bit ECC error at offset: 44, bit: 6 

[  167.170013] ecc failure 

[  167.172485] ecc failure 

[  167.175567] ecc failure 

[  167.178039] ecc failure 

[  167.181121] ecc failure 

[  167.183593] ecc failure 

[  167.186645] ecc failure 

[  167.189147] ecc failure 

[  167.192199] Correcting single bit ECC error at offset: 188, bit: 6 

[  167.198455] Correcting single bit ECC error at offset: 196, bit: 6 

[  167.205291] Correcting single bit ECC error at offset: 220, bit: 6 

[  167.211517] Correcting single bit ECC error at offset: 228, bit: 6 

[  167.218353] Correcting single bit ECC error at offset: 252, bit: 6 

[  167.224578] Correcting single bit ECC error at offset: 260, bit: 6 

[  167.231445] Correcting single bit ECC error at offset: 284, bit: 6 

[  167.237670] Correcting single bit ECC error at offset: 292, bit: 6 

[  167.244537] Correcting single bit ECC error at offset: 316, bit: 6 

[  167.250762] Correcting single bit ECC error at offset: 324, bit: 6 

[  167.256988] UBI error: ubi_io_read: error -74 (ECC error) while 
reading 22528 bytes from PEB 0:2048, read 22528 bytes
[  167.267700] [<c0034d5c>] (unwind_backtrace+0x0/0xf4) from 
[<c01db4dc>] (ubi_io_read+0x1b0/0x340)
[  167.276580] [<c01db4dc>] (ubi_io_read+0x1b0/0x340) from [<c01d1728>] 
(ubi_read_volume_table+0xbc/0xa44)
[  167.286071] [<c01d1728>] (ubi_read_volume_table+0xbc/0xa44) from 
[<c01d537c>] (ubi_attach_mtd_dev+0x674/0xcd0)
[  167.296173] [<c01d537c>] (ubi_attach_mtd_dev+0x674/0xcd0) from 
[<c01d5b80>] (ctrl_cdev_ioctl+0xec/0x164)
[  167.305725] [<c01d5b80>] (ctrl_cdev_ioctl+0xec/0x164) from 
[<c00d63d0>] (do_vfs_ioctl+0x7c/0x5f8)
[  167.314697] [<c00d63d0>] (do_vfs_ioctl+0x7c/0x5f8) from [<c00d6984>] 
(sys_ioctl+0x38/0x60)
[  167.323028] [<c00d6984>] (sys_ioctl+0x38/0x60) from [<c00300c0>] 
(ret_fast_syscall+0x0/0x30)
[  167.332214] Correcting single bit ECC error at offset: 340, bit: 6 

[  167.338470] ecc failure 

[  167.340942] Correcting single bit ECC error at offset: 12, bit: 6 

[  167.347686] ecc failure 

[  167.350158] ecc failure 

[  167.352630] Correcting single bit ECC error at offset: 44, bit: 6 

[  167.359375] ecc failure 

[  167.361846] ecc failure 

[  167.364929] ecc failure 

[  167.367401] ecc failure 

[  167.370452] ecc failure 

[  167.372955] ecc failure 

[  167.376007] ecc failure 

[  167.378479] ecc failure 

[  167.381561] Correcting single bit ECC error at offset: 188, bit: 6 

[  167.387786] Correcting single bit ECC error at offset: 196, bit: 6 

[  167.394653] Correcting single bit ECC error at offset: 220, bit: 6 

[  167.400848] Correcting single bit ECC error at offset: 228, bit: 6 

[  167.407714] Correcting single bit ECC error at offset: 252, bit: 6 

[  167.413940] Correcting single bit ECC error at offset: 260, bit: 6 

[  167.420806] Correcting single bit ECC error at offset: 284, bit: 6 

[  167.427032] Correcting single bit ECC error at offset: 292, bit: 6 

[  167.433868] Correcting single bit ECC error at offset: 316, bit: 6 

[  167.440124] Correcting single bit ECC error at offset: 324, bit: 6 

[  167.446350] UBI error: ubi_io_read: error -74 (ECC error) while 
reading 22528 bytes from PEB 1:2048, read 22528 bytes
[  167.457031] [<c0034d5c>] (unwind_backtrace+0x0/0xf4) from 
[<c01db4dc>] (ubi_io_read+0x1b0/0x340)
[  167.465911] [<c01db4dc>] (ubi_io_read+0x1b0/0x340) from [<c01d1728>] 
(ubi_read_volume_table+0xbc/0xa44)
[  167.475402] [<c01d1728>] (ubi_read_volume_table+0xbc/0xa44) from 
[<c01d537c>] (ubi_attach_mtd_dev+0x674/0xcd0)
[  167.485473] [<c01d537c>] (ubi_attach_mtd_dev+0x674/0xcd0) from 
[<c01d5b80>] (ctrl_cdev_ioctl+0xec/0x164)
[  167.495056] [<c01d5b80>] (ctrl_cdev_ioctl+0xec/0x164) from 
[<c00d63d0>] (do_vfs_ioctl+0x7c/0x5f8)
[  167.503997] [<c00d63d0>] (do_vfs_ioctl+0x7c/0x5f8) from [<c00d6984>] 
(sys_ioctl+0x38/0x60)
[  167.512329] [<c00d6984>] (sys_ioctl+0x38/0x60) from [<c00300c0>] 
(ret_fast_syscall+0x0/0x30)
[  167.520874] UBI error: vtbl_check: bad CRC at record 1: 0xf116c36b, 
not 0xb116c36b
[  167.528594] UBI error: vtbl_check: bad CRC at record 1: 0xf116c36b, 
not 0xb116c36b
[  167.536285] UBI error: process_lvol: both volume tables are corrupted 

[  167.542877] UBI error: ubi_attach_mtd_dev: failed to attach by 
scanning, error -22
ubiattach: error!: cannot attach mtd3 

            error 22 (Invalid argument)

Patch

diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index eed0fcf..e04d74a 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -32,6 +32,28 @@ 
  * larger is the journal, the more memory its index may consume.
  */
 
+/*
+ * Problem description: unstable pages after unclean power cut on NAND flashes.
+ *
+ * If a power cut happens when we have ongoing NAND page program, this page
+ * becomes unstable. The following situations are possible when we mount this
+ * flash next time and UBIFS reads the page.
+ *   o The page may look like it is empty, i.e., it contains only 0xFFs, but
+ *     we write data there, the data becomes corrupted. I.e., when the data are
+ *     read, we may get a ECC errors. Moreover, the page may be read with no
+ *     errors sometimes, with an ECC error next time, with a bit-flip next
+ *     time, etc.
+ *   o The page may have bit-flip, but when it is read next time, it may have
+ *     ECC errors or no errors at all.
+ *   o An UBIFS	node may have correct CRC, but when it is read next time, it
+ *     may have CRC error.
+ *
+ * IOW, these unstable pages are disaster. UBIFS has to handle them correctly:
+ * never write to them and never rely on their contents.
+ *
+ * TODO: handle this for buds, log, orphan area, and master area.
+ */
+
 #include "ubifs.h"
 
 /*