Patchwork ubi deadlock on .36+

login
register
mail settings
Submitter Artem Bityutskiy
Date Nov. 13, 2010, 1:15 p.m.
Message ID <1289654101.2218.51.camel@localhost>
Download mbox | patch
Permalink /patch/71052/
State Accepted
Commit 276832d878d8a892ac7b40fd0ee07fe757e080c7
Headers show

Comments

Artem Bityutskiy - Nov. 13, 2010, 1:15 p.m.
On Sat, 2010-11-13 at 14:37 +0200, Artem Bityutskiy wrote:
> On Thu, 2010-11-04 at 15:07 +0200, Grazvydas Ignotas wrote:
> > On Thu, Nov 4, 2010 at 9:29 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> > > On Wed, 2010-11-03 at 23:30 +0200, Grazvydas Ignotas wrote:
> > >> Hi,
> > >>
> > >> there seems to be some issue with NAND on my OMAP3 board that causes
> > >> CRC errors on 2.6.36 and 2.6.37-rc1. Those seem to be triggering a bug
> > >> in UBI that makes it loop forever (or very long) printing this:
> > >>
> > >> uncorrectable error :
> > >> UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes
> > >> from PEB 0:512, read 512 bytes
> > >> uncorrectable error :
> > >> UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes
> > >> from PEB 68:512, read 512 bytes
> > >> UBI: run torture test for PEB 68
> > >> UBI: PEB 68 passed torture test, do not mark it a bad
> > >>
> > >>
> > >> here is full log of one minute session, after which I killed power:
> > >> http://notaz.gp2x.de/misc/pnd/logs/linux_20101103_ubi_lockup
> > >
> > > Hmm, could you please enable UBI debugging and provide me the logs? See
> > > here for some hints:
> > > http://www.linux-mtd.infradead.org/doc/ubi.html#L_how_send_bugreport
> > 
> > done:
> > http://notaz.gp2x.de/misc/pnd/logs/linux_20101103_ubi_lockup2
> 
> But would it be possible to enable all UBI debugging messages?

While trying to figure out what is happening in your system, I realized
one possible scenario which may confuse UBI. I've added a patch below.
This probably won't fix your issue (but you could try), I need more time
to think about what was happening. But a log with all messages (not only
I/O) would help. Thanks.

From 703ba5f120644fefef3cfed46c0d8ccf6a15b4ee Mon Sep 17 00:00:00 2001
From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date: Sat, 13 Nov 2010 15:08:29 +0200
Subject: [PATCH] UBI: improve UBI robustness

When reading data from the flash, corrupt the buffer we are about to
read to before reading. The idea is to fix the following possible
situation:

1. The buffer contains data from previous operation, e.g., read from
   another PEB previously. The data looks like expected, e.g., if we
   just do not read anything and return - the caller would not
   notice this. E.g., if we are reading a VID header, the buffer may
   contain a valid VID header from another PEB.
2. The driver is buggy and returns use success or -EBADMSG or
   -EUCLEAN, but it does not actually put any data to the buffer.

This may confuse UBI or upper layers - they may think the buffer
contains valid data while in fact it is just old data. This is
especially possible because UBI (and UBIFS) relies on CRC, and
treats data as correct even in case of ECC errors if the CRC is
correct.

Try to prevent this situation by changing the first byte of the
buffer.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 drivers/mtd/ubi/io.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)
Grazvydas Ignotas - Nov. 13, 2010, 2:23 p.m.
On Sat, Nov 13, 2010 at 3:15 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> While trying to figure out what is happening in your system, I realized
> one possible scenario which may confuse UBI. I've added a patch below.
> This probably won't fix your issue (but you could try), I need more time
> to think about what was happening. But a log with all messages (not only
> I/O) would help. Thanks.

Well I think I already know what's wrong with my driver - it has
subpage reads broken. So UBI tries to read a subpage, driver fails
there, then it runs a torture test on full PEB that passes (because
page reads work right), marks that PEB as good and retries the subpage
read that fails again, and the story repeats. Does that sound like
reasonable scenario, or do you still want more debugging logs?
Artem Bityutskiy - Nov. 14, 2010, 7:50 a.m.
On Sat, 2010-11-13 at 16:23 +0200, Grazvydas Ignotas wrote:
> On Sat, Nov 13, 2010 at 3:15 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> > While trying to figure out what is happening in your system, I realized
> > one possible scenario which may confuse UBI. I've added a patch below.
> > This probably won't fix your issue (but you could try), I need more time
> > to think about what was happening. But a log with all messages (not only
> > I/O) would help. Thanks.
> 
> Well I think I already know what's wrong with my driver - it has
> subpage reads broken. So UBI tries to read a subpage, driver fails
> there, then it runs a torture test on full PEB that passes (because
> page reads work right), marks that PEB as good and retries the subpage
> read that fails again, and the story repeats. Does that sound like
> reasonable scenario, or do you still want more debugging logs?

Yaeah, obviously you have driver problems, I'm just interested to
improve UBI's resilience.

Patch

diff --git a/drivers/mtd/ubi/io.c b/drivers/mtd/ubi/io.c
index c2960ac..9ab1a33 100644
--- a/drivers/mtd/ubi/io.c
+++ b/drivers/mtd/ubi/io.c
@@ -146,6 +146,28 @@  int ubi_io_read(const struct ubi_device *ubi, void *buf, int pnum, int offset,
 	if (err)
 		return err;
 
+	/*
+	 * Deliberately corrupt the buffer to improve robustness. Indeed, if we
+	 * do not do this, the following may happen:
+	 * 1. The buffer contains data from previous operation, e.g., read from
+	 *    another PEB previously. The data looks like expected, e.g., if we
+	 *    just do not read anything and return - the caller would not
+	 *    notice this. E.g., if we are reading a VID header, the buffer may
+	 *    contain a valid VID header from another PEB.
+	 * 2. The driver is buggy and returns us success or -EBADMSG or
+	 *    -EUCLEAN, but it does not actually put any data to the buffer.
+	 *
+	 * This may confuse UBI or upper layers - they may think the buffer
+	 * contains valid data while in fact it is just old data. This is
+	 * especially possible because UBI (and UBIFS) relies on CRC, and
+	 * treats data as correct even in case of ECC errors if the CRC is
+	 * correct.
+	 *
+	 * Try to prevent this situation by changing the first byte of the
+	 * buffer.
+	 */
+	*((uint8_t *)buf) ^= 0xFF;
+
 	addr = (loff_t)pnum * ubi->peb_size + offset;
 retry:
 	err = ubi->mtd->read(ubi->mtd, addr, len, &read, buf);