diff mbox

ubifs became broken on contigous power-fails

Message ID 1274617419.22999.24.camel@localhost
State New, archived
Headers show

Commit Message

Artem Bityutskiy May 23, 2010, 12:23 p.m. UTC
On Sun, 2010-05-23 at 14:28 +0300, Artem Bityutskiy wrote:
> On Tue, 2010-05-11 at 18:43 +0400, Alexander Pazdnikov wrote:
> > Hello.
> > 
> > We are stress-testing 8 devices by power loss in 5 minutes interval.
> > Device uses sqlite database to store collected data, every 1 minute accumulated data (500-1000 records) is stored into database in transaction.
> > 
> > ubifs (ubi2:dbfs on /usr/local/ecom/db bellow) with database on 6 of 8 devices after different time (1-3 days) became broken.
> > 
> > Any advise for futher debugging or solving this problem is highly appriciated.
> > 
> > 
> > kernel 2.6.32.12
> > 
> > suspicious ->       reserved GC LEB:     -1
> > 
> > # cat /proc/mtd
> > dev:    size   erasesize  name
> > mtd0: 00020000 00020000 "bootstrap"
> > mtd1: 00080000 00020000 "uboot"
> > mtd2: 00020000 00020000 "uboot_env1"
> > mtd3: 00020000 00020000 "uboot_env2"
> > mtd4: 02000000 00020000 "ubi_main"
> > mtd5: 02000000 00020000 "ubi_var"
> > mtd6: 0bf00000 00020000 "ubi_database"
> > 
> > 
> > mounting ubi2:dbfs on startup 
> > [   14.328117] UBIFS: recovery needed
> > [   53.941378] UBIFS error (pid 462): ubifs_rcvry_gc_commit: could not find a dirty LEB
> 
> This is must be a bug. UBIFS should always have space for GC. I will
> think how we can track this down, although I have a very limited amount
> of time.

'ubifs_rcvry_gc_commit()' is poorly documented and I do not remember how
it works already (argh! and it was written by Adrian :-) ), but I cooked
up a patch for you. I'll try to look at this issue some more later and
make sure I really understand how 'ubifs_rcvry_gc_commit()' works. But
for now, try the following patch:

>From 019acd23e01e33406eb325fadb9bd89fd4e8c374 Mon Sep 17 00:00:00 2001
From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date: Sun, 23 May 2010 15:20:21 +0300
Subject: [PATCH] UBIFS: try harder to find gc_leb

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 fs/ubifs/recovery.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

Comments

Alexander Pazdnikov May 24, 2010, 6:49 a.m. UTC | #1
Artem, thank you very much, now applying your patches and making new stress-tests :-)

Also last holidays have caught another issue.

We've modified our program to primary mount ubi2:dbfs in RO, then sleep 3 secs, then remount in RW.
Today have found half (5/10) of devices with the error : No space left on device
If mount by hands this stalled fs primary in RW mode, then everything goes OK.
If drop down mount "mount ubi2:dbfs in RO, then sleep 3 secs", leave only mount in RW, everything also goes OK.

[    6.576168] UBIFS: recovery needed
[    6.665035] UBIFS: recovery deferred
[    6.668941] UBIFS: mounted UBI device 2, volume 0, name "dbfs"
[    6.674801] UBIFS: mounted read-only
[    6.678707] UBIFS: file system size:   189407232 bytes (184968 KiB, 180 MiB, 1468 LEBs)
[    6.687496] UBIFS: journal size:       9418752 bytes (9198 KiB, 8 MiB, 73 LEBs)
[    6.694332] UBIFS: media format:       w4/r0 (latest is w4/r0)
[    6.700191] UBIFS: default compressor: lzo
[    6.705074] UBIFS: reserved for root:  4952683 bytes (4836 KiB)
[    9.787104] UBIFS: completing deferred recovery
Remounting ubi2:dbfs on /usr/local/ecom/db failed: No space left on device

# df
Filesystem                Size      Used Available Use% Mounted on
ubi0:rootfs              13.2M      7.2M      6.0M  55% /
tmpfs                     4.0K         0      4.0K   0% /dev
shm                       1.0M         0      1.0M   0% /dev/shm
tmpfs                     1.0M     12.0K   1012.0K   1% /tmp
tmpfs                     1.0M      4.0K   1020.0K   0% /root
ubi1:logs                11.5M      2.4M      8.4M  22% /var/log
ubi1:tmp                 13.4M     16.0K     12.7M   0% /usr/local/ecom/tmp
ubi0:local                2.1M     44.0K      1.9M   2% /var/local
ubi:config                2.0M     68.0K      1.9M   3% /usr/local/ecom/conf
ubi2:dbfs               169.2M     18.7M    145.8M  11% /usr/local/ecom/db
Artem Bityutskiy May 24, 2010, 8:18 a.m. UTC | #2
On Mon, 2010-05-24 at 12:49 +0600, Alexander Pazdnikov wrote:
> Artem, thank you very much, now applying your patches and making new stress-tests :-)
> 
> Also last holidays have caught another issue.
> 
> We've modified our program to primary mount ubi2:dbfs in RO, then sleep 3 secs, then remount in RW.
> Today have found half (5/10) of devices with the error : No space left on device
> If mount by hands this stalled fs primary in RW mode, then everything goes OK.
> If drop down mount "mount ubi2:dbfs in RO, then sleep 3 secs", leave only mount in RW, everything also goes OK.
> 
> [    6.576168] UBIFS: recovery needed
> [    6.665035] UBIFS: recovery deferred
> [    6.668941] UBIFS: mounted UBI device 2, volume 0, name "dbfs"
> [    6.674801] UBIFS: mounted read-only
> [    6.678707] UBIFS: file system size:   189407232 bytes (184968 KiB, 180 MiB, 1468 LEBs)
> [    6.687496] UBIFS: journal size:       9418752 bytes (9198 KiB, 8 MiB, 73 LEBs)
> [    6.694332] UBIFS: media format:       w4/r0 (latest is w4/r0)
> [    6.700191] UBIFS: default compressor: lzo
> [    6.705074] UBIFS: reserved for root:  4952683 bytes (4836 KiB)
> [    9.787104] UBIFS: completing deferred recovery
> Remounting ubi2:dbfs on /usr/local/ecom/db failed: No space left on device
> 
> # df
> Filesystem                Size      Used Available Use% Mounted on
> ubi0:rootfs              13.2M      7.2M      6.0M  55% /
> tmpfs                     4.0K         0      4.0K   0% /dev
> shm                       1.0M         0      1.0M   0% /dev/shm
> tmpfs                     1.0M     12.0K   1012.0K   1% /tmp
> tmpfs                     1.0M      4.0K   1020.0K   0% /root
> ubi1:logs                11.5M      2.4M      8.4M  22% /var/log
> ubi1:tmp                 13.4M     16.0K     12.7M   0% /usr/local/ecom/tmp
> ubi0:local                2.1M     44.0K      1.9M   2% /var/local
> ubi:config                2.0M     68.0K      1.9M   3% /usr/local/ecom/conf
> ubi2:dbfs               169.2M     18.7M    145.8M  11% /usr/local/ecom/db

Could you enable debugging messages when mounging it after power cuts?
The messages generate a lot of noise, so you need to disable them before
starting the real test. The idea is to get UBIFS debugging messages when
errors occur.

You can read about how to enable/disable the messages here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_how_send_bugreport
and in the
Documentation/filesystems/ubifs.txt

The problems you see are probably budgeting bugs. Please, read this
section to get some idea what is the UBIFS budgeting subsystem about:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

I have a test which reproduces budgeting bugs here:
git://git.infradead.org/users/dedekind/ubifs-userspace.git

see the "freespace" test and the 'freespace.sh' script which runs the
freespace test on various NAND/NOR size and geometry flashes.

I never had time to really dig this and fix. Would be nice if you could
do this because I will unlikely have time in the nearest future, but I
can help you.

One think you can do is to make sure you do not fill UBIFS completely,
then you will avoid hitting these budgeting bugs, also known as ENOSPC
issues (I think btrfs uses this terminology).

E.g., if you work under a normal user, you can reserve more space for
the root (see -R mkfs.ubifs option). But of course, ideally we should
just fix the issues.
Alexander Pazdnikov May 24, 2010, 9:50 a.m. UTC | #3
> Could you enable debugging messages when mounging it after power cuts?
> The messages generate a lot of noise, so you need to disable them before
> starting the real test. The idea is to get UBIFS debugging messages when
> errors occur.
Sorry, I've hurried up, and after applying your pathes reflash all of the 
devices with new firmware.
But I'll setup some devices with all firmware to catch this issue again.

> I never had time to really dig this and fix. Would be nice if you could
> do this because I will unlikely have time in the nearest future, but I
> can help you.
Thank you, I'll continue testings and can supply with debug information.

> One think you can do is to make sure you do not fill UBIFS completely,
> then you will avoid hitting these budgeting bugs, also known as ENOSPC
> issues (I think btrfs uses this terminology).
In the issue above there is 11% used space and 89% free space.
I've got this issue only when mounting fs in RO and then remounting it in RW.
Artem Bityutskiy May 24, 2010, 9:56 a.m. UTC | #4
On Mon, 2010-05-24 at 15:50 +0600, Alexander Pazdnikov wrote:
> > Could you enable debugging messages when mounging it after power cuts?
> > The messages generate a lot of noise, so you need to disable them before
> > starting the real test. The idea is to get UBIFS debugging messages when
> > errors occur.
> Sorry, I've hurried up, and after applying your pathes reflash all of the 
> devices with new firmware.
> But I'll setup some devices with all firmware to catch this issue again.
> 
> > I never had time to really dig this and fix. Would be nice if you could
> > do this because I will unlikely have time in the nearest future, but I
> > can help you.
> Thank you, I'll continue testings and can supply with debug information.
> 
> > One think you can do is to make sure you do not fill UBIFS completely,
> > then you will avoid hitting these budgeting bugs, also known as ENOSPC
> > issues (I think btrfs uses this terminology).
> In the issue above there is 11% used space and 89% free space.
> I've got this issue only when mounting fs in RO and then remounting it in RW.

OK. Try my patch. Also, it is worth running your tests with UBIFS extra
self checks enabled.
Artem Bityutskiy June 12, 2010, 2:22 p.m. UTC | #5
On Mon, 2010-05-24 at 12:56 +0300, Artem Bityutskiy wrote:
> On Mon, 2010-05-24 at 15:50 +0600, Alexander Pazdnikov wrote:
> > > Could you enable debugging messages when mounging it after power cuts?
> > > The messages generate a lot of noise, so you need to disable them before
> > > starting the real test. The idea is to get UBIFS debugging messages when
> > > errors occur.
> > Sorry, I've hurried up, and after applying your pathes reflash all of the 
> > devices with new firmware.
> > But I'll setup some devices with all firmware to catch this issue again.
> > 
> > > I never had time to really dig this and fix. Would be nice if you could
> > > do this because I will unlikely have time in the nearest future, but I
> > > can help you.
> > Thank you, I'll continue testings and can supply with debug information.
> > 
> > > One think you can do is to make sure you do not fill UBIFS completely,
> > > then you will avoid hitting these budgeting bugs, also known as ENOSPC
> > > issues (I think btrfs uses this terminology).
> > In the issue above there is 11% used space and 89% free space.
> > I've got this issue only when mounting fs in RO and then remounting it in RW.
> 
> OK. Try my patch. Also, it is worth running your tests with UBIFS extra
> self checks enabled.

Did it help?
Alexander Pazdnikov June 15, 2010, 5:02 a.m. UTC | #6
В сообщении от Суббота 12 июня 2010 20:22:55 автор Artem Bityutskiy написал:
> On Mon, 2010-05-24 at 12:56 +0300, Artem Bityutskiy wrote:
> > On Mon, 2010-05-24 at 15:50 +0600, Alexander Pazdnikov wrote:
> > > > Could you enable debugging messages when mounging it after power
> > > > cuts? The messages generate a lot of noise, so you need to disable
> > > > them before starting the real test. The idea is to get UBIFS
> > > > debugging messages when errors occur.
> > >
> > > Sorry, I've hurried up, and after applying your pathes reflash all of
> > > the devices with new firmware.
> > > But I'll setup some devices with all firmware to catch this issue
> > > again.
> > >
> > > > I never had time to really dig this and fix. Would be nice if you
> > > > could do this because I will unlikely have time in the nearest
> > > > future, but I can help you.
> > >
> > > Thank you, I'll continue testings and can supply with debug
> > > information.
> > >
> > > > One think you can do is to make sure you do not fill UBIFS
> > > > completely, then you will avoid hitting these budgeting bugs, also
> > > > known as ENOSPC issues (I think btrfs uses this terminology).
> > >
> > > In the issue above there is 11% used space and 89% free space.
> > > I've got this issue only when mounting fs in RO and then remounting it
> > > in RW.
> >
> > OK. Try my patch. Also, it is worth running your tests with UBIFS extra
> > self checks enabled.
> 
> Did it help?
> 
Artem, thank you very much, yes - it helped, all of 10 testing devices 
survived with repeatable powerfails (1,2,3,4,5,6,7,8,9,10 minutes in this 
sequence) for a week.
Now we leave 3 devices for this test.

Appologize me, please, I've not tested again for the RO-RW recovery on remount 
with this patch to supply you with debug info.
Artem Bityutskiy July 13, 2010, 3:58 a.m. UTC | #7
On Tue, 2010-06-15 at 11:02 +0600, Alexander Pazdnikov wrote:
> Artem, thank you very much, yes - it helped, all of 10 testing devices 
> survived with repeatable powerfails (1,2,3,4,5,6,7,8,9,10 minutes in this 
> sequence) for a week.
> Now we leave 3 devices for this test.

OK, I finally pushed the fix to the ubifs-2.6.git tree and will merge it
next merge window:

http://git.infradead.org/ubifs-2.6.git/commit/6fb4374f6b1b3932f3acfe9d353568d3d8599cad
diff mbox

Patch

diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
index 109c6ea..5e2a438 100644
--- a/fs/ubifs/recovery.c
+++ b/fs/ubifs/recovery.c
@@ -1063,8 +1063,10 @@  int ubifs_rcvry_gc_commit(struct ubifs_info *c)
 	}
 	err = ubifs_find_dirty_leb(c, &lp, wbuf->offs, 2);
 	if (err) {
-		if (err == -ENOSPC)
-			dbg_err("could not find a dirty LEB");
+		if (err == -ENOSPC) {
+			dbg_rcvry("could not find a dirty LEB");
+			goto find_free;
+		}
 		return err;
 	}
 	ubifs_assert(!(lp.flags & LPROPS_INDEX));