Patchwork ubifs : corruption after power cut test

login
register
mail settings
Submitter Artem Bityutskiy
Date July 13, 2010, 2:24 p.m.
Message ID <1279031064.31639.90.camel@localhost>
Download mbox | patch
Permalink /patch/58781/
State New
Headers show

Comments

Artem Bityutskiy - July 13, 2010, 2:24 p.m.
On Tue, 2010-07-13 at 11:24 +0200, Matthieu CASTET wrote:
> Matthieu CASTET a écrit :
> > Matthieu CASTET a écrit :
> >> Hi,
> >>
> >> we found some bug in our driver. Now there no more ubifs error when
> >> there is uncorrectable ecc error (they should happen in the last
> >> (interrupted) written page).
> >>
> >> But now we got "validate_master: bad master node at offset 69632 error
> >> 7" [1].
> > notice that gc_lnum==-1 in this case.
> > Also this didn't happen on power cut.
> > The senario was :
> > - power cut
> > - mount fs [1]
> > - do some fs operation
> > - umount fs quickly (9 second after mount in this case) [2]
> > - mount fs [3]
> > 
> > The the problem seems that gc_lnum==-1 is not handled in mount or
> > shouldn't happen in umount.
> > 
> The attached patch try to support mount with gc_lnum == -1.
> 
> Does it look sane ?

I did not give it much thought, but I do not see how master node can end
up with gc_lnum = -1 in it, and it seems we assumed this cannot happen.
Could you please add this hack to your kernel? It should catch the
situations when we write gc_lnum == -1 to the master node and print the
stack dump, which should give some idea about the code-path which causes
it.
Matthieu CASTET - July 13, 2010, 3:10 p.m.
Artem Bityutskiy a écrit :
> On Tue, 2010-07-13 at 11:24 +0200, Matthieu CASTET wrote:
>> Matthieu CASTET a écrit :
>>> Matthieu CASTET a écrit :
>>>> Hi,
>>>>
>>>> we found some bug in our driver. Now there no more ubifs error when
>>>> there is uncorrectable ecc error (they should happen in the last
>>>> (interrupted) written page).
>>>>
>>>> But now we got "validate_master: bad master node at offset 69632 error
>>>> 7" [1].
>>> notice that gc_lnum==-1 in this case.
>>> Also this didn't happen on power cut.
>>> The senario was :
>>> - power cut
>>> - mount fs [1]
>>> - do some fs operation
>>> - umount fs quickly (9 second after mount in this case) [2]
>>> - mount fs [3]
>>>
>>> The the problem seems that gc_lnum==-1 is not handled in mount or
>>> shouldn't happen in umount.
>>>
>> The attached patch try to support mount with gc_lnum == -1.
>>
>> Does it look sane ?
> 
> I did not give it much thought, but I do not see how master node can end
> up with gc_lnum = -1 in it, and it seems we assumed this cannot happen.
> Could you please add this hack to your kernel? It should catch the
> situations when we write gc_lnum == -1 to the master node and print the
> stack dump, which should give some idea about the code-path which causes
> it.
Ok thanks, I will run it

When checking the code, I saw that switch_gc_head can set c->gc_lnum to -1.

In ubifs_put_super, we set c->mst_node->gc_lnum to c->gc_lnum and write 
master node.
Can't ubifs_put_super run while switch_gc_head set gc_lnum to -1 ?

Matthieu
Matthieu CASTET - July 28, 2010, 7:40 a.m.
Hi,

Matthieu CASTET a écrit :
> Artem Bityutskiy a écrit :
>> On Tue, 2010-07-13 at 11:24 +0200, Matthieu CASTET wrote:
>>> Matthieu CASTET a écrit :
>>>> Matthieu CASTET a écrit :
>>>>> Hi,
>>>>>
>>>>> we found some bug in our driver. Now there no more ubifs error when
>>>>> there is uncorrectable ecc error (they should happen in the last
>>>>> (interrupted) written page).
>>>>>
>>>>> But now we got "validate_master: bad master node at offset 69632 error
>>>>> 7" [1].
>>>> notice that gc_lnum==-1 in this case.
>>>> Also this didn't happen on power cut.
>>>> The senario was :
>>>> - power cut
>>>> - mount fs [1]
>>>> - do some fs operation
>>>> - umount fs quickly (9 second after mount in this case) [2]
>>>> - mount fs [3]
>>>>
>>>> The the problem seems that gc_lnum==-1 is not handled in mount or
>>>> shouldn't happen in umount.
>>>>
>>> The attached patch try to support mount with gc_lnum == -1.
>>>
>>> Does it look sane ?
>> I did not give it much thought, but I do not see how master node can end
>> up with gc_lnum = -1 in it, and it seems we assumed this cannot happen.
>> Could you please add this hack to your kernel? It should catch the
>> situations when we write gc_lnum == -1 to the master node and print the
>> stack dump, which should give some idea about the code-path which causes
>> it.
> Ok thanks, I will run it
> 
> When checking the code, I saw that switch_gc_head can set c->gc_lnum to -1.
> 
> In ubifs_put_super, we set c->mst_node->gc_lnum to c->gc_lnum and write 
> master node.
> Can't ubifs_put_super run while switch_gc_head set gc_lnum to -1 ?
> 
I manage to reproduce it with the backtrace [1].

Matthieu

[1]
# UBIFS: recovery completed
UBIFS: mounted UBI device 3, volume 0, name "test"
UBIFS: file system size:   30474240 bytes (29760 KiB, 29 MiB, 240 LEBs)
UBIFS: journal size:       1523712 bytes (1488 KiB, 1 MiB, 12 LEBs)
UBIFS: media format:       w4/r0 (latest is w4/r0)
UBIFS: default compressor: lzo
UBIFS: reserved for root:  1439373 bytes (1405 KiB)
checking all files...
++++++ power failure detected, cleaning up tmpfile (262415 bytes)
### round 0 : 16 seconds
UBIFS: un-mount UBI device 3, volume 0
ubifs_write_master: gc_lnum is -1!
[<c00279f0>] (dump_stack+0x0/0x14) from [<c00d64c4>] 
(ubifs_write_master+0x170/0x1b0)
[<c00d6354>] (ubifs_write_master+0x0/0x1b0) from [<c00ce264>] 
(ubifs_put_super+0x1a0/0x1d8)
  r7:c7a7e000 r6:00000003 r5:c795c124 r4:c795c100
[<c00ce0c4>] (ubifs_put_super+0x0/0x1d8) from [<c007ed20>] 
(generic_shutdown_super+0x78/0xfc)
  r8:00000000 r7:c780cf38 r6:c780cf20 r5:c01b08bc r4:c7a9d400
[<c007eca8>] (generic_shutdown_super+0x0/0xfc) from [<c007ede8>] 
(kill_anon_super+0x18/0x34)
  r5:c022739c r4:0000000b
[<c007edd0>] (kill_anon_super+0x0/0x34) from [<c007ee7c>] 
(deactivate_super+0x48/0x60)
  r4:c7a9d400
[<c007ee34>] (deactivate_super+0x0/0x60) from [<c0093998>] 
(mntput_no_expire+0x64/0xc8)
  r5:c7a9d400 r4:c780cf20
[<c0093934>] (mntput_no_expire+0x0/0xc8) from [<c009456c>] 
(sys_umount+0x58/0x31c)
  r5:c780cf38 r4:c780cf18
[<c0094514>] (sys_umount+0x0/0x31c) from [<c0023c00>] 
(ret_fast_syscall+0x0/0x2c)
UBIFS error (pid 285): validate_master: bad master node at offset 104448 
error 7

Patch

diff --git a/fs/ubifs/master.c b/fs/ubifs/master.c
index 28beaee..8277f64 100644
--- a/fs/ubifs/master.c
+++ b/fs/ubifs/master.c
@@ -378,6 +378,15 @@  int ubifs_write_master(struct ubifs_info *c)
 	c->mst_offs = offs;
 	c->mst_node->highest_inum = cpu_to_le64(c->highest_inum);
 
+	{
+		/* Temporary hack for Matthieu */
+		int gc_lnum = le32_to_cpu(c->mst_node->gc_lnum);
+		if (gc_lnum < 0) {
+			printk(KERN_CRIT "%s: gc_lnum is %d!\n", __func__, gc_lnum);
+			dump_stack();
+		}
+	}
+
 	err = ubifs_write_node(c, c->mst_node, len, lnum, offs, UBI_SHORTTERM);
 	if (err)
 		return err;