Message ID | 1282501832.16502.97.camel@brekeke |
---|---|
State | Accepted |
Commit | 3a8fa0ed3e1a8dc54f9297d2a2292b87e03e21b5 |
Headers | show |
On Sun, Aug 22, 2010 at 2:30 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > Could you please dump LEB 5586 and check whether it is really erased or > not? Please, apply the following patch: > ... Hi Artem, I applied your patch to 2.6.35 and booted that on the same bad device for which I posted a boot log. But to my surprise, it didn't give any errors at all - the boot process completed normally, and it appears to be working fine so far. I had booted the kernel from memory, so I pulled the power and let it go through the normal boot process (using the 2.6.31 kernel that's built in to the firmware) - that also works fine. I checked my sanity by trying another bad device, and the same thing happened there - booting 2.6.35 somehow "fixed" my problem. Unfortunately, this means I can no longer tell what happened to the original block you were interested in. I'll try to dig up another bad device, and use your patch with an older kenel version to see what happens there. Is it at all possible that the error was caused by something at the UBI or MTD layer which was fixed between 2.6.34 and 2.6.35, and booting up with 2.6.35 "touched" something that made it work again even after reverting to an older kernel? Sounds pretty far-fetched, but I don't know how else to explain these suddenly-recovered devices...
Hi, On Tue, 2010-08-24 at 18:38 -0400, Matthew L. Creech wrote: > I applied your patch to 2.6.35 and booted that on the same bad device > for which I posted a boot log. But to my surprise, it didn't give any > errors at all - the boot process completed normally, and it appears to > be working fine so far. I had booted the kernel from memory, so I > pulled the power and let it go through the normal boot process (using > the 2.6.31 kernel that's built in to the firmware) - that also works > fine. > > I checked my sanity by trying another bad device, and the same thing > happened there - booting 2.6.35 somehow "fixed" my problem. > Unfortunately, this means I can no longer tell what happened to the > original block you were interested in. I'll try to dig up another bad > device, and use your patch with an older kenel version to see what > happens there. This is interesting. BTW, if you use 2.6.31, you should in any case apply patches from the ubifs-v2.6.31 back-port tree, there were some good fixes. > Is it at all possible that the error was caused by something at the > UBI or MTD layer which was fixed between 2.6.34 and 2.6.35, and > booting up with 2.6.35 "touched" something that made it work again > even after reverting to an older kernel? Sounds pretty far-fetched, > but I don't know how else to explain these suddenly-recovered > devices... OK :-) Artem.
On Tue, Aug 24, 2010 at 6:38 PM, Matthew L. Creech <mlcreech@gmail.com> wrote: > > I applied your patch to 2.6.35 and booted that on the same bad device > for which I posted a boot log. But to my surprise, it didn't give any > errors at all - the boot process completed normally, and it appears to > be working fine so far. I had booted the kernel from memory, so I > pulled the power and let it go through the normal boot process (using > the 2.6.31 kernel that's built in to the firmware) - that also works > fine. > Another device had similar UBIFS corruption, so this time we stayed with 2.6.34 and added the mapping status to that error message. This is the output: UBIFS error (pid 455): ubifs_read_node: bad node type (255 but expected 1) UBI DBG (pid 455): ubi_is_mapped: test LEB 0:7746 UBIFS error (pid 455): ubifs_read_node: bad node at LEB 7746:110360, LEB mapping status 0 Not a node, first 24 bytes: 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ........................ Call Trace: [c7b69bf0] [c0008410] show_stack+0x54/0x134 (unreliable) [c7b69c30] [c00cfeb8] ubifs_read_node+0x2f0/0x308 [c7b69c60] [c00ef4a8] ubifs_tnc_read_node+0x60/0x188 [c7b69ca0] [c00d34a4] ubifs_tnc_locate+0xd0/0x1d8 [c7b69d00] [c00c5818] do_readpage+0x1d0/0x4b8 [c7b69d50] [c00c7410] ubifs_readpage+0x3ec/0x400 --- Exception: 901 at flush_dcache_icache_page+0x24/0x30 LR = flush_dcache_icache_page+0x20/0x30 [c7b69db0] [c0048c44] generic_file_aio_read+0x454/0x630 (unreliable) [c7b69e50] [c00698d4] do_sync_read+0xa4/0xe0 [c7b69ef0] [c0069f5c] vfs_read+0xc4/0x16c [c7b69f10] [c006a29c] sys_read+0x4c/0x80 [c7b69f40] [c000f40c] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xfb45b4c LR = 0xfb45b34 Is this what was expected? FYI, this particular device only has corruption on a file that's not boot-critical, so it continues the boot process somewhat normally. A partial log (cut off once it seemed that nothing else useful was happening) is here, if that's of any use: http://mcreech.com/work/ubifs-mapping-status.txt It's also worth mentioning that when enabling UBI & UBIFS self checks on this device, the boot process halts after some errors are detected. The pertinent portion is: UBIFS error (pid 1): scan_check_cb: bad accounting of LEB 31: free 0, dirty 560 flags 0x0, should be free 129024, dirty 0 (pid 1) start dumping LEB 31 UBIFS DBG (pid 1): ubifs_start_scan: scan LEB 31:0 UBI DBG (pid 1): ubi_leb_read: read 129024 bytes from LEB 0:31:0 UBI DBG (pid 1): ubi_eba_read_leb: read 129024 bytes from offset 0 of LEB 0:31 (unmapped) UBIFS DBG (pid 1): ubifs_scan: look at LEB 31:0 (129024 bytes left) UBIFS DBG (pid 1): ubifs_scan_a_node: hit empty space UBIFS DBG (pid 1): ubifs_end_scan: stop scanning LEB 31 at offset 0 LEB 31 has 0 nodes ending at 0 (pid 1) finish dumping LEB 31 UBIFS error (pid 1): do_commit: commit failed, error -22 UBIFS warning (pid 1): ubifs_ro_mode: switched to read-only mode, error -22 Call Trace: [c7827d00] [c0008410] show_stack+0x54/0x134 (unreliable) [c7827d40] [c00ce774] ubifs_ro_mode+0x60/0x70 [c7827d50] [c00daf40] do_commit+0x5f0/0x5fc [c7827dd0] [c00eab14] ubifs_rcvry_gc_commit+0x440/0x46c [c7827e10] [c00cc900] ubifs_get_sb+0xe60/0x1734 [c7827ea0] [c006c794] vfs_kern_mount+0x68/0xf0 [c7827ed0] [c006c85c] do_kern_mount+0x40/0xf0 [c7827f00] [c0080dd8] do_mount+0x634/0x6a0 [c7827f50] [c0080f28] sys_mount+0x90/0xcc [c7827f80] [c02cdec8] mount_block_root+0x108/0x284 [c7827fd0] [c02ce288] prepare_namespace+0xac/0x1e8 [c7827fe0] [c02cd820] kernel_init+0x144/0x154 [c7827ff0] [c000f230] kernel_thread+0x4c/0x68 I've no idea if this is related or not, but there's a full log of that run available here: http://mcreech.com/work/ubifs-self-checks.txt Please let me know if there are any other tests you'd like run on this device. Otherwise, we'll probably try booting 2.6.35 and see what happens. Thanks!
Hi, On Tue, 2010-08-31 at 11:36 -0400, Matthew L. Creech wrote: > UBIFS error (pid 455): ubifs_read_node: bad node type (255 but expected 1) > UBI DBG (pid 455): ubi_is_mapped: test LEB 0:7746 > UBIFS error (pid 455): ubifs_read_node: bad node at LEB 7746:110360, > LEB mapping status 0 OK, so this LEB is unmapped. > Not a node, first 24 bytes: > 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ff ff ff ff ........................ > Call Trace: > [c7b69bf0] [c0008410] show_stack+0x54/0x134 (unreliable) > [c7b69c30] [c00cfeb8] ubifs_read_node+0x2f0/0x308 > [c7b69c60] [c00ef4a8] ubifs_tnc_read_node+0x60/0x188 > [c7b69ca0] [c00d34a4] ubifs_tnc_locate+0xd0/0x1d8 > [c7b69d00] [c00c5818] do_readpage+0x1d0/0x4b8 > [c7b69d50] [c00c7410] ubifs_readpage+0x3ec/0x400 > --- Exception: 901 at flush_dcache_icache_page+0x24/0x30 > LR = flush_dcache_icache_page+0x20/0x30 > [c7b69db0] [c0048c44] generic_file_aio_read+0x454/0x630 (unreliable) > [c7b69e50] [c00698d4] do_sync_read+0xa4/0xe0 > [c7b69ef0] [c0069f5c] vfs_read+0xc4/0x16c > [c7b69f10] [c006a29c] sys_read+0x4c/0x80 > [c7b69f40] [c000f40c] ret_from_syscall+0x0/0x38 > --- Exception: c01 at 0xfb45b4c > LR = 0xfb45b34 > > Is this what was expected? Well, it is what I thought it wold be. > FYI, this particular device only has corruption on a file that's not > boot-critical, so it continues the boot process somewhat normally. A > partial log (cut off once it seemed that nothing else useful was > happening) is here, if that's of any use: > > http://mcreech.com/work/ubifs-mapping-status.txt > > It's also worth mentioning that when enabling UBI & UBIFS self checks > on this device, the boot process halts after some errors are detected. > The pertinent portion is: > > UBIFS error (pid 1): scan_check_cb: bad accounting of LEB 31: free 0, > dirty 560 flags 0x0, should be free 129024, dirty 0 > (pid 1) start dumping LEB 31 > UBIFS DBG (pid 1): ubifs_start_scan: scan LEB 31:0 > UBI DBG (pid 1): ubi_leb_read: read 129024 bytes from LEB 0:31:0 > UBI DBG (pid 1): ubi_eba_read_leb: read 129024 bytes from offset 0 of > LEB 0:31 (unmapped) > UBIFS DBG (pid 1): ubifs_scan: look at LEB 31:0 (129024 bytes left) > UBIFS DBG (pid 1): ubifs_scan_a_node: hit empty space > UBIFS DBG (pid 1): ubifs_end_scan: stop scanning LEB 31 at offset 0 > LEB 31 has 0 nodes ending at 0 > (pid 1) finish dumping LEB 31 > UBIFS error (pid 1): do_commit: commit failed, error -22 > UBIFS warning (pid 1): ubifs_ro_mode: switched to read-only mode, error -22 > Call Trace: > [c7827d00] [c0008410] show_stack+0x54/0x134 (unreliable) > [c7827d40] [c00ce774] ubifs_ro_mode+0x60/0x70 > [c7827d50] [c00daf40] do_commit+0x5f0/0x5fc > [c7827dd0] [c00eab14] ubifs_rcvry_gc_commit+0x440/0x46c > [c7827e10] [c00cc900] ubifs_get_sb+0xe60/0x1734 > [c7827ea0] [c006c794] vfs_kern_mount+0x68/0xf0 > [c7827ed0] [c006c85c] do_kern_mount+0x40/0xf0 > [c7827f00] [c0080dd8] do_mount+0x634/0x6a0 > [c7827f50] [c0080f28] sys_mount+0x90/0xcc > [c7827f80] [c02cdec8] mount_block_root+0x108/0x284 > [c7827fd0] [c02ce288] prepare_namespace+0xac/0x1e8 > [c7827fe0] [c02cd820] kernel_init+0x144/0x154 > [c7827ff0] [c000f230] kernel_thread+0x4c/0x68 > > I've no idea if this is related or not, but there's a full log of that > run available here: > > http://mcreech.com/work/ubifs-self-checks.txt > > Please let me know if there are any other tests you'd like run on this > device. Otherwise, we'll probably try booting 2.6.35 and see what > happens. I need to take some time and carefully look at this. And think. Please, make a copy of the contents of your flash, if you can. From your side what would be helpful is if you tried to figure out how to reproduce this. Since I do not have your HW I cannot reproduce this, so the only thing I can do is to ask you to reproduce the problem with various debugging patches. CCing Adrian - may be he has ideas.
On Wed, 2010-09-01 at 21:57 +0300, Artem Bityutskiy wrote: > > Please let me know if there are any other tests you'd like run on this > > device. Otherwise, we'll probably try booting 2.6.35 and see what > > happens. > > I need to take some time and carefully look at this. And think. Please, > make a copy of the contents of your flash, if you can. > > From your side what would be helpful is if you tried to figure out how > to reproduce this. Since I do not have your HW I cannot reproduce this, > so the only thing I can do is to ask you to reproduce the problem with > various debugging patches. Matthew, I've sent a series of UBI patches which may help us to find out the root cause of your issues. The other idea which would definitely help is to create a debugging patch which will track all erasures of PEBs and store them somewhere. I do not know which tracing debugging tools you have, if you have some fast tracing you can just send this info via your tracing interface. But if you don't, you can use other flash or another partition on your flash and store the info. Here is my thinking: UBIFS index points to an unmapped LEB. There are 2 possibilities: either the index is incorrect or someone - UBI or UBIFS mistakenly unmapped an needed LEB. I personally think this is most probably the latter. So we need to gather information about all unmap operations: 1. which LEB and which PEB is unmapped. The best place to get this info is the 'do_sync_erase()' function. But it does not lave LEB. But we need to add an 'int lnum' parameter there, and amend the callers as well. It is some work, but should not be too difficult. 2. Then we need to know who unmapped the LEB - we need the stackdump. Normally, we use 'dump_stack()' function to print stack dump - but it prints to the log buffer. So we need a function which gives us an array of addresses which we then can save and later transform to symbols. Or we need a func which gives us a string containing an array of addresses. We probably need to implement it. But I think kmemleak is doing something like that - we can look there. But also, make sure no-one in UBI use mtd->erase directly, just in case. I think all erases should go via 'do_sync_erase()' 3. Most erasures are done in the background thread, so the stackdump will point to the background thread, which is not informative at all. This means we should also print PEB/LEB/stackdump in 'schedule_erase()' to track all places where the erasure is initiated. So, for each PEB erase / LEB unmap we store 1, 2 and 3 somewhere. When we hit the UBIFS bug, we can see how the LEB was unmapped and how the PEB was erased - this should give use idea what happened. Do you think you can do something like this? I do not think I have time in near future for this. What do you think? But of course, if you learn how to reproduce this - this would be great.
On Mon, Sep 6, 2010 at 5:17 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > The other idea which would definitely help is to create a debugging > patch which will track all erasures of PEBs and store them somewhere. I > do not know which tracing debugging tools you have, if you have some > fast tracing you can just send this info via your tracing interface. But > if you don't, you can use other flash or another partition on your flash > and store the info. > ... This makes sense, however I'm not sure of a good way to store this info. I don't have any hardware debugging tools (BDI, etc.) though I could probably get my hands on one if you think it would help. Creating another flash partition and saving data there could work, although I'm not familiar with how to do that safely/cleanly from within the kernel (I could give it a try though). The brute-force method would be to just dump all of this info out to the serial console, and I'll leave a minicom session running on a PC to capture everything that happens. I can't be certain how long it would take to get a freshly-formatted device into this state, but the quantity of output isn't a problem if I'm capturing to a PC. I'll probably have time to set this test up in the next few days, but it may be weeks until the test device goes bad (if it ever goes bad at all). As far as what code should be tested, do you want me to just pull a copy of the latest ubifs-2.6 tree? Or apply these patches to something more stable? Thanks for your help Artem!
On Tue, 2010-09-07 at 11:59 -0400, Matthew L. Creech wrote: > On Mon, Sep 6, 2010 at 5:17 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > The other idea which would definitely help is to create a debugging > > patch which will track all erasures of PEBs and store them somewhere. I > > do not know which tracing debugging tools you have, if you have some > > fast tracing you can just send this info via your tracing interface. But > > if you don't, you can use other flash or another partition on your flash > > and store the info. > > > ... > > This makes sense, however I'm not sure of a good way to store this > info. I don't have any hardware debugging tools (BDI, etc.) though I > could probably get my hands on one if you think it would help. > Creating another flash partition and saving data there could work, > although I'm not familiar with how to do that safely/cleanly from > within the kernel (I could give it a try though). Well, MTD API provides all basic operations. But I guess this is more complex, so I'd leave this as a last resort method. > The brute-force method would be to just dump all of this info out to > the serial console, Serial is too slow. If this bug is about race conditions, serial will make it unreproducible. And this will slow down stuff too much, so you'd run the same amount of operations tens time slower. Take a look at the netconsole - using ethernet will be faster. It is really easy to setup - just do not forget to switch off/configure iptables. See Documentation/networking/netconsole.txt > and I'll leave a minicom session running on a PC > to capture everything that happens. I can't be certain how long it > would take to get a freshly-formatted device into this state, but the > quantity of output isn't a problem if I'm capturing to a PC. Yeah, s/minicom/netconsole/, and capture the stuff in a file. This will makes stuff simpler, at least the 'dump_stack()' issue I told about will go away. > I'll probably have time to set this test up in the next few days, but > it may be weeks until the test device goes bad (if it ever goes bad at > all). As far as what code should be tested, do you want me to just > pull a copy of the latest ubifs-2.6 tree? This is probably not so important. Use the ubifs you have, but I'd like to also see it. And send the debugging patch to me for reveiw to comments (you'll need to prepare). > Or apply these patches to > something more stable? I think the ubifs base where you already saw the issue will be the best. > Thanks for your help Artem! NP, feel free to ask questions.
On Tue, 2010-09-07 at 20:17 +0300, Artem Bityutskiy wrote: > > I'll probably have time to set this test up in the next few days, but > > it may be weeks until the test device goes bad (if it ever goes bad at > > all). As far as what code should be tested, do you want me to just > > pull a copy of the latest ubifs-2.6 tree? > > This is probably not so important. Use the ubifs you have, but I'd like > to also see it. And send the debugging patch to me for reveiw to > comments (you'll need to prepare). Err, I also think you should try my UBI patches which may help. But they may themselves have a bug. May be 2 setups - one with one of your trees and the other is latest ubifs + those patches?
diff --git a/fs/ubifs/io.c b/fs/ubifs/io.c index bcf5a16..9432431 100644 --- a/fs/ubifs/io.c +++ b/fs/ubifs/io.c @@ -815,7 +815,8 @@ int ubifs_read_node(const struct ubifs_info *c, void *buf, int type, int len, return 0; out: - ubifs_err("bad node at LEB %d:%d", lnum, offs); + ubifs_err("bad node at LEB %d:%d, LEB mapping status %d", lnum, offs, + ubi_is_mapped(c->ubi, lnum)); dbg_dump_node(c, buf); dbg_dump_stack(); return -EINVAL;