Message ID | D7B1B5F4F3F27A4CB073BF422331203F2A18A7474A@Exchange1.lawo.de |
---|---|
State | Superseded |
Headers | show |
Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG): > Hi, > > I made a "hardcore test" with: > $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.1>; sync; done & > $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done & > $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; sync; done & > > It took about 2-3 hours until I had an error (two times): This test ran the over night without any error on my imx51 board. :-\ Bill's great analysis showed that it may be a linked list corruption in rw_semaphore. Thorsten, can you please enable CONFIG_DEBUG_LIST? Also try whether you can trigger the issue with lock debugging enabled. Thanks, //richard
> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG): >> I made a "hardcore test" with: >> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.1>; sync; done & >> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.2>; sync; done & >> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.3>; sync; done & >> It took about 2-3 hours until I had an error (two times): On 5 Feb 2014, richard@nod.at wrote: > This test ran the over night without any error on my imx51 board. :-\ > Bill's great analysis showed that it may be a linked list corruption > in rw_semaphore. Thorsten, can you please enable CONFIG_DEBUG_LIST? > Also try whether you can trigger the issue with lock debugging > enabled. I am trying to run the same test. I have 'fastmap' enabled and 'kmemleak'. I have various occurrences of these two, unreferenced object 0xc2c06e50 (size 24): comm "sync", pid 2941, jiffies 855335 (age 6354.950s) hex dump (first 24 bytes): 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 79 00 00 00 ZZZZZZZZZZZZy... 07 00 00 00 5a 5a 5a a5 ....ZZZ. backtrace: [<c019a544>] kmem_cache_alloc+0x10c/0x1a0 [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4 [<c02ac204>] ubi_wl_get_peb+0x28/0xbc [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884 [<c02a51a4>] ubi_leb_write+0xc4/0xe0 [<c0200f38>] ubifs_leb_write+0x9c/0x130 [<c020b28c>] ubifs_log_start_commit+0x230/0x3f4 [<c020c368>] do_commit+0x134/0x870 [<c01fbfa0>] ubifs_sync_fs+0x88/0x9c [<c01c30bc>] __sync_filesystem+0x74/0x98 [<c01a2860>] iterate_supers+0x9c/0x104 [<c01c31f4>] sys_sync+0x3c/0x68 [<c0129300>] ret_fast_syscall+0x0/0x1c [<ffffffff>] 0xffffffff unreferenced object 0xc2c06df0 (size 24): comm "flush-ubifs_0_3", pid 260, jiffies 867487 (age 6233.430s) hex dump (first 24 bytes): 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 7a 00 00 00 ZZZZZZZZZZZZz... 1e 00 00 00 5a 5a 5a a5 ....ZZZ. backtrace: [<c019a544>] kmem_cache_alloc+0x10c/0x1a0 [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4 [<c02ac204>] ubi_wl_get_peb+0x28/0xbc [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884 [<c02a50c0>] ubi_leb_map+0x70/0x90 [<c020125c>] ubifs_leb_map+0x74/0x100 [<c020af44>] ubifs_add_bud_to_log+0x1f4/0x30c [<c01f4830>] make_reservation+0x2e0/0x3e0 [<c01f53e8>] ubifs_jnl_write_data+0xfc/0x25c [<c01f838c>] do_writepage+0x88/0x260 [<c017b368>] __writepage+0x18/0x84 [<c017b98c>] write_cache_pages+0x1b4/0x3ac [<c01bead4>] writeback_single_inode+0x9c/0x258 [<c01befac>] writeback_sb_inodes+0xbc/0x180 [<c01bf6c0>] writeback_inodes_wb+0x7c/0x178 [<c01bfa00>] wb_writeback+0x244/0x2ac It is a 'cache' so I am suspicious of the kmemleak (also my Linux is old [kmemleak] with the Ubi/UbiFs/Mtd patches). However, I just wondered if Thorsten has posted a .config somewhere? I am testing on an IMX25 system as well and trying to replicate with his test. The Linux version is different as well. I suspect Richard will have tried with 'fastmap' as well? Are you running without 'fastmap' Thorsten? I will let my system run over night. Maybe just, $ grep -E 'MTD|UBI' .config | grep -v '^#' is fine for your config? Or maybe a full config to pastebin or someplace? I am pretty sure that http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854 is not the cause of this issue; although it is a good thing to be aware of. You can apply the patch in the crosstool-ng directory to fix gcc-4.8. It is quite possible that the FSL/Linaro people have done this. The 4.8.2 doesn't seem to come with this patch in the vanilla tarball. Also, I have had this occur with gcc 4.7. Especially, this same sort of issue has been occurring for some time (before gcc 4.8's release on 2013-03-22). Another memory issue was a suspect, but now it has been fixed and we still seem to have the issue. Fwiw, Bill Pringlemeir.
Am 05.02.2014 22:45, schrieb Bill Pringlemeir: > >> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG): > >>> I made a "hardcore test" with: >>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.1>; sync; done & >>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.2>; sync; done & >>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.3>; sync; done & > >>> It took about 2-3 hours until I had an error (two times): > > On 5 Feb 2014, richard@nod.at wrote: > >> This test ran the over night without any error on my imx51 board. :-\ > >> Bill's great analysis showed that it may be a linked list corruption >> in rw_semaphore. Thorsten, can you please enable CONFIG_DEBUG_LIST? >> Also try whether you can trigger the issue with lock debugging >> enabled. > > I am trying to run the same test. I have 'fastmap' enabled and > 'kmemleak'. I have various occurrences of these two, Did you backport all UBI/Fastmap fixes? Fastmap saw some memory leak fixes. > unreferenced object 0xc2c06e50 (size 24): > comm "sync", pid 2941, jiffies 855335 (age 6354.950s) > hex dump (first 24 bytes): > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 79 00 00 00 ZZZZZZZZZZZZy... > 07 00 00 00 5a 5a 5a a5 ....ZZZ. > backtrace: > [<c019a544>] kmem_cache_alloc+0x10c/0x1a0 Can you please decode the line number? > [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4 > [<c02ac204>] ubi_wl_get_peb+0x28/0xbc > [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884 > [<c02a51a4>] ubi_leb_write+0xc4/0xe0 > [<c0200f38>] ubifs_leb_write+0x9c/0x130 > [<c020b28c>] ubifs_log_start_commit+0x230/0x3f4 > [<c020c368>] do_commit+0x134/0x870 > [<c01fbfa0>] ubifs_sync_fs+0x88/0x9c > [<c01c30bc>] __sync_filesystem+0x74/0x98 > [<c01a2860>] iterate_supers+0x9c/0x104 > [<c01c31f4>] sys_sync+0x3c/0x68 > [<c0129300>] ret_fast_syscall+0x0/0x1c > [<ffffffff>] 0xffffffff > unreferenced object 0xc2c06df0 (size 24): > comm "flush-ubifs_0_3", pid 260, jiffies 867487 (age 6233.430s) > hex dump (first 24 bytes): > 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 7a 00 00 00 ZZZZZZZZZZZZz... > 1e 00 00 00 5a 5a 5a a5 ....ZZZ. > backtrace: > [<c019a544>] kmem_cache_alloc+0x10c/0x1a0 Here too. > [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4 > [<c02ac204>] ubi_wl_get_peb+0x28/0xbc > [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884 > [<c02a50c0>] ubi_leb_map+0x70/0x90 > [<c020125c>] ubifs_leb_map+0x74/0x100 > [<c020af44>] ubifs_add_bud_to_log+0x1f4/0x30c > [<c01f4830>] make_reservation+0x2e0/0x3e0 > [<c01f53e8>] ubifs_jnl_write_data+0xfc/0x25c > [<c01f838c>] do_writepage+0x88/0x260 > [<c017b368>] __writepage+0x18/0x84 > [<c017b98c>] write_cache_pages+0x1b4/0x3ac > [<c01bead4>] writeback_single_inode+0x9c/0x258 > [<c01befac>] writeback_sb_inodes+0xbc/0x180 > [<c01bf6c0>] writeback_inodes_wb+0x7c/0x178 > [<c01bfa00>] wb_writeback+0x244/0x2ac > > It is a 'cache' so I am suspicious of the kmemleak (also my Linux is old > [kmemleak] with the Ubi/UbiFs/Mtd patches). However, I just wondered if > Thorsten has posted a .config somewhere? I am testing on an IMX25 > system as well and trying to replicate with his test. The Linux version > is different as well. I suspect Richard will have tried with 'fastmap' > as well? Are you running without 'fastmap' Thorsten? I will let my > system run over night. Maybe just, > > $ grep -E 'MTD|UBI' .config | grep -v '^#' I think the issue is not related to fastmap because we have reports for kernels older than 3.7. Fastmap got merged in 3.7. Thanks, //richard
On 5 Feb 2014, richard@nod.at wrote: > I think the issue is not related to fastmap because we have reports > for kernels older than 3.7. Fastmap got merged in 3.7. Yes, of course. I meant fastmap may prevent the issue from happening. So, I wondered about Thorsten's configuration. Fwiw, Bill Pringlemeir.
Hi,
> On 5 Feb 2014, richard@nod.at wrote: >> I think the issue is not related to fastmap because we have reports >> for kernels older than 3.7. Fastmap got merged in 3.7. > ________________________________________ > Von: Bill Pringlemeir [bpringlemeir@nbsps.com] > Gesendet: Mittwoch, 5. Februar 2014 23:23 > Yes, of course. I meant fastmap may prevent the issue from happening. > So, I wondered about Thorsten's configuration. On 6 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > we run on a 3.6.11 kernel, so no fastmap. > > grep -E 'MTD|UBI' .config | grep -v '^#' gives mxc_nand has 52689931 interrupts and several bit flips and I didn't have a crash over night. I think that 'kmemleak' and/or fast map changes the behaviour (or maybe some other option). I will retry with options closer to yours. Fwiw, Bill Pringlemeir.
Hi, > -----Ursprüngliche Nachricht----- > Von: Richard Weinberger [mailto:richard@nod.at] > Gesendet: Mittwoch, 5. Februar 2014 09:29 > An: Wiedemer, Thorsten (Lawo AG) > Cc: linux-mtd@lists.infradead.org; Bill Pringlemeir > Betreff: Re: UBI leb_write_unlock NULL pointer Oops (continuation) > > Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG): >> Hi, >> >> I made a "hardcore test" with: >> $ while [ 1 ]; do cp <file_of_8kByte_size> >> tmp/<file_of_8kByte_size.1>; sync; done & $ while [ 1 ]; do cp >> <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done & $ >> while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; >> sync; done & >> >> It took about 2-3 hours until I had an error (two times): > > This test ran the over night without any error on my imx51 board. :-\ > > Bill's great analysis showed that it may be a linked list corruption in rw_semaphore. > Thorsten, can you please enable CONFIG_DEBUG_LIST? > Also try whether you can trigger the issue with lock debugging enabled. > > Thanks, > //richard short update (I was out of office the rest of last week). I compiled the kernel with the debug flags for debug list and lock alloc. The kernel compiled with gcc-4.8.2 didn't start (no output on serial console and reboot of the system). I didn't try (yet) to find out what happens at startup. I compiled the same kernel (and same config) with gcc-4.4.4. The write test runs now for over 16 hours without error. Next step is to find out wether this is due to a changed timing because of the debug flags or if it's the compiler. Best regards, Thorsten
>> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG): >>> I made a "hardcore test" with: >>>> while [ 1 ]; do cp <file_of_8kByte_size> >>> tmp/<file_of_8kByte_size.1>; sync; done & $ while [ 1 ]; do cp >>> <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done & $ >>> while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; >>> sync; done & >>> It took about 2-3 hours until I had an error (two times): >> -----Ursprüngliche Nachricht----- >> Von: Richard Weinberger [mailto:richard@nod.at] >> This test ran the over night without any error on my imx51 board. :-\ >> Thorsten, can you please enable CONFIG_DEBUG_LIST? >> Also try whether you can trigger the issue with lock debugging >> enabled. On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > short update (I was out of office the rest of last week). I compiled > the kernel with the debug flags for debug list and lock alloc. The > kernel compiled with gcc-4.8.2 didn't start (no output on serial > console and reboot of the system). I didn't try (yet) to find out > what happens at startup. You don't need to enable the 'lock alloc' debugging; Just the 'debug list' as Richard suggested. One at a time would work and give clues if you can reproduce it. > I compiled the same kernel (and same config) with gcc-4.4.4. The write > test runs now for over 16 hours without error. Next step is to find > out wether this is due to a changed timing because of the debug flags > or if it's the compiler. I ran a test as per the above on an IMX25 and mxc_nand has 448179139 interrupts, with about 6 bit flips and one torture test. It was been running for about four days. I am using gcc 4.7.3 (crosstool-ng) and backports to 2.6.36. I think that the issue is not related to an MTD driver and/or UBI/UbiFS directly. It is more likely an architecture issue and maybe some API inconsistency. It could be compiler related, however, it seems many people have seen the issue on various ARM926 systems (different Linux versions, different compilers, and different MTD drivers). User space tasks running in parallel with the test may play a role. Did you turn CONFIG_PREEMPT off? I think memory pressure and other effect (not related to UBI/UbiFS) maybe needed to trigger the issue. We don't normally see this on our systems. The one time it happened, an application developer ran some 'ls -R' or 'find .' in parallel with a file intensive feature in our application. I haven't found a test to reproduce it reliably. Fwiw, Bill Pringlemeir.
Hi, > On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > >> short update (I was out of office the rest of last week). I compiled >> the kernel with the debug flags for debug list and lock alloc. The >> kernel compiled with gcc-4.8.2 didn't start (no output on serial >> console and reboot of the system). I didn't try (yet) to find out >> what happens at startup. > > You don't need to enable the 'lock alloc' debugging; Just the 'debug list' as Richard suggested. One at a time would work and give clues if you > can reproduce it. I tested this, compiled with gcc.4-4-4. I had an error one time, but there was no bug report for list handling, only kernel oops: Internal error: Oops: 17 [#1] PREEMPT ARM Modules linked in: CPU: 0 Tainted: G O (3.6.11 #26) PC is at __up_write+0x38/0x168 LR is at __up_write+0x20/0x168 pc : [<c0234df0>] lr : [<c0234dd8>] psr: a0000093 sp : c726bc20 ip : c79c83bc fp : 60000013 r10: c71fed3c r9 : 00000001 r8 : c71fed38 r7 : 00000000 r6 : c726a000 r5 : 00000000 r4 : c71fed20 r3 : 00000001 r2 : c726a000 r1 : 00000000 r0 : 00000002 Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 0005317f Table: 87104000 DAC: 00000015 Process RAVENNA_streame (pid: 852, stack limit = 0xc726a270) Stack: (0xc726bc20 to 0xc726c000) bc20: c7aeb000 c71fed20 c79bc800 c726a000 00000656 00000001 00000001 00000656 bc40: c7aeb000 c02dfc1c c79bc800 00000000 c79c4600 000004ea 00000001 c02e0c14 bc60: 00000800 00000000 c726bc70 c01e9800 32353128 32323132 74613135 31202c61 bc80: 31200029 00010029 00000001 c7ac6920 000002c2 c0238650 c053a278 c071efda bca0: c071e000 c071eff3 c071efd7 0001c800 00000656 c7aeb000 00000800 00000800 bcc0: 00000001 c72cc000 0001c800 c02ded34 0001c800 00000800 c726a000 c7b32000 bce0: c7aeb000 0001c800 00000656 c01e6104 00000800 c0048054 c7ac6960 c7b32000 bd00: c7ac6920 c7b32000 00000358 00000355 00002910 c01e669c 00000800 00000000 bd20: 00000000 c7b32000 c726bd90 c72cc000 00000355 c726bde0 c7b32000 00000000 bd40: 00000001 c7ac6920 000006f0 c01d8b88 c726bd94 c70599a0 00000656 0001c800 bd60: 32353128 2c303132 74616420 32202c61 00000029 c022f74c 00000005 c05a4fc0 bd80: 00000001 c05a4fc0 c05a53c0 c05a53c0 00000325 00000001 a0000013 c05a53c0 bda0: c7b32000 c7578820 00001000 20000002 00025292 c071e000 00000005 c01dcacc bdc0: 00001000 c05a56e0 c726a000 c7578820 00000000 c01dcd10 00000000 00004830 bde0: 00025292 20000002 c75788d4 c05a53c0 00000002 c726be4c 00000000 00000002 be00: c75788d4 c00a052c c726bed8 c00a1420 0000000e 00004830 00000000 ffffffff be20: c726a000 c00a0518 c75788d4 c00cb538 00000002 00000001 00000001 c726be50 be40: 00000041 00000005 00000000 c05a56e0 c05a4fc0 c05a53c0 c0696480 c06952a0 be60: c7578820 c757887c 00000001 c00cb488 00004830 c7578820 00000000 c0099598 be80: 00004830 00000005 00000000 00000000 c75788d4 c726bed8 7fffffff 00000000 bea0: 00000000 00000000 b5bfe13c c00a16ec 91827364 c726beb4 c726beb4 c726bebc bec0: c726bebc 00000000 00000000 c75788d4 ffffffff c0098dc0 7ffffffd 00000000 bee0: 00000000 00000000 ffffffff 7fffffff 00000001 00000000 00000005 c0098e08 bf00: ffffffff 7fffffff 00000001 00000000 00000000 c7578820 c7b32000 ffffffff bf20: 7fffffff c0014428 c726a000 c01dceb4 ffffffff 7fffffff b5bf9860 ffffffff bf40: 7fffffff b5bf9860 00000076 c00f4364 ffffffff 7fffffff 00000000 c00c9fc0 bf60: 00000000 ffffffff 7fffffff c00f438c ffffffff 7fffffff 00000000 c00cab74 bf80: 00000000 00000000 c7bb1ec0 c00f45d8 0000000a 00000001 b5bf9860 0000000a bfa0: 00004830 c00142a0 0000000a 00004830 0000000a 00000002 b5bff4f4 00000000 bfc0: 0000000a 00004830 b5bf9860 00000076 0003ecb8 000505a0 b5bfeae0 b5bfe13c bfe0: 00000000 b5bf95d8 b6f55f8c b6f56fe4 80000010 0000000a ffffffff ffffffff [<c0234df0>] (__up_write+0x38/0x168) from [<c02dfc1c>] (leb_write_unlock+0xbc/0x13c) [<c02dfc1c>] (leb_write_unlock+0xbc/0x13c) from [<c02e0c14>] (ubi_eba_write_leb+0xa0/0x53c) [<c02e0c14>] (ubi_eba_write_leb+0xa0/0x53c) from [<c02ded34>] (ubi_leb_write+0xe4/0xe8) [<c02ded34>] (ubi_leb_write+0xe4/0xe8) from [<c01e6104>] (ubifs_leb_write+0x9c/0x128) [<c01e6104>] (ubifs_leb_write+0x9c/0x128) from [<c01e669c>] (ubifs_wbuf_write_nolock+0x358/0x6f8) [<c01e669c>] (ubifs_wbuf_write_nolock+0x358/0x6f8) from [<c01d8b88>] (ubifs_jnl_write_data+0x1a0/0x298) [<c01d8b88>] (ubifs_jnl_write_data+0x1a0/0x298) from [<c01dcacc>] (do_writepage+0x8c/0x224) [<c01dcacc>] (do_writepage+0x8c/0x224) from [<c00a052c>] (__writepage+0x14/0x64) [<c00a052c>] (__writepage+0x14/0x64) from [<c00a1420>] (write_cache_pages+0x1cc/0x458) [<c00a1420>] (write_cache_pages+0x1cc/0x458) from [<c00a16ec>] (generic_writepages+0x40/0x60) [<c00a16ec>] (generic_writepages+0x40/0x60) from [<c0098dc0>] (__filemap_fdatawrite_range+0x64/0x6c) [<c0098dc0>] (__filemap_fdatawrite_range+0x64/0x6c) from [<c0098e08>] (filemap_write_and_wait_range+0x40/0x6c) [<c0098e08>] (filemap_write_and_wait_range+0x40/0x6c) from [<c01dceb4>] (ubifs_fsync+0x44/0xb8) [<c01dceb4>] (ubifs_fsync+0x44/0xb8) from [<c00f4364>] (vfs_fsync_range+0x40/0x44) [<c00f4364>] (vfs_fsync_range+0x40/0x44) from [<c00f438c>] (vfs_fsync+0x24/0x2c) [<c00f438c>] (vfs_fsync+0x24/0x2c) from [<c00f45d8>] (do_fsync+0x28/0x50) [<c00f45d8>] (do_fsync+0x28/0x50) from [<c00142a0>] (ret_fast_syscall+0x0/0x2c) Code: e48a7004 e5985004 e15a0005 0a000029 (e595300c) ---[ end trace 8ee04e42747b7c3c ]--- note: RAVENNA_streame[852] exited with preempt_count 2 Today I compiled the kernel with debug lock alloc flags. The test runs already for some hours, but no error. It sounds perhaps strange, but there seems to be an issue with max/mean erase count. To make the debug list tests I changed my hardware. The bug occurred rather quickly after some minutes. On the system I used for my overnight tests, I never saw the bug again, even with a kernel I used before to reproduce the bug several times. >> I compiled the same kernel (and same config) with gcc-4.4.4. The write >> test runs now for over 16 hours without error. Next step is to find >> out wether this is due to a changed timing because of the debug flags >> or if it's the compiler. This was the hardware which never showed the bug again ... As mentioned above, I had the bug also with gcc-4.4.4 on a newer hardware. > User space tasks running in parallel with the test may play a role. Did you turn CONFIG_PREEMPT off? I think memory pressure and other effect (not related to UBI/UbiFS) maybe needed to trigger the issue. We don't normally see this on our systems. The one time it happened, an application > developer ran some 'ls -R' or 'find .' in parallel with a file intensive feature in our application. I haven't found a test to reproduce it reliably. I have not yet tested with preemption off, but that test will run this night. > > Fwiw, > Bill Pringlemeir. Tanks, Thorsten
Am 12.02.2014 16:18, schrieb Wiedemer, Thorsten (Lawo AG): >> On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: >> >>> short update (I was out of office the rest of last week). I compiled >>> the kernel with the debug flags for debug list and lock alloc. The >>> kernel compiled with gcc-4.8.2 didn't start (no output on serial >>> console and reboot of the system). I didn't try (yet) to find out >>> what happens at startup. >> >> You don't need to enable the 'lock alloc' debugging; Just the 'debug list' as Richard suggested. One at a time would work and give clues if you >> can reproduce it. > > > I tested this, compiled with gcc.4-4-4. I had an error one time, but there was no bug report for list handling, only kernel oops: Are you able to trigger the bug with function tracing enabled? If so, please add the ftrace_dump_on_oops kernel parameter. Such that we see what happened before oops. Thanks, //richard
>> On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: >>> short update (I was out of office the rest of last week). I compiled >>> the kernel with the debug flags for debug list and lock alloc. The >>> kernel compiled with gcc-4.8.2 didn't start (no output on serial >>> console and reboot of the system). I didn't try (yet) to find out >>> what happens at startup. > Bill Pringlemeir wrote: >> You don't need to enable the 'lock alloc' debugging; Just the 'debug >> list' as Richard suggested. One at a time would work and give clues >> if you can reproduce it. On 12 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > I tested this, compiled with gcc.4-4-4. I had an error one time, but > there was no bug report for list handling, only kernel oops: > Internal error: Oops: 17 [#1] PREEMPT ARM > Modules linked in: > CPU: 0 Tainted: G O (3.6.11 #26) > PC is at __up_write+0x38/0x168 > LR is at __up_write+0x20/0x168 > pc : [<c0234df0>] lr : [<c0234dd8>] psr: a0000093 > sp : c726bc20 ip : c79c83bc fp : 60000013 > r10: c71fed3c r9 : 00000001 r8 : c71fed38 > r7 : 00000000 r6 : c726a000 r5 : 00000000 r4 : c71fed20 > r3 : 00000001 r2 : c726a000 r1 : 00000000 r0 : 00000002 > Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user > Control: 0005317f Table: 87104000 DAC: 00000015 > Process RAVENNA_streame (pid: 852, stack limit = 0xc726a270) > Stack: (0xc726bc20 to 0xc726c000) > (__up_write+0x38/0x168) from (leb_write_unlock+0xbc/0x13c) > (leb_write_unlock+0xbc/0x13c) from (ubi_eba_write_leb+0xa0/0x53c) > (ubi_eba_write_leb+0xa0/0x53c) from (ubi_leb_write+0xe4/0xe8) > (ubi_leb_write+0xe4/0xe8) from (ubifs_leb_write+0x9c/0x128) > (ubifs_leb_write+0x9c/0x128) from (ubifs_wbuf_write_nolock+0x358/0x6f8) > (ubifs_wbuf_write_nolock+0x358/0x6f8) from (ubifs_jnl_write_data+0x1a0/0x298) > (ubifs_jnl_write_data+0x1a0/0x298) from (do_writepage+0x8c/0x224) > (do_writepage+0x8c/0x224) from (__writepage+0x14/0x64) > (__writepage+0x14/0x64) from (write_cache_pages+0x1cc/0x458) > (write_cache_pages+0x1cc/0x458) from (generic_writepages+0x40/0x60) > (generic_writepages+0x40/0x60) from (__filemap_fdatawrite_range+0x64/0x6c) > (__filemap_fdatawrite_range+0x64/0x6c) from \ (filemap_write_and_wait_range+0x40/0x6c) > (filemap_write_and_wait_range+0x40/0x6c) from (ubifs_fsync+0x44/0xb8) > (ubifs_fsync+0x44/0xb8) from (vfs_fsync_range+0x40/0x44) > (vfs_fsync_range+0x40/0x44) from (vfs_fsync+0x24/0x2c) > (vfs_fsync+0x24/0x2c) from (do_fsync+0x28/0x50) > (do_fsync+0x28/0x50) from (ret_fast_syscall+0x0/0x2c) > Code: e48a7004 e5985004 e15a0005 0a000029 (e595300c) > ---[ end trace 8ee04e42747b7c3c ]--- > note: RAVENNA_streame[852] exited with preempt_count 2 $ printf "\x04\x70\x8a\xe4\x04\x50\x98\xe5\x05\x00\x5a\xe1\x29\x00\x00\x0a\x0c\x30\x95\xe5" > crash.dump $ objdump --disassemble-all -m arm -b binary crash.dump crash.dump: file format binary Disassembly of section .data: 00000000 <.data>: 0: e48a7004 str r7, [sl], #4 4: e5985004 ldr r5, [r8, #4] 8: e15a0005 cmp sl, r5 c: 0a000029 beq 0xb8 10: e595300c ldr r3, [r5, #12] 'r5' is NULL. It seems to be the same symptom. If you run your ARM objdump with -S on either vmlinux or '__up_write', it will help confirm that it is the list corrupted again. The assembler above should match. What is 'RAVENNA_streame'? Is this your standard test and not the '8k binary' copy test or are you doing the copy test with this process also running? We have 'IRQs off', which makes sense for __up_write. Trying 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out what went on before. It might also make sense to dump some 'rwsem_waiter' nodes on the error? It looks like '__up_write' might normally have an empty list? Certainly an non-empty 'rwsem_waiter' is going to trigger the condition more often? I guess I can look to see what might cause this, even if I can not reproduce it. The 'preemp_count' has been two every time you have this; is that true? > Today I compiled the kernel with debug lock alloc flags. The test runs > already for some hours, but no error. It sounds perhaps strange, but > there seems to be an issue with max/mean erase count. To make the > debug list tests I changed my hardware. The bug occurred rather > quickly after some minutes. On the system I used for my overnight > tests, I never saw the bug again, even with a kernel I used before to > reproduce the bug several times. >>> I compiled the same kernel (and same config) with gcc-4.4.4. The >>> write test runs now for over 16 hours without error. Next step is >>> to find out wether this is due to a changed timing because of the >>> debug flags or if it's the compiler. > This was the hardware which never showed the bug again ... As > mentioned above, I had the bug also with gcc-4.4.4 on a newer > hardware. I stopped the test on my device at about 1/2 billion mxc_nand interrupts. My hardware is older and probably has well worn flash. The age of the flash can determine how long an erase/write takes. It is possible the flash chip is altering timing which exposes the bug. Ie, how long it takes for the MTD to read, write or erase. I might try your 'triple-8k file test' on a device with newer flash. I don't think this is a compiler issue; although different compilers will produce slightly different code with different timing. People have seen this on many different systems with different Linux versions and compilers. That said, verification is always good. Fwiw, Bill Pringlemeir.
On 12 Feb 2014, bpringlemeir@nbsps.com wrote: > We have 'IRQs off', which makes sense for __up_write. Trying > 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out > what went on before. It might also make sense to dump some > 'rwsem_waiter' nodes on the error? It looks like '__up_write' might > normally have an empty list? Certainly an non-empty 'rwsem_waiter' is > going to trigger the condition more often? I guess I can look to see > what might cause this, even if I can not reproduce it. The > 'preemp_count' has been two every time you have this; is that true? Wouldn't a smaller MTD trigger the condition more often? It looks like the locking is done per erase block and several files in the same erase block with simultaneous reading/writing will trigger this kind of effect? Does that sound right Richard? Will it matter if I use a fixed or dynamic volume size? Can I make a small UBI/UbiFS MTD partition and use that for testing? My dynamic partition is about 200MB big. Usually we never come near filling it, so there is lots of opportunity to use other erase blocks. Fwiw, Bill Pringlemeir.
Am 12.02.2014 19:21, schrieb Bill Pringlemeir: > > On 12 Feb 2014, bpringlemeir@nbsps.com wrote: > >> We have 'IRQs off', which makes sense for __up_write. Trying >> 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out >> what went on before. It might also make sense to dump some >> 'rwsem_waiter' nodes on the error? It looks like '__up_write' might >> normally have an empty list? Certainly an non-empty 'rwsem_waiter' is >> going to trigger the condition more often? I guess I can look to see >> what might cause this, even if I can not reproduce it. The >> 'preemp_count' has been two every time you have this; is that true? > > Wouldn't a smaller MTD trigger the condition more often? It looks like > the locking is done per erase block and several files in the same erase > block with simultaneous reading/writing will trigger this kind of > effect? > > Does that sound right Richard? Will it matter if I use a fixed or > dynamic volume size? Can I make a small UBI/UbiFS MTD partition and use > that for testing? My dynamic partition is about 200MB big. Usually we > never come near filling it, so there is lots of opportunity to use other > erase blocks. Yeah, I had the same idea and setup a MTD using nandsim. So far I was unable to trigger the issue. Let's wait for more results from Thorsten. Thanks, //richard
> Am 12.02.2014 19:21, schrieb Bill Pringlemeir: >> Does that sound right Richard? Will it matter if I use a fixed or >> dynamic volume size? Can I make a small UBI/UbiFS MTD partition and use >> that for testing? My dynamic partition is about 200MB big. Usually we >> never come near filling it, so there is lots of opportunity to use other >> erase blocks. On 12 Feb 2014, richard@nod.at wrote: > Yeah, I had the same idea and setup a MTD using nandsim. > So far I was unable to trigger the issue. > Let's wait for more results from Thorsten. Ok, but I would like to try with one of my ARM926 systems. I think that this is not a simple race. In rwsem-spinlock.c there are two places where 'rwsem_waiter' are 'allocated'. Those are '__down_read' and '__down_write_nested'. The 'rwsem_waiter' is allocated on the kernel stack and the current task is halted. This is patched into the list of the 'rwsem' rooted in ubi's ubi_ltree_entry. Ie, the list 'allocation' are across various task's thread_info. If a task is killed and/or the 'sp' is not in a good state, we may be a weird value linked in the 'rwsem' list. On the ARM926, there are no lock free primitives except 'swp' and in order to RMW the platform locks interrupts. However, UbiFS/UBI maybe called via a data fault handler and these can not be locked. If this is the case, you will never be able to replicate it on anything but an ARM926 or an architecture with this type of atomic primitives. This is why I was originally cross posting to the ARM list. I think having a better test case would be constructive? At least we should explore the fact that it is an 'arch dependent' issue? All of the cases reported seem to be confined to the ARM926. It seems like many readers/writer to a small UBI partition would be the best type of test case. I am not sure if UbiFS will actually call UBI to write/read during a data fault, or is this deferred somehow? Thanks, Bill Pringlemeir. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/rwsem-spinlock.c#n123 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/rwsem-spinlock.c#n189
Hello I am a colleague of Thorsten. We tried to activate the CONFIG_PREEMPT_NONE and the error occured again. Unable to handle kernel NULL pointer dereference at virtual address 0000000c pgd = c7370000 [0000000c] *pgd=8717e831, *pte=00000000, *ppte=00000000 Internal error: Oops: 17 [#1] ARM Modules linked in: ravenna_portctrl(O) ravenna_sync(O) ravenna_core(O) dallis_metering(O) dallis_audiorouter(O) dallis_multidrop(O) dallis_gpio(O) ravenna_fpga_core(O) marvell_lawo(O) i2c_imx fec fec_imx25_init [last unloaded: ravenna_fpga_loader] CPU: 0 Tainted: G O (3.6.11 #1) PC is at __up_write+0x34/0x16c LR is at leb_write_unlock+0x30/0xb8 pc : [<c0205a3c>] lr : [<c02a6bf8>] psr: a0000093 sp : c72ffe28 ip : c79b03bc fp : c79b9800 r10: c72fff9c r9 : 00000499 r8 : c720cc7c r7 : c720cc78 r6 : 00000499 r5 : c79b9800 r4 : c720cc60 r3 : 00000000 r2 : 60000013 r1 : 00000000 r0 : c720cc78 Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user Control: 0005317f Table: 87370000 DAC: 00000015 Process sync (pid: 7976, stack limit = 0xc72fe270) Stack: (0xc72ffe28 to 0xc7300000) fe20: c7b2e000 60000013 00000800 c720cc60 c79b9800 00000499 fe40: c7b2e000 00000800 00000499 c72fff9c c79b9800 c02a6bf8 00014800 000007c8 fe60: 0000a800 c02a7268 00000800 c0044aa0 00000000 00000000 c72fe000 60000093 fe80: 00000031 00000020 000c54e7 00000001 00000000 00000000 c79faa00 00000499 fea0: c7b2e000 00000031 00000001 c00dc620 00000000 00000031 00000031 c006612c fec0: c72fe000 00000000 c04f4c68 00000800 0000a800 00000499 c7b2e000 00000800 fee0: c72fe000 c72fff9c 00000000 c02a61e0 0000a800 00000800 c7b28000 c7b28000 ff00: c7b2e000 0000a800 00000499 c01b96e0 00000800 00000540 c7b28000 c72fe000 ff20: c7ab7490 c7b28000 00000800 00000540 c7b28000 c01ba2ac 00000800 c7402668 ff40: c781d000 00000090 00000002 c7ab74b4 c7ab7490 c01b4be8 c7abf000 c7abf040 ff60: c7abf400 c04f9710 c00dc620 c00dc648 60008400 c00b8fd4 00000001 00000000 ff80: 00000001 be870e24 00000024 c0012b88 00000000 c00dc6f4 00000000 00000001 ffa0: 000d632c c0012a00 000d632c 00000001 00000000 be870e24 000ce4f4 000a1a38 ffc0: 000d632c 00000001 be870e24 00000024 000007c6 00000000 b6f64000 00000000 ffe0: b6ece0e0 be870c7c 000a1a48 b6ece0ec 60000010 00000000 00000000 00000000 [<c0205a3c>] (__up_write+0x34/0x16c) from [<c02a6bf8>] (leb_write_unlock+0x30/0xb8) [<c02a6bf8>] (leb_write_unlock+0x30/0xb8) from [<c02a7268>] (ubi_eba_write_leb+0xac/0x660) [<c02a7268>] (ubi_eba_write_leb+0xac/0x660) from [<c02a61e0>] (ubi_leb_write+0xdc/0xf0) [<c02a61e0>] (ubi_leb_write+0xdc/0xf0) from [<c01b96e0>] (ubifs_leb_write+0x6c/0x128) [<c01b96e0>] (ubifs_leb_write+0x6c/0x128) from [<c01ba2ac>] (ubifs_wbuf_sync_nolock+0xf8/0x324) [<c01ba2ac>] (ubifs_wbuf_sync_nolock+0xf8/0x324) from [<c01b4be8>] (ubifs_sync_fs+0x58/0x90) [<c01b4be8>] (ubifs_sync_fs+0x58/0x90) from [<c00dc648>] (sync_fs_one_sb+0x28/0x2c) [<c00dc648>] (sync_fs_one_sb+0x28/0x2c) from [<c00b8fd4>] (iterate_supers+0x98/0xb4) [<c00b8fd4>] (iterate_supers+0x98/0xb4) from [<c00dc6f4>] (sys_sync+0x48/0x98) [<c00dc6f4>] (sys_sync+0x48/0x98) from [<c0012a00>] (ret_fast_syscall+0x0/0x2c) Code: e5903004 e58d2004 e1580003 0a00002a (e593200c) ---[ end trace 9e9e220ab164ad69 ]--- We also were able to reproduce the error with the activated DEBUG_LOCK_ALLOC flag, with the following results. UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22 UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625) [<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) [<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) [<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) [<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node [<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4) [<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc) [<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec) [<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) [<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) [<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) [<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) magic 0x6101831 crc 0x8ae44db2 node_type 0 (inode node) group_type 0 (no node group) sqnum 20716 len 160 key (954, inode) creat_sqnum 20501 size 288 nlink 4 atime 1392296839.0 mtime 1392296832.0 ctime 1392296839.0 uid 0 gid 0 mode 16893 flags 0x1 xattr_cnt 0 xattr_size 0 xattr_names 0 compr_type 0x1 data len 0 UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22 UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625) [<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) [<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) [<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) [<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node [<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4) [<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc) [<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec) [<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) [<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) [<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) [<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) magic 0x6101831 crc 0x8ae44db2 node_type 0 (inode node) group_type 0 (no node group) sqnum 20716 len 160 key (954, inode) creat_sqnum 20501 size 288 nlink 4 atime 1392296839.0 mtime 1392296832.0 ctime 1392296839.0 uid 0 gid 0 mode 16893 flags 0x1 xattr_cnt 0 xattr_size 0 xattr_names 0 compr_type 0x1 data len 0 After the error occured the first time it seems that every ubi file access triggers the error message. Best Regards Emanuel
Hello We were able to reproduce the error with function trace enabled, and got the following output. If was too big for pastebin, so i split it into different pastes. Function Trace Part 1 http://pastebin.com/ugBsJH83 Function Trace Part 2 http://pastebin.com/tBCjLXLj Function Trace Part 3 http://pastebin.com/FgUZcN4e Function Trace Part 4 http://pastebin.com/nebyz4XU >We also were able to reproduce the error with the activated DEBUG_LOCK_ALLOC flag, with the following results. > >UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22 >UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625) >[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) >[<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) >[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) >[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) >[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) >UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node >[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4) >[<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc) >[<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec) >[<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) >[<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508) >[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4) >[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8) >[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c) > magic 0x6101831 > crc 0x8ae44db2 > node_type 0 (inode node) > group_type 0 (no node group) > sqnum 20716 > len 160 > key (954, inode) > creat_sqnum 20501 > size 288 I think the error we seen in the case is related to the following patch and is not related http://lists.infradead.org/pipermail/linux-mtd/2013-June/047359.html We port the patches to our Kernel and will try to reproduce it again. Best Regards Emanuel
Hi, I'm back again now. > Bill Pringlemeir wrote: > > $ printf "\x04\x70\x8a\xe4\x04\x50\x98\xe5\x05\x00\x5a\xe1\x29\x00\x00\x0a\x0c\x30\x95\xe5" > crash.dump $ objdump --disassemble-all -m arm -b binary crash.dump > > crash.dump: file format binary > > > Disassembly of section .data: > > 00000000 <.data>: > 0: e48a7004 str r7, [sl], #4 > 4: e5985004 ldr r5, [r8, #4] > 8: e15a0005 cmp sl, r5 > c: 0a000029 beq 0xb8 > 10: e595300c ldr r3, [r5, #12] > > 'r5' is NULL. It seems to be the same symptom. If you run your ARM objdump with -S on either vmlinux or '__up_write', it will help confirm that it is the list corrupted again. The assembler above should match. I don't have running a objdump on my ARM system at the moment, but rwsem-spinlock.c compiled with debug info, objdump -S -D gives for __up_write(): ... sem->activity = 0; 29c: e3a07000 mov r7, #0 2a0: e1a0a008 mov sl, r8 2a4: e48a7004 str r7, [sl], #4 2a8: e5985004 ldr r5, [r8, #4] if (!list_empty(&sem->wait_list)) 2ac: e15a0005 cmp sl, r5 2b0: 0a000029 beq 35c <__up_write+0xe0> /* if we are allowed to wake writers try to grant a single write lock * if there's a writer at the front of the queue * - we leave the 'waiting count' incremented to signify potential * contention */ if (waiter->flags & RWSEM_WAITING_FOR_WRITE) { 2b4: e595300c ldr r3, [r5, #12] { ... Seems to match ... > What is 'RAVENNA_streame'? Is this your standard test and not the '8k binary' copy test or are you doing the copy test with this process also running? This is an application which runs parallel to our copy test. The last days, Emanuel set up another test environment which seems to reproduce the error more reliably (at least on some hardwares, not on all). At the moment, there are running proprietary applications in parallel, but I'll try to strip it down to a sequence which I can provide you, if you like. > We have 'IRQs off', which makes sense for __up_write. Trying 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out what went on before. It might also make sense to dump some 'rwsem_waiter' nodes on the error? It looks like '__up_write' might normally have an empty list? > Certainly an non-empty 'rwsem_waiter' is going to trigger the condition more often? I guess I can look to see what might cause this, even if I can not reproduce it. The 'preemp_count' has been two every time you have this; is that true? We could reproduce the error now with function tracing enabled, so we have two hopefully valuable traces. But they are rather big (around 4MB each). Shall I use pastebin and cut them in several peaces to provide them? Or off-list as email attachment? The trace Emanuel posted Wednesday may be not valuable. Perhaps there is a (different) error triggered due to memory pressure caused by the function tracing. Best regards, Thorsten
>> Bill Pringlemeir wrote: >> Disassembly of section .data: >> 00000000 <.data>: >> 0: e48a7004 str r7, [sl], #4 >> 4: e5985004 ldr r5, [r8, #4] >> 8: e15a0005 cmp sl, r5 >> c: 0a000029 beq 0xb8 >> 10: e595300c ldr r3, [r5, #12] >> 'r5' is NULL. It seems to be the same symptom. If you run your ARM objdump >> with -S on either vmlinux or '__up_write', it will help confirm that >> it is the list corrupted again. The assembler above should match. On 20 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > I don't have running a objdump on my ARM system at the moment, but > rwsem-spinlock.c compiled with debug info, objdump -S -D gives for > __up_write(): > ... > sem->activity = 0; > 29c: e3a07000 mov r7, #0 > 2a0: e1a0a008 mov sl, r8 > 2a4: e48a7004 str r7, [sl], #4 > 2a8: e5985004 ldr r5, [r8, #4] > if (!list_empty(&sem->wait_list)) > 2ac: e15a0005 cmp sl, r5 > 2b0: 0a000029 beq 35c <__up_write+0xe0> /* if we are allowed to wake writers > try to grant a single write lock * if there's a writer at the front of > the queue * - we leave the 'waiting count' incremented to signify > potential * contention */ if (waiter->flags & RWSEM_WAITING_FOR_WRITE) > { > 2b4: e595300c ldr r3, [r5, #12] > { > ... > Seems to match ... It doesn't matter where it runs. I just want to make sure it is always the 'waiter' variable. >> What is 'RAVENNA_streame'? Is this your standard test and not the >> '8k binary' copy test or are you doing the copy test with this >> process also running? > This is an application which runs parallel to our copy test. The last > days, Emanuel set up another test environment which seems to reproduce > the error more reliably (at least on some hardwares, not on all). At > the moment, there are running proprietary applications in parallel, > but I'll try to strip it down to a sequence which I can provide you, > if you like. I think scheduling is important to this issue, that is why I asked. > We could reproduce the error now with function tracing enabled, so we > have two hopefully valuable traces. But they are rather big (around > 4MB each). Shall I use pastebin and cut them in several peaces to > provide them? Or off-list as email attachment? The trace Emanuel > posted Wednesday may be not valuable. Perhaps there is a (different) > error triggered due to memory pressure caused by the function tracing. After looking, the allocation is not due to memory pressure. It is due to different tasks waiting on the rwsem with 'waiter' allocated on the stack; I guess the task is gone, handling a signal or something else. However, the function traces are great. As you note they are rather big, so it will take anyone some time to analyze them. You could alter '__rwsem_do_wake', static inline struct rw_semaphore * __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) { struct rwsem_waiter *waiter; struct task_struct *tsk; int woken; waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); + if(!waiter) { + printk("Bad rwsem\n"); + printk("activity is %d.\n", sem->activity); + BUG(); + } if (waiter->type == RWSEM_WAITING_FOR_WRITE) { if (wakewrite) ... or something like that. * the rw-semaphore definition * - if activity is 0 then there are no active readers or writers * - if activity is +ve then that is the number of active readers * - if activity is -1 then there is one active writer * - if wait_list is not empty, then there are processes waiting... It seems inconsistent to have a non-empty list with activity as 0 as well? The above is trying to trace when we find a 'NULL' in the 'wait_list', which always seems to be the issue, but probably not the root cause. You can also put similar code in '__rwsem_wake_one_writer' if you instead get the 'up_read()' fault. Fwiw, Bill Pringlemeir.
On 20 Feb 2014, bpringlemeir@nbsps.com wrote: > You could alter '__rwsem_do_wake', > > static inline struct rw_semaphore * > __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) > { > struct rwsem_waiter *waiter; > struct task_struct *tsk; > int woken; > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); + if(!waiter) { + printk("Bad rwsem\n"); + printk("activity is %d.\n", sem->activity); + BUG(); + } if (waiter->type == RWSEM_WAITING_FOR_WRITE) { if (wakewrite) > > ... or something like that. > > * the rw-semaphore definition > * - if activity is 0 then there are no active readers or writers > * - if activity is +ve then that is the number of active readers > * - if activity is -1 then there is one active writer > * - if wait_list is not empty, then there are processes waiting... > > It seems inconsistent to have a non-empty list with activity as 0 as > well? The above is trying to trace when we find a 'NULL' in the > 'wait_list', which always seems to be the issue, but probably not the > root cause. > > You can also put similar code in '__rwsem_wake_one_writer' if you > instead get the 'up_read()' fault. Sorry, this code has to go in the callers before they write to 'activity'. void __up_write(struct rw_semaphore *sem) { unsigned long flags; + struct rwsem_waiter *waiter; raw_spin_lock_irqsave(&sem->wait_lock, flags); + waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); + if(!waiter) { + printk("Bad rwsem\n"); + printk("activity is %d.\n", sem->activity); + BUG(); + } sem->activity = 0; if (!list_empty(&sem->wait_list)) sem = __rwsem_do_wake(sem, 1); raw_spin_unlock_irqrestore(&sem->wait_lock, flags); } __up_read() is the other one. Or you wait to see if the function traces are helpful.
Hi, > Thorsten Wiedemer wrote: > > We could reproduce the error now with function tracing enabled, so we have two hopefully valuable traces. But they are rather big (around 4MB each). Shall I use pastebin and cut them in several peaces to provide them? Or off-list as email attachment? > Here I have the last parts of the two hopefully "valuable" traces. The pieces show what happens between the leb_write_lock() und the leb_write unlock() of the process which triggers the oops. If this is not enough, I can provide also more ... Take care of the PIDs, there are several processes with the same name running. Boths traces did not result from identical test cases, so there are some different processes running. The part of the first trace fits in one paste: http://pastebin.com/TL3yNVcw The part of the second trace is bigger than 500k, so split into two pastes. http://pastebin.com/1qMt0Pqg http://pastebin.com/ZLXU17wx We'll have a closer look at that. Regards, Thorsten
> The part of the first trace fits in one paste: > > http://pastebin.com/TL3yNVcw > > The part of the second trace is bigger than 500k, so split into two pastes. > > http://pastebin.com/1qMt0Pqg > http://pastebin.com/ZLXU17wx > > We'll have a closer look at that. > > Regards, > Thorsten Hello mtd-linux mailing list, My name is Benoit and I work together with Thorsten and Emanuel on this. Eventhought I am not a kernel expert, I try to take a deep look in the 2 ftraces we've got, in order to find differences and similarities and maybe have a clue what happened differently in the failed operations. I put my efforts looking at what happens between leb_write_lock and leb_write_unlock all over the traces (not only the partial traces commited to pastebin). What I can already tell, is that in both traces, I can see ubifs_orphan_ operations in very similar ways and only once in each traces : 1- 1 process invokes ubifs_add_orphan ... some processes invoke ubifs api... 2- 1 other process (a sync) invokes ubifs_oprhan_start_commit ... some processes invoke ubifs api... 3- a third process invokes ubifs_delete_orphan ... some processes invoke ubifs api... 4- the process in which the failure finally arises invokes leb_write_lock (starting from here you can follow the trace in first pastebin, line 112) ... some processes invoke ubifs api... 5- sync process of 2- invokes ubifs_orphan_end_commit (line 614) and gets interrupted by other processes before it finishes (line 734 and several processes after that point). This process resumes at line 2336 ... some processes invoke ubifs api in the meanwhile... 6- failing process finally comes to leb_write_unlock (line 4332) and goes to the Oops. I will continue to take deep look at traces to see if I find things more closely related to the semaphore you pointed out. Regards, Benoit Quiniou
On 21 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > Here I have the last parts of the two hopefully "valuable" traces. The > pieces show what happens between the leb_write_lock() und the > leb_write unlock() of the process which triggers the oops. If this is > not enough, I can provide also more ... Take care of the PIDs, there > are several processes with the same name running. Boths traces did > not result from identical test cases, so there are some different > processes running. > The part of the first trace fits in one paste: > http://pastebin.com/TL3yNVcw I am not sure if this is a complete trace? I don't understand why we would start off with 'leb_write_unlock'... However, $ grep -E 'leb|ltree' ubi.crash.raw # slight edits sync-7493 0.... 1348758715us : leb_write_unlock <-ubi_eba_write_leb sync-7493 0.... 1348758719us : add_preempt_count <-leb_write_unlock sync-7493 0...1 1348758724us : up_write <-leb_write_unlock sync-7493 0...1 1348758739us : kfree <-leb_write_unlock sync-7493 0...1 1348758746us : sub_preempt_count <-leb_write_unlock sync-7493 0.... 1348758954us : ubifs_leb_write <-ubifs_wbuf_sync_nolock sync-7493 0.... 1348758960us : ubi_leb_write <-ubifs_leb_write sync-7493 0.... 1348758965us : ubi_eba_write_leb <-ubi_leb_write sync-7493 0.... 1348758969us : leb_write_lock <-ubi_eba_write_leb sync-7493 0.... 1348758973us : ltree_add_entry <-leb_write_lock sync-7493 0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry sync-7493 0.... 1348758983us : add_preempt_count <-ltree_add_entry sync-7493 0...1 1348758989us : sub_preempt_count <-ltree_add_entry sync-7493 0.... 1348758994us : kfree <-ltree_add_entry sync-7493 0.... 1348758998us : down_write <-leb_write_lock sync-7493 0.... 1348759010us : ubi_io_write <-ubi_eba_write_leb < many reschedules, but sync-7493 still active.> sync-7492 0.... 1348761397us : leb_write_unlock <-ubi_eba_write_leb sync-7492 0.... 1348761400us : add_preempt_count <-leb_write_unlock sync-7492 0...1 1348761406us : up_write <-leb_write_unlock sync-7492 0...1 1348761419us : kfree <-leb_write_unlock sync-7492 0...1 1348761425us : sub_preempt_count <-leb_write_unlock sync-7492 0.... 1348761580us : ubifs_leb_write <-ubifs_write_node sync-7492 0.... 1348761585us : ubi_leb_write <-ubifs_leb_write sync-7492 0.... 1348761591us : ubi_eba_write_leb <-ubi_leb_write sync-7492 0.... 1348761595us : leb_write_lock <-ubi_eba_write_leb sync-7492 0.... 1348761599us : ltree_add_entry <-leb_write_lock sync-7492 0.... 1348761603us : kmem_cache_alloc_trace <-ltree_add_entry sync-7492 0.... 1348761609us : add_preempt_count <-ltree_add_entry sync-7492 0...1 1348761615us : sub_preempt_count <-ltree_add_entry sync-7492 0.... 1348761619us : kfree <-ltree_add_entry sync-7492 0.... 1348761622us : down_write <-leb_write_lock sync-7492 0.... 1348761635us : ubi_io_write <-ubi_eba_write_leb sync-7492 0.... 1348771081us : leb_write_unlock <-ubi_eba_write_leb sync-7492 0.... 1348771084us : add_preempt_count <-leb_write_unlock sync-7492 0...1 1348771090us : up_write <-leb_write_unlock sync-7492 0...1 1348771102us : kfree <-leb_write_unlock sync-7492 0...1 1348771109us : sub_preempt_count <-leb_write_unlock sync-7492 0.... 1348771269us : ubifs_leb_write <-ubifs_write_node sync-7492 0.... 1348771275us : ubi_leb_write <-ubifs_leb_write sync-7492 0.... 1348771280us : ubi_eba_write_leb <-ubi_leb_write sync-7492 0.... 1348771285us : leb_write_lock <-ubi_eba_write_leb sync-7492 0.... 1348771289us : ltree_add_entry <-leb_write_lock sync-7492 0.... 1348771292us : kmem_cache_alloc_trace <-ltree_add_entry sync-7492 0.... 1348771299us : add_preempt_count <-ltree_add_entry sync-7492 0...1 1348771304us : sub_preempt_count <-ltree_add_entry sync-7492 0.... 1348771308us : kfree <-ltree_add_entry sync-7492 0.... 1348771311us : down_write <-leb_write_lock sync-7492 0.... 1348771324us : ubi_io_write <-ubi_eba_write_leb < many reschedules and sync-7493 still in ubi_eba_write_leb> sync-7493 0.... 1348781410us : leb_write_unlock <-ubi_eba_write_leb sync-7493 0.... 1348781413us : add_preempt_count <-leb_write_unlock sync-7493 0...1 1348781418us : up_write <-leb_write_unlock ... I am not sure that we can know which LEBs are involved. However, I see cases for a double free and other issues. I think we need 'atomic_dec_and_test()' on the leb->users and to check this in the tree lookup. For instance, in 'leb_write_unlock()', the call to 'up_write()' can cause a reschedule. Say we enter 'leb_write_unlock()' with 'users = 2'. The count will decrement to one and then we may reschedule in 'up_write'. The 2nd UBI task may decrement users and inspect the count and call 'kfree'. Then we return to the original 'leb_write_unlock()', we will inspect 'le->users' and it will be zero. static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum) { struct ubi_ltree_entry *le; spin_lock(&ubi->ltree_lock); le = ltree_lookup(ubi, vol_id, lnum); le->users -= 1; ubi_assert(le->users >= 0); up_write(&le->mutex); /* can cause reschedule */ if (le->users == 0) { rb_erase(&le->rb, &ubi->ltree); kfree(le); } spin_unlock(&ubi->ltree_lock); } Maybe there are better kernel design's/APIs that can wrap this. However, I don't think the 'ubi_ltree_entry' is perfectly race free. On the IMX series, the MTD driver does a 'read-modify-write' to support sub-pages. The driver can not write to a subpage at a time. The MTD driver is also interrupt driven and several reschedules happened during ubi_io_write(). I don't think the spin_lock() will do anything on a UP system like the ARM926's that have encountered this issue. My compiler is reloading 'le->users' in the 'if' path. It seems odd that the 'wait_list' in the rwsem would always be NULL in this case. However, the wake may have rescheduled the blocked task and the stack was re-used and set to NULL/zero. I think races with 'le->users' are an issue; I don't know that it is this issue. I think it is also possible for the 'ltree_lookup()' to return an entry with 'le->user = 0', which would indicate someone is about to use 'kfree()'. I guess this is what Artem suspected? Fwiw, Bill Pringlemeir.
Am 21.02.2014 18:53, schrieb Bill Pringlemeir: > On 21 Feb 2014, Thorsten.Wiedemer@lawo.com wrote: > >> Here I have the last parts of the two hopefully "valuable" traces. The >> pieces show what happens between the leb_write_lock() und the >> leb_write unlock() of the process which triggers the oops. If this is >> not enough, I can provide also more ... Take care of the PIDs, there >> are several processes with the same name running. Boths traces did >> not result from identical test cases, so there are some different >> processes running. > >> The part of the first trace fits in one paste: > >> http://pastebin.com/TL3yNVcw > > I am not sure if this is a complete trace? I don't understand why we > would start off with 'leb_write_unlock'... However, > > $ grep -E 'leb|ltree' ubi.crash.raw # slight edits > sync-7493 0.... 1348758715us : leb_write_unlock <-ubi_eba_write_leb > sync-7493 0.... 1348758719us : add_preempt_count <-leb_write_unlock > sync-7493 0...1 1348758724us : up_write <-leb_write_unlock > sync-7493 0...1 1348758739us : kfree <-leb_write_unlock > sync-7493 0...1 1348758746us : sub_preempt_count <-leb_write_unlock > sync-7493 0.... 1348758954us : ubifs_leb_write <-ubifs_wbuf_sync_nolock > sync-7493 0.... 1348758960us : ubi_leb_write <-ubifs_leb_write > sync-7493 0.... 1348758965us : ubi_eba_write_leb <-ubi_leb_write > sync-7493 0.... 1348758969us : leb_write_lock <-ubi_eba_write_leb > sync-7493 0.... 1348758973us : ltree_add_entry <-leb_write_lock > sync-7493 0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry > sync-7493 0.... 1348758983us : add_preempt_count <-ltree_add_entry > sync-7493 0...1 1348758989us : sub_preempt_count <-ltree_add_entry > sync-7493 0.... 1348758994us : kfree <-ltree_add_entry > sync-7493 0.... 1348758998us : down_write <-leb_write_lock > sync-7493 0.... 1348759010us : ubi_io_write <-ubi_eba_write_leb > < many reschedules, but sync-7493 still active.> > sync-7492 0.... 1348761397us : leb_write_unlock <-ubi_eba_write_leb > sync-7492 0.... 1348761400us : add_preempt_count <-leb_write_unlock > sync-7492 0...1 1348761406us : up_write <-leb_write_unlock > sync-7492 0...1 1348761419us : kfree <-leb_write_unlock > sync-7492 0...1 1348761425us : sub_preempt_count <-leb_write_unlock > sync-7492 0.... 1348761580us : ubifs_leb_write <-ubifs_write_node > sync-7492 0.... 1348761585us : ubi_leb_write <-ubifs_leb_write > sync-7492 0.... 1348761591us : ubi_eba_write_leb <-ubi_leb_write > sync-7492 0.... 1348761595us : leb_write_lock <-ubi_eba_write_leb > sync-7492 0.... 1348761599us : ltree_add_entry <-leb_write_lock > sync-7492 0.... 1348761603us : kmem_cache_alloc_trace <-ltree_add_entry > sync-7492 0.... 1348761609us : add_preempt_count <-ltree_add_entry > sync-7492 0...1 1348761615us : sub_preempt_count <-ltree_add_entry > sync-7492 0.... 1348761619us : kfree <-ltree_add_entry > sync-7492 0.... 1348761622us : down_write <-leb_write_lock > sync-7492 0.... 1348761635us : ubi_io_write <-ubi_eba_write_leb > sync-7492 0.... 1348771081us : leb_write_unlock <-ubi_eba_write_leb > sync-7492 0.... 1348771084us : add_preempt_count <-leb_write_unlock > sync-7492 0...1 1348771090us : up_write <-leb_write_unlock > sync-7492 0...1 1348771102us : kfree <-leb_write_unlock > sync-7492 0...1 1348771109us : sub_preempt_count <-leb_write_unlock > sync-7492 0.... 1348771269us : ubifs_leb_write <-ubifs_write_node > sync-7492 0.... 1348771275us : ubi_leb_write <-ubifs_leb_write > sync-7492 0.... 1348771280us : ubi_eba_write_leb <-ubi_leb_write > sync-7492 0.... 1348771285us : leb_write_lock <-ubi_eba_write_leb > sync-7492 0.... 1348771289us : ltree_add_entry <-leb_write_lock > sync-7492 0.... 1348771292us : kmem_cache_alloc_trace <-ltree_add_entry > sync-7492 0.... 1348771299us : add_preempt_count <-ltree_add_entry > sync-7492 0...1 1348771304us : sub_preempt_count <-ltree_add_entry > sync-7492 0.... 1348771308us : kfree <-ltree_add_entry > sync-7492 0.... 1348771311us : down_write <-leb_write_lock > sync-7492 0.... 1348771324us : ubi_io_write <-ubi_eba_write_leb > < many reschedules and sync-7493 still in ubi_eba_write_leb> > sync-7493 0.... 1348781410us : leb_write_unlock <-ubi_eba_write_leb > sync-7493 0.... 1348781413us : add_preempt_count <-leb_write_unlock > sync-7493 0...1 1348781418us : up_write <-leb_write_unlock > ... > > I am not sure that we can know which LEBs are involved. However, I see > cases for a double free and other issues. I think we need > 'atomic_dec_and_test()' on the leb->users and to check this in the tree > lookup. For instance, in 'leb_write_unlock()', the call to 'up_write()' > can cause a reschedule. Say we enter 'leb_write_unlock()' with 'users = > 2'. The count will decrement to one and then we may reschedule in > 'up_write'. The 2nd UBI task may decrement users and inspect the count > and call 'kfree'. Then we return to the original 'leb_write_unlock()', > we will inspect 'le->users' and it will be zero. > > static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum) > { > struct ubi_ltree_entry *le; > > spin_lock(&ubi->ltree_lock); > le = ltree_lookup(ubi, vol_id, lnum); > le->users -= 1; > ubi_assert(le->users >= 0); > up_write(&le->mutex); /* can cause reschedule */ > if (le->users == 0) { > rb_erase(&le->rb, &ubi->ltree); > kfree(le); > } > spin_unlock(&ubi->ltree_lock); > } > > Maybe there are better kernel design's/APIs that can wrap this. > However, I don't think the 'ubi_ltree_entry' is perfectly race free. > > On the IMX series, the MTD driver does a 'read-modify-write' to support > sub-pages. The driver can not write to a subpage at a time. The MTD > driver is also interrupt driven and several reschedules happened during > ubi_io_write(). > > I don't think the spin_lock() will do anything on a UP system like the > ARM926's that have encountered this issue. Also on UP a spin_lock disables preemption. So, le->users is protected. Thanks, //richard
> Am 21.02.2014 18:53, schrieb Bill Pringlemeir: >> I don't think the spin_lock() will do anything on a UP system like the >> ARM926's that have encountered this issue. On 21 Feb 2014, richard@nod.at wrote: > Also on UP a spin_lock disables preemption. > So, le->users is protected. Thanks, now this make sense... > sync-7493 0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry > sync-7493 0.... 1348758983us : add_preempt_count <-ltree_add_entry > sync-7493 0...1 1348758989us : sub_preempt_count <-ltree_add_entry > sync-7493 0.... 1348758994us : kfree <-ltree_add_entry that those are the spinlocks. I say some thing dumb and get to learn something. It looks like ubi_eba_write_leb() is running retries when this happens in the pre-empted sync-7493 and then sync-7492 goes in and modifies the tree and some entries. If the ltree lock is different than the leb write mutex, could the 'down_write' and 'up_write' use different mutexes? I guess the 'lnum' is the same and the 'users' count should prevent this. I don't see any clues from the function tracing then. Also, it seems that 'sem->wait_list.next' is NULL and this is in the space allocated by the 'ltree_entry'. Ie, the head of the list has NULL; I was thinking something within the list was NULL. Regards, Bill Pringlemeir.
>> Am 21.02.2014 18:53, schrieb Bill Pringlemeir: >>> I don't think the spin_lock() will do anything on a UP system like >>> the ARM926's that have encountered this issue. > On 21 Feb 2014, richard@nod.at wrote: >> Also on UP a spin_lock disables preemption. So, le->users is >> protected. On 21 Feb 2014, bpringlemeir@nbsps.com wrote: > Thanks, now this make sense... [snip] > Also, it seems that 'sem->wait_list.next' is NULL and this is in the > space allocated by the 'ltree_entry'. Ie, the head of the list has > NULL; I was thinking something within the list was NULL. I think the 'ubi_eba_copy_leb()' may have an issue. int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to, struct ubi_vid_hdr *vid_hdr) { ... err = leb_write_trylock(ubi, vol_id, lnum); static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum) { .. le = ltree_add_entry(ubi, vol_id, lnum); /* users + 1 */ if (IS_ERR(le)) return PTR_ERR(le); if (down_write_trylock(&le->mutex)) return 0; /* Contention, cancel */ spin_lock(&ubi->ltree_lock); le->users -= 1; /* user - 1 */ ... spin_unlock(&ubi->ltree_lock); return 1; } if (err)... if (vol->eba_tbl[lnum] != from) { dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel", vol_id, lnum, from, vol->eba_tbl[lnum]); err = MOVE_CANCEL_RACE; goto out_unlock_leb; } ... out_unlock_leb: leb_write_unlock(ubi, vol_id, lnum); /* user - 1 */ Didn't the count have to bump up in this case? This isn't in Thorsten's traces, but neither are any 'down_read' or 'up_read' traces; maybe everything is in cache? Regards, Bill Pringlemeir.
Am 22.02.2014 01:49, schrieb Bill Pringlemeir: >>> Am 21.02.2014 18:53, schrieb Bill Pringlemeir: > >>>> I don't think the spin_lock() will do anything on a UP system like >>>> the ARM926's that have encountered this issue. > >> On 21 Feb 2014, richard@nod.at wrote: > >>> Also on UP a spin_lock disables preemption. So, le->users is >>> protected. > > On 21 Feb 2014, bpringlemeir@nbsps.com wrote: > >> Thanks, now this make sense... > > [snip] > >> Also, it seems that 'sem->wait_list.next' is NULL and this is in the >> space allocated by the 'ltree_entry'. Ie, the head of the list has >> NULL; I was thinking something within the list was NULL. > > I think the 'ubi_eba_copy_leb()' may have an issue. > > int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to, > struct ubi_vid_hdr *vid_hdr) > { > ... > > err = leb_write_trylock(ubi, vol_id, lnum); > > static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum) > { > .. > le = ltree_add_entry(ubi, vol_id, lnum); /* users + 1 */ > if (IS_ERR(le)) > return PTR_ERR(le); > if (down_write_trylock(&le->mutex)) > return 0; > > /* Contention, cancel */ > spin_lock(&ubi->ltree_lock); > le->users -= 1; /* user - 1 */ > ... > spin_unlock(&ubi->ltree_lock); > > return 1; > } > > if (err)... > > if (vol->eba_tbl[lnum] != from) { > dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel", > vol_id, lnum, from, vol->eba_tbl[lnum]); > err = MOVE_CANCEL_RACE; > goto out_unlock_leb; > } > ... > out_unlock_leb: > leb_write_unlock(ubi, vol_id, lnum); /* user - 1 */ > > > Didn't the count have to bump up in this case? This isn't in Thorsten's > traces, but neither are any 'down_read' or 'up_read' traces; maybe > everything is in cache? Hmm, I'm not sure whether I was able to follow your thought. But leb_write_unlock() is balanced with leb_write_trylock() in ubi_eba_copy_leb() which makes perfectly sense to me. What exactly is the problem? Thanks, //richard
On 22 Feb 2014, richard@nod.at wrote: > Am 22.02.2014 01:49, schrieb Bill Pringlemeir: >> I think the 'ubi_eba_copy_leb()' may have an issue. >> >> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to, >> struct ubi_vid_hdr *vid_hdr) >> { >> ... >> >> err = leb_write_trylock(ubi, vol_id, lnum); >> >> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum) >> { >> .. >> le = ltree_add_entry(ubi, vol_id, lnum); /* users + 1 */ >> if (IS_ERR(le)) >> return PTR_ERR(le); >> if (down_write_trylock(&le->mutex)) >> return 0; >> >> /* Contention, cancel */ >> spin_lock(&ubi->ltree_lock); >> le->users -= 1; /* user - 1 */ >> ... >> spin_unlock(&ubi->ltree_lock); >> >> return 1; >> } >> >> if (err)... >> if (vol->eba_tbl[lnum] != from) { dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel", vol_id, lnum, from, vol->eba_tbl[lnum]); err = MOVE_CANCEL_RACE; goto out_unlock_leb; } ... out_unlock_leb: leb_write_unlock(ubi, vol_id, lnum); /* user - 1 */ >> Didn't the count have to bump up in this case? This isn't in Thorsten's >> traces, but neither are any 'down_read' or 'up_read' traces; maybe >> everything is in cache? > Hmm, I'm not sure whether I was able to follow your thought. But > leb_write_unlock() is balanced with leb_write_trylock() in > ubi_eba_copy_leb() which makes perfectly sense to me. What exactly is > the problem? There are two things that must be balanced. The 'reference count' ubi_ltree_entry -> users and the rw_semaphore down/up. You are right, the trylock needs to be balanced by the 'leb_write_unlock'. However, the 'leb_write_trylock()' has already decremented 'users' in preperation to move the 'lnum'. However, in the case of contention, 'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which balances the 'trylock', but unbalances the 'users' reference count (I added some comments on the lines). I was postulating that the down/up paths are not on the same semaphore. There is nothing in the code to check this. It is possible for someone to come in and recycle the 'lnum' and create a new rwsemaphore. Here, the down/up are done on different 'rwsemaphores' and this will cause issues. I was looking at using a 'kref' instead of 'users' and then I found this path. Fwiw, Bill Pringlemeir.
Am 24.02.2014 16:09, schrieb Bill Pringlemeir: > On 22 Feb 2014, richard@nod.at wrote: > >> Am 22.02.2014 01:49, schrieb Bill Pringlemeir: > >>> I think the 'ubi_eba_copy_leb()' may have an issue. >>> >>> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to, >>> struct ubi_vid_hdr *vid_hdr) >>> { >>> ... >>> >>> err = leb_write_trylock(ubi, vol_id, lnum); >>> >>> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum) >>> { >>> .. >>> le = ltree_add_entry(ubi, vol_id, lnum); /* users + 1 */ >>> if (IS_ERR(le)) >>> return PTR_ERR(le); >>> if (down_write_trylock(&le->mutex)) >>> return 0; >>> >>> /* Contention, cancel */ >>> spin_lock(&ubi->ltree_lock); >>> le->users -= 1; /* user - 1 */ >>> ... >>> spin_unlock(&ubi->ltree_lock); >>> >>> return 1; >>> } >>> >>> if (err)... >>> > if (vol->eba_tbl[lnum] != from) { > dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel", > vol_id, lnum, from, vol->eba_tbl[lnum]); > err = MOVE_CANCEL_RACE; > goto out_unlock_leb; > } > ... > out_unlock_leb: > leb_write_unlock(ubi, vol_id, lnum); /* user - 1 */ > > >>> Didn't the count have to bump up in this case? This isn't in Thorsten's >>> traces, but neither are any 'down_read' or 'up_read' traces; maybe >>> everything is in cache? > >> Hmm, I'm not sure whether I was able to follow your thought. But >> leb_write_unlock() is balanced with leb_write_trylock() in >> ubi_eba_copy_leb() which makes perfectly sense to me. What exactly is >> the problem? > > There are two things that must be balanced. The 'reference count' > ubi_ltree_entry -> users and the rw_semaphore down/up. You are right, > the trylock needs to be balanced by the 'leb_write_unlock'. However, > the 'leb_write_trylock()' has already decremented 'users' in preperation > to move the 'lnum'. However, in the case of contention, > 'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which > balances the 'trylock', but unbalances the 'users' reference count (I > added some comments on the lines). My first thought here is "If this is true, why does ubi_assert(le->users >= 0) not trigger"? Thanks, //richard
>> On 22 Feb 2014, richard@nod.at wrote: >>> Hmm, I'm not sure whether I was able to follow your thought. But >>> leb_write_unlock() is balanced with leb_write_trylock() in >>> ubi_eba_copy_leb() which makes perfectly sense to me. What exactly is >>> the problem? > Am 24.02.2014 16:09, schrieb Bill Pringlemeir: >> There are two things that must be balanced. The 'reference count' >> ubi_ltree_entry -> users and the rw_semaphore down/up. You are right, >> the trylock needs to be balanced by the 'leb_write_unlock'. However, >> the 'leb_write_trylock()' has already decremented 'users' in preperation >> to move the 'lnum'. However, in the case of contention, >> 'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which >> balances the 'trylock', but unbalances the 'users' reference count (I >> added some comments on the lines). On 24 Feb 2014, richard@nod.at wrote: > My first thought here is "If this is true, why does > ubi_assert(le->users >= 0) not trigger"? A call to 'ltree_add_entry()' may add a completely new entry for the 'lnum'. The context switches may happen at any point that the spinlock is not held. Here is ubi_eba_copy_leb() with just mutex and reference count. leb_write_trylock -> ltree_add_entry(ubi, vol_id, lnum) create new or old. /* could reschedule here... */ leb_write_trylock -> down_write_trylock have write rwsem. /* could reschedule here... */ leb_write_trylock -> get spin lock and decrement user. /* could reschedule here... */ on 'if (vol->eba_tbl[lnum] != from)' another thread has this 'ltree_entry' so count is >1. /* could reschedule here... */ call leb_write_unlock() and destroy in use ltree_entry. Anyone calling 'ltree_add_entry' may create a new entry. Also, as the entry has been freed, the memory will be recycled and 'users' could be anything in a freed node. It is puzzling if this is related to the problem that Thorsten and others have seen that the 'assert' never fires. However, this path seems to violate the reference count and double decrements. I am pretty sure it is an issue although it maybe unrelated and latent (never triggered). However, some of the same 'suspects' are involved so I think it is a possibility to explore. Fwiw, Bill Pringlemeir.
On 24 Feb 2014, bpringlemeir@nbsps.com wrote: > /* could reschedule here... */ > on 'if (vol->eba_tbl[lnum] != from)' another thread has this > 'ltree_entry' so count is >1. > /* could reschedule here... */ Sorry, maybe that is not true. I don't know enough about the logic of ubi_eba_copy_leb(). Another thread has at least changed the physical map. It may/may not still have the rwsemaphore.
Am 24.02.2014 16:48, schrieb Bill Pringlemeir: > On 24 Feb 2014, bpringlemeir@nbsps.com wrote: > >> /* could reschedule here... */ >> on 'if (vol->eba_tbl[lnum] != from)' another thread has this >> 'ltree_entry' so count is >1. >> /* could reschedule here... */ > > Sorry, maybe that is not true. I don't know enough about the logic of > ubi_eba_copy_leb(). Another thread has at least changed the physical > map. It may/may not still have the rwsemaphore. Thorsten and Emanuel from Lawo AG kindly gave me access to one of their boards such that I was able to work directly on the issue. The solution is rather trivial. Two commits were missing in their tree: commit 8afd500cb52a5d00bab4525dd5a560d199f979b9 Author: Adam Thomas <adamthomas1111@gmail.com> Date: Sat Feb 2 22:35:08 2013 +0000 UBIFS: fix double free of ubifs_orphan objects and commit 2928f0d0c5ebd6c9605c0d98207a44376387c298 Author: Adam Thomas <adamthomas1111@gmail.com> Date: Sat Feb 2 22:32:31 2013 +0000 UBIFS: fix use of freed ubifs_orphan objects Bill, I'm very sure this fixes also the issue you face. It is not obvious that these two commits fix the issue. I've spend some time to figure out what exactly happens. In most cases up_write() crashes because "waiter" in __rwsem_do_wake() or __rwsem_wake_one_writer() somehow became NULL. It turned out that sem->wait_list gets overwritten. "UBIFS: fix double free of ubifs_orphan objects" fixes more than a double free. It addresses an issue where an orphan stays too long on the orphan list and therefore could be free()'ed twice. But another issue is that in ubifs_orphan_start_commit() an already free()'ed is set to NULL. Namely by the line "*last = NULL;". This is the line of code where the memory corruption happens. I've double checked it by adding a few trace_printk() and BUG_ON(). Thanks, //richard
On 5 Mar 2014, richard@nod.at wrote: > Thorsten and Emanuel from Lawo AG kindly gave me access to one of > their boards such that I was able to work directly on the issue. > The solution is rather trivial. > Two commits were missing in their tree: > commit 8afd500cb52a5d00bab4525dd5a560d199f979b9 > Author: Adam Thomas <adamthomas1111@gmail.com> > Date: Sat Feb 2 22:35:08 2013 +0000 > UBIFS: fix double free of ubifs_orphan objects > and > commit 2928f0d0c5ebd6c9605c0d98207a44376387c298 > Author: Adam Thomas <adamthomas1111@gmail.com> > Date: Sat Feb 2 22:32:31 2013 +0000 > UBIFS: fix use of freed ubifs_orphan objects > Bill, I'm very sure this fixes also the issue you face. I haven't seen this issue recently. It was only with older versions. I have both of those commits as well and wasn't able to reproduce it. So I have no issues. Still, the logic of ubi_eba_copy_leb() seems really weird to me. I don't understand, err = leb_write_trylock(ubi, vol_id, lnum); followed by, leb_write_unlock(ubi, vol_id, lnum); It looks like a double reference count decrement. Maybe they operate on different ltree nodes, but then the locks mis-match. Fwiw, Bill Pringlemeir.
On 5 Mar 2014, bpringlemeir@nbsps.com wrote: > On 5 Mar 2014, richard@nod.at wrote: >> The solution is rather trivial. [snip] > Still, the logic of ubi_eba_copy_leb() seems really weird to me. I > don't understand, > It looks like a double reference count decrement. Maybe they operate on > different ltree nodes, but then the locks mis-match. if (down_write_trylock(&le->mutex)) return 0; Ah, I mis-understood. The path where the lock is taken doesn't decrement the 'users' count; I was thinking of '0' as fail. Regards, Bill Pringlemeir.
Am 05.03.2014 22:42, schrieb Bill Pringlemeir: > On 5 Mar 2014, bpringlemeir@nbsps.com wrote: > >> On 5 Mar 2014, richard@nod.at wrote: > >>> The solution is rather trivial. > > [snip] > >> Still, the logic of ubi_eba_copy_leb() seems really weird to me. I >> don't understand, > >> It looks like a double reference count decrement. Maybe they operate on >> different ltree nodes, but then the locks mis-match. > > if (down_write_trylock(&le->mutex)) > return 0; > > Ah, I mis-understood. The path where the lock is taken doesn't > decrement the 'users' count; I was thinking of '0' as fail. Welcome to UNIX where 0 is success. :-) Thanks, //richard
Am 05.03.2014 22:42, schrieb Bill Pringlemeir: > On 5 Mar 2014, bpringlemeir@nbsps.com wrote: > >> On 5 Mar 2014, richard@nod.at wrote: > >>> The solution is rather trivial. > > [snip] > >> Still, the logic of ubi_eba_copy_leb() seems really weird to me. I >> don't understand, > >> It looks like a double reference count decrement. Maybe they operate on >> different ltree nodes, but then the locks mis-match. > > if (down_write_trylock(&le->mutex)) > return 0; > > Ah, I mis-understood. The path where the lock is taken doesn't > decrement the 'users' count; I was thinking of '0' as fail. Welcome to UNIX where 0 is success. :-) Thanks, //richard
diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c index 0e11671d..48fd2aa 100644 --- a/drivers/mtd/ubi/eba.c +++ b/drivers/mtd/ubi/eba.c @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum) spin_lock(&ubi->ltree_lock); le = ltree_lookup(ubi, vol_id, lnum); + ubi_assert(le); le->users -= 1; ubi_assert(le->users >= 0); up_write(&le->mutex);