diff mbox

AW: UBI leb_write_unlock NULL pointer Oops (continuation)

Message ID D7B1B5F4F3F27A4CB073BF422331203F2A18A7474A@Exchange1.lawo.de
State Superseded
Headers show

Commit Message

Wiedemer, Thorsten (Lawo AG) Feb. 4, 2014, 5:01 p.m. UTC
Hi,

I made a "hardcore test" with:
$ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.1>; sync; done &
$ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done &
$ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; sync; done &

It took about 2-3 hours until I had an error (two times):

First time:
Internal error: Oops: 17 [#1] PREEMPT ARM
Modules linked in:
CPU: 0    Tainted: G           O  (3.6.11 #1)                                   
PC is at __up_read+0x50/0xdc                                                    
LR is at __up_read+0x1c/0xdc                                                    
pc : [<c0229358>]    lr : [<c0229324>]    psr: 00000093                         
sp : c7363c70  ip : 00100100  fp : c7344c00                                     
r10: 00000000  r9 : 000001e5  r8 : 0000046d                                     
r7 : c79b4800  r6 : 0000046d  r5 : 60000013  r4 : c72fe138                      
r3 : 00000000  r2 : ffffffff  r1 : 00000001  r0 : 00200200                      
Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user              
Control: 0005317f  Table: 8736c000  DAC: 00000015                               
Process cp (pid: 6608, stack limit = 0xc7362270)                                
...
Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)                            
---[ end trace 25fc3fca34038efb ]---                                            
note: cp[6608] exited with preempt_count 2                                      

The stack dump was cut in my serial terminal window, so it's not complete. I removed it here.


Second time:

Internal error: Oops: 17 [#1] PREEMPT ARM       
Modules linked in:
CPU: 0    Tainted: G           O  (3.6.11 #1)                                                                                                                                   
PC is at __up_read+0x50/0xdc                                                                                                                                                    
LR is at __up_read+0x1c/0xdc                                                                                                                                                    
pc : [<c0229358>]    lr : [<c0229324>]    psr: 00000093                                                                                                                         
sp : c7bffc70  ip : 00100100  fp : c7268440                                                                                                                                     
r10: 00000000  r9 : 00000999  r8 : 00000480                                                                                                                                     
r7 : c79b4800  r6 : 00000480  r5 : 60000013  r4 : c7137178                                                                                                                      
r3 : 00000000  r2 : ffffffff  r1 : 00000001  r0 : 00200200                                                                                                                      
Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user                                                                                                              
Control: 0005317f  Table: 87168000  DAC: 00000015                                                                                                                               
Process cp (pid: 1276, stack limit = 0xc7bfe270)                                                                                                                                
Stack: (0xc7bffc70 to 0xc7c00000)                                                                                                                                               
fc60:                                     c7bffc80 c7137160 c7bfe000 c02d0728                                                                                                   
fc80: 00000000 c79fba00 c79b4800 00000000 00000001 c02d0a98 0000004c 00000010                                                                                                   
fca0: 00000007 00000006 c7bfe000 00000000 00000000 c79fba00 00000001 0000065a                                                                                                   
fcc0: 0001781c c7268440 00000002 00000001 c754d9b4 c02cfaa8 000177d0 0000004c                                                                                                   
fce0: 00000000 c7ae7000 000177d0 0000004c 000177d0 00000480 0000004c c01da88c                                                                                                   
fd00: 0000004c 00000000 c7ae7000 c710aca8 00000480 000177d0 c7268440 c7ae7000                                                                                                   
fd20: 00000480 c01dc4d4 0000004c 00000000 c7266420 c7268440 c7ae7000 00000002                                                                                                   
fd40: c7ae7000 c7571260 00000001 c01f9c48 00000480 000177d0 00000000 c7bffe18                                                                                                   
fd60: c7266680 c01f35e4 00000030 00000000 a0000093 c7800180 0000004c c00c1304                                                                                                   
fd80: 00000058 c01dec14 00000000 c7266420 c754d9b4 c7268440 c7266420 c754d9b4                                                                                                   
fda0: c754d9b8 c01ddd48 c754d9b8 c7ae7000 c7bffe14 c7bffe18 c7bffe88 c754d9b8                                                                                                   
fdc0: c7571260 c01dddd4 34613139 00000029 00000190 c022d138 c0538880 c7bffdec                                                                                                   
fde0: 0000ffff c7ae7000 00000001 c7bffe88 c7ae71a8 c754d9b0 c7571260 00000001                                                                                                   
fe00: 00000190 c01e0898 c754d9b0 20000013 00000001 00000002 c72663c0 2c383633                                                                                                   
fe20: 72696420 72746e65 30202c79 65366178 34613139 c7580029 c0656f20 c71f9800
fe40: c71f9800 c00c1ca4 c71f9800 c7ae7000 c754d9b0 c7abd490 c758b240 c01ce8c4
fe60: c7bffe84 00000000 c71f9850 00000000 00000050 000000a0 0000004c 000000a0
fe80: 00000480 00018800 00075088 4a6e91a4 00000000 c7ae718c 00000000 c7571260
fea0: c758b240 c754d998 c7ae7000 00000158 00000001 c758b380 c75713a0 c01d3070
fec0: 00000001 00000000 00000050 00000000 c05659a8 00000001 c7570c40 00000000
fee0: c7bfe000 00200020 00000000 00000000 00000000 00000278 00000007 c754d998
ff00: c758b240 c7571260 0000000a c0012f48 c7bfe000 00000000 00000002 c00d2aa4
ff20: b6fc24d0 000e0cb8 00000000 00000000 c754d998 c00d2c44 00000004 c716c000
ff40: c7818b30 c7570c40 aa6e91a4 00000013 c716c004 c004a008 00000000 c716c000
ff60: c758b240 00000000 00000002 00000000 00000000 00000000 000200c1 000081ed
ff80: 00000022 00000700 00003164 be913eee 000081ed 00000003 be913eee b6fc24d0
ffa0: 00000011 c0012dc0 be913eee b6fc24d0 be913eee 00000002 00000011 000f5c20
ffc0: be913eee b6fc24d0 00000011 0000000a be913eee 00000002 be913edb 00000002
ffe0: b6f276d0 be913a24 000b96ac b6f276dc 60000010 be913eee 00000000 00000000
[<c0229358>] (__up_read+0x50/0xdc) from [<c02d0728>] (leb_read_unlock+0x74/0xec)
[<c02d0728>] (leb_read_unlock+0x74/0xec) from [<c02d0a98>] (ubi_eba_read_leb+0x218/0x41c)
[<c02d0a98>] (ubi_eba_read_leb+0x218/0x41c) from [<c02cfaa8>] (ubi_leb_read+0xa4/0x12c)
[<c02cfaa8>] (ubi_leb_read+0xa4/0x12c) from [<c01da88c>] (ubifs_leb_read+0x24/0x88)
[<c01da88c>] (ubifs_leb_read+0x24/0x88) from [<c01dc4d4>] (ubifs_read_node+0x98/0x2a4)
[<c01dc4d4>] (ubifs_read_node+0x98/0x2a4) from [<c01f9c48>] (ubifs_tnc_read_node+0x4c/0x140)
[<c01f9c48>] (ubifs_tnc_read_node+0x4c/0x140) from [<c01ddd48>] (matches_name.isra.23+0x94/0xd8)
[<c01ddd48>] (matches_name.isra.23+0x94/0xd8) from [<c01dddd4>] (resolve_collision+0x48/0x334)
[<c01dddd4>] (resolve_collision+0x48/0x334) from [<c01e0898>] (ubifs_tnc_remove_nm+0x78/0x128)
[<c01e0898>] (ubifs_tnc_remove_nm+0x78/0x128) from [<c01ce8c4>] (ubifs_jnl_update+0x2cc/0x608)
[<c01ce8c4>] (ubifs_jnl_update+0x2cc/0x608) from [<c01d3070>] (ubifs_unlink+0x14c/0x268)
[<c01d3070>] (ubifs_unlink+0x14c/0x268) from [<c00d2aa4>] (vfs_unlink+0x78/0x104)
[<c00d2aa4>] (vfs_unlink+0x78/0x104) from [<c00d2c44>] (do_unlinkat+0x114/0x168)
[<c00d2c44>] (do_unlinkat+0x114/0x168) from [<c0012dc0>] (ret_fast_syscall+0x0/0x2c)
Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)
---[ end trace 786c7bb100a792ee ]---
note: cp[1276] exited with preempt_count 2

Unfortunately not yet with the "ubi_assert()" in the kernel.

Not the same error as before, but perhaps the same reason ?

Regards,
Thorsten

Comments

Richard Weinberger Feb. 5, 2014, 8:29 a.m. UTC | #1
Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG):
> Hi,
> 
> I made a "hardcore test" with:
> $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.1>; sync; done &
> $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done &
> $ while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; sync; done &
> 
> It took about 2-3 hours until I had an error (two times):

This test ran the over night without any error on my imx51 board. :-\

Bill's great analysis showed that it may be a linked list corruption in rw_semaphore.
Thorsten, can you please enable CONFIG_DEBUG_LIST?
Also try whether you can trigger the issue with lock debugging enabled.

Thanks,
//richard
Bill Pringlemeir Feb. 5, 2014, 9:45 p.m. UTC | #2
> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG):

>> I made a "hardcore test" with:
>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.1>; sync; done &
>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.2>; sync; done &
>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.3>; sync; done &

>> It took about 2-3 hours until I had an error (two times):

On  5 Feb 2014, richard@nod.at wrote:

> This test ran the over night without any error on my imx51 board. :-\

> Bill's great analysis showed that it may be a linked list corruption
> in rw_semaphore.  Thorsten, can you please enable CONFIG_DEBUG_LIST?
> Also try whether you can trigger the issue with lock debugging
> enabled.

I am trying to run the same test.  I have 'fastmap' enabled and
'kmemleak'.  I have various occurrences of these two,

unreferenced object 0xc2c06e50 (size 24):
  comm "sync", pid 2941, jiffies 855335 (age 6354.950s)
  hex dump (first 24 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 79 00 00 00  ZZZZZZZZZZZZy...
    07 00 00 00 5a 5a 5a a5                          ....ZZZ.
  backtrace:
    [<c019a544>] kmem_cache_alloc+0x10c/0x1a0
    [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4
    [<c02ac204>] ubi_wl_get_peb+0x28/0xbc
    [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884
    [<c02a51a4>] ubi_leb_write+0xc4/0xe0
    [<c0200f38>] ubifs_leb_write+0x9c/0x130
    [<c020b28c>] ubifs_log_start_commit+0x230/0x3f4
    [<c020c368>] do_commit+0x134/0x870
    [<c01fbfa0>] ubifs_sync_fs+0x88/0x9c
    [<c01c30bc>] __sync_filesystem+0x74/0x98
    [<c01a2860>] iterate_supers+0x9c/0x104
    [<c01c31f4>] sys_sync+0x3c/0x68
    [<c0129300>] ret_fast_syscall+0x0/0x1c
    [<ffffffff>] 0xffffffff
unreferenced object 0xc2c06df0 (size 24):
  comm "flush-ubifs_0_3", pid 260, jiffies 867487 (age 6233.430s)
  hex dump (first 24 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 7a 00 00 00  ZZZZZZZZZZZZz...
    1e 00 00 00 5a 5a 5a a5                          ....ZZZ.
  backtrace:
    [<c019a544>] kmem_cache_alloc+0x10c/0x1a0
    [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4
    [<c02ac204>] ubi_wl_get_peb+0x28/0xbc
    [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884
    [<c02a50c0>] ubi_leb_map+0x70/0x90
    [<c020125c>] ubifs_leb_map+0x74/0x100
    [<c020af44>] ubifs_add_bud_to_log+0x1f4/0x30c
    [<c01f4830>] make_reservation+0x2e0/0x3e0
    [<c01f53e8>] ubifs_jnl_write_data+0xfc/0x25c
    [<c01f838c>] do_writepage+0x88/0x260
    [<c017b368>] __writepage+0x18/0x84
    [<c017b98c>] write_cache_pages+0x1b4/0x3ac
    [<c01bead4>] writeback_single_inode+0x9c/0x258
    [<c01befac>] writeback_sb_inodes+0xbc/0x180
    [<c01bf6c0>] writeback_inodes_wb+0x7c/0x178
    [<c01bfa00>] wb_writeback+0x244/0x2ac

It is a 'cache' so I am suspicious of the kmemleak (also my Linux is old
[kmemleak] with the Ubi/UbiFs/Mtd patches).  However, I just wondered if
Thorsten has posted a .config somewhere?  I am testing on an IMX25
system as well and trying to replicate with his test.  The Linux version
is different as well.  I suspect Richard will have tried with 'fastmap'
as well?  Are you running without 'fastmap' Thorsten?  I will let my
system run over night.  Maybe just,

 $ grep -E 'MTD|UBI' .config | grep -v '^#'

is fine for your config?  Or maybe a full config to pastebin or
someplace?

I am pretty sure that http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
is not the cause of this issue; although it is a good thing to be aware
of.  You can apply the patch in the crosstool-ng directory to fix
gcc-4.8.  It is quite possible that the FSL/Linaro people have done
this.  The 4.8.2 doesn't seem to come with this patch in the vanilla
tarball.

Also, I have had this occur with gcc 4.7.  Especially, this same sort of
issue has been occurring for some time (before gcc 4.8's release on
2013-03-22).  Another memory issue was a suspect, but now it has been
fixed and we still seem to have the issue.

Fwiw,
Bill Pringlemeir.
Richard Weinberger Feb. 5, 2014, 10:13 p.m. UTC | #3
Am 05.02.2014 22:45, schrieb Bill Pringlemeir:
> 
>> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG):
> 
>>> I made a "hardcore test" with:
>>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.1>; sync; done &
>>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.2>; sync; done &
>>> $ while [ 1 ]; do cp <8kByte_file> tmp/<8kByte_file.3>; sync; done &
> 
>>> It took about 2-3 hours until I had an error (two times):
> 
> On  5 Feb 2014, richard@nod.at wrote:
> 
>> This test ran the over night without any error on my imx51 board. :-\
> 
>> Bill's great analysis showed that it may be a linked list corruption
>> in rw_semaphore.  Thorsten, can you please enable CONFIG_DEBUG_LIST?
>> Also try whether you can trigger the issue with lock debugging
>> enabled.
> 
> I am trying to run the same test.  I have 'fastmap' enabled and
> 'kmemleak'.  I have various occurrences of these two,

Did you backport all UBI/Fastmap fixes?
Fastmap saw some memory leak fixes.

> unreferenced object 0xc2c06e50 (size 24):
>   comm "sync", pid 2941, jiffies 855335 (age 6354.950s)
>   hex dump (first 24 bytes):
>     5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 79 00 00 00  ZZZZZZZZZZZZy...
>     07 00 00 00 5a 5a 5a a5                          ....ZZZ.
>   backtrace:
>     [<c019a544>] kmem_cache_alloc+0x10c/0x1a0

Can you please decode the line number?

>     [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4
>     [<c02ac204>] ubi_wl_get_peb+0x28/0xbc
>     [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884
>     [<c02a51a4>] ubi_leb_write+0xc4/0xe0
>     [<c0200f38>] ubifs_leb_write+0x9c/0x130
>     [<c020b28c>] ubifs_log_start_commit+0x230/0x3f4
>     [<c020c368>] do_commit+0x134/0x870
>     [<c01fbfa0>] ubifs_sync_fs+0x88/0x9c
>     [<c01c30bc>] __sync_filesystem+0x74/0x98
>     [<c01a2860>] iterate_supers+0x9c/0x104
>     [<c01c31f4>] sys_sync+0x3c/0x68
>     [<c0129300>] ret_fast_syscall+0x0/0x1c
>     [<ffffffff>] 0xffffffff
> unreferenced object 0xc2c06df0 (size 24):
>   comm "flush-ubifs_0_3", pid 260, jiffies 867487 (age 6233.430s)
>   hex dump (first 24 bytes):
>     5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 7a 00 00 00  ZZZZZZZZZZZZz...
>     1e 00 00 00 5a 5a 5a a5                          ....ZZZ.
>   backtrace:
>     [<c019a544>] kmem_cache_alloc+0x10c/0x1a0

Here too.

>     [<c02b2b6c>] ubi_update_fastmap+0xdc/0x14f4
>     [<c02ac204>] ubi_wl_get_peb+0x28/0xbc
>     [<c02a64c0>] ubi_eba_write_leb+0x23c/0x884
>     [<c02a50c0>] ubi_leb_map+0x70/0x90
>     [<c020125c>] ubifs_leb_map+0x74/0x100
>     [<c020af44>] ubifs_add_bud_to_log+0x1f4/0x30c
>     [<c01f4830>] make_reservation+0x2e0/0x3e0
>     [<c01f53e8>] ubifs_jnl_write_data+0xfc/0x25c
>     [<c01f838c>] do_writepage+0x88/0x260
>     [<c017b368>] __writepage+0x18/0x84
>     [<c017b98c>] write_cache_pages+0x1b4/0x3ac
>     [<c01bead4>] writeback_single_inode+0x9c/0x258
>     [<c01befac>] writeback_sb_inodes+0xbc/0x180
>     [<c01bf6c0>] writeback_inodes_wb+0x7c/0x178
>     [<c01bfa00>] wb_writeback+0x244/0x2ac
> 
> It is a 'cache' so I am suspicious of the kmemleak (also my Linux is old
> [kmemleak] with the Ubi/UbiFs/Mtd patches).  However, I just wondered if
> Thorsten has posted a .config somewhere?  I am testing on an IMX25
> system as well and trying to replicate with his test.  The Linux version
> is different as well.  I suspect Richard will have tried with 'fastmap'
> as well?  Are you running without 'fastmap' Thorsten?  I will let my
> system run over night.  Maybe just,
> 
>  $ grep -E 'MTD|UBI' .config | grep -v '^#'

I think the issue is not related to fastmap because we have reports for kernels older
than 3.7.
Fastmap got merged in 3.7.

Thanks,
//richard
Bill Pringlemeir Feb. 5, 2014, 10:23 p.m. UTC | #4
On  5 Feb 2014, richard@nod.at wrote:


> I think the issue is not related to fastmap because we have reports
> for kernels older than 3.7.  Fastmap got merged in 3.7.

Yes, of course.  I meant fastmap may prevent the issue from happening.
So, I wondered about Thorsten's configuration.

Fwiw,
Bill Pringlemeir.
Wiedemer, Thorsten (Lawo AG) Feb. 6, 2014, 1:05 p.m. UTC | #5
Hi,
Bill Pringlemeir Feb. 6, 2014, 4 p.m. UTC | #6
> On  5 Feb 2014, richard@nod.at wrote:

>> I think the issue is not related to fastmap because we have reports
>> for kernels older than 3.7.  Fastmap got merged in 3.7.

> ________________________________________
> Von: Bill Pringlemeir [bpringlemeir@nbsps.com]
> Gesendet: Mittwoch, 5. Februar 2014 23:23

> Yes, of course.  I meant fastmap may prevent the issue from happening.
> So, I wondered about Thorsten's configuration.

On  6 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

> we run on a 3.6.11 kernel, so no fastmap.
>
> grep -E 'MTD|UBI' .config | grep -v '^#'   gives

mxc_nand has 52689931 interrupts and several bit flips and I didn't have
a crash over night.  I think that 'kmemleak' and/or fast map changes the
behaviour (or maybe some other option).  I will retry with options
closer to yours.

Fwiw,
Bill Pringlemeir.
Wiedemer, Thorsten (Lawo AG) Feb. 11, 2014, 8:01 a.m. UTC | #7
Hi,

> -----Ursprüngliche Nachricht-----
> Von: Richard Weinberger [mailto:richard@nod.at] 
> Gesendet: Mittwoch, 5. Februar 2014 09:29
> An: Wiedemer, Thorsten (Lawo AG)
> Cc: linux-mtd@lists.infradead.org; Bill Pringlemeir
> Betreff: Re: UBI leb_write_unlock NULL pointer Oops (continuation) 
>
> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG):
>>  Hi,
>>  
>>  I made a "hardcore test" with:
>>  $ while [ 1 ]; do cp <file_of_8kByte_size> 
>>  tmp/<file_of_8kByte_size.1>; sync; done & $ while [ 1 ]; do cp 
>>  <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done & $ 
>>  while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>; 
>>  sync; done &
>>  
>>  It took about 2-3 hours until I had an error (two times):
>
> This test ran the over night without any error on my imx51 board. :-\
>
> Bill's great analysis showed that it may be a linked list corruption in rw_semaphore.
> Thorsten, can you please enable CONFIG_DEBUG_LIST?
> Also try whether you can trigger the issue with lock debugging enabled.
>
> Thanks,
> //richard

short update (I was out of office the rest of last week).
I compiled the kernel with the debug flags for debug list and lock alloc.
The kernel compiled with gcc-4.8.2 didn't start (no output on serial console and reboot of the system).
I didn't try (yet) to find out what happens at startup.

I compiled the same kernel (and same config) with gcc-4.4.4. The write test runs now for over 16 hours without error.
Next step is to find out wether this is due to a changed timing because of the debug flags or if it's the compiler.

Best regards,
Thorsten
Bill Pringlemeir Feb. 11, 2014, 3:25 p.m. UTC | #8
>> Am 04.02.2014 18:01, schrieb Wiedemer, Thorsten (Lawo AG):

>>> I made a "hardcore test" with:
>>>> while [ 1 ]; do cp <file_of_8kByte_size> 
>>> tmp/<file_of_8kByte_size.1>; sync; done & $ while [ 1 ]; do cp 
>>> <file_of_8kByte_size> tmp/<file_of_8kByte_size.2>; sync; done & $ 
>>> while [ 1 ]; do cp <file_of_8kByte_size> tmp/<file_of_8kByte_size.3>;
>>> sync; done &

>>> It took about 2-3 hours until I had an error (two times):

>> -----Ursprüngliche Nachricht-----
>> Von: Richard Weinberger [mailto:richard@nod.at] 

>> This test ran the over night without any error on my imx51 board. :-\

>> Thorsten, can you please enable CONFIG_DEBUG_LIST?
>> Also try whether you can trigger the issue with lock debugging
>> enabled.

On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

> short update (I was out of office the rest of last week).  I compiled
> the kernel with the debug flags for debug list and lock alloc.  The
> kernel compiled with gcc-4.8.2 didn't start (no output on serial
> console and reboot of the system).  I didn't try (yet) to find out
> what happens at startup.

You don't need to enable the 'lock alloc' debugging; Just the 'debug
list' as Richard suggested.  One at a time would work and give clues if
you can reproduce it.

> I compiled the same kernel (and same config) with gcc-4.4.4. The write
> test runs now for over 16 hours without error.  Next step is to find
> out wether this is due to a changed timing because of the debug flags
> or if it's the compiler.

I ran a test as per the above on an IMX25 and mxc_nand has 448179139
interrupts, with about 6 bit flips and one torture test.  It was been
running for about four days.  I am using gcc 4.7.3 (crosstool-ng) and
backports to 2.6.36.  

I think that the issue is not related to an MTD driver and/or UBI/UbiFS
directly.  It is more likely an architecture issue and maybe some API
inconsistency.  It could be compiler related, however, it seems many
people have seen the issue on various ARM926 systems (different Linux
versions, different compilers, and different MTD drivers).

User space tasks running in parallel with the test may play a role.  Did
you turn CONFIG_PREEMPT off?  I think memory pressure and other effect
(not related to UBI/UbiFS) maybe needed to trigger the issue.  We don't
normally see this on our systems.  The one time it happened, an
application developer ran some 'ls -R' or 'find .' in parallel with a
file intensive feature in our application.  I haven't found a test to
reproduce it reliably.

Fwiw,
Bill Pringlemeir.
Wiedemer, Thorsten (Lawo AG) Feb. 12, 2014, 3:18 p.m. UTC | #9
Hi,


> On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:
>
>> short update (I was out of office the rest of last week).  I compiled 
>>  the kernel with the debug flags for debug list and lock alloc.  The 
>> kernel compiled with gcc-4.8.2 didn't start (no output on serial 
>>  console and reboot of the system).  I didn't try (yet) to find out 
>>  what happens at startup.
>
> You don't need to enable the 'lock alloc' debugging; Just the 'debug list' as Richard suggested.  One at a time would work and give clues if you 
> can reproduce it.


I tested this, compiled with gcc.4-4-4. I had an error one time, but there was no bug report for list handling, only kernel oops:

Internal error: Oops: 17 [#1] PREEMPT ARM
Modules linked in:
CPU: 0    Tainted: G           O  (3.6.11 #26)                                                                                                                                  
PC is at __up_write+0x38/0x168                                                                                                                                                  
LR is at __up_write+0x20/0x168                                                                                                                                                  
pc : [<c0234df0>]    lr : [<c0234dd8>]    psr: a0000093                                                                                                                         
sp : c726bc20  ip : c79c83bc  fp : 60000013                                                                                                                                     
r10: c71fed3c  r9 : 00000001  r8 : c71fed38                                                                                                                                     
r7 : 00000000  r6 : c726a000  r5 : 00000000  r4 : c71fed20                                                                                                                      
r3 : 00000001  r2 : c726a000  r1 : 00000000  r0 : 00000002                                                                                                                      
Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user                                                                                                              
Control: 0005317f  Table: 87104000  DAC: 00000015                                                                                                                               
Process RAVENNA_streame (pid: 852, stack limit = 0xc726a270)                                                                                                                    
Stack: (0xc726bc20 to 0xc726c000)                                                                                                                                               
bc20: c7aeb000 c71fed20 c79bc800 c726a000 00000656 00000001 00000001 00000656                                                                                                   
bc40: c7aeb000 c02dfc1c c79bc800 00000000 c79c4600 000004ea 00000001 c02e0c14                                                                                                   
bc60: 00000800 00000000 c726bc70 c01e9800 32353128 32323132 74613135 31202c61                                                                                                   
bc80: 31200029 00010029 00000001 c7ac6920 000002c2 c0238650 c053a278 c071efda                                                                                                   
bca0: c071e000 c071eff3 c071efd7 0001c800 00000656 c7aeb000 00000800 00000800                                                                                                   
bcc0: 00000001 c72cc000 0001c800 c02ded34 0001c800 00000800 c726a000 c7b32000                                                                                                   
bce0: c7aeb000 0001c800 00000656 c01e6104 00000800 c0048054 c7ac6960 c7b32000                                                                                                   
bd00: c7ac6920 c7b32000 00000358 00000355 00002910 c01e669c 00000800 00000000                                                                                                   
bd20: 00000000 c7b32000 c726bd90 c72cc000 00000355 c726bde0 c7b32000 00000000                                                                                                   
bd40: 00000001 c7ac6920 000006f0 c01d8b88 c726bd94 c70599a0 00000656 0001c800                                                                                                   
bd60: 32353128 2c303132 74616420 32202c61 00000029 c022f74c 00000005 c05a4fc0                                                                                                   
bd80: 00000001 c05a4fc0 c05a53c0 c05a53c0 00000325 00000001 a0000013 c05a53c0                                                                                                   
bda0: c7b32000 c7578820 00001000 20000002 00025292 c071e000 00000005 c01dcacc                                                                                                   
bdc0: 00001000 c05a56e0 c726a000 c7578820 00000000 c01dcd10 00000000 00004830                                                                                                   
bde0: 00025292 20000002 c75788d4 c05a53c0 00000002 c726be4c 00000000 00000002                                                                                                   
be00: c75788d4 c00a052c c726bed8 c00a1420 0000000e 00004830 00000000 ffffffff                                                                                                   
be20: c726a000 c00a0518 c75788d4 c00cb538 00000002 00000001 00000001 c726be50                                                                                                   
be40: 00000041 00000005 00000000 c05a56e0 c05a4fc0 c05a53c0 c0696480 c06952a0                                                                                                   
be60: c7578820 c757887c 00000001 c00cb488 00004830 c7578820 00000000 c0099598                                                                                                   
be80: 00004830 00000005 00000000 00000000 c75788d4 c726bed8 7fffffff 00000000                                                                                                   
bea0: 00000000 00000000 b5bfe13c c00a16ec 91827364 c726beb4 c726beb4 c726bebc                                                                                                   
bec0: c726bebc 00000000 00000000 c75788d4 ffffffff c0098dc0 7ffffffd 00000000                                                                                                   
bee0: 00000000 00000000 ffffffff 7fffffff 00000001 00000000 00000005 c0098e08                                                                                                   
bf00: ffffffff 7fffffff 00000001 00000000 00000000 c7578820 c7b32000 ffffffff                                                                                                   
bf20: 7fffffff c0014428 c726a000 c01dceb4 ffffffff 7fffffff b5bf9860 ffffffff                                                                                                   
bf40: 7fffffff b5bf9860 00000076 c00f4364 ffffffff 7fffffff 00000000 c00c9fc0                                                                                                   
bf60: 00000000 ffffffff 7fffffff c00f438c ffffffff 7fffffff 00000000 c00cab74                                                                                                   
bf80: 00000000 00000000 c7bb1ec0 c00f45d8 0000000a 00000001 b5bf9860 0000000a                                                                                                   
bfa0: 00004830 c00142a0 0000000a 00004830 0000000a 00000002 b5bff4f4 00000000                                                                                                   
bfc0: 0000000a 00004830 b5bf9860 00000076 0003ecb8 000505a0 b5bfeae0 b5bfe13c                                                                                                   
bfe0: 00000000 b5bf95d8 b6f55f8c b6f56fe4 80000010 0000000a ffffffff ffffffff                                                                                                   
[<c0234df0>] (__up_write+0x38/0x168) from [<c02dfc1c>] (leb_write_unlock+0xbc/0x13c)                                                                                            
[<c02dfc1c>] (leb_write_unlock+0xbc/0x13c) from [<c02e0c14>] (ubi_eba_write_leb+0xa0/0x53c)                                                                                     
[<c02e0c14>] (ubi_eba_write_leb+0xa0/0x53c) from [<c02ded34>] (ubi_leb_write+0xe4/0xe8)                                                                                         
[<c02ded34>] (ubi_leb_write+0xe4/0xe8) from [<c01e6104>] (ubifs_leb_write+0x9c/0x128)                                                                                           
[<c01e6104>] (ubifs_leb_write+0x9c/0x128) from [<c01e669c>] (ubifs_wbuf_write_nolock+0x358/0x6f8)                                                                               
[<c01e669c>] (ubifs_wbuf_write_nolock+0x358/0x6f8) from [<c01d8b88>] (ubifs_jnl_write_data+0x1a0/0x298)                                                                         
[<c01d8b88>] (ubifs_jnl_write_data+0x1a0/0x298) from [<c01dcacc>] (do_writepage+0x8c/0x224)                                                                                     
[<c01dcacc>] (do_writepage+0x8c/0x224) from [<c00a052c>] (__writepage+0x14/0x64)                                                                                                
[<c00a052c>] (__writepage+0x14/0x64) from [<c00a1420>] (write_cache_pages+0x1cc/0x458)                                                                                          
[<c00a1420>] (write_cache_pages+0x1cc/0x458) from [<c00a16ec>] (generic_writepages+0x40/0x60)                                                                                   
[<c00a16ec>] (generic_writepages+0x40/0x60) from [<c0098dc0>] (__filemap_fdatawrite_range+0x64/0x6c)                                                                            
[<c0098dc0>] (__filemap_fdatawrite_range+0x64/0x6c) from [<c0098e08>] (filemap_write_and_wait_range+0x40/0x6c)                                                                  
[<c0098e08>] (filemap_write_and_wait_range+0x40/0x6c) from [<c01dceb4>] (ubifs_fsync+0x44/0xb8)                                                                                 
[<c01dceb4>] (ubifs_fsync+0x44/0xb8) from [<c00f4364>] (vfs_fsync_range+0x40/0x44)                                                                                              
[<c00f4364>] (vfs_fsync_range+0x40/0x44) from [<c00f438c>] (vfs_fsync+0x24/0x2c)                                                                                                
[<c00f438c>] (vfs_fsync+0x24/0x2c) from [<c00f45d8>] (do_fsync+0x28/0x50)                          
[<c00f45d8>] (do_fsync+0x28/0x50) from [<c00142a0>] (ret_fast_syscall+0x0/0x2c)                                                                                                 
Code: e48a7004 e5985004 e15a0005 0a000029 (e595300c)                                                                                                                            
---[ end trace 8ee04e42747b7c3c ]---                                                                                                                                            
note: RAVENNA_streame[852] exited with preempt_count 2                                                                                                                          

Today I compiled the kernel with debug lock alloc flags. The test runs already for some hours, but no error.
It sounds perhaps strange, but there seems to be an issue with max/mean erase count. 
To make the debug list tests I changed my hardware. The bug occurred rather quickly after some minutes.
On the system I used for my overnight tests, I never saw the bug again, even with a kernel I used before to reproduce the bug several times.

>> I compiled the same kernel (and same config) with gcc-4.4.4. The write 
>>  test runs now for over 16 hours without error.  Next step is to find 
>>  out wether this is due to a changed timing because of the debug flags 
>>  or if it's the compiler.

This was the hardware which never showed the bug again ...
As mentioned above, I had the bug also with gcc-4.4.4 on a newer hardware.

> User space tasks running in parallel with the test may play a role.  Did you turn CONFIG_PREEMPT off?  I think memory pressure and other effect (not related to UBI/UbiFS) maybe needed to trigger the issue.  We don't normally see this on our systems.  The one time it happened, an application > developer ran some 'ls -R' or 'find .' in parallel with a file intensive feature in our application.  I haven't found a test to reproduce it reliably.

I have not yet tested with preemption off, but that test will run this night.

>
> Fwiw,
> Bill Pringlemeir.

Tanks,
Thorsten
Richard Weinberger Feb. 12, 2014, 5:46 p.m. UTC | #10
Am 12.02.2014 16:18, schrieb Wiedemer, Thorsten (Lawo AG):
>> On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:
>>
>>> short update (I was out of office the rest of last week).  I compiled 
>>>  the kernel with the debug flags for debug list and lock alloc.  The 
>>> kernel compiled with gcc-4.8.2 didn't start (no output on serial 
>>>  console and reboot of the system).  I didn't try (yet) to find out 
>>>  what happens at startup.
>>
>> You don't need to enable the 'lock alloc' debugging; Just the 'debug list' as Richard suggested.  One at a time would work and give clues if you 
>> can reproduce it.
> 
> 
> I tested this, compiled with gcc.4-4-4. I had an error one time, but there was no bug report for list handling, only kernel oops:

Are you able to trigger the bug with function tracing enabled?
If so, please add the ftrace_dump_on_oops kernel parameter. Such that we see what happened before oops.

Thanks,
//richard
Bill Pringlemeir Feb. 12, 2014, 6:11 p.m. UTC | #11
>> On 11 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

>>> short update (I was out of office the rest of last week).  I compiled
>>> the kernel with the debug flags for debug list and lock alloc.  The 
>>> kernel compiled with gcc-4.8.2 didn't start (no output on serial 
>>> console and reboot of the system).  I didn't try (yet) to find out 
>>> what happens at startup.

> Bill Pringlemeir wrote:
>> You don't need to enable the 'lock alloc' debugging; Just the 'debug
>> list' as Richard suggested.  One at a time would work and give clues
>> if you can reproduce it.

On 12 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

> I tested this, compiled with gcc.4-4-4. I had an error one time, but
> there was no bug report for list handling, only kernel oops:

> Internal error: Oops: 17 [#1] PREEMPT ARM
> Modules linked in:
> CPU: 0 Tainted: G O (3.6.11 #26)
> PC is at __up_write+0x38/0x168
> LR is at __up_write+0x20/0x168
> pc : [<c0234df0>]    lr : [<c0234dd8>]    psr: a0000093
> sp : c726bc20  ip : c79c83bc  fp : 60000013
> r10: c71fed3c  r9 : 00000001  r8 : c71fed38
> r7 : 00000000  r6 : c726a000  r5 : 00000000  r4 : c71fed20
> r3 : 00000001  r2 : c726a000  r1 : 00000000  r0 : 00000002
> Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
> Control: 0005317f  Table: 87104000  DAC: 00000015
> Process RAVENNA_streame (pid: 852, stack limit = 0xc726a270)
> Stack: (0xc726bc20 to 0xc726c000)
> (__up_write+0x38/0x168) from (leb_write_unlock+0xbc/0x13c)
> (leb_write_unlock+0xbc/0x13c) from (ubi_eba_write_leb+0xa0/0x53c)
> (ubi_eba_write_leb+0xa0/0x53c) from (ubi_leb_write+0xe4/0xe8)
> (ubi_leb_write+0xe4/0xe8) from (ubifs_leb_write+0x9c/0x128)
> (ubifs_leb_write+0x9c/0x128) from (ubifs_wbuf_write_nolock+0x358/0x6f8)
> (ubifs_wbuf_write_nolock+0x358/0x6f8) from (ubifs_jnl_write_data+0x1a0/0x298)
> (ubifs_jnl_write_data+0x1a0/0x298) from (do_writepage+0x8c/0x224)
> (do_writepage+0x8c/0x224) from (__writepage+0x14/0x64)
> (__writepage+0x14/0x64) from (write_cache_pages+0x1cc/0x458)
> (write_cache_pages+0x1cc/0x458) from (generic_writepages+0x40/0x60)
> (generic_writepages+0x40/0x60) from (__filemap_fdatawrite_range+0x64/0x6c)
> (__filemap_fdatawrite_range+0x64/0x6c) from \
       (filemap_write_and_wait_range+0x40/0x6c)
> (filemap_write_and_wait_range+0x40/0x6c) from (ubifs_fsync+0x44/0xb8)
> (ubifs_fsync+0x44/0xb8) from (vfs_fsync_range+0x40/0x44)
> (vfs_fsync_range+0x40/0x44) from (vfs_fsync+0x24/0x2c)
> (vfs_fsync+0x24/0x2c) from (do_fsync+0x28/0x50)
> (do_fsync+0x28/0x50) from (ret_fast_syscall+0x0/0x2c)
> Code: e48a7004 e5985004 e15a0005 0a000029 (e595300c)
> ---[ end trace 8ee04e42747b7c3c ]---
> note: RAVENNA_streame[852] exited with preempt_count 2

$ printf "\x04\x70\x8a\xe4\x04\x50\x98\xe5\x05\x00\x5a\xe1\x29\x00\x00\x0a\x0c\x30\x95\xe5" > crash.dump
$ objdump --disassemble-all -m arm -b binary crash.dump 

crash.dump:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   e48a7004        str     r7, [sl], #4
   4:   e5985004        ldr     r5, [r8, #4]
   8:   e15a0005        cmp     sl, r5
   c:   0a000029        beq     0xb8
  10:   e595300c        ldr     r3, [r5, #12]

'r5' is NULL.  It seems to be the same symptom.  If you run your ARM
objdump with -S on either vmlinux or '__up_write', it will help confirm
that it is the list corrupted again.  The assembler above should match.

What is 'RAVENNA_streame'?  Is this your standard test and not the '8k
binary' copy test or are you doing the copy test with this process also
running?

We have 'IRQs off', which makes sense for __up_write.  Trying
'ftrace_dump_on_oops' as Richard suggests would be helpful to find out
what went on before.  It might also make sense to dump some
'rwsem_waiter' nodes on the error?  It looks like '__up_write' might
normally have an empty list?  Certainly an non-empty 'rwsem_waiter' is
going to trigger the condition more often?  I guess I can look to see
what might cause this, even if I can not reproduce it.  The
'preemp_count' has been two every time you have this; is that true?

> Today I compiled the kernel with debug lock alloc flags. The test runs
> already for some hours, but no error.  It sounds perhaps strange, but
> there seems to be an issue with max/mean erase count.  To make the
> debug list tests I changed my hardware. The bug occurred rather
> quickly after some minutes.  On the system I used for my overnight
> tests, I never saw the bug again, even with a kernel I used before to
> reproduce the bug several times.

>>> I compiled the same kernel (and same config) with gcc-4.4.4. The
>>> write test runs now for over 16 hours without error.  Next step is
>>> to find out wether this is due to a changed timing because of the
>>> debug flags or if it's the compiler.

> This was the hardware which never showed the bug again ...  As
> mentioned above, I had the bug also with gcc-4.4.4 on a newer
> hardware.

I stopped the test on my device at about 1/2 billion mxc_nand
interrupts.  My hardware is older and probably has well worn flash.  The
age of the flash can determine how long an erase/write takes.  It is
possible the flash chip is altering timing which exposes the bug.  Ie,
how long it takes for the MTD to read, write or erase.  I might try your
'triple-8k file test' on a device with newer flash.  I don't think this
is a compiler issue; although different compilers will produce slightly
different code with different timing.  People have seen this on many
different systems with different Linux versions and compilers.  That
said, verification is always good.

Fwiw,
Bill Pringlemeir.
Bill Pringlemeir Feb. 12, 2014, 6:21 p.m. UTC | #12
On 12 Feb 2014, bpringlemeir@nbsps.com wrote:

> We have 'IRQs off', which makes sense for __up_write.  Trying
> 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out
> what went on before.  It might also make sense to dump some
> 'rwsem_waiter' nodes on the error?  It looks like '__up_write' might
> normally have an empty list?  Certainly an non-empty 'rwsem_waiter' is
> going to trigger the condition more often?  I guess I can look to see
> what might cause this, even if I can not reproduce it.  The
> 'preemp_count' has been two every time you have this; is that true?

Wouldn't a smaller MTD trigger the condition more often?  It looks like
the locking is done per erase block and several files in the same erase
block with simultaneous reading/writing will trigger this kind of
effect?

Does that sound right Richard?  Will it matter if I use a fixed or
dynamic volume size?  Can I make a small UBI/UbiFS MTD partition and use
that for testing?  My dynamic partition is about 200MB big.  Usually we
never come near filling it, so there is lots of opportunity to use other
erase blocks.

Fwiw,
Bill Pringlemeir.
Richard Weinberger Feb. 12, 2014, 8:48 p.m. UTC | #13
Am 12.02.2014 19:21, schrieb Bill Pringlemeir:
> 
> On 12 Feb 2014, bpringlemeir@nbsps.com wrote:
> 
>> We have 'IRQs off', which makes sense for __up_write.  Trying
>> 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out
>> what went on before.  It might also make sense to dump some
>> 'rwsem_waiter' nodes on the error?  It looks like '__up_write' might
>> normally have an empty list?  Certainly an non-empty 'rwsem_waiter' is
>> going to trigger the condition more often?  I guess I can look to see
>> what might cause this, even if I can not reproduce it.  The
>> 'preemp_count' has been two every time you have this; is that true?
> 
> Wouldn't a smaller MTD trigger the condition more often?  It looks like
> the locking is done per erase block and several files in the same erase
> block with simultaneous reading/writing will trigger this kind of
> effect?
> 
> Does that sound right Richard?  Will it matter if I use a fixed or
> dynamic volume size?  Can I make a small UBI/UbiFS MTD partition and use
> that for testing?  My dynamic partition is about 200MB big.  Usually we
> never come near filling it, so there is lots of opportunity to use other
> erase blocks.

Yeah, I had the same idea and setup a MTD using nandsim.
So far I was unable to trigger the issue.

Let's wait for more results from Thorsten.

Thanks,
//richard
Bill Pringlemeir Feb. 14, 2014, 5:11 p.m. UTC | #14
> Am 12.02.2014 19:21, schrieb Bill Pringlemeir:

>> Does that sound right Richard?  Will it matter if I use a fixed or
>> dynamic volume size?  Can I make a small UBI/UbiFS MTD partition and use
>> that for testing?  My dynamic partition is about 200MB big.  Usually we
>> never come near filling it, so there is lots of opportunity to use other
>> erase blocks.

On 12 Feb 2014, richard@nod.at wrote:

> Yeah, I had the same idea and setup a MTD using nandsim.
> So far I was unable to trigger the issue.

> Let's wait for more results from Thorsten.

Ok, but I would like to try with one of my ARM926 systems.  I think that
this is not a simple race.  In rwsem-spinlock.c there are two places
where 'rwsem_waiter' are 'allocated'.  Those are '__down_read' and
'__down_write_nested'.  The 'rwsem_waiter' is allocated on the kernel
stack and the current task is halted.  This is patched into the list of
the 'rwsem' rooted in ubi's ubi_ltree_entry.  Ie, the list 'allocation'
are across various task's thread_info.

If a task is killed and/or the 'sp' is not in a good state, we may be a
weird value linked in the 'rwsem' list.  On the ARM926, there are no
lock free primitives except 'swp' and in order to RMW the platform locks
interrupts.  However, UbiFS/UBI maybe called via a data fault handler
and these can not be locked.  If this is the case, you will never be
able to replicate it on anything but an ARM926 or an architecture with
this type of atomic primitives.  This is why I was originally cross
posting to the ARM list.

I think having a better test case would be constructive?  At least we
should explore the fact that it is an 'arch dependent' issue?  All of
the cases reported seem to be confined to the ARM926.

It seems like many readers/writer to a small UBI partition would be the
best type of test case.  I am not sure if UbiFS will actually call UBI
to write/read during a data fault, or is this deferred somehow?

Thanks,
Bill Pringlemeir.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/rwsem-spinlock.c#n123
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/rwsem-spinlock.c#n189
Ziegler, Emanuel (Lawo AG) Feb. 18, 2014, 8:25 a.m. UTC | #15
Hello

I am a colleague of Thorsten. We tried to activate the CONFIG_PREEMPT_NONE and the error occured again.

Unable to handle kernel NULL pointer dereference at virtual address 0000000c
pgd = c7370000
[0000000c] *pgd=8717e831, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] ARM
Modules linked in: ravenna_portctrl(O) ravenna_sync(O) ravenna_core(O) dallis_metering(O) dallis_audiorouter(O) dallis_multidrop(O) dallis_gpio(O) ravenna_fpga_core(O) marvell_lawo(O) i2c_imx fec fec_imx25_init [last unloaded: ravenna_fpga_loader]
CPU: 0    Tainted: G           O  (3.6.11 #1)
PC is at __up_write+0x34/0x16c
LR is at leb_write_unlock+0x30/0xb8
pc : [<c0205a3c>]    lr : [<c02a6bf8>]    psr: a0000093
sp : c72ffe28  ip : c79b03bc  fp : c79b9800
r10: c72fff9c  r9 : 00000499  r8 : c720cc7c
r7 : c720cc78  r6 : 00000499  r5 : c79b9800  r4 : c720cc60
r3 : 00000000  r2 : 60000013  r1 : 00000000  r0 : c720cc78
Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 0005317f  Table: 87370000  DAC: 00000015
Process sync (pid: 7976, stack limit = 0xc72fe270)
Stack: (0xc72ffe28 to 0xc7300000)
fe20:                   c7b2e000 60000013 00000800 c720cc60 c79b9800 00000499
fe40: c7b2e000 00000800 00000499 c72fff9c c79b9800 c02a6bf8 00014800 000007c8
fe60: 0000a800 c02a7268 00000800 c0044aa0 00000000 00000000 c72fe000 60000093
fe80: 00000031 00000020 000c54e7 00000001 00000000 00000000 c79faa00 00000499
fea0: c7b2e000 00000031 00000001 c00dc620 00000000 00000031 00000031 c006612c
fec0: c72fe000 00000000 c04f4c68 00000800 0000a800 00000499 c7b2e000 00000800
fee0: c72fe000 c72fff9c 00000000 c02a61e0 0000a800 00000800 c7b28000 c7b28000
ff00: c7b2e000 0000a800 00000499 c01b96e0 00000800 00000540 c7b28000 c72fe000
ff20: c7ab7490 c7b28000 00000800 00000540 c7b28000 c01ba2ac 00000800 c7402668
ff40: c781d000 00000090 00000002 c7ab74b4 c7ab7490 c01b4be8 c7abf000 c7abf040
ff60: c7abf400 c04f9710 c00dc620 c00dc648 60008400 c00b8fd4 00000001 00000000
ff80: 00000001 be870e24 00000024 c0012b88 00000000 c00dc6f4 00000000 00000001
ffa0: 000d632c c0012a00 000d632c 00000001 00000000 be870e24 000ce4f4 000a1a38
ffc0: 000d632c 00000001 be870e24 00000024 000007c6 00000000 b6f64000 00000000
ffe0: b6ece0e0 be870c7c 000a1a48 b6ece0ec 60000010 00000000 00000000 00000000
[<c0205a3c>] (__up_write+0x34/0x16c) from [<c02a6bf8>] (leb_write_unlock+0x30/0xb8)
[<c02a6bf8>] (leb_write_unlock+0x30/0xb8) from [<c02a7268>] (ubi_eba_write_leb+0xac/0x660)
[<c02a7268>] (ubi_eba_write_leb+0xac/0x660) from [<c02a61e0>] (ubi_leb_write+0xdc/0xf0)
[<c02a61e0>] (ubi_leb_write+0xdc/0xf0) from [<c01b96e0>] (ubifs_leb_write+0x6c/0x128)
[<c01b96e0>] (ubifs_leb_write+0x6c/0x128) from [<c01ba2ac>] (ubifs_wbuf_sync_nolock+0xf8/0x324)
[<c01ba2ac>] (ubifs_wbuf_sync_nolock+0xf8/0x324) from [<c01b4be8>] (ubifs_sync_fs+0x58/0x90)
[<c01b4be8>] (ubifs_sync_fs+0x58/0x90) from [<c00dc648>] (sync_fs_one_sb+0x28/0x2c)
[<c00dc648>] (sync_fs_one_sb+0x28/0x2c) from [<c00b8fd4>] (iterate_supers+0x98/0xb4)
[<c00b8fd4>] (iterate_supers+0x98/0xb4) from [<c00dc6f4>] (sys_sync+0x48/0x98)
[<c00dc6f4>] (sys_sync+0x48/0x98) from [<c0012a00>] (ret_fast_syscall+0x0/0x2c)
Code: e5903004 e58d2004 e1580003 0a00002a (e593200c) 
---[ end trace 9e9e220ab164ad69 ]---


We also were able to reproduce the error with the activated DEBUG_LOCK_ALLOC flag, with the following results.

UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22
UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625)
[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c)
[<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node
[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4)
[<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc)
[<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec)
[<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c)
[<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
        magic          0x6101831
        crc            0x8ae44db2
        node_type      0 (inode node)
        group_type     0 (no node group)
        sqnum          20716
        len            160
        key            (954, inode)
        creat_sqnum    20501
        size           288
        nlink          4
        atime          1392296839.0
        mtime          1392296832.0
        ctime          1392296839.0
        uid            0
        gid            0
        mode           16893
        flags          0x1
        xattr_cnt      0
        xattr_size     0
        xattr_names    0
        compr_type     0x1
        data len       0
UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22
UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625)
[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c)
[<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node
[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4)
[<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc)
[<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec)
[<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c)
[<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
        magic          0x6101831
        crc            0x8ae44db2
        node_type      0 (inode node)
        group_type     0 (no node group)
        sqnum          20716
        len            160
        key            (954, inode)
        creat_sqnum    20501
        size           288
        nlink          4
        atime          1392296839.0
        mtime          1392296832.0
        ctime          1392296839.0
        uid            0
        gid            0
        mode           16893
        flags          0x1
        xattr_cnt      0
        xattr_size     0
        xattr_names    0
        compr_type     0x1
        data len       0

After the error occured the first time it seems that every ubi file access triggers the error message.

Best Regards 
Emanuel
Ziegler, Emanuel (Lawo AG) Feb. 19, 2014, 11:09 a.m. UTC | #16
Hello

We were able to reproduce the error with function trace enabled, and got the following output.
If was too big for pastebin, so i split it into different pastes.

Function Trace Part 1
http://pastebin.com/ugBsJH83
Function Trace Part 2
http://pastebin.com/tBCjLXLj
Function Trace Part 3
http://pastebin.com/FgUZcN4e
Function Trace Part 4
http://pastebin.com/nebyz4XU


>We also were able to reproduce the error with the activated DEBUG_LOCK_ALLOC flag, with the following results.
>
>UBIFS error (pid 7625): ubifs_readdir: cannot find next direntry, error -22
>UBIFS assert failed in ubifs_tnc_next_ent at 2776 (pid 7625)
>[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c)
>[<c01e0318>] (ubifs_tnc_next_ent+0x18c/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
>[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
>[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
>[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
>UBIFS error (pid 7625): ubifs_validate_entry: bad extended attribute entry node
>[<c001795c>] (unwind_backtrace+0x0/0xf0) from [<c01dce80>] (lnc_add_directly+0x78/0xc4)
>[<c01dce80>] (lnc_add_directly+0x78/0xc4) from [<c01dcf80>] (matches_name+0xb4/0xcc)
>[<c01dcf80>] (matches_name+0xb4/0xcc) from [<c01dcfd4>] (resolve_collision+0x3c/0x2ec)
>[<c01dcfd4>] (resolve_collision+0x3c/0x2ec) from [<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c)
>[<c01e02cc>] (ubifs_tnc_next_ent+0x140/0x19c) from [<c01d22d4>] (ubifs_readdir+0x308/0x508)
>[<c01d22d4>] (ubifs_readdir+0x308/0x508) from [<c00d5914>] (vfs_readdir+0x80/0xa4)
>[<c00d5914>] (vfs_readdir+0x80/0xa4) from [<c00d5ad0>] (sys_getdents64+0x64/0xc8)
>[<c00d5ad0>] (sys_getdents64+0x64/0xc8) from [<c0012a20>] (ret_fast_syscall+0x0/0x2c)
>        magic          0x6101831
>        crc            0x8ae44db2
>        node_type      0 (inode node)
>        group_type     0 (no node group)
>        sqnum          20716
>        len            160
>        key            (954, inode)
>        creat_sqnum    20501
>        size           288
I think the error we seen in the case is related to the following patch and is not related
http://lists.infradead.org/pipermail/linux-mtd/2013-June/047359.html

We port the patches to our Kernel and will try to reproduce it again.


Best Regards
Emanuel
Wiedemer, Thorsten (Lawo AG) Feb. 20, 2014, 3:21 p.m. UTC | #17
Hi,

I'm back again now.

> Bill Pringlemeir wrote:
>
> $ printf "\x04\x70\x8a\xe4\x04\x50\x98\xe5\x05\x00\x5a\xe1\x29\x00\x00\x0a\x0c\x30\x95\xe5" > crash.dump $ objdump --disassemble-all -m arm -b binary crash.dump 
>
> crash.dump:     file format binary
>
>
> Disassembly of section .data:
>
> 00000000 <.data>:
>  0:   e48a7004        str     r7, [sl], #4
>    4:   e5985004        ldr     r5, [r8, #4]
>    8:   e15a0005        cmp     sl, r5
>    c:   0a000029        beq     0xb8
>   10:   e595300c        ldr     r3, [r5, #12]
>
> 'r5' is NULL.  It seems to be the same symptom.  If you run your ARM objdump with -S on either vmlinux or '__up_write', it will help confirm that it is the list corrupted again.  The assembler above should match.

I don't have running a objdump on my ARM system at the moment, but rwsem-spinlock.c compiled with debug info,  objdump -S -D gives for __up_write():

...
	sem->activity = 0;
 29c:	e3a07000 	mov	r7, #0
 2a0:	e1a0a008 	mov	sl, r8

 2a4:	e48a7004 	str	r7, [sl], #4
 2a8:	e5985004 	ldr	r5, [r8, #4]
	if (!list_empty(&sem->wait_list))
 2ac:	e15a0005 	cmp	sl, r5
 2b0:	0a000029 	beq	35c <__up_write+0xe0>
	/* if we are allowed to wake writers try to grant a single write lock
	 * if there's a writer at the front of the queue
	 * - we leave the 'waiting count' incremented to signify potential
	 *   contention
	 */
	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
 2b4:	e595300c 	ldr	r3, [r5, #12]
{
...

Seems to match ...

> What is 'RAVENNA_streame'?  Is this your standard test and not the '8k binary' copy test or are you doing the copy test with this process also running?

This is an application which runs parallel to our copy test. The last days, Emanuel set up another test environment which seems to reproduce the error more reliably (at least on some hardwares, not on all).
At the moment, there are running proprietary applications in parallel, but I'll try to strip it down to a sequence which I can provide you, if you like.

> We have 'IRQs off', which makes sense for __up_write.  Trying 'ftrace_dump_on_oops' as Richard suggests would be helpful to find out what went on before.  It might also make sense to dump some 'rwsem_waiter' nodes on the error?  It looks like '__up_write' might normally have an empty list?  > Certainly an non-empty 'rwsem_waiter' is going to trigger the condition more often?  I guess I can look to see what might cause this, even if I can not reproduce it.  The 'preemp_count' has been two every time you have this; is that true?

We could reproduce the error now with function tracing enabled, so we have two hopefully valuable traces. But they are rather big (around 4MB each). Shall I use pastebin and cut them in several peaces to provide them? Or off-list as email attachment?
The trace Emanuel posted Wednesday may be not valuable. Perhaps there is a (different) error triggered due to memory pressure caused by the function tracing.


Best regards,
Thorsten
Bill Pringlemeir Feb. 20, 2014, 5:26 p.m. UTC | #18
>> Bill Pringlemeir wrote:

>> Disassembly of section .data:

>> 00000000 <.data>:
>> 0:   e48a7004        str     r7, [sl], #4
>> 4:   e5985004        ldr     r5, [r8, #4]
>> 8:   e15a0005        cmp     sl, r5
>> c:   0a000029        beq     0xb8
>> 10:   e595300c        ldr     r3, [r5, #12]

>> 'r5' is NULL.  It seems to be the same symptom.  If you run your ARM objdump
>> 	with -S on either vmlinux or '__up_write', it will help confirm that
>> 	it is the list corrupted again.  The assembler above should match.

On 20 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

> I don't have running a objdump on my ARM system at the moment, but
> 	rwsem-spinlock.c compiled with debug info, objdump -S -D gives for
> 	__up_write():
> ...
> 	sem->activity = 0;
> 29c:	e3a07000 	mov	r7, #0
> 2a0:	e1a0a008 	mov	sl, r8

> 2a4:	e48a7004 	str	r7, [sl], #4
> 2a8:	e5985004 	ldr	r5, [r8, #4]
> 	if (!list_empty(&sem->wait_list))
> 2ac:	e15a0005 	cmp	sl, r5
> 2b0: 0a000029 beq 35c <__up_write+0xe0> /* if we are allowed to wake writers
> 	try to grant a single write lock * if there's a writer at the front of
> 	the queue * - we leave the 'waiting count' incremented to signify
> 	potential * contention */ if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
> 	{
> 2b4:	e595300c 	ldr	r3, [r5, #12]
> {
> ...

> Seems to match ...

It doesn't matter where it runs.  I just want to make sure it is always
the 'waiter' variable.

>> What is 'RAVENNA_streame'?  Is this your standard test and not the
>> '8k binary' copy test or are you doing the copy test with this
>> process also running?

> This is an application which runs parallel to our copy test. The last
> days, Emanuel set up another test environment which seems to reproduce
> the error more reliably (at least on some hardwares, not on all).  At
> the moment, there are running proprietary applications in parallel,
> but I'll try to strip it down to a sequence which I can provide you,
> if you like.

I think scheduling is important to this issue, that is why I asked.

> We could reproduce the error now with function tracing enabled, so we
> have two hopefully valuable traces. But they are rather big (around
> 4MB each). Shall I use pastebin and cut them in several peaces to
> provide them? Or off-list as email attachment?  The trace Emanuel
> posted Wednesday may be not valuable. Perhaps there is a (different)
> error triggered due to memory pressure caused by the function tracing.

After looking, the allocation is not due to memory pressure.  It is due
to different tasks waiting on the rwsem with 'waiter' allocated on the
stack; I guess the task is gone, handling a signal or something
else. However, the function traces are great.  As you note they are
rather big, so it will take anyone some time to analyze them.

You could alter '__rwsem_do_wake',

static inline struct rw_semaphore *
__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
{
	struct rwsem_waiter *waiter;
	struct task_struct *tsk;
	int woken;

	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+       if(!waiter) {
+          printk("Bad rwsem\n");
+          printk("activity is %d.\n", sem->activity);
+          BUG();
+       }
	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
		if (wakewrite)

... or something like that.

 * the rw-semaphore definition
 * - if activity is 0 then there are no active readers or writers
 * - if activity is +ve then that is the number of active readers
 * - if activity is -1 then there is one active writer
 * - if wait_list is not empty, then there are processes waiting...

It seems inconsistent to have a non-empty list with activity as 0 as
well?  The above is trying to trace when we find a 'NULL' in the
'wait_list', which always seems to be the issue, but probably not the
root cause.

You can also put similar code in '__rwsem_wake_one_writer' if you
instead get the 'up_read()' fault.

Fwiw,
Bill Pringlemeir.
Bill Pringlemeir Feb. 20, 2014, 5:38 p.m. UTC | #19
On 20 Feb 2014, bpringlemeir@nbsps.com wrote:

> You could alter '__rwsem_do_wake',
>
> static inline struct rw_semaphore *
> __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
> {
> 	struct rwsem_waiter *waiter;
> 	struct task_struct *tsk;
> 	int woken;
>
	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+       if(!waiter) {
+          printk("Bad rwsem\n");
+          printk("activity is %d.\n", sem->activity);
+          BUG();
+       }
	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
		if (wakewrite)
>
> ... or something like that.
>
> * the rw-semaphore definition
> * - if activity is 0 then there are no active readers or writers
> * - if activity is +ve then that is the number of active readers
> * - if activity is -1 then there is one active writer
> * - if wait_list is not empty, then there are processes waiting...
>
> It seems inconsistent to have a non-empty list with activity as 0 as
> well?  The above is trying to trace when we find a 'NULL' in the
> 'wait_list', which always seems to be the issue, but probably not the
> root cause.
>
> You can also put similar code in '__rwsem_wake_one_writer' if you
> instead get the 'up_read()' fault.

Sorry, this code has to go in the callers before they write to
'activity'.

void __up_write(struct rw_semaphore *sem)
{
	unsigned long flags;
+	struct rwsem_waiter *waiter;


	raw_spin_lock_irqsave(&sem->wait_lock, flags);

+	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+       if(!waiter) {
+          printk("Bad rwsem\n");
+          printk("activity is %d.\n", sem->activity);
+          BUG();
+       }

	sem->activity = 0;
	if (!list_empty(&sem->wait_list))
		sem = __rwsem_do_wake(sem, 1);

	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
}

__up_read() is the other one.  Or you wait to see if the function traces
are helpful.
Wiedemer, Thorsten (Lawo AG) Feb. 21, 2014, 8:55 a.m. UTC | #20
Hi,

> Thorsten Wiedemer wrote:
> 
> We could reproduce the error now with function tracing enabled, so we have two hopefully valuable traces. But they are rather big (around 4MB each). Shall I use pastebin and cut them in several peaces to provide them? Or off-list as email attachment?
>

Here I have the last parts of the two hopefully "valuable" traces. The pieces show what happens between the leb_write_lock() und the leb_write unlock() of the process which triggers the oops.
If this is not enough, I can provide also more ...
Take care of the PIDs, there are several processes with the same name running.
Boths traces did not result from identical test cases, so there are some different processes running.

The part of the first trace fits in one paste:

http://pastebin.com/TL3yNVcw

The part of the second trace is bigger than 500k, so split into two pastes.

http://pastebin.com/1qMt0Pqg
http://pastebin.com/ZLXU17wx

We'll have a closer look at that.

Regards,
Thorsten
Quiniou, Benoit (Lawo AG) Feb. 21, 2014, 9:28 a.m. UTC | #21
> The part of the first trace fits in one paste:
> 
> http://pastebin.com/TL3yNVcw
> 
> The part of the second trace is bigger than 500k, so split into two pastes.
> 
> http://pastebin.com/1qMt0Pqg
> http://pastebin.com/ZLXU17wx
> 
> We'll have a closer look at that.
> 
> Regards,
> Thorsten

Hello mtd-linux mailing list,

My name is Benoit and I work together with Thorsten and Emanuel on this.
Eventhought  I am not a kernel expert, I try to take a deep look in the 2 ftraces we've got, in order to find differences and similarities and maybe have a clue what happened differently in the failed operations.
I put my efforts looking at what happens between leb_write_lock and leb_write_unlock all over the traces (not only the partial traces commited to pastebin).
What I can already tell, is that in both traces, I can see ubifs_orphan_ operations in very similar ways and only once in each traces :
1- 1 process invokes ubifs_add_orphan
... some processes invoke ubifs api...
2- 1 other process (a sync) invokes ubifs_oprhan_start_commit
... some processes invoke ubifs api...
3- a third process invokes ubifs_delete_orphan
... some processes invoke ubifs api...
4- the process in which the failure finally arises invokes leb_write_lock 
(starting from here you can follow the trace in first pastebin, line 112)
... some processes invoke ubifs api...
5- sync process of 2- invokes ubifs_orphan_end_commit (line 614) and gets interrupted by other processes before it finishes (line 734 and several processes after that point). This process resumes at line 2336
... some processes invoke ubifs api in the meanwhile...
6- failing process finally comes to leb_write_unlock (line 4332) and goes to the Oops.

I will continue to take deep look at traces to see if I find things more closely related to the semaphore you pointed out.

Regards,

Benoit Quiniou
Bill Pringlemeir Feb. 21, 2014, 5:53 p.m. UTC | #22
On 21 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:

> Here I have the last parts of the two hopefully "valuable" traces. The
> pieces show what happens between the leb_write_lock() und the
> leb_write unlock() of the process which triggers the oops.  If this is
> not enough, I can provide also more ...  Take care of the PIDs, there
> are several processes with the same name running.  Boths traces did
> not result from identical test cases, so there are some different
> processes running.

> The part of the first trace fits in one paste:

> http://pastebin.com/TL3yNVcw

I am not sure if this is a complete trace?  I don't understand why we
would start off with 'leb_write_unlock'...  However,

$ grep -E 'leb|ltree' ubi.crash.raw # slight edits
  sync-7493    0.... 1348758715us : leb_write_unlock <-ubi_eba_write_leb
  sync-7493    0.... 1348758719us : add_preempt_count <-leb_write_unlock
  sync-7493    0...1 1348758724us : up_write <-leb_write_unlock
  sync-7493    0...1 1348758739us : kfree <-leb_write_unlock
  sync-7493    0...1 1348758746us : sub_preempt_count <-leb_write_unlock
  sync-7493    0.... 1348758954us : ubifs_leb_write <-ubifs_wbuf_sync_nolock
  sync-7493    0.... 1348758960us : ubi_leb_write <-ubifs_leb_write
  sync-7493    0.... 1348758965us : ubi_eba_write_leb <-ubi_leb_write
  sync-7493    0.... 1348758969us : leb_write_lock <-ubi_eba_write_leb
  sync-7493    0.... 1348758973us : ltree_add_entry <-leb_write_lock
  sync-7493    0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry
  sync-7493    0.... 1348758983us : add_preempt_count <-ltree_add_entry
  sync-7493    0...1 1348758989us : sub_preempt_count <-ltree_add_entry
  sync-7493    0.... 1348758994us : kfree <-ltree_add_entry
  sync-7493    0.... 1348758998us : down_write <-leb_write_lock
  sync-7493    0.... 1348759010us : ubi_io_write <-ubi_eba_write_leb
< many reschedules, but  sync-7493 still active.>
   sync-7492    0.... 1348761397us : leb_write_unlock <-ubi_eba_write_leb
   sync-7492    0.... 1348761400us : add_preempt_count <-leb_write_unlock
   sync-7492    0...1 1348761406us : up_write <-leb_write_unlock
   sync-7492    0...1 1348761419us : kfree <-leb_write_unlock
   sync-7492    0...1 1348761425us : sub_preempt_count <-leb_write_unlock
   sync-7492    0.... 1348761580us : ubifs_leb_write <-ubifs_write_node
   sync-7492    0.... 1348761585us : ubi_leb_write <-ubifs_leb_write
   sync-7492    0.... 1348761591us : ubi_eba_write_leb <-ubi_leb_write
   sync-7492    0.... 1348761595us : leb_write_lock <-ubi_eba_write_leb
   sync-7492    0.... 1348761599us : ltree_add_entry <-leb_write_lock
   sync-7492    0.... 1348761603us : kmem_cache_alloc_trace <-ltree_add_entry
   sync-7492    0.... 1348761609us : add_preempt_count <-ltree_add_entry
   sync-7492    0...1 1348761615us : sub_preempt_count <-ltree_add_entry
   sync-7492    0.... 1348761619us : kfree <-ltree_add_entry
   sync-7492    0.... 1348761622us : down_write <-leb_write_lock
   sync-7492    0.... 1348761635us : ubi_io_write <-ubi_eba_write_leb
   sync-7492    0.... 1348771081us : leb_write_unlock <-ubi_eba_write_leb
   sync-7492    0.... 1348771084us : add_preempt_count <-leb_write_unlock
   sync-7492    0...1 1348771090us : up_write <-leb_write_unlock
   sync-7492    0...1 1348771102us : kfree <-leb_write_unlock
   sync-7492    0...1 1348771109us : sub_preempt_count <-leb_write_unlock
   sync-7492    0.... 1348771269us : ubifs_leb_write <-ubifs_write_node
   sync-7492    0.... 1348771275us : ubi_leb_write <-ubifs_leb_write
   sync-7492    0.... 1348771280us : ubi_eba_write_leb <-ubi_leb_write
   sync-7492    0.... 1348771285us : leb_write_lock <-ubi_eba_write_leb
   sync-7492    0.... 1348771289us : ltree_add_entry <-leb_write_lock
   sync-7492    0.... 1348771292us : kmem_cache_alloc_trace <-ltree_add_entry
   sync-7492    0.... 1348771299us : add_preempt_count <-ltree_add_entry
   sync-7492    0...1 1348771304us : sub_preempt_count <-ltree_add_entry
   sync-7492    0.... 1348771308us : kfree <-ltree_add_entry
   sync-7492    0.... 1348771311us : down_write <-leb_write_lock
   sync-7492    0.... 1348771324us : ubi_io_write <-ubi_eba_write_leb
< many reschedules and sync-7493 still in ubi_eba_write_leb>
  sync-7493    0.... 1348781410us : leb_write_unlock <-ubi_eba_write_leb
  sync-7493    0.... 1348781413us : add_preempt_count <-leb_write_unlock
  sync-7493    0...1 1348781418us : up_write <-leb_write_unlock
...

I am not sure that we can know which LEBs are involved.  However, I see
cases for a double free and other issues.  I think we need
'atomic_dec_and_test()' on the leb->users and to check this in the tree
lookup.  For instance, in 'leb_write_unlock()', the call to 'up_write()'
can cause a reschedule.  Say we enter 'leb_write_unlock()' with 'users =
2'.  The count will decrement to one and then we may reschedule in
'up_write'.  The 2nd UBI task may decrement users and inspect the count
and call 'kfree'.  Then we return to the original 'leb_write_unlock()',
we will inspect 'le->users' and it will be zero.

static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
{
	struct ubi_ltree_entry *le;

	spin_lock(&ubi->ltree_lock);
	le = ltree_lookup(ubi, vol_id, lnum);
	le->users -= 1;
	ubi_assert(le->users >= 0);
	up_write(&le->mutex);         /* can cause reschedule */
	if (le->users == 0) {
		rb_erase(&le->rb, &ubi->ltree);
		kfree(le);
	}
	spin_unlock(&ubi->ltree_lock);
}

Maybe there are better kernel design's/APIs that can wrap this.
However, I don't think the 'ubi_ltree_entry' is perfectly race free.

On the IMX series, the MTD driver does a 'read-modify-write' to support
sub-pages.  The driver can not write to a subpage at a time.  The MTD
driver is also interrupt driven and several reschedules happened during
ubi_io_write().

I don't think the spin_lock() will do anything on a UP system like the
ARM926's that have encountered this issue.

My compiler is reloading 'le->users' in the 'if' path.  It seems odd
that the 'wait_list' in the rwsem would always be NULL in this case.
However, the wake may have rescheduled the blocked task and the stack
was re-used and set to NULL/zero.  I think races with 'le->users' are an
issue; I don't know that it is this issue.  I think it is also possible
for the 'ltree_lookup()' to return an entry with 'le->user = 0', which
would indicate someone is about to use 'kfree()'.  I guess this is what
Artem suspected?

Fwiw,
Bill Pringlemeir.
Richard Weinberger Feb. 21, 2014, 6:12 p.m. UTC | #23
Am 21.02.2014 18:53, schrieb Bill Pringlemeir:
> On 21 Feb 2014, Thorsten.Wiedemer@lawo.com wrote:
> 
>> Here I have the last parts of the two hopefully "valuable" traces. The
>> pieces show what happens between the leb_write_lock() und the
>> leb_write unlock() of the process which triggers the oops.  If this is
>> not enough, I can provide also more ...  Take care of the PIDs, there
>> are several processes with the same name running.  Boths traces did
>> not result from identical test cases, so there are some different
>> processes running.
> 
>> The part of the first trace fits in one paste:
> 
>> http://pastebin.com/TL3yNVcw
> 
> I am not sure if this is a complete trace?  I don't understand why we
> would start off with 'leb_write_unlock'...  However,
> 
> $ grep -E 'leb|ltree' ubi.crash.raw # slight edits
>   sync-7493    0.... 1348758715us : leb_write_unlock <-ubi_eba_write_leb
>   sync-7493    0.... 1348758719us : add_preempt_count <-leb_write_unlock
>   sync-7493    0...1 1348758724us : up_write <-leb_write_unlock
>   sync-7493    0...1 1348758739us : kfree <-leb_write_unlock
>   sync-7493    0...1 1348758746us : sub_preempt_count <-leb_write_unlock
>   sync-7493    0.... 1348758954us : ubifs_leb_write <-ubifs_wbuf_sync_nolock
>   sync-7493    0.... 1348758960us : ubi_leb_write <-ubifs_leb_write
>   sync-7493    0.... 1348758965us : ubi_eba_write_leb <-ubi_leb_write
>   sync-7493    0.... 1348758969us : leb_write_lock <-ubi_eba_write_leb
>   sync-7493    0.... 1348758973us : ltree_add_entry <-leb_write_lock
>   sync-7493    0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry
>   sync-7493    0.... 1348758983us : add_preempt_count <-ltree_add_entry
>   sync-7493    0...1 1348758989us : sub_preempt_count <-ltree_add_entry
>   sync-7493    0.... 1348758994us : kfree <-ltree_add_entry
>   sync-7493    0.... 1348758998us : down_write <-leb_write_lock
>   sync-7493    0.... 1348759010us : ubi_io_write <-ubi_eba_write_leb
> < many reschedules, but  sync-7493 still active.>
>    sync-7492    0.... 1348761397us : leb_write_unlock <-ubi_eba_write_leb
>    sync-7492    0.... 1348761400us : add_preempt_count <-leb_write_unlock
>    sync-7492    0...1 1348761406us : up_write <-leb_write_unlock
>    sync-7492    0...1 1348761419us : kfree <-leb_write_unlock
>    sync-7492    0...1 1348761425us : sub_preempt_count <-leb_write_unlock
>    sync-7492    0.... 1348761580us : ubifs_leb_write <-ubifs_write_node
>    sync-7492    0.... 1348761585us : ubi_leb_write <-ubifs_leb_write
>    sync-7492    0.... 1348761591us : ubi_eba_write_leb <-ubi_leb_write
>    sync-7492    0.... 1348761595us : leb_write_lock <-ubi_eba_write_leb
>    sync-7492    0.... 1348761599us : ltree_add_entry <-leb_write_lock
>    sync-7492    0.... 1348761603us : kmem_cache_alloc_trace <-ltree_add_entry
>    sync-7492    0.... 1348761609us : add_preempt_count <-ltree_add_entry
>    sync-7492    0...1 1348761615us : sub_preempt_count <-ltree_add_entry
>    sync-7492    0.... 1348761619us : kfree <-ltree_add_entry
>    sync-7492    0.... 1348761622us : down_write <-leb_write_lock
>    sync-7492    0.... 1348761635us : ubi_io_write <-ubi_eba_write_leb
>    sync-7492    0.... 1348771081us : leb_write_unlock <-ubi_eba_write_leb
>    sync-7492    0.... 1348771084us : add_preempt_count <-leb_write_unlock
>    sync-7492    0...1 1348771090us : up_write <-leb_write_unlock
>    sync-7492    0...1 1348771102us : kfree <-leb_write_unlock
>    sync-7492    0...1 1348771109us : sub_preempt_count <-leb_write_unlock
>    sync-7492    0.... 1348771269us : ubifs_leb_write <-ubifs_write_node
>    sync-7492    0.... 1348771275us : ubi_leb_write <-ubifs_leb_write
>    sync-7492    0.... 1348771280us : ubi_eba_write_leb <-ubi_leb_write
>    sync-7492    0.... 1348771285us : leb_write_lock <-ubi_eba_write_leb
>    sync-7492    0.... 1348771289us : ltree_add_entry <-leb_write_lock
>    sync-7492    0.... 1348771292us : kmem_cache_alloc_trace <-ltree_add_entry
>    sync-7492    0.... 1348771299us : add_preempt_count <-ltree_add_entry
>    sync-7492    0...1 1348771304us : sub_preempt_count <-ltree_add_entry
>    sync-7492    0.... 1348771308us : kfree <-ltree_add_entry
>    sync-7492    0.... 1348771311us : down_write <-leb_write_lock
>    sync-7492    0.... 1348771324us : ubi_io_write <-ubi_eba_write_leb
> < many reschedules and sync-7493 still in ubi_eba_write_leb>
>   sync-7493    0.... 1348781410us : leb_write_unlock <-ubi_eba_write_leb
>   sync-7493    0.... 1348781413us : add_preempt_count <-leb_write_unlock
>   sync-7493    0...1 1348781418us : up_write <-leb_write_unlock
> ...
> 
> I am not sure that we can know which LEBs are involved.  However, I see
> cases for a double free and other issues.  I think we need
> 'atomic_dec_and_test()' on the leb->users and to check this in the tree
> lookup.  For instance, in 'leb_write_unlock()', the call to 'up_write()'
> can cause a reschedule.  Say we enter 'leb_write_unlock()' with 'users =
> 2'.  The count will decrement to one and then we may reschedule in
> 'up_write'.  The 2nd UBI task may decrement users and inspect the count
> and call 'kfree'.  Then we return to the original 'leb_write_unlock()',
> we will inspect 'le->users' and it will be zero.
> 
> static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
> {
> 	struct ubi_ltree_entry *le;
> 
> 	spin_lock(&ubi->ltree_lock);
> 	le = ltree_lookup(ubi, vol_id, lnum);
> 	le->users -= 1;
> 	ubi_assert(le->users >= 0);
> 	up_write(&le->mutex);         /* can cause reschedule */
> 	if (le->users == 0) {
> 		rb_erase(&le->rb, &ubi->ltree);
> 		kfree(le);
> 	}
> 	spin_unlock(&ubi->ltree_lock);
> }
> 
> Maybe there are better kernel design's/APIs that can wrap this.
> However, I don't think the 'ubi_ltree_entry' is perfectly race free.
> 
> On the IMX series, the MTD driver does a 'read-modify-write' to support
> sub-pages.  The driver can not write to a subpage at a time.  The MTD
> driver is also interrupt driven and several reschedules happened during
> ubi_io_write().
> 
> I don't think the spin_lock() will do anything on a UP system like the
> ARM926's that have encountered this issue.

Also on UP a spin_lock disables preemption.
So, le->users is protected.

Thanks,
//richard
Bill Pringlemeir Feb. 21, 2014, 7:45 p.m. UTC | #24
> Am 21.02.2014 18:53, schrieb Bill Pringlemeir:
>> I don't think the spin_lock() will do anything on a UP system like the
>> ARM926's that have encountered this issue.

On 21 Feb 2014, richard@nod.at wrote:

> Also on UP a spin_lock disables preemption.
> So, le->users is protected.

Thanks, now this make sense...

>   sync-7493    0.... 1348758977us : kmem_cache_alloc_trace <-ltree_add_entry
>   sync-7493    0.... 1348758983us : add_preempt_count <-ltree_add_entry
>   sync-7493    0...1 1348758989us : sub_preempt_count <-ltree_add_entry
>   sync-7493    0.... 1348758994us : kfree <-ltree_add_entry

that those are the spinlocks.  I say some thing dumb and get to learn
something.

It looks like ubi_eba_write_leb() is running retries when this happens
in the pre-empted sync-7493 and then sync-7492 goes in and modifies the
tree and some entries.

If the ltree lock is different than the leb write mutex, could the
'down_write' and 'up_write' use different mutexes?  I guess the 'lnum'
is the same and the 'users' count should prevent this.

I don't see any clues from the function tracing then.

Also, it seems that 'sem->wait_list.next' is NULL and this is in the
space allocated by the 'ltree_entry'.  Ie, the head of the list has
NULL; I was thinking something within the list was NULL.

Regards,
Bill Pringlemeir.
Bill Pringlemeir Feb. 22, 2014, 12:49 a.m. UTC | #25
>> Am 21.02.2014 18:53, schrieb Bill Pringlemeir:

>>> I don't think the spin_lock() will do anything on a UP system like
>>> the ARM926's that have encountered this issue.

> On 21 Feb 2014, richard@nod.at wrote:

>> Also on UP a spin_lock disables preemption.  So, le->users is
>> protected.

On 21 Feb 2014, bpringlemeir@nbsps.com wrote:

> Thanks, now this make sense...

[snip]

> Also, it seems that 'sem->wait_list.next' is NULL and this is in the
> space allocated by the 'ltree_entry'.  Ie, the head of the list has
> NULL; I was thinking something within the list was NULL.

I think the 'ubi_eba_copy_leb()' may have an issue.

int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
		     struct ubi_vid_hdr *vid_hdr)
{
...

	err = leb_write_trylock(ubi, vol_id, lnum);

static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum)
{
..
	le = ltree_add_entry(ubi, vol_id, lnum);  /* users + 1 */
	if (IS_ERR(le))
		return PTR_ERR(le);
	if (down_write_trylock(&le->mutex))
		return 0;

	/* Contention, cancel */
	spin_lock(&ubi->ltree_lock);
	le->users -= 1;                           /* user - 1 */
...
	spin_unlock(&ubi->ltree_lock);           

	return 1;
}

	if (err)...

	if (vol->eba_tbl[lnum] != from) {
		dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel",
		       vol_id, lnum, from, vol->eba_tbl[lnum]);
		err = MOVE_CANCEL_RACE;
		goto out_unlock_leb;
	}
...
out_unlock_leb:
	leb_write_unlock(ubi, vol_id, lnum);     /* user - 1 */


Didn't the count have to bump up in this case?  This isn't in Thorsten's
traces, but neither are any 'down_read' or 'up_read' traces; maybe
everything is in cache?

Regards,
Bill Pringlemeir.
Richard Weinberger Feb. 22, 2014, 8:32 a.m. UTC | #26
Am 22.02.2014 01:49, schrieb Bill Pringlemeir:
>>> Am 21.02.2014 18:53, schrieb Bill Pringlemeir:
> 
>>>> I don't think the spin_lock() will do anything on a UP system like
>>>> the ARM926's that have encountered this issue.
> 
>> On 21 Feb 2014, richard@nod.at wrote:
> 
>>> Also on UP a spin_lock disables preemption.  So, le->users is
>>> protected.
> 
> On 21 Feb 2014, bpringlemeir@nbsps.com wrote:
> 
>> Thanks, now this make sense...
> 
> [snip]
> 
>> Also, it seems that 'sem->wait_list.next' is NULL and this is in the
>> space allocated by the 'ltree_entry'.  Ie, the head of the list has
>> NULL; I was thinking something within the list was NULL.
> 
> I think the 'ubi_eba_copy_leb()' may have an issue.
> 
> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
> 		     struct ubi_vid_hdr *vid_hdr)
> {
> ...
> 
> 	err = leb_write_trylock(ubi, vol_id, lnum);
> 
> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum)
> {
> ..
> 	le = ltree_add_entry(ubi, vol_id, lnum);  /* users + 1 */
> 	if (IS_ERR(le))
> 		return PTR_ERR(le);
> 	if (down_write_trylock(&le->mutex))
> 		return 0;
> 
> 	/* Contention, cancel */
> 	spin_lock(&ubi->ltree_lock);
> 	le->users -= 1;                           /* user - 1 */
> ...
> 	spin_unlock(&ubi->ltree_lock);           
> 
> 	return 1;
> }
> 
> 	if (err)...
> 
> 	if (vol->eba_tbl[lnum] != from) {
> 		dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel",
> 		       vol_id, lnum, from, vol->eba_tbl[lnum]);
> 		err = MOVE_CANCEL_RACE;
> 		goto out_unlock_leb;
> 	}
> ...
> out_unlock_leb:
> 	leb_write_unlock(ubi, vol_id, lnum);     /* user - 1 */
> 
> 
> Didn't the count have to bump up in this case?  This isn't in Thorsten's
> traces, but neither are any 'down_read' or 'up_read' traces; maybe
> everything is in cache?

Hmm, I'm not sure whether I was able to follow your thought.
But leb_write_unlock() is balanced with leb_write_trylock() in ubi_eba_copy_leb() which
makes perfectly sense to me.
What exactly is the problem?

Thanks,
//richard
Bill Pringlemeir Feb. 24, 2014, 3:09 p.m. UTC | #27
On 22 Feb 2014, richard@nod.at wrote:

> Am 22.02.2014 01:49, schrieb Bill Pringlemeir:

>> I think the 'ubi_eba_copy_leb()' may have an issue.
>>
>> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
>> 		     struct ubi_vid_hdr *vid_hdr)
>> {
>> ...
>>
>> 	err = leb_write_trylock(ubi, vol_id, lnum);
>>
>> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum)
>> {
>> ..
>> 	le = ltree_add_entry(ubi, vol_id, lnum);  /* users + 1 */
>> 	if (IS_ERR(le))
>> 		return PTR_ERR(le);
>> 	if (down_write_trylock(&le->mutex))
>> 		return 0;
>>
>> 	/* Contention, cancel */
>> 	spin_lock(&ubi->ltree_lock);
>> 	le->users -= 1;                           /* user - 1 */
>> ...
>> 	spin_unlock(&ubi->ltree_lock);           
>>
>> 	return 1;
>> }
>>
>> 	if (err)...
>>
	if (vol->eba_tbl[lnum] != from) {
		dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel",
		       vol_id, lnum, from, vol->eba_tbl[lnum]);
		err = MOVE_CANCEL_RACE;
		goto out_unlock_leb;
	}
...
out_unlock_leb:
	leb_write_unlock(ubi, vol_id, lnum);     /* user - 1 */


>> Didn't the count have to bump up in this case?  This isn't in Thorsten's
>> traces, but neither are any 'down_read' or 'up_read' traces; maybe
>> everything is in cache?

> Hmm, I'm not sure whether I was able to follow your thought.  But
> leb_write_unlock() is balanced with leb_write_trylock() in
> ubi_eba_copy_leb() which makes perfectly sense to me.  What exactly is
> the problem?

There are two things that must be balanced.  The 'reference count'
ubi_ltree_entry -> users and the rw_semaphore down/up.  You are right,
the trylock needs to be balanced by the 'leb_write_unlock'.  However,
the 'leb_write_trylock()' has already decremented 'users' in preperation
to move the 'lnum'.  However, in the case of contention,
'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which
balances the 'trylock', but unbalances the 'users' reference count (I
added some comments on the lines).

I was postulating that the down/up paths are not on the same semaphore.
There is nothing in the code to check this.  It is possible for someone
to come in and recycle the 'lnum' and create a new rwsemaphore.  Here,
the down/up are done on different 'rwsemaphores' and this will cause
issues.

I was looking at using a 'kref' instead of 'users' and then I found this
path.

Fwiw,
Bill Pringlemeir.
Richard Weinberger Feb. 24, 2014, 3:36 p.m. UTC | #28
Am 24.02.2014 16:09, schrieb Bill Pringlemeir:
> On 22 Feb 2014, richard@nod.at wrote:
> 
>> Am 22.02.2014 01:49, schrieb Bill Pringlemeir:
> 
>>> I think the 'ubi_eba_copy_leb()' may have an issue.
>>>
>>> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
>>> 		     struct ubi_vid_hdr *vid_hdr)
>>> {
>>> ...
>>>
>>> 	err = leb_write_trylock(ubi, vol_id, lnum);
>>>
>>> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum)
>>> {
>>> ..
>>> 	le = ltree_add_entry(ubi, vol_id, lnum);  /* users + 1 */
>>> 	if (IS_ERR(le))
>>> 		return PTR_ERR(le);
>>> 	if (down_write_trylock(&le->mutex))
>>> 		return 0;
>>>
>>> 	/* Contention, cancel */
>>> 	spin_lock(&ubi->ltree_lock);
>>> 	le->users -= 1;                           /* user - 1 */
>>> ...
>>> 	spin_unlock(&ubi->ltree_lock);           
>>>
>>> 	return 1;
>>> }
>>>
>>> 	if (err)...
>>>
> 	if (vol->eba_tbl[lnum] != from) {
> 		dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel",
> 		       vol_id, lnum, from, vol->eba_tbl[lnum]);
> 		err = MOVE_CANCEL_RACE;
> 		goto out_unlock_leb;
> 	}
> ...
> out_unlock_leb:
> 	leb_write_unlock(ubi, vol_id, lnum);     /* user - 1 */
> 
> 
>>> Didn't the count have to bump up in this case?  This isn't in Thorsten's
>>> traces, but neither are any 'down_read' or 'up_read' traces; maybe
>>> everything is in cache?
> 
>> Hmm, I'm not sure whether I was able to follow your thought.  But
>> leb_write_unlock() is balanced with leb_write_trylock() in
>> ubi_eba_copy_leb() which makes perfectly sense to me.  What exactly is
>> the problem?
> 
> There are two things that must be balanced.  The 'reference count'
> ubi_ltree_entry -> users and the rw_semaphore down/up.  You are right,
> the trylock needs to be balanced by the 'leb_write_unlock'.  However,
> the 'leb_write_trylock()' has already decremented 'users' in preperation
> to move the 'lnum'.  However, in the case of contention,
> 'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which
> balances the 'trylock', but unbalances the 'users' reference count (I
> added some comments on the lines).

My first thought here is "If this is true, why does ubi_assert(le->users >= 0) not trigger"?

Thanks,
//richard
Bill Pringlemeir Feb. 24, 2014, 3:45 p.m. UTC | #29
>> On 22 Feb 2014, richard@nod.at wrote:

>>> Hmm, I'm not sure whether I was able to follow your thought.  But
>>> leb_write_unlock() is balanced with leb_write_trylock() in
>>> ubi_eba_copy_leb() which makes perfectly sense to me.  What exactly is
>>> the problem?

> Am 24.02.2014 16:09, schrieb Bill Pringlemeir:

>> There are two things that must be balanced.  The 'reference count'
>> ubi_ltree_entry -> users and the rw_semaphore down/up.  You are right,
>> the trylock needs to be balanced by the 'leb_write_unlock'.  However,
>> the 'leb_write_trylock()' has already decremented 'users' in preperation
>> to move the 'lnum'.  However, in the case of contention,
>> 'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which
>> balances the 'trylock', but unbalances the 'users' reference count (I
>> added some comments on the lines).

On 24 Feb 2014, richard@nod.at wrote:

> My first thought here is "If this is true, why does
> ubi_assert(le->users >= 0) not trigger"?

A call to 'ltree_add_entry()' may add a completely new entry for the
'lnum'.  The context switches may happen at any point that the spinlock
is not held.

Here is ubi_eba_copy_leb() with just mutex and reference count.

leb_write_trylock -> ltree_add_entry(ubi, vol_id, lnum) create new or
                        old.
/* could reschedule here... */
leb_write_trylock -> down_write_trylock have write rwsem.
/* could reschedule here... */
leb_write_trylock -> get spin lock and decrement user.
/* could reschedule here... */
on 'if (vol->eba_tbl[lnum] != from)' another thread has this
                        'ltree_entry' so count is >1.
/* could reschedule here... */
call leb_write_unlock() and destroy in use ltree_entry.

Anyone calling 'ltree_add_entry' may create a new entry.  Also, as the
entry has been freed, the memory will be recycled and 'users' could be
anything in a freed node.  It is puzzling if this is related to the
problem that Thorsten and others have seen that the 'assert' never
fires.  However, this path seems to violate the reference count and
double decrements.  I am pretty sure it is an issue although it maybe
unrelated and latent (never triggered).  However, some of the same
'suspects' are involved so I think it is a possibility to explore.

Fwiw,
Bill Pringlemeir.
Bill Pringlemeir Feb. 24, 2014, 3:48 p.m. UTC | #30
On 24 Feb 2014, bpringlemeir@nbsps.com wrote:

> /* could reschedule here... */
> on 'if (vol->eba_tbl[lnum] != from)' another thread has this
> 'ltree_entry' so count is >1.
> /* could reschedule here... */

Sorry, maybe that is not true.  I don't know enough about the logic of
ubi_eba_copy_leb().  Another thread has at least changed the physical
map.  It may/may not still have the rwsemaphore.
Richard Weinberger March 5, 2014, 8:57 p.m. UTC | #31
Am 24.02.2014 16:48, schrieb Bill Pringlemeir:
> On 24 Feb 2014, bpringlemeir@nbsps.com wrote:
> 
>> /* could reschedule here... */
>> on 'if (vol->eba_tbl[lnum] != from)' another thread has this
>> 'ltree_entry' so count is >1.
>> /* could reschedule here... */
> 
> Sorry, maybe that is not true.  I don't know enough about the logic of
> ubi_eba_copy_leb().  Another thread has at least changed the physical
> map.  It may/may not still have the rwsemaphore.

Thorsten and Emanuel from Lawo AG kindly gave me access to one of their boards
such that I was able to work directly on the issue.

The solution is rather trivial.
Two commits were missing in their tree:

commit 8afd500cb52a5d00bab4525dd5a560d199f979b9
Author: Adam Thomas <adamthomas1111@gmail.com>
Date:   Sat Feb 2 22:35:08 2013 +0000

    UBIFS: fix double free of ubifs_orphan objects

and

commit 2928f0d0c5ebd6c9605c0d98207a44376387c298
Author: Adam Thomas <adamthomas1111@gmail.com>
Date:   Sat Feb 2 22:32:31 2013 +0000

    UBIFS: fix use of freed ubifs_orphan objects

Bill, I'm very sure this fixes also the issue you face.

It is not obvious that these two commits fix the issue.
I've spend some time to figure out what exactly happens.

In most cases up_write() crashes because "waiter" in __rwsem_do_wake()
or __rwsem_wake_one_writer() somehow became NULL.
It turned out that sem->wait_list gets overwritten.

"UBIFS: fix double free of ubifs_orphan objects" fixes more than a double free.
It addresses an issue where an orphan stays too long on the orphan list and therefore could
be free()'ed twice. But another issue is that in ubifs_orphan_start_commit()
an already free()'ed is set to NULL.
Namely by the line "*last = NULL;".
This is the line of code where the memory corruption happens.
I've double checked it by adding a few trace_printk() and BUG_ON().

Thanks,
//richard
Bill Pringlemeir March 5, 2014, 9:30 p.m. UTC | #32
On  5 Mar 2014, richard@nod.at wrote:

> Thorsten and Emanuel from Lawo AG kindly gave me access to one of
> their boards such that I was able to work directly on the issue.

> The solution is rather trivial.
> Two commits were missing in their tree:

> commit 8afd500cb52a5d00bab4525dd5a560d199f979b9
> Author: Adam Thomas <adamthomas1111@gmail.com>
> Date:   Sat Feb 2 22:35:08 2013 +0000

> UBIFS: fix double free of ubifs_orphan objects

> and

> commit 2928f0d0c5ebd6c9605c0d98207a44376387c298
> Author: Adam Thomas <adamthomas1111@gmail.com>
> Date:   Sat Feb 2 22:32:31 2013 +0000

> UBIFS: fix use of freed ubifs_orphan objects

> Bill, I'm very sure this fixes also the issue you face.

I haven't seen this issue recently.  It was only with older versions.  I
have both of those commits as well and wasn't able to reproduce it.  So
I have no issues.

Still, the logic of ubi_eba_copy_leb() seems really weird to me.  I
don't understand,

 	err = leb_write_trylock(ubi, vol_id, lnum);

followed by,

	leb_write_unlock(ubi, vol_id, lnum);

It looks like a double reference count decrement.  Maybe they operate on
different ltree nodes, but then the locks mis-match.

Fwiw,
Bill Pringlemeir.
Bill Pringlemeir March 5, 2014, 9:42 p.m. UTC | #33
On  5 Mar 2014, bpringlemeir@nbsps.com wrote:

> On  5 Mar 2014, richard@nod.at wrote:

>> The solution is rather trivial.

[snip]

> Still, the logic of ubi_eba_copy_leb() seems really weird to me.  I
> don't understand,

> It looks like a double reference count decrement.  Maybe they operate on
> different ltree nodes, but then the locks mis-match.

	if (down_write_trylock(&le->mutex))
		return 0;

Ah, I mis-understood.  The path where the lock is taken doesn't
decrement the 'users' count; I was thinking of '0' as fail.

Regards,
Bill Pringlemeir.
Richard Weinberger March 5, 2014, 11:11 p.m. UTC | #34
Am 05.03.2014 22:42, schrieb Bill Pringlemeir:
> On  5 Mar 2014, bpringlemeir@nbsps.com wrote:
> 
>> On  5 Mar 2014, richard@nod.at wrote:
> 
>>> The solution is rather trivial.
> 
> [snip]
> 
>> Still, the logic of ubi_eba_copy_leb() seems really weird to me.  I
>> don't understand,
> 
>> It looks like a double reference count decrement.  Maybe they operate on
>> different ltree nodes, but then the locks mis-match.
> 
> 	if (down_write_trylock(&le->mutex))
> 		return 0;
> 
> Ah, I mis-understood.  The path where the lock is taken doesn't
> decrement the 'users' count; I was thinking of '0' as fail.

Welcome to UNIX where 0 is success. :-)

Thanks,
//richard
Richard Weinberger March 5, 2014, 11:12 p.m. UTC | #35
Am 05.03.2014 22:42, schrieb Bill Pringlemeir:
> On  5 Mar 2014, bpringlemeir@nbsps.com wrote:
> 
>> On  5 Mar 2014, richard@nod.at wrote:
> 
>>> The solution is rather trivial.
> 
> [snip]
> 
>> Still, the logic of ubi_eba_copy_leb() seems really weird to me.  I
>> don't understand,
> 
>> It looks like a double reference count decrement.  Maybe they operate on
>> different ltree nodes, but then the locks mis-match.
> 
> 	if (down_write_trylock(&le->mutex))
> 		return 0;
> 
> Ah, I mis-understood.  The path where the lock is taken doesn't
> decrement the 'users' count; I was thinking of '0' as fail.

Welcome to UNIX where 0 is success. :-)

Thanks,
//richard
diff mbox

Patch

diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
index 0e11671d..48fd2aa 100644
--- a/drivers/mtd/ubi/eba.c
+++ b/drivers/mtd/ubi/eba.c
@@ -301,6 +301,7 @@  static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)

        spin_lock(&ubi->ltree_lock);
        le = ltree_lookup(ubi, vol_id, lnum);
+       ubi_assert(le);
        le->users -= 1;
        ubi_assert(le->users >= 0);
        up_write(&le->mutex);