UBI leb_write_unlock NULL pointer Oops (continuation)

Message ID	52EF9FFE.4020405@nod.at
State	RFC
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> ppid: 12325, pid: 12329, t: 0.0860s scanners: attach: 1.3.1 clamav: 0.96.5/m: Message-ID: <52EF9FFE.4020405@nod.at> Date: Mon, 03 Feb 2014 14:56:14 +0100 From: Richard Weinberger <richard@nod.at> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: "Wiedemer, Thorsten (Lawo AG)" <Thorsten.Wiedemer@lawo.com> Subject: Re: UBI leb_write_unlock NULL pointer Oops (continuation) References: <D7B1B5F4F3F27A4CB073BF422331203F2A18997F1F@Exchange1.lawo.de> <CAFLxGvya5WXoKcYmOgeM_SmVVEht1jEzeLG9vHhwFudFU+Ny8A@mail.gmail.com> <D7B1B5F4F3F27A4CB073BF422331203F2A18997F8B@Exchange1.lawo.de> <52EF772D.8080207@nod.at> <D7B1B5F4F3F27A4CB073BF422331203F2A18DD7989@Exchange1.lawo.de> In-Reply-To: <D7B1B5F4F3F27A4CB073BF422331203F2A18DD7989@Exchange1.lawo.de> summary: Content analysis details: (-1.9 points) pts rule name description ---- ---------------------- -------------------------------------------------- -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Cc: "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-mtd" <linux-mtd-bounces@lists.infradead.org> Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Richard Weinberger Feb. 3, 2014, 1:56 p.m. UTC

Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
> Hi,
> 
> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
> A test over the last weekend resultet In 6 devices showing the bug.
> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
> 
> If I have further access to the "big" setup for some days, I will try to make a test without preemption.

Hmm, ok.
Please also apply this patch, just in case...


Thanks,
//richard

Artem Bityutskiy Feb. 4, 2014, 7:22 a.m. UTC | #1

On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
> > Hi,
> > 
> > I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
> > A test over the last weekend resultet In 6 devices showing the bug.
> > What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
> > Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
> > 
> > If I have further access to the "big" setup for some days, I will try to make a test without preemption.
> 
> Hmm, ok.
> Please also apply this patch, just in case...
> 
> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
> index 0e11671d..48fd2aa 100644
> --- a/drivers/mtd/ubi/eba.c
> +++ b/drivers/mtd/ubi/eba.c
> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
> 
>  	spin_lock(&ubi->ltree_lock);
>  	le = ltree_lookup(ubi, vol_id, lnum);
> +	ubi_assert(le);
>  	le->users -= 1;
>  	ubi_assert(le->users >= 0);
>  	up_write(&le->mutex);

The UBI LEB locking is a bit over-designed, it could be simplified, may
be this could help looking for the problem.

The this report does really sound like there is something specific to
Thorsten's system which corrupts memory.

And it is difficult to debug this via the mailing list. Thorsten should
start adding various checks like this and try to come closer to the
root-cause.

Richard Weinberger Feb. 4, 2014, 7:46 a.m. UTC | #2

Am 04.02.2014 08:22, schrieb Artem Bityutskiy:
> On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
>> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
>>> Hi,
>>>
>>> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
>>> A test over the last weekend resultet In 6 devices showing the bug.
>>> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
>>> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
>>>
>>> If I have further access to the "big" setup for some days, I will try to make a test without preemption.
>>
>> Hmm, ok.
>> Please also apply this patch, just in case...
>>
>> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
>> index 0e11671d..48fd2aa 100644
>> --- a/drivers/mtd/ubi/eba.c
>> +++ b/drivers/mtd/ubi/eba.c
>> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
>>
>>  	spin_lock(&ubi->ltree_lock);
>>  	le = ltree_lookup(ubi, vol_id, lnum);
>> +	ubi_assert(le);
>>  	le->users -= 1;
>>  	ubi_assert(le->users >= 0);
>>  	up_write(&le->mutex);
> 
> The UBI LEB locking is a bit over-designed, it could be simplified, may
> be this could help looking for the problem.
> 
> The this report does really sound like there is something specific to
> Thorsten's system which corrupts memory.

Thorsten sees:
Dec 25 03:59:22 kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000c
(leb_write_unlock+0x74/0xf0) from [<c02d0d10>] (ubi_eba_write_leb+0x94/0x820

In July 2013 we got this report from a user:
[  300.554525] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
(leb_write_unlock+0xa0/0xf4) from [<802850e0>] (ubi_eba_write_leb+0x568/0x80c)

In both cases we fault at address 0000000c and leb_write_unlock() was called by ubi_eba_write_leb().

Same user saw the issue also in the read path:

[   38.471134] Unable to handle kernel NULL pointer dereference at
virtual address 00000000
(leb_read_unlock+0xa0/0xf4) from [<80285cdc>] (ubi_eba_read_leb+0x404/0x480)

In that case the fault happened at 00000000 directly.

A bit too deterministic for a memory corruption IMHO.

> And it is difficult to debug this via the mailing list. Thorsten should
> start adding various checks like this and try to come closer to the
> root-cause.

Yeah.
We also need more oopses, maybe we can spot a pattern.

Thanks,
//richard

Artem Bityutskiy Feb. 4, 2014, 7:54 a.m. UTC | #3

On Tue, 2014-02-04 at 08:46 +0100, Richard Weinberger wrote:
> Am 04.02.2014 08:22, schrieb Artem Bityutskiy:
> > On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
> >> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
> >>> Hi,
> >>>
> >>> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
> >>> A test over the last weekend resultet In 6 devices showing the bug.
> >>> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
> >>> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
> >>>
> >>> If I have further access to the "big" setup for some days, I will try to make a test without preemption.
> >>
> >> Hmm, ok.
> >> Please also apply this patch, just in case...
> >>
> >> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
> >> index 0e11671d..48fd2aa 100644
> >> --- a/drivers/mtd/ubi/eba.c
> >> +++ b/drivers/mtd/ubi/eba.c
> >> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
> >>
> >>  	spin_lock(&ubi->ltree_lock);
> >>  	le = ltree_lookup(ubi, vol_id, lnum);
> >> +	ubi_assert(le);
> >>  	le->users -= 1;
> >>  	ubi_assert(le->users >= 0);
> >>  	up_write(&le->mutex);
> > 
> > The UBI LEB locking is a bit over-designed, it could be simplified, may
> > be this could help looking for the problem.
> > 
> > The this report does really sound like there is something specific to
> > Thorsten's system which corrupts memory.

May be. Although sometimes corruptions are also deterministics - a
buffer over-run at the same place causes the same side effects etc.

But in any case, the only way I know to deal with this issues is start
putting various prints and assertions, and trying to come closer to the
root-cause. Sometimes bisecting helps, but this case would be difficult
to bisect because the reproducability is hard. Indeed, one may think
that there is no failure duding a day, so the commit as 'good' while it
may be actually 'bad', the bug just happen to not manifest itself
quickly enough.

Bill Pringlemeir Feb. 4, 2014, 3:45 p.m. UTC | #4

>>>> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):

>>>>> I can reproduce it fairly regularly, but not really "quickly". At
>>>>> the moment, I can use a setup of about identical 70 devices.  A
>>>>> test over the last weekend resultet In 6 devices showing the bug.
>>>>> What we have are multiple processes which write in different
>>>>> intervals some data on the device and sync it, because this data
>>>>> should be available after a power cut.  Perhaps I can force the
>>>>> error more often in writing test processes with shorter write/sync
>>>>> intervals.
>>>>>
>>>>> If I have further access to the "big" setup for some days, I will
>>>>> try to make a test without preemption.

>>> On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:

>>>> Hmm, ok.
>>>> Please also apply this patch, just in case...

I don't think this patch will help.

On  4 Feb 2014, dedekind1 at gmail.com wrote:

> May be. Although sometimes corruptions are also deterministics - a
> buffer over-run at the same place causes the same side effects etc.

> But in any case, the only way I know to deal with this issues is start
> putting various prints and assertions, and trying to come closer to the
> root-cause. Sometimes bisecting helps, but this case would be difficult
> to bisect because the reproducability is hard. Indeed, one may think
> that there is no failure duding a day, so the commit as 'good' while it
> may be actually 'bad', the bug just happen to not manifest itself
> quickly enough.

I have seen the same crash on a 2.6.36 system with all of the UbiFs/Ubi
backported.  It is also an IMX25 based system.

We have,

 PC is at __up_write+0x3c/0x1a8  
 LR is at __up_write+0x24/0x1a8  
 pc : [<c0229400>]    lr : [<c02293e8>]    psr: a0000093  
 sp : c7225cc8  ip : 00020000  fp : c79fba00  
 r10: 00000523  r9 : 00000001  r8 : c7b33000  
 r7 : c793a800  r6 : c7bd473c  r5 : c7bd4738  r4 : c7bd4720  
 r3 : 00000000  r2 : 00000002  r1 : 00000001  r0 : 00000002  
 Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user  

 Code: e4863004 e5953004 e1560003 0a00002a (e593200c)

I run this,

   $ printf "\x04\x30\x86\xe4"
            "\x04\x30\x95\xe5" 
            "\x03\x00\x56\xe1"
            "\x2a\x00\x00\x0a"
            "\x0c\x20\x93\xe5" > crash.dump 
   $ objdump --disassemble-all -b binary -m arm crash.dump 
   crash.dump:     file format binary

   Disassembly of section .data:

   00000000 <.data>:
      0:   e4863004        str     r3, [r6], #4
      4:   e5953004        ldr     r3, [r5, #4]
      8:   e1560003        cmp     r6, r3
      c:   0a00002a        beq     0xbc
     10:   e593200c        ldr     r2, [r3, #12]

The values 'r6' and 'r5' are pointers and they are far from non-NULL and
look like good kernel data pointers.  Something in the list is 'NULL'.
There is a load of 'r3' as NULL and then the use of '[r3, #12]' which
gives the crash address of '0xc'.

Here is the objdump with source interspersed for my build,

        sem->activity = 0;
 350:   e3a0a000        mov     sl, #0
 354:   e1a05000        mov     r5, r0
 358:   e485a004        str     sl, [r5], #4
 * list_empty - tests whether a list is empty
 * @head: the list to test.
 */
static inline int list_empty(const struct list_head *head)
{
        return head->next == head;
 35c:   e5903004        ldr     r3, [r0, #4]
        if (!list_empty(&sem->wait_list))
 360:   e1550003        cmp     r5, r3
 364:   0a00002b        beq     418 <__up_write+0xfc>
        /* if we are allowed to wake writers try to grant a single write lock
         * if there's a writer at the front of the queue
         * - we leave the 'waiting count' incremented to signify potential
         *   contention
         */
        if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
 368:   e593200c        ldr     r2, [r3, #12]
 36c:   e2124002        ands    r4, r2, #2
 370:   0a000006        beq     390 <__up_write+0x74>
 374:   ea000034        b       44c <__up_write+0x130>

The compiler picks different registers,

  r3/sl+r3, r5/r0, r6/r5 but the code is the same.

The 'rw_semaphore' is 

    struct rw_semaphore {
            __s32		activity;
            struct list_head	wait_list;
    };

So, the 'wait_list' is non-NULL, the rw_semaphore is non-NULL, but
'wait_list->next' is NULL.  This seems to be very consistent with this
'oops'.

It seems that the "ltree_lock" doesn't protect the use of the
ltree_lookup() versus insertions and deletions?  Ie, ltree_lookup() may
return non-NULL, but an interrupt/pagefault before a 'le->users +/- =
1;' may mean the node is released?  On a UP system, does 'spin_lock()'
actually do anything?  The rw_semaphore uses spin_lock_irqsave().
However, that doesn't make sense as I think this occurs mainly on a
ARM926 system.

The ARM926 systems do not have proper 'lock free' idioms like
'ldrex/strex' and they try to do atomic operations by locking
interrupts.  I think that UbiFs/UBI maybe called on a 'data fault' or
'program fault' (in user space) when memory pressure is present.  I have
seen this occur in some sound drivers where the data source is coming
from disk (or maybe the driver uses vmalloc() or something).  So I think
on occasion, the ltree_lookup() may not work or there is something weird
with the atomic primatives and data/page faults.

Fwiw,
Bill Pringlemeir.

Bill Pringlemeir Feb. 4, 2014, 5:05 p.m. UTC | #5

On  4 Feb 2014, bpringlemeir@nbsps.com wrote:

> The ARM926 systems do not have proper 'lock free' idioms like
> 'ldrex/strex' and they try to do atomic operations by locking
> interrupts.  I think that UbiFs/UBI maybe called on a 'data fault' or
> 'program fault' (in user space) when memory pressure is present.  I have
> seen this occur in some sound drivers where the data source is coming
> from disk (or maybe the driver uses vmalloc() or something).  So I think
> on occasion, the ltree_lookup() may not work or there is something weird
> with the atomic primatives and data/page faults.


 https://www.google.ca/#q=site:infradead.org+leb_write_unlock+oops

 http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html

 at91sam9g20 - arm926, different MTD driver. Linux 3.6.9

 Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)

   0:   e5903004        ldr     r3, [r0, #4]
   4:   e58d2004        str     r2, [sp, #4]
   8:   e1560003        cmp     r6, r3
   c:   0a00002a        beq     0xbc
  10:   e593200c        ldr     r2, [r3, #12]

The code sequence looks identical and the Oops trace, etc is the same.
People from Pengutronix also indicated seeing the same type of Opps; I
think they deal with the IMX, but maybe this was on another board.

Regards,
Bill Pringlemeir.

Bill Pringlemeir Feb. 4, 2014, 7:57 p.m. UTC | #6

On  4 Feb 2014, bpringlemeir@nbsps.com wrote:

> http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html
>
> at91sam9g20 - arm926, different MTD driver. Linux 3.6.9
>
> Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)
>
> 0:   e5903004        ldr     r3, [r0, #4]
> 4:   e58d2004        str     r2, [sp, #4]
> 8:   e1560003        cmp     r6, r3
> c:   0a00002a        beq     0xbc
> 10:   e593200c        ldr     r2, [r3, #12]
>
> The code sequence looks identical and the Oops trace, etc is the same.
> People from Pengutronix also indicated seeing the same type of Opps; I
> think they deal with the IMX, but maybe this was on another board.

>>>> schrieb Wiedemer, Thorsten (Lawo AG):

> Ehmm, OK, OK, even with the changes in kernel, ubi_assert() in
> leb_write_unlock() wouldn't have triggered ...

Another up_read() crash,

 http://lists.infradead.org/pipermail/linux-mtd/2013-July/047512.html

 Code: e1530001 0a000016 e3e01000 e5801000 (e8930003)

 00000000 <.data>:
    0:   e1530001        cmp     r3, r1
    4:   0a000016        beq     0x64
    8:   e3e01000        mvn     r1, #0
    c:   e5801000        str     r1, [r0]
   10:   e8930003        ldm     r3, {r0, r1}

Thorsten's Oops,

 Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)

 00000000 <.data>:
    0:   e3e02000        mvn     r2, #0
    4:   e5842000        str     r2, [r4]
    8:   e59fc084        ldr     ip, [pc, #132]  ; 0x94
    c:   e59f0084        ldr     r0, [pc, #132]  ; 0x98
   10:   e8930006        ldm     r3, {r1, r2}

The registers are different, but the instruction sequence is similar.
In my ARM926 build, the __up_read() is,

static inline int list_empty(const struct list_head *head)
{
        return head->next == head;
 250:   e1a01000        mov     r1, r0
 254:   e5b12004        ldr     r2, [r1, #4]!
 258:   e1520001        cmp     r2, r1
 25c:   0a000017        beq     2c0 <__up_read+0xb0>
__rwsem_wake_one_writer(struct rw_semaphore *sem)
{
        struct rwsem_waiter *waiter;
        struct task_struct *tsk;

        sem->activity = -1;
 260:   e3e01000        mvn     r1, #0
 264:   e5801000        str     r1, [r0]
 * in an undefined state.
 */
#ifndef CONFIG_DEBUG_LIST
static inline void list_del(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
 268:   e8920003        ldm     r2, {r0, r1}
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
        next->prev = prev;
 26c:   e5801004        str     r1, [r0, #4]
        prev->next = next;
 270:   e5810000        str     r0, [r1]


This is the same symptom,

  __rwsem_wake_one_writer(struct rw_semaphore *sem)
  {
...
	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
	list_del(&waiter->list);

The sem->wait_list is non-NULL, but the 'sem->wait_list.next' is NULL. I
would suggest you try with 'DEBUG_LOCK_ALLOC' or something like this.
The crash points are not the failure, it is when we insert a
rw_semaphore of 'NULL' or use some memory that is already freed.

Fwiw,
Bill Pringlemeir.

Richard Weinberger Feb. 4, 2014, 8:07 p.m. UTC | #7

Am 04.02.2014 20:57, schrieb Bill Pringlemeir:
> On  4 Feb 2014, bpringlemeir@nbsps.com wrote:
> 
>> http://lists.infradead.org/pipermail/linux-mtd/2013-May/046907.html
>>
>> at91sam9g20 - arm926, different MTD driver. Linux 3.6.9
>>
>> Code: e5903004 e58d2004 e1560003 0a00002a (e593200c)
>>
>> 0:   e5903004        ldr     r3, [r0, #4]
>> 4:   e58d2004        str     r2, [sp, #4]
>> 8:   e1560003        cmp     r6, r3
>> c:   0a00002a        beq     0xbc
>> 10:   e593200c        ldr     r2, [r3, #12]
>>
>> The code sequence looks identical and the Oops trace, etc is the same.
>> People from Pengutronix also indicated seeing the same type of Opps; I
>> think they deal with the IMX, but maybe this was on another board.
> 
>>>>> schrieb Wiedemer, Thorsten (Lawo AG):
> 
>> Ehmm, OK, OK, even with the changes in kernel, ubi_assert() in
>> leb_write_unlock() wouldn't have triggered ...
> 
> Another up_read() crash,
> 
>  http://lists.infradead.org/pipermail/linux-mtd/2013-July/047512.html
> 
>  Code: e1530001 0a000016 e3e01000 e5801000 (e8930003)
> 
>  00000000 <.data>:
>     0:   e1530001        cmp     r3, r1
>     4:   0a000016        beq     0x64
>     8:   e3e01000        mvn     r1, #0
>     c:   e5801000        str     r1, [r0]
>    10:   e8930003        ldm     r3, {r0, r1}
> 
> Thorsten's Oops,
> 
>  Code: e3e02000 e5842000 e59fc084 e59f0084 (e8930006)
> 
>  00000000 <.data>:
>     0:   e3e02000        mvn     r2, #0
>     4:   e5842000        str     r2, [r4]
>     8:   e59fc084        ldr     ip, [pc, #132]  ; 0x94
>     c:   e59f0084        ldr     r0, [pc, #132]  ; 0x98
>    10:   e8930006        ldm     r3, {r1, r2}
> 
> The registers are different, but the instruction sequence is similar.
> In my ARM926 build, the __up_read() is,
> 
> static inline int list_empty(const struct list_head *head)
> {
>         return head->next == head;
>  250:   e1a01000        mov     r1, r0
>  254:   e5b12004        ldr     r2, [r1, #4]!
>  258:   e1520001        cmp     r2, r1
>  25c:   0a000017        beq     2c0 <__up_read+0xb0>
> __rwsem_wake_one_writer(struct rw_semaphore *sem)
> {
>         struct rwsem_waiter *waiter;
>         struct task_struct *tsk;
> 
>         sem->activity = -1;
>  260:   e3e01000        mvn     r1, #0
>  264:   e5801000        str     r1, [r0]
>  * in an undefined state.
>  */
> #ifndef CONFIG_DEBUG_LIST
> static inline void list_del(struct list_head *entry)
> {
>         __list_del(entry->prev, entry->next);
>  268:   e8920003        ldm     r2, {r0, r1}
>  * This is only for internal list manipulation where we know
>  * the prev/next entries already!
>  */
> static inline void __list_del(struct list_head * prev, struct list_head * next)
> {
>         next->prev = prev;
>  26c:   e5801004        str     r1, [r0, #4]
>         prev->next = next;
>  270:   e5810000        str     r0, [r1]
> 
> 
> This is the same symptom,
> 
>   __rwsem_wake_one_writer(struct rw_semaphore *sem)
>   {
> ...
> 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> 	list_del(&waiter->list);
> 
> The sem->wait_list is non-NULL, but the 'sem->wait_list.next' is NULL. I
> would suggest you try with 'DEBUG_LOCK_ALLOC' or something like this.
> The crash points are not the failure, it is when we insert a
> rw_semaphore of 'NULL' or use some memory that is already freed.

CONFIG_DEBUG_LIST please.

Thanks,
//richard

UBI leb_write_unlock NULL pointer Oops (continuation)

Commit Message

Comments

Patch