diff mbox

net: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected in skb_array_produce

Message ID 50038580.20299907.1486634551103.JavaMail.zimbra@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Wang Feb. 9, 2017, 10:02 a.m. UTC
----- Original Message -----
> Hello,
> 
> I've got the following report while running syzkaller fuzzer on mmotm
> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
> 

[...]

> 
> other info that might help us debug this:
> 
>  Possible interrupt unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&(&r->consumer_lock)->rlock);
>                                local_irq_disable();
>                                lock(&(&r->producer_lock)->rlock);
>                                lock(&(&r->consumer_lock)->rlock);
>   <Interrupt>
>     lock(&(&r->producer_lock)->rlock);
> 

Thanks a lot for the testing.

Looks like we could address this by using skb_array_consume_bh() instead.

Could you pls verify if the following patch works?

Comments

Dmitry Vyukov Feb. 9, 2017, 10:49 a.m. UTC | #1
On Thu, Feb 9, 2017 at 11:02 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> ----- Original Message -----
>> Hello,
>>
>> I've got the following report while running syzkaller fuzzer on mmotm
>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>>
>
> [...]
>
>>
>> other info that might help us debug this:
>>
>>  Possible interrupt unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&(&r->consumer_lock)->rlock);
>>                                local_irq_disable();
>>                                lock(&(&r->producer_lock)->rlock);
>>                                lock(&(&r->consumer_lock)->rlock);
>>   <Interrupt>
>>     lock(&(&r->producer_lock)->rlock);
>>
>
> Thanks a lot for the testing.
>
> Looks like we could address this by using skb_array_consume_bh() instead.
>
> Could you pls verify if the following patch works?

No, I can't test it, sorry. This happened once on bots. And bots
currently test only upstream versions.


> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8a7d6b9..a97c00d 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -520,7 +520,7 @@ static void tun_queue_purge(struct tun_file *tfile)
>  {
>         struct sk_buff *skb;
>
> -       while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
> +       while ((skb = skb_array_consume_bh(&tfile->tx_array)) != NULL)
>                 kfree_skb(skb);
>
>         skb_queue_purge(&tfile->sk.sk_write_queue);
> @@ -1458,7 +1458,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
>         struct sk_buff *skb = NULL;
>         int error = 0;
>
> -       skb = skb_array_consume(&tfile->tx_array);
> +       skb = skb_array_consume_bh(&tfile->tx_array);
>         if (skb)
>                 goto out;
>         if (noblock) {
> @@ -1470,7 +1470,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
>         current->state = TASK_INTERRUPTIBLE;
>
>         while (1) {
> -               skb = skb_array_consume(&tfile->tx_array);
> +               skb = skb_array_consume_bh(&tfile->tx_array);
>                 if (skb)
>                         break;
>                 if (signal_pending(current)) {
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Michael S. Tsirkin Feb. 9, 2017, 5:50 p.m. UTC | #2
On Thu, Feb 09, 2017 at 11:49:30AM +0100, Dmitry Vyukov wrote:
> On Thu, Feb 9, 2017 at 11:02 AM, Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > ----- Original Message -----
> >> Hello,
> >>
> >> I've got the following report while running syzkaller fuzzer on mmotm
> >> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> >> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
> >>
> >
> > [...]
> >
> >>
> >> other info that might help us debug this:
> >>
> >>  Possible interrupt unsafe locking scenario:
> >>
> >>        CPU0                    CPU1
> >>        ----                    ----
> >>   lock(&(&r->consumer_lock)->rlock);
> >>                                local_irq_disable();
> >>                                lock(&(&r->producer_lock)->rlock);
> >>                                lock(&(&r->consumer_lock)->rlock);
> >>   <Interrupt>
> >>     lock(&(&r->producer_lock)->rlock);
> >>
> >
> > Thanks a lot for the testing.
> >
> > Looks like we could address this by using skb_array_consume_bh() instead.
> >
> > Could you pls verify if the following patch works?
> 
> No, I can't test it, sorry. This happened once on bots. And bots
> currently test only upstream versions.

Which trees are tested? Will linux-next help?

> 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 8a7d6b9..a97c00d 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -520,7 +520,7 @@ static void tun_queue_purge(struct tun_file *tfile)
> >  {
> >         struct sk_buff *skb;
> >
> > -       while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
> > +       while ((skb = skb_array_consume_bh(&tfile->tx_array)) != NULL)
> >                 kfree_skb(skb);
> >
> >         skb_queue_purge(&tfile->sk.sk_write_queue);
> > @@ -1458,7 +1458,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
> >         struct sk_buff *skb = NULL;
> >         int error = 0;
> >
> > -       skb = skb_array_consume(&tfile->tx_array);
> > +       skb = skb_array_consume_bh(&tfile->tx_array);
> >         if (skb)
> >                 goto out;
> >         if (noblock) {
> > @@ -1470,7 +1470,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
> >         current->state = TASK_INTERRUPTIBLE;
> >
> >         while (1) {
> > -               skb = skb_array_consume(&tfile->tx_array);
> > +               skb = skb_array_consume_bh(&tfile->tx_array);
> >                 if (skb)
> >                         break;
> >                 if (signal_pending(current)) {
> >
> > --
> > You received this message because you are subscribed to the Google Groups "syzkaller" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
Dmitry Vyukov Feb. 9, 2017, 5:55 p.m. UTC | #3
On Thu, Feb 9, 2017 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Feb 09, 2017 at 11:49:30AM +0100, Dmitry Vyukov wrote:
>> On Thu, Feb 9, 2017 at 11:02 AM, Jason Wang <jasowang@redhat.com> wrote:
>> >
>> >
>> > ----- Original Message -----
>> >> Hello,
>> >>
>> >> I've got the following report while running syzkaller fuzzer on mmotm
>> >> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>> >> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>> >>
>> >
>> > [...]
>> >
>> >>
>> >> other info that might help us debug this:
>> >>
>> >>  Possible interrupt unsafe locking scenario:
>> >>
>> >>        CPU0                    CPU1
>> >>        ----                    ----
>> >>   lock(&(&r->consumer_lock)->rlock);
>> >>                                local_irq_disable();
>> >>                                lock(&(&r->producer_lock)->rlock);
>> >>                                lock(&(&r->consumer_lock)->rlock);
>> >>   <Interrupt>
>> >>     lock(&(&r->producer_lock)->rlock);
>> >>
>> >
>> > Thanks a lot for the testing.
>> >
>> > Looks like we could address this by using skb_array_consume_bh() instead.
>> >
>> > Could you pls verify if the following patch works?
>>
>> No, I can't test it, sorry. This happened once on bots. And bots
>> currently test only upstream versions.
>
> Which trees are tested? Will linux-next help?

Linus tree, linux-next and mmotm at the moment.
Michael S. Tsirkin Feb. 9, 2017, 6:10 p.m. UTC | #4
On Thu, Feb 09, 2017 at 05:02:31AM -0500, Jason Wang wrote:
> 
> 
> ----- Original Message -----
> > Hello,
> > 
> > I've got the following report while running syzkaller fuzzer on mmotm
> > (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> > remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
> > 
> 
> [...]
> 
> > 
> > other info that might help us debug this:
> > 
> >  Possible interrupt unsafe locking scenario:
> > 
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&(&r->consumer_lock)->rlock);
> >                                local_irq_disable();
> >                                lock(&(&r->producer_lock)->rlock);
> >                                lock(&(&r->consumer_lock)->rlock);
> >   <Interrupt>
> >     lock(&(&r->producer_lock)->rlock);
> > 
> 
> Thanks a lot for the testing.
> 
> Looks like we could address this by using skb_array_consume_bh() instead.
> 
> Could you pls verify if the following patch works?

I think we should use _bh for the produce call as well,
since resizing takes the producer lock.


> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8a7d6b9..a97c00d 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -520,7 +520,7 @@ static void tun_queue_purge(struct tun_file *tfile)
>  {
>         struct sk_buff *skb;
>  
> -       while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
> +       while ((skb = skb_array_consume_bh(&tfile->tx_array)) != NULL)
>                 kfree_skb(skb);
>  
>         skb_queue_purge(&tfile->sk.sk_write_queue);
> @@ -1458,7 +1458,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
>         struct sk_buff *skb = NULL;
>         int error = 0;
>  
> -       skb = skb_array_consume(&tfile->tx_array);
> +       skb = skb_array_consume_bh(&tfile->tx_array);
>         if (skb)
>                 goto out;
>         if (noblock) {
> @@ -1470,7 +1470,7 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
>         current->state = TASK_INTERRUPTIBLE;
>  
>         while (1) {
> -               skb = skb_array_consume(&tfile->tx_array);
> +               skb = skb_array_consume_bh(&tfile->tx_array);
>                 if (skb)
>                         break;
>                 if (signal_pending(current)) {
Michael S. Tsirkin Feb. 9, 2017, 6:10 p.m. UTC | #5
On Thu, Feb 09, 2017 at 06:55:41PM +0100, Dmitry Vyukov wrote:
> On Thu, Feb 9, 2017 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Feb 09, 2017 at 11:49:30AM +0100, Dmitry Vyukov wrote:
> >> On Thu, Feb 9, 2017 at 11:02 AM, Jason Wang <jasowang@redhat.com> wrote:
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> Hello,
> >> >>
> >> >> I've got the following report while running syzkaller fuzzer on mmotm
> >> >> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> >> >> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
> >> >>
> >> >
> >> > [...]
> >> >
> >> >>
> >> >> other info that might help us debug this:
> >> >>
> >> >>  Possible interrupt unsafe locking scenario:
> >> >>
> >> >>        CPU0                    CPU1
> >> >>        ----                    ----
> >> >>   lock(&(&r->consumer_lock)->rlock);
> >> >>                                local_irq_disable();
> >> >>                                lock(&(&r->producer_lock)->rlock);
> >> >>                                lock(&(&r->consumer_lock)->rlock);
> >> >>   <Interrupt>
> >> >>     lock(&(&r->producer_lock)->rlock);
> >> >>
> >> >
> >> > Thanks a lot for the testing.
> >> >
> >> > Looks like we could address this by using skb_array_consume_bh() instead.
> >> >
> >> > Could you pls verify if the following patch works?
> >>
> >> No, I can't test it, sorry. This happened once on bots. And bots
> >> currently test only upstream versions.
> >
> > Which trees are tested? Will linux-next help?
> 
> Linus tree, linux-next and mmotm at the moment.

OK that works, I'll add the fix to my tree includes in linux-next.
Jason Wang Feb. 10, 2017, 5:13 a.m. UTC | #6
On 2017年02月09日 18:49, Dmitry Vyukov wrote:
> On Thu, Feb 9, 2017 at 11:02 AM, Jason Wang<jasowang@redhat.com>  wrote:
>> ----- Original Message -----
>>> Hello,
>>>
>>> I've got the following report while running syzkaller fuzzer on mmotm
>>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>>>
>> [...]
>>
>>> other info that might help us debug this:
>>>
>>>   Possible interrupt unsafe locking scenario:
>>>
>>>         CPU0                    CPU1
>>>         ----                    ----
>>>    lock(&(&r->consumer_lock)->rlock);
>>>                                 local_irq_disable();
>>>                                 lock(&(&r->producer_lock)->rlock);
>>>                                 lock(&(&r->consumer_lock)->rlock);
>>>    <Interrupt>
>>>      lock(&(&r->producer_lock)->rlock);
>>>
>> Thanks a lot for the testing.
>>
>> Looks like we could address this by using skb_array_consume_bh() instead.
>>
>> Could you pls verify if the following patch works?
> No, I can't test it, sorry. This happened once on bots. And bots
> currently test only upstream versions.
>
>

No problem, will try to test my self.

Thanks
Jason Wang Feb. 10, 2017, 5:17 a.m. UTC | #7
On 2017年02月10日 02:10, Michael S. Tsirkin wrote:
> On Thu, Feb 09, 2017 at 05:02:31AM -0500, Jason Wang wrote:
>> ----- Original Message -----
>>> Hello,
>>>
>>> I've got the following report while running syzkaller fuzzer on mmotm
>>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>>>
>> [...]
>>
>>> other info that might help us debug this:
>>>
>>>   Possible interrupt unsafe locking scenario:
>>>
>>>         CPU0                    CPU1
>>>         ----                    ----
>>>    lock(&(&r->consumer_lock)->rlock);
>>>                                 local_irq_disable();
>>>                                 lock(&(&r->producer_lock)->rlock);
>>>                                 lock(&(&r->consumer_lock)->rlock);
>>>    <Interrupt>
>>>      lock(&(&r->producer_lock)->rlock);
>>>
>> Thanks a lot for the testing.
>>
>> Looks like we could address this by using skb_array_consume_bh() instead.
>>
>> Could you pls verify if the following patch works?
> I think we should use _bh for the produce call as well,
> since resizing takes the producer lock.

Looks not since irq was disabled during resizing?

Thanks
Dmitry Vyukov Feb. 18, 2017, 5:28 p.m. UTC | #8
On Fri, Feb 10, 2017 at 6:17 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年02月10日 02:10, Michael S. Tsirkin wrote:
>>
>> On Thu, Feb 09, 2017 at 05:02:31AM -0500, Jason Wang wrote:
>>>
>>> ----- Original Message -----
>>>>
>>>> Hello,
>>>>
>>>> I've got the following report while running syzkaller fuzzer on mmotm
>>>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>>>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>>>>
>>> [...]
>>>
>>>> other info that might help us debug this:
>>>>
>>>>   Possible interrupt unsafe locking scenario:
>>>>
>>>>         CPU0                    CPU1
>>>>         ----                    ----
>>>>    lock(&(&r->consumer_lock)->rlock);
>>>>                                 local_irq_disable();
>>>>                                 lock(&(&r->producer_lock)->rlock);
>>>>                                 lock(&(&r->consumer_lock)->rlock);
>>>>    <Interrupt>
>>>>      lock(&(&r->producer_lock)->rlock);
>>>>
>>> Thanks a lot for the testing.
>>>
>>> Looks like we could address this by using skb_array_consume_bh() instead.
>>>
>>> Could you pls verify if the following patch works?
>>
>> I think we should use _bh for the produce call as well,
>> since resizing takes the producer lock.
>
> Looks not since irq was disabled during resizing?


Hello,

Is there a fix for this that we can pick up?
This killed 10'000 VMs on our testing infra over the last day. Still
happening on linux-next.

Thanks
Dmitry Vyukov Feb. 18, 2017, 5:35 p.m. UTC | #9
On Sat, Feb 18, 2017 at 6:28 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Fri, Feb 10, 2017 at 6:17 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>> On 2017年02月10日 02:10, Michael S. Tsirkin wrote:
>>>
>>> On Thu, Feb 09, 2017 at 05:02:31AM -0500, Jason Wang wrote:
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've got the following report while running syzkaller fuzzer on mmotm
>>>>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
>>>>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
>>>>>
>>>> [...]
>>>>
>>>>> other info that might help us debug this:
>>>>>
>>>>>   Possible interrupt unsafe locking scenario:
>>>>>
>>>>>         CPU0                    CPU1
>>>>>         ----                    ----
>>>>>    lock(&(&r->consumer_lock)->rlock);
>>>>>                                 local_irq_disable();
>>>>>                                 lock(&(&r->producer_lock)->rlock);
>>>>>                                 lock(&(&r->consumer_lock)->rlock);
>>>>>    <Interrupt>
>>>>>      lock(&(&r->producer_lock)->rlock);
>>>>>
>>>> Thanks a lot for the testing.
>>>>
>>>> Looks like we could address this by using skb_array_consume_bh() instead.
>>>>
>>>> Could you pls verify if the following patch works?
>>>
>>> I think we should use _bh for the produce call as well,
>>> since resizing takes the producer lock.
>>
>> Looks not since irq was disabled during resizing?
>
>
> Hello,
>
> Is there a fix for this that we can pick up?
> This killed 10'000 VMs on our testing infra over the last day. Still
> happening on linux-next.


Ah, sorry, I see the patch above with skb_array_consume_bh. It's just
that it's not in linux-next. Will manually apply it now then.
Should we also do something with produce_skb?
Michael S. Tsirkin Feb. 19, 2017, 5:18 a.m. UTC | #10
On Sat, Feb 18, 2017 at 06:28:39PM +0100, Dmitry Vyukov wrote:
> On Fri, Feb 10, 2017 at 6:17 AM, Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2017年02月10日 02:10, Michael S. Tsirkin wrote:
> >>
> >> On Thu, Feb 09, 2017 at 05:02:31AM -0500, Jason Wang wrote:
> >>>
> >>> ----- Original Message -----
> >>>>
> >>>> Hello,
> >>>>
> >>>> I've got the following report while running syzkaller fuzzer on mmotm
> >>>> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> >>>> remotes/mmotm/auto-latest ee4ba7533626ba7bf2f8b992266467ac9fdc045e:
> >>>>
> >>> [...]
> >>>
> >>>> other info that might help us debug this:
> >>>>
> >>>>   Possible interrupt unsafe locking scenario:
> >>>>
> >>>>         CPU0                    CPU1
> >>>>         ----                    ----
> >>>>    lock(&(&r->consumer_lock)->rlock);
> >>>>                                 local_irq_disable();
> >>>>                                 lock(&(&r->producer_lock)->rlock);
> >>>>                                 lock(&(&r->consumer_lock)->rlock);
> >>>>    <Interrupt>
> >>>>      lock(&(&r->producer_lock)->rlock);
> >>>>
> >>> Thanks a lot for the testing.
> >>>
> >>> Looks like we could address this by using skb_array_consume_bh() instead.
> >>>
> >>> Could you pls verify if the following patch works?
> >>
> >> I think we should use _bh for the produce call as well,
> >> since resizing takes the producer lock.
> >
> > Looks not since irq was disabled during resizing?
> 
> 
> Hello,
> 
> Is there a fix for this that we can pick up?
> This killed 10'000 VMs on our testing infra over the last day. Still
> happening on linux-next.
> 
> Thanks

I posted a fix.  ptr_ring: fix race conditions when resizing
Just reposted.  I'll push into linux-next ASAP.
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8a7d6b9..a97c00d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -520,7 +520,7 @@  static void tun_queue_purge(struct tun_file *tfile)
 {
        struct sk_buff *skb;
 
-       while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
+       while ((skb = skb_array_consume_bh(&tfile->tx_array)) != NULL)
                kfree_skb(skb);
 
        skb_queue_purge(&tfile->sk.sk_write_queue);
@@ -1458,7 +1458,7 @@  static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
        struct sk_buff *skb = NULL;
        int error = 0;
 
-       skb = skb_array_consume(&tfile->tx_array);
+       skb = skb_array_consume_bh(&tfile->tx_array);
        if (skb)
                goto out;
        if (noblock) {
@@ -1470,7 +1470,7 @@  static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
        current->state = TASK_INTERRUPTIBLE;
 
        while (1) {
-               skb = skb_array_consume(&tfile->tx_array);
+               skb = skb_array_consume_bh(&tfile->tx_array);
                if (skb)
                        break;
                if (signal_pending(current)) {