Message ID | 1451921108.8255.74.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Eric Dumazet <eric.dumazet@gmail.com> writes: > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: [...] >> I believe the crash occurred between these two actions. I just saw >> that there are some interesting events in the log prior to the crash: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> kernel: (stc): proto stack 4's ->recv failed >> kernel: (stc): remove_channel_from_table: id 3 >> kernel: (stc): remove_channel_from_table: id 2 >> kernel: (stc): remove_channel_from_table: id 4 >> kernel: (stc): all chnl_ids unregistered >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> The first print is from btwilink.c. However, I can't see the >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> 6LoWPAN or anything similar). >> >> Thanks, Jacob > > Definitely these details are useful ;) > > Could you try : > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > index 6e3af8b42cdd..0c99a74fb895 100644 > --- a/drivers/misc/ti-st/st_core.c > +++ b/drivers/misc/ti-st/st_core.c > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > skb_queue_purge(&st_gdata->txq); > skb_queue_purge(&st_gdata->tx_waitq); > kfree_skb(st_gdata->rx_skb); > + st_gdata->rx_skb = NULL; > kfree_skb(st_gdata->tx_skb); > + st_gdata->tx_skb = NULL; > /* TTY ldisc cleanup */ > err = tty_unregister_ldisc(N_TI_WL); > if (err) Hmm ... the code continues with err = tty_unregister_ldisc(N_TI_WL); if (err) pr_err("unable to un-register ldisc %ld", err); /* free the global data pointer */ kfree(st_gdata); So who would ever see that the rx_skb and tx_skb pointers were cleared prior to freeing the data structure containing them? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2016-01-04 at 16:14 +0000, Rainer Weikusat wrote: > Eric Dumazet <eric.dumazet@gmail.com> writes: > > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > > [...] > > >> I believe the crash occurred between these two actions. I just saw > >> that there are some interesting events in the log prior to the crash: > >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > >> kernel: (stc): proto stack 4's ->recv failed > >> kernel: (stc): remove_channel_from_table: id 3 > >> kernel: (stc): remove_channel_from_table: id 2 > >> kernel: (stc): remove_channel_from_table: id 4 > >> kernel: (stc): all chnl_ids unregistered > >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close > >> > >> The first print is from btwilink.c. However, I can't see the > >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > >> 6LoWPAN or anything similar). > >> > >> Thanks, Jacob > > > > Definitely these details are useful ;) > > > > Could you try : > > > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > > index 6e3af8b42cdd..0c99a74fb895 100644 > > --- a/drivers/misc/ti-st/st_core.c > > +++ b/drivers/misc/ti-st/st_core.c > > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > > skb_queue_purge(&st_gdata->txq); > > skb_queue_purge(&st_gdata->tx_waitq); > > kfree_skb(st_gdata->rx_skb); > > + st_gdata->rx_skb = NULL; > > kfree_skb(st_gdata->tx_skb); > > + st_gdata->tx_skb = NULL; > > /* TTY ldisc cleanup */ > > err = tty_unregister_ldisc(N_TI_WL); > > if (err) > > Hmm ... the code continues with > > err = tty_unregister_ldisc(N_TI_WL); > if (err) > pr_err("unable to un-register ldisc %ld", err); > /* free the global data pointer */ > kfree(st_gdata); > > So who would ever see that the rx_skb and tx_skb pointers were cleared > prior to freeing the data structure containing them? This is the theory, but I suspect a use after free. kfree(st_gdata) does not clear all content with 0, unless you use special SLUB/SLAB debugging features. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> > <jacob@teenage.engineering> wrote: >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <edumazet@google.com> wrote: >> >>> How often can you trigger this bug ? >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a >> >> few times when bringing up/down network interfaces. Does the trace >> >> give any clue? >> >> >> > >> > A little bit. You need to help people to narrow down the problem >> > because there are too many places using skb->next and skb->prev. >> > >> > Since you mentioned it seems related to network interface flip, >> > what network interfaces are you using? What's is your TC setup? >> > >> > Thanks. >> >> The system contains only one physical network interface (TI WL1837, >> wl18xx module). >> The state prior to the crash was as follows: >> - One virtual network interface active (as STA, associated with access point) >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> btwilink/st_drv modules) >> >> Actions made around the time of the crash: >> - Bluetooth disabled >> - One additional virtual network interface brought up (also as STA) >> >> I believe the crash occurred between these two actions. I just saw >> that there are some interesting events in the log prior to the crash: >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> kernel: (stc): proto stack 4's ->recv failed >> kernel: (stc): remove_channel_from_table: id 3 >> kernel: (stc): remove_channel_from_table: id 2 >> kernel: (stc): remove_channel_from_table: id 4 >> kernel: (stc): all chnl_ids unregistered >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> The first print is from btwilink.c. However, I can't see the >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> 6LoWPAN or anything similar). >> >> Thanks, Jacob > > Definitely these details are useful ;) > > Could you try : > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > index 6e3af8b42cdd..0c99a74fb895 100644 > --- a/drivers/misc/ti-st/st_core.c > +++ b/drivers/misc/ti-st/st_core.c > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > skb_queue_purge(&st_gdata->txq); > skb_queue_purge(&st_gdata->tx_waitq); > kfree_skb(st_gdata->rx_skb); > + st_gdata->rx_skb = NULL; > kfree_skb(st_gdata->tx_skb); > + st_gdata->tx_skb = NULL; > /* TTY ldisc cleanup */ > err = tty_unregister_ldisc(N_TI_WL); > if (err) > > Sure. Since I don't have a good way to trigger the initial issue, I can't really know if there is a difference with your patch. However, normal usage seems to work as expected with your patch. I've tried to reproduce the initial issue with and without your patch repeatedly for hours and have not seen any crash in any of the runs so far. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: > On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: > >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote: > >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog > >> > <jacob@teenage.engineering> wrote: > >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <edumazet@google.com> wrote: > >> >>> How often can you trigger this bug ? > >> >> > >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a > >> >> few times when bringing up/down network interfaces. Does the trace > >> >> give any clue? > >> >> > >> > > >> > A little bit. You need to help people to narrow down the problem > >> > because there are too many places using skb->next and skb->prev. > >> > > >> > Since you mentioned it seems related to network interface flip, > >> > what network interfaces are you using? What's is your TC setup? > >> > > >> > Thanks. > >> > >> The system contains only one physical network interface (TI WL1837, > >> wl18xx module). > >> The state prior to the crash was as follows: > >> - One virtual network interface active (as STA, associated with access point) > >> - Bluetooth (BLE only) active (same physical chip, co-existence, > >> btwilink/st_drv modules) > >> > >> Actions made around the time of the crash: > >> - Bluetooth disabled > >> - One additional virtual network interface brought up (also as STA) > >> > >> I believe the crash occurred between these two actions. I just saw > >> that there are some interesting events in the log prior to the crash: > >> kernel: Bluetooth: Unable to push skb to HCI core(-6) > >> kernel: (stc): proto stack 4's ->recv failed > >> kernel: (stc): remove_channel_from_table: id 3 > >> kernel: (stc): remove_channel_from_table: id 2 > >> kernel: (stc): remove_channel_from_table: id 4 > >> kernel: (stc): all chnl_ids unregistered > >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close > >> > >> The first print is from btwilink.c. However, I can't see the > >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using > >> 6LoWPAN or anything similar). > >> > >> Thanks, Jacob > > > > Definitely these details are useful ;) > > > > Could you try : > > > > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c > > index 6e3af8b42cdd..0c99a74fb895 100644 > > --- a/drivers/misc/ti-st/st_core.c > > +++ b/drivers/misc/ti-st/st_core.c > > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) > > skb_queue_purge(&st_gdata->txq); > > skb_queue_purge(&st_gdata->tx_waitq); > > kfree_skb(st_gdata->rx_skb); > > + st_gdata->rx_skb = NULL; > > kfree_skb(st_gdata->tx_skb); > > + st_gdata->tx_skb = NULL; > > /* TTY ldisc cleanup */ > > err = tty_unregister_ldisc(N_TI_WL); > > if (err) > > > > > > Sure. Since I don't have a good way to trigger the initial issue, I > can't really know if there is a difference with your patch. However, > normal usage seems to work as expected with your patch. I've tried to > reproduce the initial issue with and without your patch repeatedly for > hours and have not seen any crash in any of the runs so far. > -- You might build a kernel with KASAN support to get maybe more chances to trigger the bug. ( https://www.kernel.org/doc/Documentation/kasan.txt ) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: >> On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >> >> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> >> > On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >> >> > <jacob@teenage.engineering> wrote: >> >> >> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <edumazet@google.com> wrote: >> >> >>> How often can you trigger this bug ? >> >> >> >> >> >> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a >> >> >> few times when bringing up/down network interfaces. Does the trace >> >> >> give any clue? >> >> >> >> >> > >> >> > A little bit. You need to help people to narrow down the problem >> >> > because there are too many places using skb->next and skb->prev. >> >> > >> >> > Since you mentioned it seems related to network interface flip, >> >> > what network interfaces are you using? What's is your TC setup? >> >> > >> >> > Thanks. >> >> >> >> The system contains only one physical network interface (TI WL1837, >> >> wl18xx module). >> >> The state prior to the crash was as follows: >> >> - One virtual network interface active (as STA, associated with access point) >> >> - Bluetooth (BLE only) active (same physical chip, co-existence, >> >> btwilink/st_drv modules) >> >> >> >> Actions made around the time of the crash: >> >> - Bluetooth disabled >> >> - One additional virtual network interface brought up (also as STA) >> >> >> >> I believe the crash occurred between these two actions. I just saw >> >> that there are some interesting events in the log prior to the crash: >> >> kernel: Bluetooth: Unable to push skb to HCI core(-6) >> >> kernel: (stc): proto stack 4's ->recv failed >> >> kernel: (stc): remove_channel_from_table: id 3 >> >> kernel: (stc): remove_channel_from_table: id 2 >> >> kernel: (stc): remove_channel_from_table: id 4 >> >> kernel: (stc): all chnl_ids unregistered >> >> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >> >> >> >> The first print is from btwilink.c. However, I can't see the >> >> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >> >> 6LoWPAN or anything similar). >> >> >> >> Thanks, Jacob >> > >> > Definitely these details are useful ;) >> > >> > Could you try : >> > >> > diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c >> > index 6e3af8b42cdd..0c99a74fb895 100644 >> > --- a/drivers/misc/ti-st/st_core.c >> > +++ b/drivers/misc/ti-st/st_core.c >> > @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) >> > skb_queue_purge(&st_gdata->txq); >> > skb_queue_purge(&st_gdata->tx_waitq); >> > kfree_skb(st_gdata->rx_skb); >> > + st_gdata->rx_skb = NULL; >> > kfree_skb(st_gdata->tx_skb); >> > + st_gdata->tx_skb = NULL; >> > /* TTY ldisc cleanup */ >> > err = tty_unregister_ldisc(N_TI_WL); >> > if (err) >> > >> > >> >> Sure. Since I don't have a good way to trigger the initial issue, I >> can't really know if there is a difference with your patch. However, >> normal usage seems to work as expected with your patch. I've tried to >> reproduce the initial issue with and without your patch repeatedly for >> hours and have not seen any crash in any of the runs so far. >> -- > > You might build a kernel with KASAN support to get maybe more chances to > trigger the bug. > > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > Ah. Doesn't seem to be supported on arm(32) unfortunately. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: > On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > You might build a kernel with KASAN support to get maybe more chances to > > trigger the bug. > > > > ( https://www.kernel.org/doc/Documentation/kasan.txt ) > > > > Ah. Doesn't seem to be supported on arm(32) unfortunately. Then you could at least use standard debugging features : CONFIG_SLAB=y CONFIG_SLABINFO=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y (Or equivalent SLUB options) and CONFIG_DEBUG_PAGEALLOC=y (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 5, 2016 at 3:39 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-01-05 at 15:34 +0100, Jacob Siverskog wrote: >> On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> > >> > You might build a kernel with KASAN support to get maybe more chances to >> > trigger the bug. >> > >> > ( https://www.kernel.org/doc/Documentation/kasan.txt ) >> > >> >> Ah. Doesn't seem to be supported on arm(32) unfortunately. > > Then you could at least use standard debugging features : > > CONFIG_SLAB=y > CONFIG_SLABINFO=y > CONFIG_DEBUG_SLAB=y > CONFIG_DEBUG_SLAB_LEAK=y > > (Or equivalent SLUB options) > > and > > CONFIG_DEBUG_PAGEALLOC=y > > (If arm(32) has CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y) I tried with those enabled and while toggling power on the Bluetooth interface I usually get this after a few iterations: kernel: Bluetooth: Unable to push skb to HCI core(-6) kernel: (stc): proto stack 4's ->recv failed kernel: Slab corruption (Not tainted): skbuff_head_cache start=c08a8a00, len=176 kernel: 0a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 kkkkkkkkkkkkjkk. kernel: Prev obj: start=c08a8940, len=176 kernel: 000: 00 00 00 00 00 00 00 00 31 73 52 00 43 17 2b 14 ........1sR.C.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 ................ kernel: Next obj: start=c08a8ac0, len=176 kernel: 000: 00 00 00 00 00 00 00 00 01 42 f6 50 36 17 2b 14 .........B.P6.+. kernel: 010: 00 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 ................ The "Unable to push skb" and "recv failed" lines always appear before the corruption. Unfortunately, the corruptions occur also with your patch.
Hi Jacob, On 01/05/2016 06:34 AM, Jacob Siverskog wrote: > On Tue, Jan 5, 2016 at 3:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Tue, 2016-01-05 at 12:07 +0100, Jacob Siverskog wrote: >>> On Mon, Jan 4, 2016 at 4:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>>> On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote: >>>>> On Wed, Dec 30, 2015 at 11:30 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote: >>>>>> On Wed, Dec 30, 2015 at 6:30 AM, Jacob Siverskog >>>>>> <jacob@teenage.engineering> wrote: >>>>>>> On Wed, Dec 30, 2015 at 2:26 PM, Eric Dumazet <edumazet@google.com> wrote: >>>>>>>> How often can you trigger this bug ? >>>>>>> >>>>>>> Ok. I don't have a good repro to trigger it unfortunately, I've seen it just a >>>>>>> few times when bringing up/down network interfaces. Does the trace >>>>>>> give any clue? >>>>>>> >>>>>> >>>>>> A little bit. You need to help people to narrow down the problem >>>>>> because there are too many places using skb->next and skb->prev. >>>>>> >>>>>> Since you mentioned it seems related to network interface flip, >>>>>> what network interfaces are you using? What's is your TC setup? >>>>>> >>>>>> Thanks. >>>>> >>>>> The system contains only one physical network interface (TI WL1837, >>>>> wl18xx module). >>>>> The state prior to the crash was as follows: >>>>> - One virtual network interface active (as STA, associated with access point) >>>>> - Bluetooth (BLE only) active (same physical chip, co-existence, >>>>> btwilink/st_drv modules) >>>>> >>>>> Actions made around the time of the crash: >>>>> - Bluetooth disabled >>>>> - One additional virtual network interface brought up (also as STA) >>>>> >>>>> I believe the crash occurred between these two actions. I just saw >>>>> that there are some interesting events in the log prior to the crash: >>>>> kernel: Bluetooth: Unable to push skb to HCI core(-6) >>>>> kernel: (stc): proto stack 4's ->recv failed >>>>> kernel: (stc): remove_channel_from_table: id 3 >>>>> kernel: (stc): remove_channel_from_table: id 2 >>>>> kernel: (stc): remove_channel_from_table: id 4 >>>>> kernel: (stc): all chnl_ids unregistered >>>>> kernel: (stk) :ldisc_install = 0(stc): st_tty_close >>>>> >>>>> The first print is from btwilink.c. However, I can't see the >>>>> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using >>>>> 6LoWPAN or anything similar). >>>>> >>>>> Thanks, Jacob >>>> >>>> Definitely these details are useful ;) >>>> >>>> Could you try : >>>> >>>> diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c >>>> index 6e3af8b42cdd..0c99a74fb895 100644 >>>> --- a/drivers/misc/ti-st/st_core.c >>>> +++ b/drivers/misc/ti-st/st_core.c >>>> @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) >>>> skb_queue_purge(&st_gdata->txq); >>>> skb_queue_purge(&st_gdata->tx_waitq); >>>> kfree_skb(st_gdata->rx_skb); >>>> + st_gdata->rx_skb = NULL; >>>> kfree_skb(st_gdata->tx_skb); >>>> + st_gdata->tx_skb = NULL; >>>> /* TTY ldisc cleanup */ >>>> err = tty_unregister_ldisc(N_TI_WL); >>>> if (err) FWIW, You don't need that ti-st junk to get the WL1837 working; the WL1837 only has BT channels. Unfortunately, that's really all I can say about it; sorry. Regards, Peter Hurley
diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c index 6e3af8b42cdd..0c99a74fb895 100644 --- a/drivers/misc/ti-st/st_core.c +++ b/drivers/misc/ti-st/st_core.c @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata) skb_queue_purge(&st_gdata->txq); skb_queue_purge(&st_gdata->tx_waitq); kfree_skb(st_gdata->rx_skb); + st_gdata->rx_skb = NULL; kfree_skb(st_gdata->tx_skb); + st_gdata->tx_skb = NULL; /* TTY ldisc cleanup */ err = tty_unregister_ldisc(N_TI_WL); if (err)