diff mbox

NULL deref in bnx2 / crashes ? ( was: netconsole leads to stalled CPU task )

Message ID 1345640757.5158.1321.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Aug. 22, 2012, 1:05 p.m. UTC
On Wed, 2012-08-22 at 14:17 +0200, Sylvain Munaut wrote:
> Hi,
> 
> > Could be the infamous slave_dev_queue_mapping striking again.
> >
> > Could you please try :
> >
> > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> > index 346b1eb..df731a0 100644
> > --- a/net/core/netpoll.c
> > +++ b/net/core/netpoll.c
> > @@ -335,8 +335,11 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
> >         /* don't get messages out of order, and no recursion */
> >         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
> >                 struct netdev_queue *txq;
> > +               int queue_index = skb_get_queue_mapping(skb);
> >
> > -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
> > +               if (queue_index >= dev->real_num_tx_queues)
> > +                       queue_index = 0;
> > +               txq = netdev_get_tx_queue(dev, queue_index);
> >
> >                 /* try until next clock tick */
> >                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
> 
> 
> Well, it doesn't solve the problem :(
> 
> It does have an effect though. Now even on the machine with the
> broadcom card, it just freeze the machine ...
> On the machine with intel card, it actually does get a couple of
> netconsole packet out and then freeze as well.
> 

my patch was incomplete, sorry :



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sylvain Munaut Aug. 22, 2012, 2:29 p.m. UTC | #1
Hi,

> my patch was incomplete, sorry :
>
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 346b1eb..ddc453b 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -335,8 +335,13 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
>         /* don't get messages out of order, and no recursion */
>         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
>                 struct netdev_queue *txq;
> +               int queue_index = skb_get_queue_mapping(skb);
>
> -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
> +               if (queue_index >= dev->real_num_tx_queues) {
> +                       queue_index = 0;
> +                       skb_set_queue_mapping(skb, 0);
> +               }
> +               txq = netdev_get_tx_queue(dev, queue_index);
>
>                 /* try until next clock tick */
>                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;

Ok, I tried this.

The machine with the intel card still hard freeze (no output / no nothing ...)
The machine with the bnx2 don't crash anymore and no NULL deref, but
the modprobe still hangs and I get this every 180 sec or so :

[  775.956926] INFO: rcu_preempt self-detected stall on CPU { 14}
(t=600009 jiffies)
[  775.956927] Pid: 3154, comm: modprobe Not tainted
3.6.0-rc2-00092-g9040592-dirty #8
[  775.956928] Call Trace:
[  775.956931]  <IRQ>  [<ffffffff81094a58>] ? rcu_pending+0xee/0x490
[  775.956933]  [<ffffffff810571da>] ?
irqtime_account_process_tick.isra.73+0x134/0x23a
[  775.956934]  [<ffffffff81095406>] ? rcu_check_callbacks+0x79/0x85
[  775.956936]  [<ffffffff81041609>] ? update_process_times+0x30/0x62
[  775.956938]  [<ffffffff8106f204>] ? tick_sched_timer+0x75/0x9e
[  775.956939]  [<ffffffff8106f18f>] ? tick_nohz_handler+0xd0/0xd0
[  775.956941]  [<ffffffff810501cf>] ? __run_hrtimer.isra.26+0x75/0xcd
[  775.956943]  [<ffffffff81050873>] ? hrtimer_interrupt+0xe2/0x1f0
[  775.956948]  [<ffffffffa000941d>] ? bnx2_poll_work+0x2d3/0xa52 [bnx2]
[  775.956950]  [<ffffffff8100fa5c>] ? sched_clock+0x5/0x8
[  775.956952]  [<ffffffff81023d93>] ? smp_apic_timer_interrupt+0x6d/0x7f
[  775.956954]  [<ffffffff8133244a>] ? apic_timer_interrupt+0x6a/0x70
[  775.956956]  [<ffffffff8126d407>] ? __napi_complete+0x1c/0x23
[  775.956958]  [<ffffffff81074249>] ? do_raw_spin_lock+0x18/0x1b
[  775.956960]  [<ffffffff8126e596>] ? net_rx_action+0x7f/0x185
[  775.956962]  [<ffffffff8100f80e>] ? __cycles_2_ns+0x9/0x45
[  775.956964]  [<ffffffff8103caae>] ? __do_softirq+0x9c/0x14b
[  775.956966]  [<ffffffff81332b3c>] ? call_softirq+0x1c/0x30
[  775.956968]  <EOI>  [<ffffffff8100aea6>] ? do_softirq+0x3c/0x7a
[  775.956970]  [<ffffffff8103c9db>] ? _local_bh_enable_ip.isra.7+0x76/0xa3
[  775.956972]  [<ffffffff812804b7>] ? netpoll_poll_dev+0xfe/0x4bc
[  775.956974]  [<ffffffff81280b02>] ? netpoll_send_skb_on_dev+0x28d/0x33b
[  775.956978]  [<ffffffffa0ff2c4c>] ? bond_dev_queue_xmit+0x62/0x7f [bonding]
[  775.956982]  [<ffffffffa0ff7588>] ? bond_3ad_xmit_xor+0xe7/0x10c [bonding]
[  775.956984]  [<ffffffffa0ff2ffd>] ? bond_start_xmit+0x394/0x3ff [bonding]
[  775.956987]  [<ffffffff81280ac1>] ? netpoll_send_skb_on_dev+0x24c/0x33b
[  775.956990]  [<ffffffffa0079fd5>] ?
vlan_dev_hard_start_xmit+0xab/0xf6 [8021q]
[  775.956992]  [<ffffffff81280ac1>] ? netpoll_send_skb_on_dev+0x24c/0x33b
[  775.956996]  [<ffffffffa01998e8>] ? __br_deliver+0x93/0xbe [bridge]
[  775.956998]  [<ffffffffa019837d>] ? br_dev_xmit+0x14a/0x16b [bridge]
[  775.957001]  [<ffffffff81280ac1>] ? netpoll_send_skb_on_dev+0x24c/0x33b
[  775.957003]  [<ffffffff81280372>] ? find_skb.isra.24+0x31/0x78
[  775.957005]  [<ffffffff81280bdc>] ? netpoll_send_skb+0x2c/0x39
[  775.957007]  [<ffffffffa00c422a>] ? write_msg+0x98/0xf3 [netconsole]
[  775.957009]  [<ffffffff81037db2>] ?
call_console_drivers.constprop.17+0x6e/0x7d
[  775.957011]  [<ffffffff81038248>] ? console_unlock+0x2ab/0x351
[  775.957012]  [<ffffffff81039112>] ? register_console+0x273/0x303
[  775.957014]  [<ffffffffa0103182>] ? init_netconsole+0x182/0x210 [netconsole]
[  775.957016]  [<ffffffffa0103000>] ? 0xffffffffa0102fff
[  775.957018]  [<ffffffff81002085>] ? do_one_initcall+0x75/0x12c
[  775.957019]  [<ffffffff81077b35>] ? sys_init_module+0x80/0x1c5
[  775.957020]  [<ffffffff813319b9>] ? system_call_fastpath+0x16/0x1b


Cheers,

    Sylvain
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang Aug. 22, 2012, 3:40 p.m. UTC | #2
On Wed, 22 Aug 2012 at 14:29 GMT, Sylvain Munaut <s.munaut@whatever-company.com> wrote:
> Hi,
>
>> my patch was incomplete, sorry :
>>
>> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>> index 346b1eb..ddc453b 100644
>> --- a/net/core/netpoll.c
>> +++ b/net/core/netpoll.c
>> @@ -335,8 +335,13 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
>>         /* don't get messages out of order, and no recursion */
>>         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
>>                 struct netdev_queue *txq;
>> +               int queue_index = skb_get_queue_mapping(skb);
>>
>> -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
>> +               if (queue_index >= dev->real_num_tx_queues) {
>> +                       queue_index = 0;
>> +                       skb_set_queue_mapping(skb, 0);
>> +               }
>> +               txq = netdev_get_tx_queue(dev, queue_index);
>>
>>                 /* try until next clock tick */
>>                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
>
> Ok, I tried this.
>
> The machine with the intel card still hard freeze (no output / no nothing ...)
> The machine with the bnx2 don't crash anymore and no NULL deref, but
> the modprobe still hangs and I get this every 180 sec or so :
>


Thanks for reporting this!

I will try to see if I can reproduce it on KVM guest tomorrow.
Need to go to sleep now.

To be honest, I never try to setup netconsole on such a complex NIC,
bridge on vlan tagged bonding.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang Aug. 23, 2012, 7:57 a.m. UTC | #3
On Wed, 22 Aug 2012 at 14:29 GMT, Sylvain Munaut <s.munaut@whatever-company.com> wrote:
> Hi,
>
>
> The machine with the intel card still hard freeze (no output / no nothing ...)
> The machine with the bnx2 don't crash anymore and no NULL deref, but
> the modprobe still hangs and I get this every 180 sec or so :

The NULL-deref can be reproduced easily, and Eric's patch could fix it.
So, Eric, can you resend your patch with your SOB?

I can't reproduce the hang as it is net driver specific, it is
probably related with my patch:

commit 6bdb7fe31046ac50b47e83c35cd6c6b6160a475d
Author: Amerigo Wang <amwang@redhat.com>
Date:   Fri Aug 10 01:24:50 2012 +0000

    netpoll: re-enable irq in poll_napi()


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lin Ming Aug. 25, 2012, 2:20 a.m. UTC | #4
On Wed, Aug 22, 2012 at 10:29 PM, Sylvain Munaut
<s.munaut@whatever-company.com> wrote:
> Hi,
>
>> my patch was incomplete, sorry :
>>
>> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>> index 346b1eb..ddc453b 100644
>> --- a/net/core/netpoll.c
>> +++ b/net/core/netpoll.c
>> @@ -335,8 +335,13 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
>>         /* don't get messages out of order, and no recursion */
>>         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
>>                 struct netdev_queue *txq;
>> +               int queue_index = skb_get_queue_mapping(skb);
>>
>> -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
>> +               if (queue_index >= dev->real_num_tx_queues) {
>> +                       queue_index = 0;
>> +                       skb_set_queue_mapping(skb, 0);
>> +               }
>> +               txq = netdev_get_tx_queue(dev, queue_index);
>>
>>                 /* try until next clock tick */
>>                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
>
> Ok, I tried this.
>
> The machine with the intel card still hard freeze (no output / no nothing ...)

Did you enable hard lockup detector?

CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y

> The machine with the bnx2 don't crash anymore and no NULL deref, but
> the modprobe still hangs and I get this every 180 sec or so :
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sylvain Munaut Sept. 12, 2012, 11:53 a.m. UTC | #5
Hi Eric,

AFAICT, the following patch was never merged but I just confirmed that
I do need it over 3.6-rc5 for things to work properly.

>> > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>> > index 346b1eb..df731a0 100644
>> > --- a/net/core/netpoll.c
>> > +++ b/net/core/netpoll.c
>> > @@ -335,8 +335,11 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
>> >         /* don't get messages out of order, and no recursion */
>> >         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
>> >                 struct netdev_queue *txq;
>> > +               int queue_index = skb_get_queue_mapping(skb);
>> >
>> > -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
>> > +               if (queue_index >= dev->real_num_tx_queues)
>> > +                       queue_index = 0;
>> > +               txq = netdev_get_tx_queue(dev, queue_index);
>> >
>> >                 /* try until next clock tick */
>> >                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
>>
>>
>> Well, it doesn't solve the problem :(
>>
>> It does have an effect though. Now even on the machine with the
>> broadcom card, it just freeze the machine ...
>> On the machine with intel card, it actually does get a couple of
>> netconsole packet out and then freeze as well.
>>
>
> my patch was incomplete, sorry :
>
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 346b1eb..ddc453b 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -335,8 +335,13 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
>         /* don't get messages out of order, and no recursion */
>         if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
>                 struct netdev_queue *txq;
> +               int queue_index = skb_get_queue_mapping(skb);
>
> -               txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
> +               if (queue_index >= dev->real_num_tx_queues) {
> +                       queue_index = 0;
> +                       skb_set_queue_mapping(skb, 0);
> +               }
> +               txq = netdev_get_tx_queue(dev, queue_index);
>
>                 /* try until next clock tick */
>                 for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;

Cheers,

    Sylvain
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang Sept. 12, 2012, 12:49 p.m. UTC | #6
On Wed, 12 Sep 2012 at 11:53 GMT, Sylvain Munaut <s.munaut@whatever-company.com> wrote:
> Hi Eric,
>
> AFAICT, the following patch was never merged but I just confirmed that
> I do need it over 3.6-rc5 for things to work properly.
>

Yes, indeed.

Eric, please resend your patch?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 12, 2012, 1:05 p.m. UTC | #7
On Wed, 2012-09-12 at 12:49 +0000, Cong Wang wrote:
> On Wed, 12 Sep 2012 at 11:53 GMT, Sylvain Munaut <s.munaut@whatever-company.com> wrote:
> > Hi Eric,
> >
> > AFAICT, the following patch was never merged but I just confirmed that
> > I do need it over 3.6-rc5 for things to work properly.
> >
> 
> Yes, indeed.
> 
> Eric, please resend your patch?
> 

Yes, but I have some worries of why it is needed.

Isnt it covering a bug elsewhere ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sylvain Munaut Sept. 13, 2012, 5:35 p.m. UTC | #8
Hi,

> Yes, but I have some worries of why it is needed.
>
> Isnt it covering a bug elsewhere ?

That may very well be.

Of the few test servers I have running the same kernel, I just found
the one with netconsole active to be "stuck".

Not frozen, but all user process are hanged up and it's spitting
message about processes and CPU being "stuck". The trace is different
in each case depending on what the process was actually doing at the
time it got stuck.

No message sent to the netconsole with the root cause and nothing was
written in the logs ...

Cheers,

    Sylvain
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 346b1eb..ddc453b 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -335,8 +335,13 @@  void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
 	/* don't get messages out of order, and no recursion */
 	if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
 		struct netdev_queue *txq;
+		int queue_index = skb_get_queue_mapping(skb);
 
-		txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
+		if (queue_index >= dev->real_num_tx_queues) {
+			queue_index = 0;
+			skb_set_queue_mapping(skb, 0);
+		}
+		txq = netdev_get_tx_queue(dev, queue_index);
 
 		/* try until next clock tick */
 		for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;