diff mbox

dst: call cond_resched() in dst_gc_task()

Message ID 1265657560.4236.80.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 8, 2010, 7:32 p.m. UTC
Le lundi 08 février 2010 à 15:32 +0100, Eric Dumazet a écrit :
> Le lundi 08 février 2010 à 15:16 +0100, Paweł Staszewski a écrit :
> 
> > >    
> > Some day ago after info about route cache i was have  also this info:
> 
> > Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48
> > 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5 53 48
> > 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> > Call Trace:
> >   <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
> >   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
> >   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
> >   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
> >   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
> >   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >   <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
> >   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
> >   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
> >   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
> >   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
> >   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
> >   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
> >   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
> >   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
> >   [<ffffffff810479bd>] ? kthread+0x79/0x81
> >   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
> >   [<ffffffff81047944>] ? kthread+0x0/0x81
> > 
> > 
> >   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> > 
> > 
> 
> This trace is indeed very interesting, since dst_gc_task() is run from a
> work queue, and there is no scheduling point in it.
> 
> We might need add a scheduling point in dst_gc_task() in case huge
> number of entries were flushed.
> 

David, here is the patch I sent to Pawel to solve this problem.
This probaby is a stable candidate.

Thanks

[PATCH] dst: call cond_resched() in dst_gc_task()

On some workloads, it is quite possible to get a huge dst list to
process in dst_gc_task(), and trigger soft lockup detection.

Fix is to call cond_resched(), as we run in process context.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Feb. 8, 2010, 11:01 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Feb 2010 20:32:40 +0100

> [PATCH] dst: call cond_resched() in dst_gc_task()
> 
> On some workloads, it is quite possible to get a huge dst list to
> process in dst_gc_task(), and trigger soft lockup detection.
> 
> Fix is to call cond_resched(), as we run in process context.
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up to -stable.

When fixing bugs with kernel bugzilla entries, please
mention them in the commit message.  I fixed this up for
you but please take care of it next time.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Feb. 8, 2010, 11:26 p.m. UTC | #2
On Mon, 08 Feb 2010 20:32:40 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> [PATCH] dst: call cond_resched() in dst_gc_task()
> 
> On some workloads, it is quite possible to get a huge dst list to
> process in dst_gc_task(), and trigger soft lockup detection.
> 
> Fix is to call cond_resched(), as we run in process context.
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> 
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 57bc4d5..cb1b348 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -17,6 +17,7 @@
>  #include <linux/string.h>
>  #include <linux/types.h>
>  #include <net/net_namespace.h>
> +#include <linux/sched.h>
>  
>  #include <net/dst.h>
>  
> @@ -79,6 +80,7 @@ loop:
>  	while ((dst = next) != NULL) {
>  		next = dst->next;
>  		prefetch(&next->next);
> +		cond_resched();
>  		if (likely(atomic_read(&dst->__refcnt))) {
>  			last->next = dst;
>  			last = dst;

Gad.  Am I understanding this right?  The softlockup threshold is sixty
seconds!

I assume that this function spends most of its time walking over busy
entries?  Is a more powerful data structure needed?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 8, 2010, 11:34 p.m. UTC | #3
From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 8 Feb 2010 15:26:06 -0800

> I assume that this function spends most of its time walking over busy
> entries?  Is a more powerful data structure needed?

When you're getting pounded with millions of packets per second,
all mostly to different destinations (and thus resolving to
different routing cache entries), this is what happens.

For a busy router, really, this is normal behavior.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Feb. 8, 2010, 11:37 p.m. UTC | #4
On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Mon, 8 Feb 2010 15:26:06 -0800
> 
> > I assume that this function spends most of its time walking over busy
> > entries?  Is a more powerful data structure needed?
> 
> When you're getting pounded with millions of packets per second,
> all mostly to different destinations (and thus resolving to
> different routing cache entries), this is what happens.
> 
> For a busy router, really, this is normal behavior.

Is the cache a net win in that scenario?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 8, 2010, 11:50 p.m. UTC | #5
From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 8 Feb 2010 15:37:44 -0800

> On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
>> For a busy router, really, this is normal behavior.
> 
> Is the cache a net win in that scenario?

Absolutely.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Feb. 8, 2010, 11:50 p.m. UTC | #6
On Mon, 8 Feb 2010 15:37:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Date: Mon, 8 Feb 2010 15:26:06 -0800
> > 
> > > I assume that this function spends most of its time walking over busy
> > > entries?  Is a more powerful data structure needed?
> > 
> > When you're getting pounded with millions of packets per second,
> > all mostly to different destinations (and thus resolving to
> > different routing cache entries), this is what happens.
> > 
> > For a busy router, really, this is normal behavior.
> 
> Is the cache a net win in that scenario?

No, cache doesn't help.

Robert who is the expert in this area, runs with FIB TRIE and
no routing cache.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 9, 2010, 6:06 a.m. UTC | #7
Le lundi 08 février 2010 à 15:50 -0800, Stephen Hemminger a écrit :

> No, cache doesn't help.
> 
> Robert who is the expert in this area, runs with FIB TRIE and
> no routing cache.

Who knows, it probably depends on many factors. I always run with cache
enabled, because it saves cycles on moderate load.

FIB_TRIE is unrelated here, if routing table is very small, it fits HASH
or TRIE.

Pawel hit the bug with tunables that basically enabled the cache but in
a non helpful way (filling the list of busy dst). User error combined
with a lazy kernel function :)

Please note that conversion from softirq to workqueue, without
scheduling point, might/probably use same cpu for handling network irqs
and running dst_gc_task() :

On big routers, admins usually use irq affinities, so we can have very
litle cpu time available to run other tasks on those cpus.

After this patch, I believe that scheduler is allowed to migrate
dst_gc_task() to an idle cpu.

Another point (for 2.6.34) to address is the dst_gc_mutex that can delay
NETDEV_UNREGISTER/NETDEV_DOWN events for a long period.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 9, 2010, 6:07 a.m. UTC | #8
Le lundi 08 février 2010 à 15:01 -0800, David Miller a écrit :

> 
> When fixing bugs with kernel bugzilla entries, please
> mention them in the commit message.  I fixed this up for
> you but please take care of it next time.
> 
> Thanks!

Sorry Dave, I was not aware of the bugzilla entry.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Feb. 9, 2010, 6:35 a.m. UTC | #9
On Tue, 09 Feb 2010 07:06:38 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> After this patch, I believe that scheduler is allowed to migrate
> dst_gc_task() to an idle cpu.

No, keventd threads are each pinned to a single CPU (kthread_bind() in
start_workqueue_thread()), so dst_gc_task() gets run on the CPU which
ran schedule_delayed_work() and no other.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 9, 2010, 7:20 a.m. UTC | #10
Le lundi 08 février 2010 à 22:35 -0800, Andrew Morton a écrit :
> On Tue, 09 Feb 2010 07:06:38 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > After this patch, I believe that scheduler is allowed to migrate
> > dst_gc_task() to an idle cpu.
> 
> No, keventd threads are each pinned to a single CPU (kthread_bind() in
> start_workqueue_thread()), so dst_gc_task() gets run on the CPU which
> ran schedule_delayed_work() and no other.
> 

Ah OK, thanks Andrew for this clarification.

I suppose offlining a cpu migrates its works to another (online) cpu ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Feb. 9, 2010, 7:31 a.m. UTC | #11
On Tue, 09 Feb 2010 08:20:36 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I suppose offlining a cpu migrates its works to another (online) cpu ?

Sort of, effectively.  The workqueue code runs all the pending works on
the to-be-offlined CPU and then it's done.

schedule_delayed_work() starts out with a timer, and the timer code
_does_ perform migration off the going-away CPU.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dst.c b/net/core/dst.c
index 57bc4d5..cb1b348 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -17,6 +17,7 @@ 
 #include <linux/string.h>
 #include <linux/types.h>
 #include <net/net_namespace.h>
+#include <linux/sched.h>
 
 #include <net/dst.h>
 
@@ -79,6 +80,7 @@  loop:
 	while ((dst = next) != NULL) {
 		next = dst->next;
 		prefetch(&next->next);
+		cond_resched();
 		if (likely(atomic_read(&dst->__refcnt))) {
 			last->next = dst;
 			last = dst;