diff mbox

net: allow netdev_wait_allrefs() to run faster

Message ID 4ADF2B57.4030708@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 21, 2009, 3:40 p.m. UTC
Octavian Purdila a écrit :
> On Sunday 18 October 2009 21:21:44 you wrote:
>>> The msleep(250) should be tuned first. Then if this is really necessary
>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>> more. 
>>> Just try msleep(1 or 2), it should work quite well.
>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>  system can handle that.
>>
> 
> I would also like to see this patch in, we are running into scalability issues 
> with creating/deleting lots of interfaces as well.

Ben patch only address interface deletion, and one part of the problem,
maybe the more visible one for the current kernel.

Adding lots of interfaces only needs several threads to run concurently.

Before applying/examining his patch I suggest identifying all dev_put() spots than
can be deleted and replaced by something more scalable. I began this job
but others can help me.

RTNL and rcu grace periods are going to hurt anyway, so you probably need
to use many tasks to be able to delete lots of interfaces in parallel.

netdev_run_todo() should also use a better algorithm to allow parallelism.

Following patch doesnt slow down dev_put() users and real scalability
problems will surface and might be addressed.

[PATCH] net: allow netdev_wait_allrefs() to run faster

netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users report that no more than 4 devices can be dismantled per second,
this is a pretty serious problem for extreme setups.

Most likely, references only wait for a rcu grace period that should come
fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Oct. 21, 2009, 4:09 p.m. UTC | #1
Eric Dumazet a écrit :
> Octavian Purdila a écrit :
>> On Sunday 18 October 2009 21:21:44 you wrote:
>>>> The msleep(250) should be tuned first. Then if this is really necessary
>>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>>> more. 
>>>> Just try msleep(1 or 2), it should work quite well.
>>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>>  system can handle that.
>>>
>> I would also like to see this patch in, we are running into scalability issues 
>> with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 

Here are typical timings (on current kernel, but on following example
netdev_wait_allrefs() doesnt wait at all, because my netdevice has no refs)

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 down

real    0m0.021s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.022s
user    0m0.000s
sys     0m0.001s

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.001s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.036s
user    0m0.000s
sys     0m0.001s

22 ms (or 36 ms) delay are also problematic if you want to dismantle 1.000.000 netdevices at once.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Oct. 21, 2009, 4:51 p.m. UTC | #2
On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.

The first part I've been tackling has been the overhead in procfs, sysctl 
and sysfs.  I've got patches for some of the issues, hacks for others, and 
should have something to post in a few days.  Getting rid of those overheads 
is enough to get to decent interface creation times for the first 20 or 30k 
interfaces.

On the interface deletion side of things, within the network code, fib_hash 
has a few linear scans that really start hurting.  trie is a bit better, 
but I haven't started digging too deeply into its flush/remove overhead yet, 
aside from noticing that trie has a linear scan if 
CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
fn_trie_flush() is currently the worst offender during deletion.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Oct. 21, 2009, 4:55 p.m. UTC | #3
On Wednesday 21 October 2009 18:40:07 you wrote:

> >
> > I would also like to see this patch in, we are running into scalability
> > issues with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put()
>  spots than can be deleted and replaced by something more scalable. I began
>  this job but others can help me.
> 

Yes, I agree with you, there are multiple places which needs to be touched to 
allow for better scaling with regard to the number of interfaces. We do have 
patches that addresses some of these issues, but unfortunately they are based 
on 2.6.7 and some of them are quite ugly hacks :) 

However, we are in the process of switching to 2.6.31 so I hope we will be 
able to contribute on this effort.

> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 

Hmm, how would multiple tasks help here? Isn't the RTNL mutex global?

> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 

Thanks, I am going to test it on our platform and send back the results.

tavi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 21, 2009, 7:54 p.m. UTC | #4
Benjamin LaHaise a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> Ben patch only address interface deletion, and one part of the problem,
>> maybe the more visible one for the current kernel.
> 
> The first part I've been tackling has been the overhead in procfs, sysctl 
> and sysfs.  I've got patches for some of the issues, hacks for others, and 
> should have something to post in a few days.  Getting rid of those overheads 
> is enough to get to decent interface creation times for the first 20 or 30k 
> interfaces.
> 
> On the interface deletion side of things, within the network code, fib_hash 
> has a few linear scans that really start hurting.  trie is a bit better, 
> but I haven't started digging too deeply into its flush/remove overhead yet, 
> aside from noticing that trie has a linear scan if 
> CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
> fn_trie_flush() is currently the worst offender during deletion.

Well, there are many things to change...

# ip -o link | wc -l
13097
# time ip -o link show mv22248
13045: mv22248@eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN \    link/ether 00:1e:0b:8e:c8:08 brd ff:ff:ff:ff:ff:ff

real    0m0.840s
user    0m0.473s
sys     0m0.368s

almost one second to get link status of one particular interface :(
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Oct. 23, 2009, 9:13 p.m. UTC | #5
On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Octavian Purdila a écrit :
> > On Sunday 18 October 2009 21:21:44 you wrote:
> >>> The msleep(250) should be tuned first. Then if this is really necessary
> >>> to dismantle 100.000 netdevices per second, we might have to think a bit
> >>> more. 
> >>> Just try msleep(1 or 2), it should work quite well.
> >> My goal is tearing down 100,000 interfaces in a few seconds, which really
> >>  is  necessary.  Right now we're running about 40,000 interfaces on a not
> >>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
> >>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
> >>  system can handle that.
> >>
> > 
> > I would also like to see this patch in, we are running into scalability issues 
> > with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 
> netdev_wait_allrefs() waits that all references to a device vanishes.
> 
> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> Some users report that no more than 4 devices can be dismantled per second,
> this is a pretty serious problem for extreme setups.
> 
> Most likely, references only wait for a rcu grace period that should come
> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Is this a place where synchronize_rcu_expedited() is appropriate?
(It went in to 2.6.32-rc1.)

							Thanx, Paul

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/core/dev.c |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 28b0b9e..fca2e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
> 
> -		msleep(250);
> +		schedule_timeout_uninterruptible(1);
> 
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 29, 2009, 11:07 p.m. UTC | #6
Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> Ben patch only address interface deletion, and one part of the problem,
>> maybe the more visible one for the current kernel.
>
> The first part I've been tackling has been the overhead in procfs, sysctl 
> and sysfs.  

Could you keep me in the loop with that.  I have some pending cleanups for
all of those pieces of code and may be able to help/advice/review.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fca2e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4983,7 +4983,7 @@  static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		schedule_timeout_uninterruptible(1);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "