diff mbox

[RFC] iproute: Faster ip link add, set and delete

Message ID 87k3ozm5v1.fsf@xmission.com
State RFC, archived
Delegated to: stephen hemminger
Headers show

Commit Message

Eric W. Biederman March 22, 2013, 10:23 p.m. UTC
Because ip link add, set, and delete map the interface name to the
interface index by dumping all of the interfaces before performing
their respective commands.  Operations that should be constant time
slow down when lots of network interfaces are in use.  Resulting
in O(N^2) time to work with O(N) devices.

Make the work that iproute does constant time by passing the interface
name to the kernel instead.

In small scale testing on my system this shows dramatic performance
increases of ip link add from 120s to just 11s to add 5000 network
devices.  And from longer than I cared to wait to just 58s to delete
all of those interfaces again.

Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Reported-by: Benoit Lourdelet <blourdel@juniper.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

I think I am bungling the case where people specify an ifindex as ifNNNN
but does anyone care?

 ip/iplink.c |   19 +------------------
 1 files changed, 1 insertions(+), 18 deletions(-)

Comments

Stephen Hemminger March 22, 2013, 10:27 p.m. UTC | #1
The whole ifindex map is a design mistake at this point.
Better off to do a lazy cache or something like that.


On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Because ip link add, set, and delete map the interface name to the
> interface index by dumping all of the interfaces before performing
> their respective commands.  Operations that should be constant time
> slow down when lots of network interfaces are in use.  Resulting
> in O(N^2) time to work with O(N) devices.
>
> Make the work that iproute does constant time by passing the interface
> name to the kernel instead.
>
> In small scale testing on my system this shows dramatic performance
> increases of ip link add from 120s to just 11s to add 5000 network
> devices.  And from longer than I cared to wait to just 58s to delete
> all of those interfaces again.
>
> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
> Reported-by: Benoit Lourdelet <blourdel@juniper.net>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> I think I am bungling the case where people specify an ifindex as ifNNNN
> but does anyone care?
>
>  ip/iplink.c |   19 +------------------
>  1 files changed, 1 insertions(+), 18 deletions(-)
>
> diff --git a/ip/iplink.c b/ip/iplink.c
> index ad33611..6dffbf0 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
>                 }
>         }
>
> -       ll_init_map(&rth);
> -
>         if (!(flags & NLM_F_CREATE)) {
>                 if (!dev) {
>                         fprintf(stderr, "Not enough information: \"dev\" "
> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
>                         exit(-1);
>                 }
>
> -               req.i.ifi_index = ll_name_to_index(dev);
> -               if (req.i.ifi_index == 0) {
> -                       fprintf(stderr, "Cannot find device \"%s\"\n", dev);
> -                       return -1;
> -               }
> +               name = dev;
>         } else {
>                 /* Allow "ip link add dev" and "ip link add name" */
>                 if (!name)
>                         name = dev;
>
> -               if (link) {
> -                       int ifindex;
> -
> -                       ifindex = ll_name_to_index(link);
> -                       if (ifindex == 0) {
> -                               fprintf(stderr, "Cannot find device \"%s\"\n",
> -                                       link);
> -                               return -1;
> -                       }
> -                       addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4);
> -               }
>         }
>
>         if (name) {
> --
> 1.7.5.4
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benoit Lourdelet March 26, 2013, 11:51 a.m. UTC | #2
Hello,

I re-tested with the patch and got the following results on a 32x 2Ghz
core system.

# veth 	add 	delete
1000 	36 	34
3000 	259 	137
4000 	462 	195
5000 	729     N/A

The script to create is the following :
for i in `seq 1 5000`; do
	sudo ip link add type veth
Done


The script to delete:
for d in /sys/class/net/veth*; do
	ip link del `basename $d` 2>/dev/null || true
Done

There is a very good improvement in deletion.



iproute2 does not seems to be well multithread as I get time divided by a
factor of 2 with a 8x  3.2 Ghz core system.

I don¹t know if that is the improvement you expected ?

Would the iproute2 redesign you mentioned help improve performance even
further ?


As a reference : Iproute2 baseline w/o patch:

# veth 	add 	delete

1000 	57 	70
2000 	193 	250
3000 	435 	510
4000 	752 	824
5000 	1123 	1185

Regards

Benoit




On 22/03/2013 23:27, "Stephen Hemminger" <stephen@networkplumber.org>
wrote:

>The whole ifindex map is a design mistake at this point.
>Better off to do a lazy cache or something like that.
>
>
>On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman
><ebiederm@xmission.com> wrote:
>>
>> Because ip link add, set, and delete map the interface name to the
>> interface index by dumping all of the interfaces before performing
>> their respective commands.  Operations that should be constant time
>> slow down when lots of network interfaces are in use.  Resulting
>> in O(N^2) time to work with O(N) devices.
>>
>> Make the work that iproute does constant time by passing the interface
>> name to the kernel instead.
>>
>> In small scale testing on my system this shows dramatic performance
>> increases of ip link add from 120s to just 11s to add 5000 network
>> devices.  And from longer than I cared to wait to just 58s to delete
>> all of those interfaces again.
>>
>> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
>> Reported-by: Benoit Lourdelet <blourdel@juniper.net>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> I think I am bungling the case where people specify an ifindex as ifNNNN
>> but does anyone care?
>>
>>  ip/iplink.c |   19 +------------------
>>  1 files changed, 1 insertions(+), 18 deletions(-)
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index ad33611..6dffbf0 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>>                 }
>>         }
>>
>> -       ll_init_map(&rth);
>> -
>>         if (!(flags & NLM_F_CREATE)) {
>>                 if (!dev) {
>>                         fprintf(stderr, "Not enough information:
>>\"dev\" "
>> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>>                         exit(-1);
>>                 }
>>
>> -               req.i.ifi_index = ll_name_to_index(dev);
>> -               if (req.i.ifi_index == 0) {
>> -                       fprintf(stderr, "Cannot find device \"%s\"\n",
>>dev);
>> -                       return -1;
>> -               }
>> +               name = dev;
>>         } else {
>>                 /* Allow "ip link add dev" and "ip link add name" */
>>                 if (!name)
>>                         name = dev;
>>
>> -               if (link) {
>> -                       int ifindex;
>> -
>> -                       ifindex = ll_name_to_index(link);
>> -                       if (ifindex == 0) {
>> -                               fprintf(stderr, "Cannot find device
>>\"%s\"\n",
>> -                                       link);
>> -                               return -1;
>> -                       }
>> -                       addattr_l(&req.n, sizeof(req), IFLA_LINK,
>>&ifindex, 4);
>> -               }
>>         }
>>
>>         if (name) {
>> --
>> 1.7.5.4
>>
>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 26, 2013, 12:40 p.m. UTC | #3
Benoit Lourdelet <blourdel@juniper.net> writes:

> Hello,
>
> I re-tested with the patch and got the following results on a 32x 2Ghz
> core system.
>
> # veth 	add 	delete
> 1000 	36 	34
> 3000 	259 	137
> 4000 	462 	195
> 5000 	729     N/A
>
> The script to create is the following :
> for i in `seq 1 5000`; do
> 	sudo ip link add type veth
> Done

Which performs horribly as I mentioned earlier because you are asking
the kernel to create the names.  If you want performance you need to
specify the names of the network devices you are creating.

aka ip link add a$i type veth name b$i

> The script to delete:
> for d in /sys/class/net/veth*; do
> 	ip link del `basename $d` 2>/dev/null || true
> Done
>
> There is a very good improvement in deletion.
>
>
>
> iproute2 does not seems to be well multithread as I get time divided by a
> factor of 2 with a 8x  3.2 Ghz core system.

All netlink traffic and all network stack configuration is serialized by
the rtnl_lock in the kernel.  This is the slow path in the kernel, not
the fast path.

> I don¹t know if that is the improvement you expected ?
>
> Would the iproute2 redesign you mentioned help improve performance even
> further ?

Specifing the names would dramatically improve your creation
performance.  It should only take you about 10s for 5000 veth pairs.
But you have to specify the names.

Anyway I have exhausted my time, and inclination in this matter.  Good
luck with whatever your problem is.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn March 26, 2013, 2:17 p.m. UTC | #4
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Specifing the names would dramatically improve your creation
> performance.  It should only take you about 10s for 5000 veth pairs.
> But you have to specify the names.

Thanks, Eric.  I'm going to update lxc to always specify names for
the veth pairs, rather than only when they are requested by the
user's configuration file.

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn March 26, 2013, 2:33 p.m. UTC | #5
Actually, lxc is using random names now, so it's ok.

Benoit, can you use the patches from Eric with lxc (or use the script
you were using before but specify names as he said)?

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 26, 2013, 3:31 p.m. UTC | #6
On Tue, 2013-03-26 at 11:51 +0000, Benoit Lourdelet wrote:

> The script to delete:
> for d in /sys/class/net/veth*; do
> 	ip link del `basename $d` 2>/dev/null || true
> Done
> 
> There is a very good improvement in deletion.

I can do better ;)

If you are really doing this kind of things, you could use :

rmmod veth

Note that "ip" command supports a batch mode

ip -batch filename

In this case, the caching is done only once.

Eric, Stephen, one possibility would be to use the cache only in batch
mode.

Anyway caching is wrong because several users can use ip command at the
same time.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benoit Lourdelet March 27, 2013, 1:37 p.m. UTC | #7
Hello Serge,

I am indeed using Eric patch with lxc.

It solves the initial problem of slowness to start around 1600 containers.

I am now able to start more than  2000 without having new containers
slower and slower to start.

thanks

Benoit

On 26/03/2013 15:33, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:

>Actually, lxc is using random names now, so it's ok.
>
>Benoit, can you use the patches from Eric with lxc (or use the script
>you were using before but specify names as he said)?
>
>-serge
>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 27, 2013, 3:11 p.m. UTC | #8
Benoit Lourdelet <blourdel@juniper.net> writes:

> Hello Serge,
>
> I am indeed using Eric patch with lxc.
>
> It solves the initial problem of slowness to start around 1600
> containers.

Good so now we just need a production ready patch for iproute.

> I am now able to start more than  2000 without having new containers
> slower and slower to start.

May I ask how large a box you are running and how complex your
containers are.  I am trying to get a feel for how common it is likely
to be to find people running thousands of containers on a single
machine.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benoit Lourdelet March 28, 2013, 8:27 p.m. UTC | #9
Hello Eric,

I am running simple containers (2 network interfaces, 10MB of RAM, default
routing) and want to test scalability.

Out test platform is a 32x 2Ghz cores x86.

Regards

Benoit

On 27/03/2013 16:11, "Eric W. Biederman" <ebiederm@xmission.com> wrote:

>Benoit Lourdelet <blourdel@juniper.net> writes:
>
>> Hello Serge,
>>
>> I am indeed using Eric patch with lxc.
>>
>> It solves the initial problem of slowness to start around 1600
>> containers.
>
>Good so now we just need a production ready patch for iproute.
>
>> I am now able to start more than  2000 without having new containers
>> slower and slower to start.
>
>May I ask how large a box you are running and how complex your
>containers are.  I am trying to get a feel for how common it is likely
>to be to find people running thousands of containers on a single
>machine.
>
>Eric
>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/ip/iplink.c b/ip/iplink.c
index ad33611..6dffbf0 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -533,8 +533,6 @@  static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 		}
 	}
 
-	ll_init_map(&rth);
-
 	if (!(flags & NLM_F_CREATE)) {
 		if (!dev) {
 			fprintf(stderr, "Not enough information: \"dev\" "
@@ -542,27 +540,12 @@  static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 			exit(-1);
 		}
 
-		req.i.ifi_index = ll_name_to_index(dev);
-		if (req.i.ifi_index == 0) {
-			fprintf(stderr, "Cannot find device \"%s\"\n", dev);
-			return -1;
-		}
+		name = dev;
 	} else {
 		/* Allow "ip link add dev" and "ip link add name" */
 		if (!name)
 			name = dev;
 
-		if (link) {
-			int ifindex;
-
-			ifindex = ll_name_to_index(link);
-			if (ifindex == 0) {
-				fprintf(stderr, "Cannot find device \"%s\"\n",
-					link);
-				return -1;
-			}
-			addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4);
-		}
 	}
 
 	if (name) {