Scalability of interface creation and deletion

Message ID	C449131127D58077CB25C9D8@Ximines.local
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 4843BB7158 for <patchwork-incoming@ozlabs.org>; Mon, 9 May 2011 00:35:01 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753857Ab1EHO1M (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Sun, 8 May 2011 10:27:12 -0400 Received: from mail.avalus.com ([89.16.176.221]:47783 "EHLO mail.avalus.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751206Ab1EHO1L (ORCPT <rfc822;netdev@vger.kernel.org>); Sun, 8 May 2011 10:27:11 -0400 Received: from [192.168.100.15] (87-194-71-186.bethere.co.uk [87.194.71.186]) by mail.avalus.com (Postfix) with ESMTPSA id 66CB7C560D5; Sun, 8 May 2011 15:27:08 +0100 (BST) Date: Sun, 08 May 2011 15:27:07 +0100 From: Alex Bligh <alex@alex.org.uk> Reply-To: Alex Bligh <alex@alex.org.uk> To: paulmck@linux.vnet.ibm.com cc: Eric Dumazet <eric.dumazet@gmail.com>, netdev@vger.kernel.org, Alex Bligh <alex@alex.org.uk> Subject: Re: Scalability of interface creation and deletion Message-ID: <C449131127D58077CB25C9D8@Ximines.local> In-Reply-To: <20110508134425.GL2641@linux.vnet.ibm.com> References: <0F4A638C2A523577CDBC295E@Ximines.local> <1304785589.3207.5.camel@edumazet-laptop> <178E8895FB84C07251538EF7@Ximines.local> <1304793174.3207.22.camel@edumazet-laptop> <1304793749.3207.26.camel@edumazet-laptop> <1304838742.3207.45.camel@edumazet-laptop> <F57561A93EFF5E88729A8D53@nimrod.local> <7B76F9D75FD26D716624004B@nimrod.local> <20110508125028.GK2641@linux.vnet.ibm.com> <B2891EFD056565BBD4DBCE16@nimrod.local> <20110508134425.GL2641@linux.vnet.ibm.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

Alex Bligh May 8, 2011, 2:27 p.m. UTC

Paul,

>> Yes, really 20-49us and 50-99us, not ms. Raw data attached :-)
>>
>> I'm guessing there are circumstances where there is an early exit.
>
> Well, if you were onlining and offlining CPUs, then if there was only
> one CPU online, this could happen.

No, I wasn't doing that.

>  And there really is only one CPU
> online during boot, so if your measurements included early boot time,
> this could easily explain these very short timings.

No, I waited a few minutes after boot for the system to stabilize, and
all CPUs were definitely online.

The patch to the kernel I am running is below.

>> There is nothing much going on these systems (idle, no other users,
>> just normal system daemons).
>
> And normal system daemons might cause this, right?

Yes. Everything is normal, except I did
 service udev stop
 unshare -n bash
which together stop the system running interface scripts when
interfaces are created (as upstart and upstart-udev-bridge are
now integrated, you can't kill upstart, so you have to rely on
unshare -n to stop the events being propagated). That's just
to avoid measuring the time it takes to execute the scripts.

Paul E. McKenney May 8, 2011, 2:47 p.m. UTC | #1

On Sun, May 08, 2011 at 03:27:07PM +0100, Alex Bligh wrote:
> Paul,
> 
> >>Yes, really 20-49us and 50-99us, not ms. Raw data attached :-)
> >>
> >>I'm guessing there are circumstances where there is an early exit.
> >
> >Well, if you were onlining and offlining CPUs, then if there was only
> >one CPU online, this could happen.
> 
> No, I wasn't doing that.

OK.

> > And there really is only one CPU
> >online during boot, so if your measurements included early boot time,
> >this could easily explain these very short timings.
> 
> No, I waited a few minutes after boot for the system to stabilize, and
> all CPUs were definitely online.
> 
> The patch to the kernel I am running is below.

OK, interesting...

My guess is that you need to be using ktime_get_ts().  Isn't ktime_get()
subject to various sorts of adjustment?

> >>There is nothing much going on these systems (idle, no other users,
> >>just normal system daemons).
> >
> >And normal system daemons might cause this, right?
> 
> Yes. Everything is normal, except I did
> service udev stop
> unshare -n bash
> which together stop the system running interface scripts when
> interfaces are created (as upstart and upstart-udev-bridge are
> now integrated, you can't kill upstart, so you have to rely on
> unshare -n to stop the events being propagated). That's just
> to avoid measuring the time it takes to execute the scripts.

OK, so you really could be seeing grace periods started by these system
daemons.

							Thanx, Paul

> -- 
> Alex Bligh
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index dd4aea8..e401018 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1518,6 +1518,7 @@ EXPORT_SYMBOL_GPL(call_rcu_bh);
> void synchronize_sched(void)
> {
>        struct rcu_synchronize rcu;
> +       ktime_t time_start = ktime_get();
> 
>        if (rcu_blocking_is_gp())
>                return;
> @@ -1529,6 +1530,7 @@ void synchronize_sched(void)
>        /* Wait for it. */
>        wait_for_completion(&rcu.completion);
>        destroy_rcu_head_on_stack(&rcu.head);
> +       pr_err("synchronize_sched() in %lld us\n",
> ktime_us_delta(ktime_get(), time_start));
> }
> EXPORT_SYMBOL_GPL(synchronize_sched);
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 856b6ee..013f627 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5164,7 +5164,9 @@ static void rollback_registered_many(struct
> list_head *head)
>        dev = list_first_entry(head, struct net_device, unreg_list);
>        call_netdevice_notifiers(NETDEV_UNREGISTER_BATCH, dev);
> 
> +       pr_err("begin rcu_barrier()\n");
>        rcu_barrier();
> +       pr_err("end rcu_barrier()\n");
> 
>        list_for_each_entry(dev, head, unreg_list)
>                dev_put(dev);
> @@ -5915,8 +5917,10 @@ EXPORT_SYMBOL(free_netdev);
>  */
> void synchronize_net(void)
> {
> +       pr_err("begin synchronize_net()\n");
>        might_sleep();
>        synchronize_rcu();
> +       pr_err("end synchronize_net()\n");
> }
> EXPORT_SYMBOL(synchronize_net);
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Bligh May 8, 2011, 3:17 p.m. UTC | #2

Paul,

>> No, I waited a few minutes after boot for the system to stabilize, and
>> all CPUs were definitely online.
>>
>> The patch to the kernel I am running is below.
>
> OK, interesting...
>
> My guess is that you need to be using ktime_get_ts().  Isn't ktime_get()
> subject to various sorts of adjustment?

It's Eric's code, not mine, but:

kernel/time/timekeeping.c suggests they do the same thing
(adjust xtime by wall_to_monotonic), just one returns a
struct timespec and the other returns a ktime_t.

>> >> There is nothing much going on these systems (idle, no other users,
>> >> just normal system daemons).
>> >
>> > And normal system daemons might cause this, right?
>>
>> Yes. Everything is normal, except I did
>> service udev stop
>> unshare -n bash
>> which together stop the system running interface scripts when
>> interfaces are created (as upstart and upstart-udev-bridge are
>> now integrated, you can't kill upstart, so you have to rely on
>> unshare -n to stop the events being propagated). That's just
>> to avoid measuring the time it takes to execute the scripts.
>
> OK, so you really could be seeing grace periods started by these system
> daemons.

In 50% of 200 calls? That seems pretty unlikely. I think it's more
likely to be the 6 jiffies per call to ensure cpus are idle,
plus the 3 calls per interface destroy.

If 6 jiffies per call to ensure cpus are idle is a fact of life,
then the question goes back to why interface removal is waiting
for rcu readers to be released synchronously, as opposed to
doing the update bits synchronously, then doing the reclaim
element (freeing the memory) afterwards using call_rcu.

Paul E. McKenney May 8, 2011, 3:48 p.m. UTC | #3

On Sun, May 08, 2011 at 04:17:42PM +0100, Alex Bligh wrote:
> Paul,
> 
> >>No, I waited a few minutes after boot for the system to stabilize, and
> >>all CPUs were definitely online.
> >>
> >>The patch to the kernel I am running is below.
> >
> >OK, interesting...
> >
> >My guess is that you need to be using ktime_get_ts().  Isn't ktime_get()
> >subject to various sorts of adjustment?
> 
> It's Eric's code, not mine, but:
> 
> kernel/time/timekeeping.c suggests they do the same thing
> (adjust xtime by wall_to_monotonic), just one returns a
> struct timespec and the other returns a ktime_t.
> 
> >>>> There is nothing much going on these systems (idle, no other users,
> >>>> just normal system daemons).
> >>>
> >>> And normal system daemons might cause this, right?
> >>
> >>Yes. Everything is normal, except I did
> >>service udev stop
> >>unshare -n bash
> >>which together stop the system running interface scripts when
> >>interfaces are created (as upstart and upstart-udev-bridge are
> >>now integrated, you can't kill upstart, so you have to rely on
> >>unshare -n to stop the events being propagated). That's just
> >>to avoid measuring the time it takes to execute the scripts.
> >
> >OK, so you really could be seeing grace periods started by these system
> >daemons.
> 
> In 50% of 200 calls? That seems pretty unlikely. I think it's more
> likely to be the 6 jiffies per call to ensure cpus are idle,
> plus the 3 calls per interface destroy.
> 
> If 6 jiffies per call to ensure cpus are idle is a fact of life,
> then the question goes back to why interface removal is waiting
> for rcu readers to be released synchronously, as opposed to
> doing the update bits synchronously, then doing the reclaim
> element (freeing the memory) afterwards using call_rcu.

This would speed things up considerably, assuming that there is no
other reason to block for an RCU grace period.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet May 8, 2011, 9 p.m. UTC | #4

Le dimanche 08 mai 2011 à 08:48 -0700, Paul E. McKenney a écrit :
> On Sun, May 08, 2011 at 04:17:42PM +0100, Alex Bligh wrote:
> > 
> > If 6 jiffies per call to ensure cpus are idle is a fact of life,
> > then the question goes back to why interface removal is waiting
> > for rcu readers to be released synchronously, as opposed to
> > doing the update bits synchronously, then doing the reclaim
> > element (freeing the memory) afterwards using call_rcu.
> 
> This would speed things up considerably, assuming that there is no
> other reason to block for an RCU grace period.
> 

Thats not so simple... Things are modular and better be safe than crash,
on a very rare event (device dismantles are not the thing we expect to
do very often. Only special needs might need to perform hundred of them
per minute...)

For example, in the VLAN dismantle phase (ip link del eth0.103)
we have 3 calls to synchronize_rcu() and one call to rcu_barrier()

[ the 'extra' synchronize_rcu() call comes from unregister_vlan_dev() ]

Maybe with new VLAN model, we could now remove this synchronize_net()
call from vlan code. Jesse what do you think ?
Once vlan_group_set_device(grp, vlan_id, NULL) had been called, why
should we respect one rcu grace period at all, given dev is queued to
unregister_netdevice_queue() [ which has its own couples of
synchronize_net() / rcu_barrier() ]

The real scalability problem of device dismantles comes from the fact
that all these waits are done under RTNL mutex. This is the real killer
because you cannot use your eight cpus, even if you are willing to.

We can probably speed things, but we should consider the following user
actions :

ip link add link eth0 vlan103 type vlan id 103
ip link del vlan103
ip link add link eth1 vlan103 type vlan id 103

The "link del" command should return to user only if the minimum things
had been done, to make sure the following "link add" wont fail
mysteriously.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Bligh May 9, 2011, 5:37 a.m. UTC | #5

--On 8 May 2011 23:00:47 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We can probably speed things, but we should consider the following user
> actions :

How about

> ip link add link eth0 vlan103 type vlan id 103
> ip link del vlan103

Removes and unlinks structures, including making name available, sending
out netlink messages, but doesn't free things

> ip link add link eth1 vlan103 type vlan id 103

creates new interface

[some time later] original zombie i/f freed

> The "link del" command should return to user only if the minimum things
> had been done, to make sure the following "link add" wont fail
> mysteriously.

Are you worried about failure through name collision (already
dealt with), vlan tag collision (ditto) or what?

Eric Dumazet May 9, 2011, 6:37 a.m. UTC | #6

Le lundi 09 mai 2011 à 06:37 +0100, Alex Bligh a écrit :
> 
> --On 8 May 2011 23:00:47 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > We can probably speed things, but we should consider the following user
> > actions :
> 
> How about
> 
> > ip link add link eth0 vlan103 type vlan id 103
> > ip link del vlan103
> 
> Removes and unlinks structures, including making name available, sending
> out netlink messages, but doesn't free things

Most of the cleanup work has to be done with RTNL being held, and this
might because of transaction atomicity requirement.

In your test you dismantle idle devices. Now think a bit when you have
both trafic in and out, sockets with destinations still pointing to the
device, in flight arp requests, all this using RCU of course.

When you dismantle one device (or several in case of a module unload),
this can have implications on other devices (see veth cas for an obvious
example : this automatically removes the peer device), but also on
routes, neighbours, cached routes, various protocol cleanups, ... and so
on. Few people even on netdev understand the whole picture.

Given that 99.99% machines setup netdevice at boot time only, and hardly
consider dismantles, we netdev guys were pragmatic and safe. Two or
three synchronize_rcu() were considered as a non issue.

It seems there is interest to improve things now.

One way is to allow more batching and delegation, and I am working on
that right now, using a kthread, so that we dont block the requester for
the whole device dismantle.

This kthread might use call_rcu() driven state machine, but that is a
detail of implementation, since only kthread would be impacted.

I am pretty busy at work these days, so dont expect patches before some
time :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney May 9, 2011, 7:11 a.m. UTC | #7

On Sun, May 08, 2011 at 11:00:47PM +0200, Eric Dumazet wrote:
> Le dimanche 08 mai 2011 à 08:48 -0700, Paul E. McKenney a écrit :
> > On Sun, May 08, 2011 at 04:17:42PM +0100, Alex Bligh wrote:
> > > 
> > > If 6 jiffies per call to ensure cpus are idle is a fact of life,
> > > then the question goes back to why interface removal is waiting
> > > for rcu readers to be released synchronously, as opposed to
> > > doing the update bits synchronously, then doing the reclaim
> > > element (freeing the memory) afterwards using call_rcu.
> > 
> > This would speed things up considerably, assuming that there is no
> > other reason to block for an RCU grace period.
> 
> Thats not so simple... Things are modular and better be safe than crash,
> on a very rare event (device dismantles are not the thing we expect to
> do very often. Only special needs might need to perform hundred of them
> per minute...)

I was afraid of that, but had to ask...

> For example, in the VLAN dismantle phase (ip link del eth0.103)
> we have 3 calls to synchronize_rcu() and one call to rcu_barrier()
> 
> [ the 'extra' synchronize_rcu() call comes from unregister_vlan_dev() ]
> 
> Maybe with new VLAN model, we could now remove this synchronize_net()
> call from vlan code. Jesse what do you think ?
> Once vlan_group_set_device(grp, vlan_id, NULL) had been called, why
> should we respect one rcu grace period at all, given dev is queued to
> unregister_netdevice_queue() [ which has its own couples of
> synchronize_net() / rcu_barrier() ]
> 
> 
> The real scalability problem of device dismantles comes from the fact
> that all these waits are done under RTNL mutex. This is the real killer
> because you cannot use your eight cpus, even if you are willing to.
> 
> We can probably speed things, but we should consider the following user
> actions :
> 
> ip link add link eth0 vlan103 type vlan id 103
> ip link del vlan103
> ip link add link eth1 vlan103 type vlan id 103
> 
> The "link del" command should return to user only if the minimum things
> had been done, to make sure the following "link add" wont fail
> mysteriously.

Hmmm...  One approach would be to use synchronize_rcu_expedited(), though
that is a bit of a big hammer.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jesse Gross May 9, 2011, 5:30 p.m. UTC | #8

On Sun, May 8, 2011 at 2:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> For example, in the VLAN dismantle phase (ip link del eth0.103)
> we have 3 calls to synchronize_rcu() and one call to rcu_barrier()
>
> [ the 'extra' synchronize_rcu() call comes from unregister_vlan_dev() ]
>
> Maybe with new VLAN model, we could now remove this synchronize_net()
> call from vlan code. Jesse what do you think ?
> Once vlan_group_set_device(grp, vlan_id, NULL) had been called, why
> should we respect one rcu grace period at all, given dev is queued to
> unregister_netdevice_queue() [ which has its own couples of
> synchronize_net() / rcu_barrier() ]

Yes, I agree that the extra call to synchronize_net() provides no
value, though I think that's actually been true for a while.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Scalability of interface creation and deletion

Commit Message

Comments

Patch