net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

Message ID 151066759055.14465.9783879083192000862.stgit@localhost.localdomain
State Changes Requested
Delegated to: David Miller
Headers show
Series
  • net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Related show

Commit Message

Kirill Tkhai Nov. 14, 2017, 1:53 p.m.
Curently mutex is used to protect pernet operations list. It makes
cleanup_net() to execute ->exit methods of the same operations set,
which was used on the time of ->init, even after net namespace is
unlinked from net_namespace_list.

But the problem is it's need to synchronize_rcu() after net is removed
from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

The solution is to convert net_mutex to the rw_semaphore. Then,
pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list.

This gives signify performance increase, like you may see below. There
is measured sequential net namespace creation in a cycle, in single
thread, without other tasks (single user mode):

1)int main(int argc, char *argv[])
{
        unsigned nr;
        if (argc < 2) {
                fprintf(stderr, "Provide nr iterations arg\n");
                return 1;
        }
        nr = atoi(argv[1]);
        while (nr-- > 0) {
                if (unshare(CLONE_NEWNET)) {
                        perror("Can't unshare");
                        return 1;
                }
        }
        return 0;
}

Origin, 100000 unshare():
0.03user 23.14system 1:39.85elapsed 23%CPU

Patched, 100000 unshare():
0.03user 67.49system 1:08.34elapsed 98%CPU

2)for i in {1..10000}; do unshare -n bash -c exit; done

Origin:
real 1m24,190s
user 0m6,225s
sys 0m15,132s

Patched:
real 0m18,235s   (4.6 times faster)
user 0m4,544s
sys 0m13,796s

This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
from Linus tree (not in net-next yet).

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/rtnetlink.h   |    2 +-
 include/net/net_namespace.h |    9 +++++++--
 net/core/net_namespace.c    |   40 ++++++++++++++++++++--------------------
 net/core/rtnetlink.c        |    4 ++--
 4 files changed, 30 insertions(+), 25 deletions(-)

Comments

Eric Dumazet Nov. 14, 2017, 5:07 p.m. | #1
On Tue, 2017-11-14 at 16:53 +0300, Kirill Tkhai wrote:
> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
> 
...

> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
> 
...

> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
> from Linus tree (not in net-next yet).


Looks great, thanks for doing this.

I wonder if the down_read_killable() is really needed, maybe a
down_read() would not be blocking the thread too long after this change.
Kirill Tkhai Nov. 14, 2017, 5:25 p.m. | #2
On 14.11.2017 20:07, Eric Dumazet wrote:
> On Tue, 2017-11-14 at 16:53 +0300, Kirill Tkhai wrote:
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
> ...
> 
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
> ...
> 
>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>> from Linus tree (not in net-next yet).
> 
> 
> Looks great, thanks for doing this.
> 
> I wonder if the down_read_killable() is really needed, maybe a
> down_read() would not be blocking the thread too long after this change.

I'm afraid there are spin_unlock() or similar actions leading to preempt_enable(),
which are made under the mutex/new rw_sem. They may be preempted and leave a cpu
for some time. So the possibility to break the wait may be useful in some way...
Andrei Vagin Nov. 14, 2017, 5:44 p.m. | #3
On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
> 
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
> 
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(&net_mutex)
>   list_del_rcu(&net->list)
>   synchronize_rcu()                                  <--- Sleep there for ages
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_exit_list(ops, &net_exit_list)
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_free_list(ops, &net_exit_list)
>   mutex_unlock(&net_mutex)
> 
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
> 
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
> 
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
> 
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
> 
> 1)int main(int argc, char *argv[])
> {
>         unsigned nr;
>         if (argc < 2) {
>                 fprintf(stderr, "Provide nr iterations arg\n");
>                 return 1;
>         }
>         nr = atoi(argv[1]);
>         while (nr-- > 0) {
>                 if (unshare(CLONE_NEWNET)) {
>                         perror("Can't unshare");
>                         return 1;
>                 }
>         }
>         return 0;
> }
> 
> Origin, 100000 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
> 
> Patched, 100000 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
> 
> 2)for i in {1..10000}; do unshare -n bash -c exit; done

Hi Kirill,

This mutex has another role. You know that net namespaces are destroyed
asynchronously, and the net mutex gurantees that a backlog will be not
big. If we have something in backlog, we know that it will be handled
before creating a new net ns.

As far as I remember net namespaces are created much faster than
they are destroyed, so with this changes we can create a really big
backlog, can't we?

There was a discussion a few month ago:
https://lists.onap.org/pipermail/containers/2016-October/037509.html


> 
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s

Here you measure time of creating and destroying net namespaces.

> 
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s

But here you measure time of crearing namespaces and you know nothing
when they will be destroyed.

Thanks,
Andrei
Eric Dumazet Nov. 14, 2017, 6 p.m. | #4
On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > Curently mutex is used to protect pernet operations list. It makes
> > cleanup_net() to execute ->exit methods of the same operations set,
> > which was used on the time of ->init, even after net namespace is
> > unlinked from net_namespace_list.
> > 
> > But the problem is it's need to synchronize_rcu() after net is removed
> > from net_namespace_list():
> > 
> > Destroy net_ns:
> > cleanup_net()
> >   mutex_lock(&net_mutex)
> >   list_del_rcu(&net->list)
> >   synchronize_rcu()                                  <--- Sleep there for ages
> >   list_for_each_entry_reverse(ops, &pernet_list, list)
> >     ops_exit_list(ops, &net_exit_list)
> >   list_for_each_entry_reverse(ops, &pernet_list, list)
> >     ops_free_list(ops, &net_exit_list)
> >   mutex_unlock(&net_mutex)
> > 
> > This primitive is not fast, especially on the systems with many processors
> > and/or when preemptible RCU is enabled in config. So, all the time, while
> > cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> > 
> > Create net_ns:
> > copy_net_ns()
> >   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
> > 
> > The solution is to convert net_mutex to the rw_semaphore. Then,
> > pernet_operations::init/::exit methods, modifying the net-related data,
> > will require down_read() locking only, while down_write() will be used
> > for changing pernet_list.
> > 
> > This gives signify performance increase, like you may see below. There
> > is measured sequential net namespace creation in a cycle, in single
> > thread, without other tasks (single user mode):
> > 
> > 1)int main(int argc, char *argv[])
> > {
> >         unsigned nr;
> >         if (argc < 2) {
> >                 fprintf(stderr, "Provide nr iterations arg\n");
> >                 return 1;
> >         }
> >         nr = atoi(argv[1]);
> >         while (nr-- > 0) {
> >                 if (unshare(CLONE_NEWNET)) {
> >                         perror("Can't unshare");
> >                         return 1;
> >                 }
> >         }
> >         return 0;
> > }
> > 
> > Origin, 100000 unshare():
> > 0.03user 23.14system 1:39.85elapsed 23%CPU
> > 
> > Patched, 100000 unshare():
> > 0.03user 67.49system 1:08.34elapsed 98%CPU
> > 
> > 2)for i in {1..10000}; do unshare -n bash -c exit; done
> 
> Hi Kirill,
> 
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
> 
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?

Please take a look at the recent patches I did :

8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 'net-speedup-netns-create-delete-time'
64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in kobject_uevent_net_broadcast()
4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add kobject_uevent_net_broadcast()
Kirill Tkhai Nov. 14, 2017, 6:04 p.m. | #5
On 14.11.2017 20:44, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>>   mutex_lock(&net_mutex)
>>   list_del_rcu(&net->list)
>>   synchronize_rcu()                                  <--- Sleep there for ages
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_exit_list(ops, &net_exit_list)
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_free_list(ops, &net_exit_list)
>>   mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>>         unsigned nr;
>>         if (argc < 2) {
>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>                 return 1;
>>         }
>>         nr = atoi(argv[1]);
>>         while (nr-- > 0) {
>>                 if (unshare(CLONE_NEWNET)) {
>>                         perror("Can't unshare");
>>                         return 1;
>>                 }
>>         }
>>         return 0;
>> }
>>
>> Origin, 100000 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 100000 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
> 
> Hi Kirill,
> 
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
> 
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?

I don't think limitation is a good goal or a gool for the mutex,
because it's very easy to create many net namespaces in case of
the mutex exists. You may open /proc/[pid]/ns/net like a file,
and net_ns counter will increment. Then, do unshare(), and
the mutex has no a way to protect against that. Anyway, mutex
can't limit a number of something in general, I've never seen
a (good) example in kernel.

As I see, the real limitation happen in inc_net_namespaces(),
which is decremented after RCU grace period in cleanup_net(),
and it has not changed.

> There was a discussion a few month ago:
> https://lists.onap.org/pipermail/containers/2016-October/037509.html
> 
> 
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
> 
> Here you measure time of creating and destroying net namespaces.
> 
>>
>> Patched:
>> real 0m18,235s   (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
> 
> But here you measure time of crearing namespaces and you know nothing
> when they will be destroyed.

You're right, and I predict, the sum time, spent on cpu, will remain the same,
but the think is that now creation and destroying may be executed in parallel.
Stephen Hemminger Nov. 14, 2017, 6:11 p.m. | #6
On Tue, 14 Nov 2017 16:53:33 +0300
Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> +	/*
> +	 * RCU-protected list, modifiable by pernet-init and -exit methods.
> +	 * When net namespace is alive (net::count > 0), all the changes
> +	 * are made under rw_sem held on write.
> +	 */
> +	struct list_head	fib_notifier_ops;
>  

If you use __rcu annotation then it could be checked (and the comment is not needed).
Andrei Vagin Nov. 14, 2017, 6:38 p.m. | #7
On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote:
> On 14.11.2017 20:44, Andrei Vagin wrote:
> > On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> >> Curently mutex is used to protect pernet operations list. It makes
> >> cleanup_net() to execute ->exit methods of the same operations set,
> >> which was used on the time of ->init, even after net namespace is
> >> unlinked from net_namespace_list.
> >>
> >> But the problem is it's need to synchronize_rcu() after net is removed
> >> from net_namespace_list():
> >>
> >> Destroy net_ns:
> >> cleanup_net()
> >>   mutex_lock(&net_mutex)
> >>   list_del_rcu(&net->list)
> >>   synchronize_rcu()                                  <--- Sleep there for ages
> >>   list_for_each_entry_reverse(ops, &pernet_list, list)
> >>     ops_exit_list(ops, &net_exit_list)
> >>   list_for_each_entry_reverse(ops, &pernet_list, list)
> >>     ops_free_list(ops, &net_exit_list)
> >>   mutex_unlock(&net_mutex)
> >>
> >> This primitive is not fast, especially on the systems with many processors
> >> and/or when preemptible RCU is enabled in config. So, all the time, while
> >> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> >> is not possible, the tasks, who makes it, are sleeping on the same mutex:
> >>
> >> Create net_ns:
> >> copy_net_ns()
> >>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
> >>
> >> The solution is to convert net_mutex to the rw_semaphore. Then,
> >> pernet_operations::init/::exit methods, modifying the net-related data,
> >> will require down_read() locking only, while down_write() will be used
> >> for changing pernet_list.
> >>
> >> This gives signify performance increase, like you may see below. There
> >> is measured sequential net namespace creation in a cycle, in single
> >> thread, without other tasks (single user mode):
> >>
> >> 1)int main(int argc, char *argv[])
> >> {
> >>         unsigned nr;
> >>         if (argc < 2) {
> >>                 fprintf(stderr, "Provide nr iterations arg\n");
> >>                 return 1;
> >>         }
> >>         nr = atoi(argv[1]);
> >>         while (nr-- > 0) {
> >>                 if (unshare(CLONE_NEWNET)) {
> >>                         perror("Can't unshare");
> >>                         return 1;
> >>                 }
> >>         }
> >>         return 0;
> >> }
> >>
> >> Origin, 100000 unshare():
> >> 0.03user 23.14system 1:39.85elapsed 23%CPU
> >>
> >> Patched, 100000 unshare():
> >> 0.03user 67.49system 1:08.34elapsed 98%CPU
> >>
> >> 2)for i in {1..10000}; do unshare -n bash -c exit; done
> > 
> > Hi Kirill,
> > 
> > This mutex has another role. You know that net namespaces are destroyed
> > asynchronously, and the net mutex gurantees that a backlog will be not
> > big. If we have something in backlog, we know that it will be handled
> > before creating a new net ns.
> > 
> > As far as I remember net namespaces are created much faster than
> > they are destroyed, so with this changes we can create a really big
> > backlog, can't we?
> 
> I don't think limitation is a good goal or a gool for the mutex,
> because it's very easy to create many net namespaces in case of
> the mutex exists. You may open /proc/[pid]/ns/net like a file,
> and net_ns counter will increment. Then, do unshare(), and
> the mutex has no a way to protect against that.

You are right, but with the mutex a user can not support a big backlog
for a long time, it is shrunk to zero periodically. With these changes
he can support a big backlog for a long time.

A big backlog affects other users. If someone creates namespaces, he
probably expects that they will be destroyed for a reasonable time.

But currently someone else can increase a destroy time to a really big
values. This problem was before your patches, but they may do this
problem worse. The question here is: Should we think about this problem
in the context of these patches?


> Anyway, mutex
> can't limit a number of something in general, I've never seen
> a (good) example in kernel.

I'm agree with you here.


> 
> As I see, the real limitation happen in inc_net_namespaces(),
> which is decremented after RCU grace period in cleanup_net(),
> and it has not changed.

ucount limits are to big to handle this problem.


> 
> > There was a discussion a few month ago:
> > https://lists.onap.org/pipermail/containers/2016-October/037509.html
> > 
> > 
> >>
> >> Origin:
> >> real 1m24,190s
> >> user 0m6,225s
> >> sys 0m15,132s
> > 
> > Here you measure time of creating and destroying net namespaces.
> > 
> >>
> >> Patched:
> >> real 0m18,235s   (4.6 times faster)
> >> user 0m4,544s
> >> sys 0m13,796s
> > 
> > But here you measure time of crearing namespaces and you know nothing
> > when they will be destroyed.
> 
> You're right, and I predict, the sum time, spent on cpu, will remain the same,
> but the think is that now creation and destroying may be executed in parallel.
Cong Wang Nov. 14, 2017, 6:39 p.m. | #8
On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>
>         get_user_ns(user_ns);
>
> -       rv = mutex_lock_killable(&net_mutex);
> +       rv = down_read_killable(&net_sem);
>         if (rv < 0) {
>                 net_free(net);
>                 dec_net_namespaces(ucounts);
> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>                 rtnl_unlock();
>         }
> -       mutex_unlock(&net_mutex);
> +       up_read(&net_sem);
>         if (rv < 0) {
>                 dec_net_namespaces(ucounts);
>                 put_user_ns(user_ns);
> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>         list_replace_init(&cleanup_list, &net_kill_list);
>         spin_unlock_irq(&cleanup_list_lock);
>
> -       mutex_lock(&net_mutex);
> +       down_read(&net_sem);
>
>         /* Don't let anyone else find us. */
>         rtnl_lock();
> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>         list_for_each_entry_reverse(ops, &pernet_list, list)
>                 ops_free_list(ops, &net_exit_list);
>
> -       mutex_unlock(&net_mutex);
> +       up_read(&net_sem);

After your patch setup_net() could run concurrently with cleanup_net(),
given that ops_exit_list() is called on error path of setup_net() too,
it means ops->exit() now could run concurrently if it doesn't have its
own lock. Not sure if this breaks any existing user.
Eric Dumazet Nov. 14, 2017, 7:07 p.m. | #9
On Tue, 2017-11-14 at 10:11 -0800, Stephen Hemminger wrote:
> On Tue, 14 Nov 2017 16:53:33 +0300
> Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> > +	/*
> > +	 * RCU-protected list, modifiable by pernet-init and -exit methods.
> > +	 * When net namespace is alive (net::count > 0), all the changes
> > +	 * are made under rw_sem held on write.
> > +	 */
> > +	struct list_head	fib_notifier_ops;
> >  
> 
> If you use __rcu annotation then it could be checked (and the comment is not needed).

I do not think we can use __rcu annotation yet on a struct list_head ?

(The annotation would be needed on the members)
Kirill Tkhai Nov. 14, 2017, 7:58 p.m. | #10
On 14.11.2017 21:39, Cong Wang wrote:
> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>
>>         get_user_ns(user_ns);
>>
>> -       rv = mutex_lock_killable(&net_mutex);
>> +       rv = down_read_killable(&net_sem);
>>         if (rv < 0) {
>>                 net_free(net);
>>                 dec_net_namespaces(ucounts);
>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>                 rtnl_unlock();
>>         }
>> -       mutex_unlock(&net_mutex);
>> +       up_read(&net_sem);
>>         if (rv < 0) {
>>                 dec_net_namespaces(ucounts);
>>                 put_user_ns(user_ns);
>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>         list_replace_init(&cleanup_list, &net_kill_list);
>>         spin_unlock_irq(&cleanup_list_lock);
>>
>> -       mutex_lock(&net_mutex);
>> +       down_read(&net_sem);
>>
>>         /* Don't let anyone else find us. */
>>         rtnl_lock();
>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>                 ops_free_list(ops, &net_exit_list);
>>
>> -       mutex_unlock(&net_mutex);
>> +       up_read(&net_sem);
> 
> After your patch setup_net() could run concurrently with cleanup_net(),
> given that ops_exit_list() is called on error path of setup_net() too,
> it means ops->exit() now could run concurrently if it doesn't have its
> own lock. Not sure if this breaks any existing user.

Yes, there will be possible concurrent ops->init() for a net namespace,
and ops->exit() for another one. I hadn't found pernet operations, which
have a problem with that. If they exist, they are hidden and not clear seen.
The pernet operations in general do not touch someone else's memory.
If suddenly there is one, KASAN should show it after a while.
Kirill Tkhai Nov. 14, 2017, 8:43 p.m. | #11
On 14.11.2017 21:38, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote:
>> On 14.11.2017 20:44, Andrei Vagin wrote:
>>> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>>
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>>
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>>   mutex_lock(&net_mutex)
>>>>   list_del_rcu(&net->list)
>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_exit_list(ops, &net_exit_list)
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_free_list(ops, &net_exit_list)
>>>>   mutex_unlock(&net_mutex)
>>>>
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>
>>>> Create net_ns:
>>>> copy_net_ns()
>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>>
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>>
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>>         unsigned nr;
>>>>         if (argc < 2) {
>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>                 return 1;
>>>>         }
>>>>         nr = atoi(argv[1]);
>>>>         while (nr-- > 0) {
>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>                         perror("Can't unshare");
>>>>                         return 1;
>>>>                 }
>>>>         }
>>>>         return 0;
>>>> }
>>>>
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>
>>> Hi Kirill,
>>>
>>> This mutex has another role. You know that net namespaces are destroyed
>>> asynchronously, and the net mutex gurantees that a backlog will be not
>>> big. If we have something in backlog, we know that it will be handled
>>> before creating a new net ns.
>>>
>>> As far as I remember net namespaces are created much faster than
>>> they are destroyed, so with this changes we can create a really big
>>> backlog, can't we?
>>
>> I don't think limitation is a good goal or a gool for the mutex,
>> because it's very easy to create many net namespaces in case of
>> the mutex exists. You may open /proc/[pid]/ns/net like a file,
>> and net_ns counter will increment. Then, do unshare(), and
>> the mutex has no a way to protect against that.
> 
> You are right, but with the mutex a user can not support a big backlog
> for a long time, it is shrunk to zero periodically. With these changes
> he can support a big backlog for a long time.
> 
> A big backlog affects other users. If someone creates namespaces, he
> probably expects that they will be destroyed for a reasonable time.

Here 2 different topics:
1)Limitation.
This problem exists in current code, and the patch does not make it worse.
If someone wants to limit number of net namespaces per user_ns/cgroup,
he should solve it in another way. Mutex does not help there.

2)Speed-up struct net destruction.
Many structures in kernel are released via rcu. The only difference
of struct net is it has more delayed actions, then task_struct for example.
User is not interested in the time, when task_struct is completely destroyed.
But I may imagine it's more important for struct net, if (for example) some
delayed connections are destroyed from ops->exit(). Are they?!
But like in the first paragraph, I think this is not related to the patch,
because in the current code you may generate dozen thousands of net namespaces
and then destroy them at once. And they will be destroyed for the same long time,
and there is additional worse thing, that it won't be possible to create a new
one for a long period. So, it's completely unrelated to the patch.
 
> But currently someone else can increase a destroy time to a really big
> values. This problem was before your patches, but they may do this
> problem worse. The question here is: Should we think about this problem
> in the context of these patches?

I think we shouldn't, because the main thing in this patch is parallel
execution of ops->init() and ->exit() and their safety. I hope people
will look at this and nothing will remain hidden.

The problem of long time destruction of net namespaces may be solved as next step.
As we will be able to destroy them in parallel, we'll implement percpu destroy
works and will destroy them much time faster then now.
 
>> Anyway, mutex
>> can't limit a number of something in general, I've never seen
>> a (good) example in kernel.
> 
> I'm agree with you here.
> 
> 
>>
>> As I see, the real limitation happen in inc_net_namespaces(),
>> which is decremented after RCU grace period in cleanup_net(),
>> and it has not changed.
> 
> ucount limits are to big to handle this problem.
> 
> 
>>
>>> There was a discussion a few month ago:
>>> https://lists.onap.org/pipermail/containers/2016-October/037509.html
>>>
>>>
>>>>
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>
>>> Here you measure time of creating and destroying net namespaces.
>>>
>>>>
>>>> Patched:
>>>> real 0m18,235s   (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>
>>> But here you measure time of crearing namespaces and you know nothing
>>> when they will be destroyed.
>>
>> You're right, and I predict, the sum time, spent on cpu, will remain the same,
>> but the think is that now creation and destroying may be executed in parallel.
Andrei Vagin Nov. 14, 2017, 10:15 p.m. | #12
On Tue, Nov 14, 2017 at 10:00:59AM -0800, Eric Dumazet wrote:
> On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> > On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > > Curently mutex is used to protect pernet operations list. It makes
> > > cleanup_net() to execute ->exit methods of the same operations set,
> > > which was used on the time of ->init, even after net namespace is
> > > unlinked from net_namespace_list.
> > > 
> > > But the problem is it's need to synchronize_rcu() after net is removed
> > > from net_namespace_list():
> > > 
> > > Destroy net_ns:
> > > cleanup_net()
> > >   mutex_lock(&net_mutex)
> > >   list_del_rcu(&net->list)
> > >   synchronize_rcu()                                  <--- Sleep there for ages
> > >   list_for_each_entry_reverse(ops, &pernet_list, list)
> > >     ops_exit_list(ops, &net_exit_list)
> > >   list_for_each_entry_reverse(ops, &pernet_list, list)
> > >     ops_free_list(ops, &net_exit_list)
> > >   mutex_unlock(&net_mutex)
> > > 
> > > This primitive is not fast, especially on the systems with many processors
> > > and/or when preemptible RCU is enabled in config. So, all the time, while
> > > cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> > > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> > > 
> > > Create net_ns:
> > > copy_net_ns()
> > >   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
> > > 
> > > The solution is to convert net_mutex to the rw_semaphore. Then,
> > > pernet_operations::init/::exit methods, modifying the net-related data,
> > > will require down_read() locking only, while down_write() will be used
> > > for changing pernet_list.
> > > 
> > > This gives signify performance increase, like you may see below. There
> > > is measured sequential net namespace creation in a cycle, in single
> > > thread, without other tasks (single user mode):
> > > 
> > > 1)int main(int argc, char *argv[])
> > > {
> > >         unsigned nr;
> > >         if (argc < 2) {
> > >                 fprintf(stderr, "Provide nr iterations arg\n");
> > >                 return 1;
> > >         }
> > >         nr = atoi(argv[1]);
> > >         while (nr-- > 0) {
> > >                 if (unshare(CLONE_NEWNET)) {
> > >                         perror("Can't unshare");
> > >                         return 1;
> > >                 }
> > >         }
> > >         return 0;
> > > }
> > > 
> > > Origin, 100000 unshare():
> > > 0.03user 23.14system 1:39.85elapsed 23%CPU
> > > 
> > > Patched, 100000 unshare():
> > > 0.03user 67.49system 1:08.34elapsed 98%CPU
> > > 
> > > 2)for i in {1..10000}; do unshare -n bash -c exit; done
> > 
> > Hi Kirill,
> > 
> > This mutex has another role. You know that net namespaces are destroyed
> > asynchronously, and the net mutex gurantees that a backlog will be not
> > big. If we have something in backlog, we know that it will be handled
> > before creating a new net ns.
> > 
> > As far as I remember net namespaces are created much faster than
> > they are destroyed, so with this changes we can create a really big
> > backlog, can't we?
> 
> Please take a look at the recent patches I did :
> 
> 8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 'net-speedup-netns-create-delete-time'
> 64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
> bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
> 789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
> a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
> d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in kobject_uevent_net_broadcast()
> 4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
> 16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add kobject_uevent_net_broadcast()
> 

Good job! Now it really works much faster. I tested these patches with
Kirill's one and everithing works good. I could not reproduce a
situation, when a backlog starts growing.

Thanks Kirill and Eric.
Eric W. Biederman Nov. 15, 2017, 3:19 a.m. | #13
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 14.11.2017 21:39, Cong Wang wrote:
>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>
>>>         get_user_ns(user_ns);
>>>
>>> -       rv = mutex_lock_killable(&net_mutex);
>>> +       rv = down_read_killable(&net_sem);
>>>         if (rv < 0) {
>>>                 net_free(net);
>>>                 dec_net_namespaces(ucounts);
>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>                 rtnl_unlock();
>>>         }
>>> -       mutex_unlock(&net_mutex);
>>> +       up_read(&net_sem);
>>>         if (rv < 0) {
>>>                 dec_net_namespaces(ucounts);
>>>                 put_user_ns(user_ns);
>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>         spin_unlock_irq(&cleanup_list_lock);
>>>
>>> -       mutex_lock(&net_mutex);
>>> +       down_read(&net_sem);
>>>
>>>         /* Don't let anyone else find us. */
>>>         rtnl_lock();
>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>                 ops_free_list(ops, &net_exit_list);
>>>
>>> -       mutex_unlock(&net_mutex);
>>> +       up_read(&net_sem);
>> 
>> After your patch setup_net() could run concurrently with cleanup_net(),
>> given that ops_exit_list() is called on error path of setup_net() too,
>> it means ops->exit() now could run concurrently if it doesn't have its
>> own lock. Not sure if this breaks any existing user.
>
> Yes, there will be possible concurrent ops->init() for a net namespace,
> and ops->exit() for another one. I hadn't found pernet operations, which
> have a problem with that. If they exist, they are hidden and not clear seen.
> The pernet operations in general do not touch someone else's memory.
> If suddenly there is one, KASAN should show it after a while.

Certainly the use of hash tables shared between multiple network
namespaces would count.  I don't rembmer how many of these we have but
there used to be quite a few.

Eric
Eric W. Biederman Nov. 15, 2017, 6:25 a.m. | #14
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(&net_mutex)
>   list_del_rcu(&net->list)
>   synchronize_rcu()                                  <--- Sleep there for ages
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_exit_list(ops, &net_exit_list)
>   list_for_each_entry_reverse(ops, &pernet_list, list)
>     ops_free_list(ops, &net_exit_list)
>   mutex_unlock(&net_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
>         unsigned nr;
>         if (argc < 2) {
>                 fprintf(stderr, "Provide nr iterations arg\n");
>                 return 1;
>         }
>         nr = atoi(argv[1]);
>         while (nr-- > 0) {
>                 if (unshare(CLONE_NEWNET)) {
>                         perror("Can't unshare");
>                         return 1;
>                 }
>         }
>         return 0;
> }
>
> Origin, 100000 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 100000 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
>
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
>
> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
> from Linus tree (not in net-next yet).

Using a rwsem to protect the list of operations makes sense.

That should allow removing the sing

I am not wild about taking a the rwsem down_write in
rtnl_link_unregister, and net_ns_barrier.  I think that works but it
goes from being a mild hack to being a pretty bad hack and something
else that can kill the parallelism you are seeking it add.

There are about 204 instances of struct pernet_operations.  That is a
lot of code to have carefully audited to ensure it can in parallel all
at once.  The existence of the exit_batch method, net_ns_barrier,
for_each_net and taking of net_mutex in rtnl_link_unregister all testify
to the fact that there are data structures accessed by multiple network
namespaces.

My preference would be to:

- Add the net_sem in addition to net_mutex with down_write only held in
  register and unregister, and maybe net_ns_barrier and
  rtnl_link_unregister.

- Factor out struct pernet_ops out of struct pernet_operations.  With
  struct pernet_ops not having the exit_batch method.  With pernet_ops
  being embedded an anonymous member of the old struct pernet_operations.

- Add [un]register_pernet_{sys,dev} functions that take a struct
  pernet_ops, that don't take net_mutex.  Have them order the
  pernet_list as:

  pernet_sys
  pernet_subsys
  pernet_device
  pernet_dev

  With the chunk in the middle taking the net_mutex.

  I think I would enforce the ordering with a failure to register
  if a subsystem or a device tried to register out of order.  

- Disable use of the single threaded workqueue if nothing needs the
  net_mutex.

- Add a test mode that deliberartely spawns threads on multiple
  processors and deliberately creates multiple network namespaces
  at the same time.

- Add a test mode that deliberately spawns threads on multiple
  processors and delibearate destrosy multiple network namespaces
  at the same time.
  
- Convert the code to unlocked operation one pernet_operations to at a
  time.  Being careful with the loopback device it's order in the list
  strongly matters.

- Finally remove the unnecessary code.


At the end of the day because all of the operations for one network
namespace will run in parallel with all of the operations for another
network namespace all of the sophistication that goes into batching the
cleanup of multiple network namespaces can be removed.  As different
tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
without slowing each other down.

I think we can remove the batching but I am afraid that will lead us into
rtnl_lock contention.

I am definitely not comfortable with changing the locking on so much
code without an explanation at all why it is safe in the commit comments
in all 204 instances.  Which combined equal most of the well maintained
and regularly used part of the networking stack.

Eric

> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  include/linux/rtnetlink.h   |    2 +-
>  include/net/net_namespace.h |    9 +++++++--
>  net/core/net_namespace.c    |   40 ++++++++++++++++++++--------------------
>  net/core/rtnetlink.c        |    4 ++--
>  4 files changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 54bcd970bfd3..36cc009f4580 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -32,7 +32,7 @@ extern int rtnl_trylock(void);
>  extern int rtnl_is_locked(void);
>  
>  extern wait_queue_head_t netdev_unregistering_wq;
> -extern struct mutex net_mutex;
> +extern struct rw_semaphore net_sem;
>  
>  #ifdef CONFIG_PROVE_LOCKING
>  extern bool lockdep_rtnl_is_held(void);
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 10f99dafd5ac..aaed826ccbe7 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -60,7 +60,7 @@ struct net {
>  
>  	struct list_head	list;		/* list of network namespaces */
>  	struct list_head	cleanup_list;	/* namespaces on death row */
> -	struct list_head	exit_list;	/* Use only net_mutex */
> +	struct list_head	exit_list;	/* Use only net_sem */
>  
>  	struct user_namespace   *user_ns;	/* Owning user namespace */
>  	struct ucounts		*ucounts;
> @@ -89,7 +89,12 @@ struct net {
>  	/* core fib_rules */
>  	struct list_head	rules_ops;
>  
> -	struct list_head	fib_notifier_ops;  /* protected by net_mutex */
> +	/*
> +	 * RCU-protected list, modifiable by pernet-init and -exit methods.
> +	 * When net namespace is alive (net::count > 0), all the changes
> +	 * are made under rw_sem held on write.
> +	 */
> +	struct list_head	fib_notifier_ops;
>  
>  	struct net_device       *loopback_dev;          /* The loopback */
>  	struct netns_core	core;
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 6cfdc7c84c48..f502b11b507e 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -29,7 +29,7 @@
>  
>  static LIST_HEAD(pernet_list);
>  static struct list_head *first_device = &pernet_list;
> -DEFINE_MUTEX(net_mutex);
> +DECLARE_RWSEM(net_sem);
>  
>  LIST_HEAD(net_namespace_list);
>  EXPORT_SYMBOL_GPL(net_namespace_list);
> @@ -65,11 +65,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
>  {
>  	struct net_generic *ng, *old_ng;
>  
> -	BUG_ON(!mutex_is_locked(&net_mutex));
> +	BUG_ON(!rwsem_is_locked(&net_sem));
>  	BUG_ON(id < MIN_PERNET_OPS_ID);
>  
>  	old_ng = rcu_dereference_protected(net->gen,
> -					   lockdep_is_held(&net_mutex));
> +					   lockdep_is_held(&net_sem));
>  	if (old_ng->s.len > id) {
>  		old_ng->ptr[id] = data;
>  		return 0;
> @@ -278,7 +278,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
>   */
>  static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
>  {
> -	/* Must be called with net_mutex held */
> +	/* Must be called with net_sem held */
>  	const struct pernet_operations *ops, *saved_ops;
>  	int error = 0;
>  	LIST_HEAD(net_exit_list);
> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>  
>  	get_user_ns(user_ns);
>  
> -	rv = mutex_lock_killable(&net_mutex);
> +	rv = down_read_killable(&net_sem);
>  	if (rv < 0) {
>  		net_free(net);
>  		dec_net_namespaces(ucounts);
> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>  		list_add_tail_rcu(&net->list, &net_namespace_list);
>  		rtnl_unlock();
>  	}
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  	if (rv < 0) {
>  		dec_net_namespaces(ucounts);
>  		put_user_ns(user_ns);
> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>  	list_replace_init(&cleanup_list, &net_kill_list);
>  	spin_unlock_irq(&cleanup_list_lock);
>  
> -	mutex_lock(&net_mutex);
> +	down_read(&net_sem);
>  
>  	/* Don't let anyone else find us. */
>  	rtnl_lock();
> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
>  		ops_free_list(ops, &net_exit_list);
>  
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  
>  	/* Ensure there are no outstanding rcu callbacks using this
>  	 * network namespace.
> @@ -513,8 +513,8 @@ static void cleanup_net(struct work_struct *work)
>   */
>  void net_ns_barrier(void)
>  {
> -	mutex_lock(&net_mutex);
> -	mutex_unlock(&net_mutex);
> +	down_write(&net_sem);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL(net_ns_barrier);
>  
> @@ -841,7 +841,7 @@ static int __init net_ns_init(void)
>  
>  	rcu_assign_pointer(init_net.gen, ng);
>  
> -	mutex_lock(&net_mutex);
> +	down_read(&net_sem);
>  	if (setup_net(&init_net, &init_user_ns))
>  		panic("Could not setup the initial network namespace");
>  
> @@ -851,7 +851,7 @@ static int __init net_ns_init(void)
>  	list_add_tail_rcu(&init_net.list, &net_namespace_list);
>  	rtnl_unlock();
>  
> -	mutex_unlock(&net_mutex);
> +	up_read(&net_sem);
>  
>  	register_pernet_subsys(&net_ns_ops);
>  
> @@ -991,9 +991,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>  int register_pernet_subsys(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	error =  register_pernet_operations(first_device, ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
> @@ -1009,9 +1009,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
>   */
>  void unregister_pernet_subsys(struct pernet_operations *ops)
>  {
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	unregister_pernet_operations(ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>  
> @@ -1037,11 +1037,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>  int register_pernet_device(struct pernet_operations *ops)
>  {
>  	int error;
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	error = register_pernet_operations(&pernet_list, ops);
>  	if (!error && (first_device == &pernet_list))
>  		first_device = &ops->list;
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(register_pernet_device);
> @@ -1057,11 +1057,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
>   */
>  void unregister_pernet_device(struct pernet_operations *ops)
>  {
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	if (&ops->list == first_device)
>  		first_device = first_device->next;
>  	unregister_pernet_operations(ops);
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(unregister_pernet_device);
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 5ace48926b19..caa215fd170b 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -390,11 +390,11 @@ static void rtnl_lock_unregistering_all(void)
>  void rtnl_link_unregister(struct rtnl_link_ops *ops)
>  {
>  	/* Close the race with cleanup_net() */
> -	mutex_lock(&net_mutex);
> +	down_write(&net_sem);
>  	rtnl_lock_unregistering_all();
>  	__rtnl_link_unregister(ops);
>  	rtnl_unlock();
> -	mutex_unlock(&net_mutex);
> +	up_write(&net_sem);
>  }
>  EXPORT_SYMBOL_GPL(rtnl_link_unregister);
>
Kirill Tkhai Nov. 15, 2017, 9:49 a.m. | #15
On 15.11.2017 09:25, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>>   mutex_lock(&net_mutex)
>>   list_del_rcu(&net->list)
>>   synchronize_rcu()                                  <--- Sleep there for ages
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_exit_list(ops, &net_exit_list)
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_free_list(ops, &net_exit_list)
>>   mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>>         unsigned nr;
>>         if (argc < 2) {
>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>                 return 1;
>>         }
>>         nr = atoi(argv[1]);
>>         while (nr-- > 0) {
>>                 if (unshare(CLONE_NEWNET)) {
>>                         perror("Can't unshare");
>>                         return 1;
>>                 }
>>         }
>>         return 0;
>> }
>>
>> Origin, 100000 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 100000 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
>>
>> Patched:
>> real 0m18,235s   (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
>>
>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>> from Linus tree (not in net-next yet).
> 
> Using a rwsem to protect the list of operations makes sense.
> 
> That should allow removing the sing
> 
> I am not wild about taking a the rwsem down_write in
> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
> goes from being a mild hack to being a pretty bad hack and something
> else that can kill the parallelism you are seeking it add.
> 
> There are about 204 instances of struct pernet_operations.  That is a
> lot of code to have carefully audited to ensure it can in parallel all
> at once.  The existence of the exit_batch method, net_ns_barrier,
> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
> to the fact that there are data structures accessed by multiple network
> namespaces.
> 
> My preference would be to:
> 
> - Add the net_sem in addition to net_mutex with down_write only held in
>   register and unregister, and maybe net_ns_barrier and
>   rtnl_link_unregister.
> 
> - Factor out struct pernet_ops out of struct pernet_operations.  With
>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>   being embedded an anonymous member of the old struct pernet_operations.
> 
> - Add [un]register_pernet_{sys,dev} functions that take a struct
>   pernet_ops, that don't take net_mutex.  Have them order the
>   pernet_list as:
> 
>   pernet_sys
>   pernet_subsys
>   pernet_device
>   pernet_dev
> 
>   With the chunk in the middle taking the net_mutex.

I think this approach will work. Thanks for the suggestion. Some more
thoughts to the plan below.

The only difficult thing there will be to choose the right order
to move ops from pernet_subsys to pernet_sys and from pernet_device
to pernet_dev one by one.

This is rather easy in case of tristate drivers, as modules may be loaded
at any time, and the only important order is dependences between them.
So, it's possible to start from a module, who has no dependences,
and move it to pernet_sys, and then continue with modules,
who have no more dependences in pernet_subsys. For pernet_device
it's vise versa.

In case of bool drivers, the situation may be worse, because
the order is not so clear there. The same priority initcalls
(for example, initcalls, registered via core_initcall()) may require
the certain order, driven by linking order. I know one example from
device mapper code, which lives here: drivers/md/Makefile.
This problem is also solvable, even if such places do not contain
comments about linking order. It's just need to respect Makefile
order, when choosing a new candidate to move.
 
>   I think I would enforce the ordering with a failure to register
>   if a subsystem or a device tried to register out of order.  
> 
> - Disable use of the single threaded workqueue if nothing needs the
>   net_mutex.

We may use per-cpu worqueues in the future. The idea to refuse using
worqueue doesn't seem good for me, because asynchronous net destroying
looks very useful.

> - Add a test mode that deliberartely spawns threads on multiple
>   processors and deliberately creates multiple network namespaces
>   at the same time.
> 
> - Add a test mode that deliberately spawns threads on multiple
>   processors and delibearate destrosy multiple network namespaces
>   at the same time.>
> - Convert the code to unlocked operation one pernet_operations to at a
>   time.  Being careful with the loopback device it's order in the list
>   strongly matters.
> 
> - Finally remove the unnecessary code.
> 
> 
> At the end of the day because all of the operations for one network
> namespace will run in parallel with all of the operations for another
> network namespace all of the sophistication that goes into batching the
> cleanup of multiple network namespaces can be removed.  As different
> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
> without slowing each other down.
> 
> I think we can remove the batching but I am afraid that will lead us into
> rtnl_lock contention.

I've looked into this lock. It used in many places for many reasons.
It's a little strange, nobody tried to break it up in several small locks..
 
> I am definitely not comfortable with changing the locking on so much
> code without an explanation at all why it is safe in the commit comments
> in all 204 instances.  Which combined equal most of the well maintained
> and regularly used part of the networking stack.

David,

could you please check the plan above and say whether 1)it's OK for you,
and 2)if so, will you expect all the changes are made in one big 200+ patch set
or we may go sequentially(50 patches a time, for example)?

>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>  include/linux/rtnetlink.h   |    2 +-
>>  include/net/net_namespace.h |    9 +++++++--
>>  net/core/net_namespace.c    |   40 ++++++++++++++++++++--------------------
>>  net/core/rtnetlink.c        |    4 ++--
>>  4 files changed, 30 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
>> index 54bcd970bfd3..36cc009f4580 100644
>> --- a/include/linux/rtnetlink.h
>> +++ b/include/linux/rtnetlink.h
>> @@ -32,7 +32,7 @@ extern int rtnl_trylock(void);
>>  extern int rtnl_is_locked(void);
>>  
>>  extern wait_queue_head_t netdev_unregistering_wq;
>> -extern struct mutex net_mutex;
>> +extern struct rw_semaphore net_sem;
>>  
>>  #ifdef CONFIG_PROVE_LOCKING
>>  extern bool lockdep_rtnl_is_held(void);
>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>> index 10f99dafd5ac..aaed826ccbe7 100644
>> --- a/include/net/net_namespace.h
>> +++ b/include/net/net_namespace.h
>> @@ -60,7 +60,7 @@ struct net {
>>  
>>  	struct list_head	list;		/* list of network namespaces */
>>  	struct list_head	cleanup_list;	/* namespaces on death row */
>> -	struct list_head	exit_list;	/* Use only net_mutex */
>> +	struct list_head	exit_list;	/* Use only net_sem */
>>  
>>  	struct user_namespace   *user_ns;	/* Owning user namespace */
>>  	struct ucounts		*ucounts;
>> @@ -89,7 +89,12 @@ struct net {
>>  	/* core fib_rules */
>>  	struct list_head	rules_ops;
>>  
>> -	struct list_head	fib_notifier_ops;  /* protected by net_mutex */
>> +	/*
>> +	 * RCU-protected list, modifiable by pernet-init and -exit methods.
>> +	 * When net namespace is alive (net::count > 0), all the changes
>> +	 * are made under rw_sem held on write.
>> +	 */
>> +	struct list_head	fib_notifier_ops;
>>  
>>  	struct net_device       *loopback_dev;          /* The loopback */
>>  	struct netns_core	core;
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index 6cfdc7c84c48..f502b11b507e 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -29,7 +29,7 @@
>>  
>>  static LIST_HEAD(pernet_list);
>>  static struct list_head *first_device = &pernet_list;
>> -DEFINE_MUTEX(net_mutex);
>> +DECLARE_RWSEM(net_sem);
>>  
>>  LIST_HEAD(net_namespace_list);
>>  EXPORT_SYMBOL_GPL(net_namespace_list);
>> @@ -65,11 +65,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
>>  {
>>  	struct net_generic *ng, *old_ng;
>>  
>> -	BUG_ON(!mutex_is_locked(&net_mutex));
>> +	BUG_ON(!rwsem_is_locked(&net_sem));
>>  	BUG_ON(id < MIN_PERNET_OPS_ID);
>>  
>>  	old_ng = rcu_dereference_protected(net->gen,
>> -					   lockdep_is_held(&net_mutex));
>> +					   lockdep_is_held(&net_sem));
>>  	if (old_ng->s.len > id) {
>>  		old_ng->ptr[id] = data;
>>  		return 0;
>> @@ -278,7 +278,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
>>   */
>>  static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
>>  {
>> -	/* Must be called with net_mutex held */
>> +	/* Must be called with net_sem held */
>>  	const struct pernet_operations *ops, *saved_ops;
>>  	int error = 0;
>>  	LIST_HEAD(net_exit_list);
>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>  
>>  	get_user_ns(user_ns);
>>  
>> -	rv = mutex_lock_killable(&net_mutex);
>> +	rv = down_read_killable(&net_sem);
>>  	if (rv < 0) {
>>  		net_free(net);
>>  		dec_net_namespaces(ucounts);
>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>  		list_add_tail_rcu(&net->list, &net_namespace_list);
>>  		rtnl_unlock();
>>  	}
>> -	mutex_unlock(&net_mutex);
>> +	up_read(&net_sem);
>>  	if (rv < 0) {
>>  		dec_net_namespaces(ucounts);
>>  		put_user_ns(user_ns);
>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>  	list_replace_init(&cleanup_list, &net_kill_list);
>>  	spin_unlock_irq(&cleanup_list_lock);
>>  
>> -	mutex_lock(&net_mutex);
>> +	down_read(&net_sem);
>>  
>>  	/* Don't let anyone else find us. */
>>  	rtnl_lock();
>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>  	list_for_each_entry_reverse(ops, &pernet_list, list)
>>  		ops_free_list(ops, &net_exit_list);
>>  
>> -	mutex_unlock(&net_mutex);
>> +	up_read(&net_sem);
>>  
>>  	/* Ensure there are no outstanding rcu callbacks using this
>>  	 * network namespace.
>> @@ -513,8 +513,8 @@ static void cleanup_net(struct work_struct *work)
>>   */
>>  void net_ns_barrier(void)
>>  {
>> -	mutex_lock(&net_mutex);
>> -	mutex_unlock(&net_mutex);
>> +	down_write(&net_sem);
>> +	up_write(&net_sem);
>>  }
>>  EXPORT_SYMBOL(net_ns_barrier);
>>  
>> @@ -841,7 +841,7 @@ static int __init net_ns_init(void)
>>  
>>  	rcu_assign_pointer(init_net.gen, ng);
>>  
>> -	mutex_lock(&net_mutex);
>> +	down_read(&net_sem);
>>  	if (setup_net(&init_net, &init_user_ns))
>>  		panic("Could not setup the initial network namespace");
>>  
>> @@ -851,7 +851,7 @@ static int __init net_ns_init(void)
>>  	list_add_tail_rcu(&init_net.list, &net_namespace_list);
>>  	rtnl_unlock();
>>  
>> -	mutex_unlock(&net_mutex);
>> +	up_read(&net_sem);
>>  
>>  	register_pernet_subsys(&net_ns_ops);
>>  
>> @@ -991,9 +991,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
>>  int register_pernet_subsys(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	down_write(&net_sem);
>>  	error =  register_pernet_operations(first_device, ops);
>> -	mutex_unlock(&net_mutex);
>> +	up_write(&net_sem);
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_subsys);
>> @@ -1009,9 +1009,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
>>   */
>>  void unregister_pernet_subsys(struct pernet_operations *ops)
>>  {
>> -	mutex_lock(&net_mutex);
>> +	down_write(&net_sem);
>>  	unregister_pernet_operations(ops);
>> -	mutex_unlock(&net_mutex);
>> +	up_write(&net_sem);
>>  }
>>  EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>>  
>> @@ -1037,11 +1037,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
>>  int register_pernet_device(struct pernet_operations *ops)
>>  {
>>  	int error;
>> -	mutex_lock(&net_mutex);
>> +	down_write(&net_sem);
>>  	error = register_pernet_operations(&pernet_list, ops);
>>  	if (!error && (first_device == &pernet_list))
>>  		first_device = &ops->list;
>> -	mutex_unlock(&net_mutex);
>> +	up_write(&net_sem);
>>  	return error;
>>  }
>>  EXPORT_SYMBOL_GPL(register_pernet_device);
>> @@ -1057,11 +1057,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
>>   */
>>  void unregister_pernet_device(struct pernet_operations *ops)
>>  {
>> -	mutex_lock(&net_mutex);
>> +	down_write(&net_sem);
>>  	if (&ops->list == first_device)
>>  		first_device = first_device->next;
>>  	unregister_pernet_operations(ops);
>> -	mutex_unlock(&net_mutex);
>> +	up_write(&net_sem);
>>  }
>>  EXPORT_SYMBOL_GPL(unregister_pernet_device);
>>  
>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>> index 5ace48926b19..caa215fd170b 100644
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -390,11 +390,11 @@ static void rtnl_lock_unregistering_all(void)
>>  void rtnl_link_unregister(struct rtnl_link_ops *ops)
>>  {
>>  	/* Close the race with cleanup_net() */
>> -	mutex_lock(&net_mutex);
>> +	down_write(&net_sem);
>>  	rtnl_lock_unregistering_all();
>>  	__rtnl_link_unregister(ops);
>>  	rtnl_unlock();
>> -	mutex_unlock(&net_mutex);
>> +	up_write(&net_sem);
>>  }
>>  EXPORT_SYMBOL_GPL(rtnl_link_unregister);
>>
Kirill Tkhai Nov. 15, 2017, 9:51 a.m. | #16
On 15.11.2017 06:19, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 14.11.2017 21:39, Cong Wang wrote:
>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>
>>>>         get_user_ns(user_ns);
>>>>
>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>> +       rv = down_read_killable(&net_sem);
>>>>         if (rv < 0) {
>>>>                 net_free(net);
>>>>                 dec_net_namespaces(ucounts);
>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>                 rtnl_unlock();
>>>>         }
>>>> -       mutex_unlock(&net_mutex);
>>>> +       up_read(&net_sem);
>>>>         if (rv < 0) {
>>>>                 dec_net_namespaces(ucounts);
>>>>                 put_user_ns(user_ns);
>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>
>>>> -       mutex_lock(&net_mutex);
>>>> +       down_read(&net_sem);
>>>>
>>>>         /* Don't let anyone else find us. */
>>>>         rtnl_lock();
>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>                 ops_free_list(ops, &net_exit_list);
>>>>
>>>> -       mutex_unlock(&net_mutex);
>>>> +       up_read(&net_sem);
>>>
>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>> given that ops_exit_list() is called on error path of setup_net() too,
>>> it means ops->exit() now could run concurrently if it doesn't have its
>>> own lock. Not sure if this breaks any existing user.
>>
>> Yes, there will be possible concurrent ops->init() for a net namespace,
>> and ops->exit() for another one. I hadn't found pernet operations, which
>> have a problem with that. If they exist, they are hidden and not clear seen.
>> The pernet operations in general do not touch someone else's memory.
>> If suddenly there is one, KASAN should show it after a while.
> 
> Certainly the use of hash tables shared between multiple network
> namespaces would count.  I don't rembmer how many of these we have but
> there used to be quite a few.

Could you please provide an example of hash tables, you mean?
Kirill Tkhai Nov. 15, 2017, 12:36 p.m. | #17
On 15.11.2017 12:51, Kirill Tkhai wrote:
> On 15.11.2017 06:19, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>
>>> On 14.11.2017 21:39, Cong Wang wrote:
>>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>
>>>>>         get_user_ns(user_ns);
>>>>>
>>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>>> +       rv = down_read_killable(&net_sem);
>>>>>         if (rv < 0) {
>>>>>                 net_free(net);
>>>>>                 dec_net_namespaces(ucounts);
>>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>>                 rtnl_unlock();
>>>>>         }
>>>>> -       mutex_unlock(&net_mutex);
>>>>> +       up_read(&net_sem);
>>>>>         if (rv < 0) {
>>>>>                 dec_net_namespaces(ucounts);
>>>>>                 put_user_ns(user_ns);
>>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>>
>>>>> -       mutex_lock(&net_mutex);
>>>>> +       down_read(&net_sem);
>>>>>
>>>>>         /* Don't let anyone else find us. */
>>>>>         rtnl_lock();
>>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>                 ops_free_list(ops, &net_exit_list);
>>>>>
>>>>> -       mutex_unlock(&net_mutex);
>>>>> +       up_read(&net_sem);
>>>>
>>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>>> given that ops_exit_list() is called on error path of setup_net() too,
>>>> it means ops->exit() now could run concurrently if it doesn't have its
>>>> own lock. Not sure if this breaks any existing user.
>>>
>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>> have a problem with that. If they exist, they are hidden and not clear seen.
>>> The pernet operations in general do not touch someone else's memory.
>>> If suddenly there is one, KASAN should show it after a while.
>>
>> Certainly the use of hash tables shared between multiple network
>> namespaces would count.  I don't rembmer how many of these we have but
>> there used to be quite a few.
> 
> Could you please provide an example of hash tables, you mean?

Ah, I see, it's dccp_hashinfo etc.
Eric W. Biederman Nov. 15, 2017, 4:29 p.m. | #18
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 15.11.2017 09:25, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>> 
>>> Curently mutex is used to protect pernet operations list. It makes
>>> cleanup_net() to execute ->exit methods of the same operations set,
>>> which was used on the time of ->init, even after net namespace is
>>> unlinked from net_namespace_list.
>>>
>>> But the problem is it's need to synchronize_rcu() after net is removed
>>> from net_namespace_list():
>>>
>>> Destroy net_ns:
>>> cleanup_net()
>>>   mutex_lock(&net_mutex)
>>>   list_del_rcu(&net->list)
>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>     ops_exit_list(ops, &net_exit_list)
>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>     ops_free_list(ops, &net_exit_list)
>>>   mutex_unlock(&net_mutex)
>>>
>>> This primitive is not fast, especially on the systems with many processors
>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>
>>> Create net_ns:
>>> copy_net_ns()
>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>
>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>> will require down_read() locking only, while down_write() will be used
>>> for changing pernet_list.
>>>
>>> This gives signify performance increase, like you may see below. There
>>> is measured sequential net namespace creation in a cycle, in single
>>> thread, without other tasks (single user mode):
>>>
>>> 1)int main(int argc, char *argv[])
>>> {
>>>         unsigned nr;
>>>         if (argc < 2) {
>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>                 return 1;
>>>         }
>>>         nr = atoi(argv[1]);
>>>         while (nr-- > 0) {
>>>                 if (unshare(CLONE_NEWNET)) {
>>>                         perror("Can't unshare");
>>>                         return 1;
>>>                 }
>>>         }
>>>         return 0;
>>> }
>>>
>>> Origin, 100000 unshare():
>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>
>>> Patched, 100000 unshare():
>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>
>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>
>>> Origin:
>>> real 1m24,190s
>>> user 0m6,225s
>>> sys 0m15,132s
>>>
>>> Patched:
>>> real 0m18,235s   (4.6 times faster)
>>> user 0m4,544s
>>> sys 0m13,796s
>>>
>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>> from Linus tree (not in net-next yet).
>> 
>> Using a rwsem to protect the list of operations makes sense.
>> 
>> That should allow removing the sing
>> 
>> I am not wild about taking a the rwsem down_write in
>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>> goes from being a mild hack to being a pretty bad hack and something
>> else that can kill the parallelism you are seeking it add.
>> 
>> There are about 204 instances of struct pernet_operations.  That is a
>> lot of code to have carefully audited to ensure it can in parallel all
>> at once.  The existence of the exit_batch method, net_ns_barrier,
>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>> to the fact that there are data structures accessed by multiple network
>> namespaces.
>> 
>> My preference would be to:
>> 
>> - Add the net_sem in addition to net_mutex with down_write only held in
>>   register and unregister, and maybe net_ns_barrier and
>>   rtnl_link_unregister.
>> 
>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>   being embedded an anonymous member of the old struct pernet_operations.
>> 
>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>   pernet_ops, that don't take net_mutex.  Have them order the
>>   pernet_list as:
>> 
>>   pernet_sys
>>   pernet_subsys
>>   pernet_device
>>   pernet_dev
>> 
>>   With the chunk in the middle taking the net_mutex.
>
> I think this approach will work. Thanks for the suggestion. Some more
> thoughts to the plan below.
>
> The only difficult thing there will be to choose the right order
> to move ops from pernet_subsys to pernet_sys and from pernet_device
> to pernet_dev one by one.
>
> This is rather easy in case of tristate drivers, as modules may be loaded
> at any time, and the only important order is dependences between them.
> So, it's possible to start from a module, who has no dependences,
> and move it to pernet_sys, and then continue with modules,
> who have no more dependences in pernet_subsys. For pernet_device
> it's vise versa.
>
> In case of bool drivers, the situation may be worse, because
> the order is not so clear there. The same priority initcalls
> (for example, initcalls, registered via core_initcall()) may require
> the certain order, driven by linking order. I know one example from
> device mapper code, which lives here: drivers/md/Makefile.
> This problem is also solvable, even if such places do not contain
> comments about linking order. It's just need to respect Makefile
> order, when choosing a new candidate to move.
>  
>>   I think I would enforce the ordering with a failure to register
>>   if a subsystem or a device tried to register out of order.  
>> 
>> - Disable use of the single threaded workqueue if nothing needs the
>>   net_mutex.
>
> We may use per-cpu worqueues in the future. The idea to refuse using
> worqueue doesn't seem good for me, because asynchronous net destroying
> looks very useful.

per-cpu workqueues are fine, and definitely what I am expecting.  If we
are doing this I want to get us off the single threaded workqueue that
serializes all of the cleanup.  That has a huge potential for
simplifying things and reducing maintenance if running everything in
parallel is actually safe.

I forget how the modern per-cpu workqueues work with respect to sleeps
and locking (I don't remember if when piece of work sleeps for a long
time we spin up another thread per-cpu workqueue thread, and thus avoid
priority inversion problems).

If locks between workqueues are not a problem we could start the
transition off of the single-threaded serializing workqueue sooner
rather than later.

>> - Add a test mode that deliberartely spawns threads on multiple
>>   processors and deliberately creates multiple network namespaces
>>   at the same time.
>> 
>> - Add a test mode that deliberately spawns threads on multiple
>>   processors and delibearate destrosy multiple network namespaces
>>   at the same time.>
>> - Convert the code to unlocked operation one pernet_operations to at a
>>   time.  Being careful with the loopback device it's order in the list
>>   strongly matters.
>> 
>> - Finally remove the unnecessary code.
>> 
>> 
>> At the end of the day because all of the operations for one network
>> namespace will run in parallel with all of the operations for another
>> network namespace all of the sophistication that goes into batching the
>> cleanup of multiple network namespaces can be removed.  As different
>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>> without slowing each other down.
>> 
>> I think we can remove the batching but I am afraid that will lead us into
>> rtnl_lock contention.
>
> I've looked into this lock. It used in many places for many reasons.
> It's a little strange, nobody tried to break it up in several small locks..

It gets tricky.  At this point getting net_mutex is enough to start
with. Mostly the rtnl_lock covers the slow path which tends to keep it
from rising to the top of the priority list.

>> I am definitely not comfortable with changing the locking on so much
>> code without an explanation at all why it is safe in the commit comments
>> in all 204 instances.  Which combined equal most of the well maintained
>> and regularly used part of the networking stack.
>
> David,
>
> could you please check the plan above and say whether 1)it's OK for you,
> and 2)if so, will you expect all the changes are made in one big 200+ patch set
> or we may go sequentially(50 patches a time, for example)?

I think I would start with the something covering the core networking
pieces so the improvements can be seen.


Eric
Eric W. Biederman Nov. 15, 2017, 4:31 p.m. | #19
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 15.11.2017 12:51, Kirill Tkhai wrote:
>> On 15.11.2017 06:19, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> On 14.11.2017 21:39, Cong Wang wrote:
>>>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>
>>>>>>         get_user_ns(user_ns);
>>>>>>
>>>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>>>> +       rv = down_read_killable(&net_sem);
>>>>>>         if (rv < 0) {
>>>>>>                 net_free(net);
>>>>>>                 dec_net_namespaces(ucounts);
>>>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>>>                 rtnl_unlock();
>>>>>>         }
>>>>>> -       mutex_unlock(&net_mutex);
>>>>>> +       up_read(&net_sem);
>>>>>>         if (rv < 0) {
>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>                 put_user_ns(user_ns);
>>>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>>>
>>>>>> -       mutex_lock(&net_mutex);
>>>>>> +       down_read(&net_sem);
>>>>>>
>>>>>>         /* Don't let anyone else find us. */
>>>>>>         rtnl_lock();
>>>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>                 ops_free_list(ops, &net_exit_list);
>>>>>>
>>>>>> -       mutex_unlock(&net_mutex);
>>>>>> +       up_read(&net_sem);
>>>>>
>>>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>>>> given that ops_exit_list() is called on error path of setup_net() too,
>>>>> it means ops->exit() now could run concurrently if it doesn't have its
>>>>> own lock. Not sure if this breaks any existing user.
>>>>
>>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>>> have a problem with that. If they exist, they are hidden and not clear seen.
>>>> The pernet operations in general do not touch someone else's memory.
>>>> If suddenly there is one, KASAN should show it after a while.
>>>
>>> Certainly the use of hash tables shared between multiple network
>>> namespaces would count.  I don't rembmer how many of these we have but
>>> there used to be quite a few.
>> 
>> Could you please provide an example of hash tables, you mean?
>
> Ah, I see, it's dccp_hashinfo etc.

The big one used to be the route cache.  With resizable hash tables
things may be getting better in that regard.

Eric
Kirill Tkhai Nov. 16, 2017, 9:13 a.m. | #20
On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>>
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>>
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>>   mutex_lock(&net_mutex)
>>>>   list_del_rcu(&net->list)
>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_exit_list(ops, &net_exit_list)
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_free_list(ops, &net_exit_list)
>>>>   mutex_unlock(&net_mutex)
>>>>
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>
>>>> Create net_ns:
>>>> copy_net_ns()
>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>>
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>>
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>>         unsigned nr;
>>>>         if (argc < 2) {
>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>                 return 1;
>>>>         }
>>>>         nr = atoi(argv[1]);
>>>>         while (nr-- > 0) {
>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>                         perror("Can't unshare");
>>>>                         return 1;
>>>>                 }
>>>>         }
>>>>         return 0;
>>>> }
>>>>
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>>
>>>> Patched:
>>>> real 0m18,235s   (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>>
>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>> from Linus tree (not in net-next yet).
>>>
>>> Using a rwsem to protect the list of operations makes sense.
>>>
>>> That should allow removing the sing
>>>
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>>
>>> There are about 204 instances of struct pernet_operations.  That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>>
>>> My preference would be to:
>>>
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>   register and unregister, and maybe net_ns_barrier and
>>>   rtnl_link_unregister.
>>>
>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>   pernet_list as:
>>>
>>>   pernet_sys
>>>   pernet_subsys
>>>   pernet_device
>>>   pernet_dev
>>>
>>>   With the chunk in the middle taking the net_mutex.
>>
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>>
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>>
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with modules,
>> who have no more dependences in pernet_subsys. For pernet_device
>> it's vise versa.
>>
>> In case of bool drivers, the situation may be worse, because
>> the order is not so clear there. The same priority initcalls
>> (for example, initcalls, registered via core_initcall()) may require
>> the certain order, driven by linking order. I know one example from
>> device mapper code, which lives here: drivers/md/Makefile.
>> This problem is also solvable, even if such places do not contain
>> comments about linking order. It's just need to respect Makefile
>> order, when choosing a new candidate to move.
>>  
>>>   I think I would enforce the ordering with a failure to register
>>>   if a subsystem or a device tried to register out of order.  
>>>
>>> - Disable use of the single threaded workqueue if nothing needs the
>>>   net_mutex.
>>
>> We may use per-cpu worqueues in the future. The idea to refuse using
>> worqueue doesn't seem good for me, because asynchronous net destroying
>> looks very useful.
> 
> per-cpu workqueues are fine, and definitely what I am expecting.  If we
> are doing this I want to get us off the single threaded workqueue that
> serializes all of the cleanup.  That has a huge potential for
> simplifying things and reducing maintenance if running everything in
> parallel is actually safe.
> 
> I forget how the modern per-cpu workqueues work with respect to sleeps
> and locking (I don't remember if when piece of work sleeps for a long
> time we spin up another thread per-cpu workqueue thread, and thus avoid
> priority inversion problems).

As I know, there is no a problem for worker thread to sleep, as there is
a pool. And right after one leaves cpu runqueue, another worker wakes up.

> If locks between workqueues are not a problem we could start the
> transition off of the single-threaded serializing workqueue sooner
> rather than later.
> 
>>> - Add a test mode that deliberartely spawns threads on multiple
>>>   processors and deliberately creates multiple network namespaces
>>>   at the same time.
>>>
>>> - Add a test mode that deliberately spawns threads on multiple
>>>   processors and delibearate destrosy multiple network namespaces
>>>   at the same time.>
>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>   time.  Being careful with the loopback device it's order in the list
>>>   strongly matters.
>>>
>>> - Finally remove the unnecessary code.
>>>
>>>
>>> At the end of the day because all of the operations for one network
>>> namespace will run in parallel with all of the operations for another
>>> network namespace all of the sophistication that goes into batching the
>>> cleanup of multiple network namespaces can be removed.  As different
>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>> without slowing each other down.
>>>
>>> I think we can remove the batching but I am afraid that will lead us into
>>> rtnl_lock contention.
>>
>> I've looked into this lock. It used in many places for many reasons.
>> It's a little strange, nobody tried to break it up in several small locks..
> 
> It gets tricky.  At this point getting net_mutex is enough to start
> with. Mostly the rtnl_lock covers the slow path which tends to keep it
> from rising to the top of the priority list.
> 
>>> I am definitely not comfortable with changing the locking on so much
>>> code without an explanation at all why it is safe in the commit comments
>>> in all 204 instances.  Which combined equal most of the well maintained
>>> and regularly used part of the networking stack.
>>
>> David,
>>
>> could you please check the plan above and say whether 1)it's OK for you,
>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>> or we may go sequentially(50 patches a time, for example)?
> 
> I think I would start with the something covering the core networking
> pieces so the improvements can be seen.
> 
> 
> Eric
> 
>
Kirill Tkhai Nov. 17, 2017, 4:46 p.m. | #21
On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>>
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>>
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>>   mutex_lock(&net_mutex)
>>>>   list_del_rcu(&net->list)
>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_exit_list(ops, &net_exit_list)
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_free_list(ops, &net_exit_list)
>>>>   mutex_unlock(&net_mutex)
>>>>
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>
>>>> Create net_ns:
>>>> copy_net_ns()
>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>>
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>>
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>>         unsigned nr;
>>>>         if (argc < 2) {
>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>                 return 1;
>>>>         }
>>>>         nr = atoi(argv[1]);
>>>>         while (nr-- > 0) {
>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>                         perror("Can't unshare");
>>>>                         return 1;
>>>>                 }
>>>>         }
>>>>         return 0;
>>>> }
>>>>
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>>
>>>> Patched:
>>>> real 0m18,235s   (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>>
>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>> from Linus tree (not in net-next yet).
>>>
>>> Using a rwsem to protect the list of operations makes sense.
>>>
>>> That should allow removing the sing
>>>
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>>
>>> There are about 204 instances of struct pernet_operations.  That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>>
>>> My preference would be to:
>>>
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>   register and unregister, and maybe net_ns_barrier and
>>>   rtnl_link_unregister.
>>>
>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>   pernet_list as:
>>>
>>>   pernet_sys
>>>   pernet_subsys
>>>   pernet_device
>>>   pernet_dev
>>>
>>>   With the chunk in the middle taking the net_mutex.
>>
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>>
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>>
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with modules,
>> who have no more dependences in pernet_subsys. For pernet_device
>> it's vise versa.
>>
>> In case of bool drivers, the situation may be worse, because
>> the order is not so clear there. The same priority initcalls
>> (for example, initcalls, registered via core_initcall()) may require
>> the certain order, driven by linking order. I know one example from
>> device mapper code, which lives here: drivers/md/Makefile.
>> This problem is also solvable, even if such places do not contain
>> comments about linking order. It's just need to respect Makefile
>> order, when choosing a new candidate to move.
>>  
>>>   I think I would enforce the ordering with a failure to register
>>>   if a subsystem or a device tried to register out of order.  
>>>
>>> - Disable use of the single threaded workqueue if nothing needs the
>>>   net_mutex.
>>
>> We may use per-cpu worqueues in the future. The idea to refuse using
>> worqueue doesn't seem good for me, because asynchronous net destroying
>> looks very useful.
> 
> per-cpu workqueues are fine, and definitely what I am expecting.  If we
> are doing this I want to get us off the single threaded workqueue that
> serializes all of the cleanup.  That has a huge potential for
> simplifying things and reducing maintenance if running everything in
> parallel is actually safe.
> 
> I forget how the modern per-cpu workqueues work with respect to sleeps
> and locking (I don't remember if when piece of work sleeps for a long
> time we spin up another thread per-cpu workqueue thread, and thus avoid
> priority inversion problems).
> 
> If locks between workqueues are not a problem we could start the
> transition off of the single-threaded serializing workqueue sooner
> rather than later.
> 
>>> - Add a test mode that deliberartely spawns threads on multiple
>>>   processors and deliberately creates multiple network namespaces
>>>   at the same time.
>>>
>>> - Add a test mode that deliberately spawns threads on multiple
>>>   processors and delibearate destrosy multiple network namespaces
>>>   at the same time.>
>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>   time.  Being careful with the loopback device it's order in the list
>>>   strongly matters.
>>>
>>> - Finally remove the unnecessary code.
>>>
>>>
>>> At the end of the day because all of the operations for one network
>>> namespace will run in parallel with all of the operations for another
>>> network namespace all of the sophistication that goes into batching the
>>> cleanup of multiple network namespaces can be removed.  As different
>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>> without slowing each other down.
>>>
>>> I think we can remove the batching but I am afraid that will lead us into
>>> rtnl_lock contention.
>>
>> I've looked into this lock. It used in many places for many reasons.
>> It's a little strange, nobody tried to break it up in several small locks..
> 
> It gets tricky.  At this point getting net_mutex is enough to start
> with. Mostly the rtnl_lock covers the slow path which tends to keep it
> from rising to the top of the priority list.
> 
>>> I am definitely not comfortable with changing the locking on so much
>>> code without an explanation at all why it is safe in the commit comments
>>> in all 204 instances.  Which combined equal most of the well maintained
>>> and regularly used part of the networking stack.
>>
>> David,
>>
>> could you please check the plan above and say whether 1)it's OK for you,
>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>> or we may go sequentially(50 patches a time, for example)?
> 
> I think I would start with the something covering the core networking
> pieces so the improvements can be seen.

Since the main problem is mutex held during synchronize_rcu(), the performance
improvements may become seen only after it's completely removed.
Kirill Tkhai Nov. 17, 2017, 6:36 p.m. | #22
On 15.11.2017 19:31, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 15.11.2017 12:51, Kirill Tkhai wrote:
>>> On 15.11.2017 06:19, Eric W. Biederman wrote:
>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>
>>>>> On 14.11.2017 21:39, Cong Wang wrote:
>>>>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>
>>>>>>>         get_user_ns(user_ns);
>>>>>>>
>>>>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>>>>> +       rv = down_read_killable(&net_sem);
>>>>>>>         if (rv < 0) {
>>>>>>>                 net_free(net);
>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>>>>                 rtnl_unlock();
>>>>>>>         }
>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>> +       up_read(&net_sem);
>>>>>>>         if (rv < 0) {
>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>>                 put_user_ns(user_ns);
>>>>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>>>>
>>>>>>> -       mutex_lock(&net_mutex);
>>>>>>> +       down_read(&net_sem);
>>>>>>>
>>>>>>>         /* Don't let anyone else find us. */
>>>>>>>         rtnl_lock();
>>>>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>>                 ops_free_list(ops, &net_exit_list);
>>>>>>>
>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>> +       up_read(&net_sem);
>>>>>>
>>>>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>>>>> given that ops_exit_list() is called on error path of setup_net() too,
>>>>>> it means ops->exit() now could run concurrently if it doesn't have its
>>>>>> own lock. Not sure if this breaks any existing user.
>>>>>
>>>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>>>> have a problem with that. If they exist, they are hidden and not clear seen.
>>>>> The pernet operations in general do not touch someone else's memory.
>>>>> If suddenly there is one, KASAN should show it after a while.
>>>>
>>>> Certainly the use of hash tables shared between multiple network
>>>> namespaces would count.  I don't rembmer how many of these we have but
>>>> there used to be quite a few.
>>>
>>> Could you please provide an example of hash tables, you mean?
>>
>> Ah, I see, it's dccp_hashinfo etc.

JFI, I've checked dccp_hashinfo, and it seems to be safe.

> 
> The big one used to be the route cache.  With resizable hash tables
> things may be getting better in that regard.

I've checked some fib-related things, and wasn't able to find that.
Excuse me, could you please clarify, if it's an assumption, or
there is exactly a problem hash table, you know? Could you please
point it me more exactly, if it's so.
Eric W. Biederman Nov. 17, 2017, 6:52 p.m. | #23
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 15.11.2017 19:31, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>> 
>>> On 15.11.2017 12:51, Kirill Tkhai wrote:
>>>> On 15.11.2017 06:19, Eric W. Biederman wrote:
>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>>
>>>>>> On 14.11.2017 21:39, Cong Wang wrote:
>>>>>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>>
>>>>>>>>         get_user_ns(user_ns);
>>>>>>>>
>>>>>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>>>>>> +       rv = down_read_killable(&net_sem);
>>>>>>>>         if (rv < 0) {
>>>>>>>>                 net_free(net);
>>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>>>>>                 rtnl_unlock();
>>>>>>>>         }
>>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>>> +       up_read(&net_sem);
>>>>>>>>         if (rv < 0) {
>>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>>>                 put_user_ns(user_ns);
>>>>>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>>>>>
>>>>>>>> -       mutex_lock(&net_mutex);
>>>>>>>> +       down_read(&net_sem);
>>>>>>>>
>>>>>>>>         /* Don't let anyone else find us. */
>>>>>>>>         rtnl_lock();
>>>>>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>>>                 ops_free_list(ops, &net_exit_list);
>>>>>>>>
>>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>>> +       up_read(&net_sem);
>>>>>>>
>>>>>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>>>>>> given that ops_exit_list() is called on error path of setup_net() too,
>>>>>>> it means ops->exit() now could run concurrently if it doesn't have its
>>>>>>> own lock. Not sure if this breaks any existing user.
>>>>>>
>>>>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>>>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>>>>> have a problem with that. If they exist, they are hidden and not clear seen.
>>>>>> The pernet operations in general do not touch someone else's memory.
>>>>>> If suddenly there is one, KASAN should show it after a while.
>>>>>
>>>>> Certainly the use of hash tables shared between multiple network
>>>>> namespaces would count.  I don't rembmer how many of these we have but
>>>>> there used to be quite a few.
>>>>
>>>> Could you please provide an example of hash tables, you mean?
>>>
>>> Ah, I see, it's dccp_hashinfo etc.
>
> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>
>> 
>> The big one used to be the route cache.  With resizable hash tables
>> things may be getting better in that regard.
>
> I've checked some fib-related things, and wasn't able to find that.
> Excuse me, could you please clarify, if it's an assumption, or
> there is exactly a problem hash table, you know? Could you please
> point it me more exactly, if it's so.

Two things.
1) Hash tables are one case I know where we access data from multiple
   network namespaces.  As such it can not be asserted that is no
   possibility for problems.

2) The responsible way to handle this is one patch for each set of
   methods explaining why those methods are safe to run in parallel.

   That ensures there is opportunity for review and people are going
   slowly enough that they actually look at these issues.

The reason I want to see this broken up is that at 200ish sets of
methods it is too much to review all at once.

I completely agree that odds are that this can be made safe and that it
is mostly likely already safe in practically every instance.    My guess
would be that if there are problems that need to be addressed they
happen in one or two places and we need to find them.  If possible I
don't want to find them after the code has shipped in a stable release.

Eric
Eric W. Biederman Nov. 17, 2017, 6:54 p.m. | #24
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> On 15.11.2017 19:29, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>> 
>>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>
>>>>> Curently mutex is used to protect pernet operations list. It makes
>>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>>> which was used on the time of ->init, even after net namespace is
>>>>> unlinked from net_namespace_list.
>>>>>
>>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>>> from net_namespace_list():
>>>>>
>>>>> Destroy net_ns:
>>>>> cleanup_net()
>>>>>   mutex_lock(&net_mutex)
>>>>>   list_del_rcu(&net->list)
>>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>     ops_exit_list(ops, &net_exit_list)
>>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>     ops_free_list(ops, &net_exit_list)
>>>>>   mutex_unlock(&net_mutex)
>>>>>
>>>>> This primitive is not fast, especially on the systems with many processors
>>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>>
>>>>> Create net_ns:
>>>>> copy_net_ns()
>>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>>
>>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>>> will require down_read() locking only, while down_write() will be used
>>>>> for changing pernet_list.
>>>>>
>>>>> This gives signify performance increase, like you may see below. There
>>>>> is measured sequential net namespace creation in a cycle, in single
>>>>> thread, without other tasks (single user mode):
>>>>>
>>>>> 1)int main(int argc, char *argv[])
>>>>> {
>>>>>         unsigned nr;
>>>>>         if (argc < 2) {
>>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>>                 return 1;
>>>>>         }
>>>>>         nr = atoi(argv[1]);
>>>>>         while (nr-- > 0) {
>>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>>                         perror("Can't unshare");
>>>>>                         return 1;
>>>>>                 }
>>>>>         }
>>>>>         return 0;
>>>>> }
>>>>>
>>>>> Origin, 100000 unshare():
>>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>>
>>>>> Patched, 100000 unshare():
>>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>>
>>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>>
>>>>> Origin:
>>>>> real 1m24,190s
>>>>> user 0m6,225s
>>>>> sys 0m15,132s
>>>>>
>>>>> Patched:
>>>>> real 0m18,235s   (4.6 times faster)
>>>>> user 0m4,544s
>>>>> sys 0m13,796s
>>>>>
>>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>>> from Linus tree (not in net-next yet).
>>>>
>>>> Using a rwsem to protect the list of operations makes sense.
>>>>
>>>> That should allow removing the sing
>>>>
>>>> I am not wild about taking a the rwsem down_write in
>>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>>> goes from being a mild hack to being a pretty bad hack and something
>>>> else that can kill the parallelism you are seeking it add.
>>>>
>>>> There are about 204 instances of struct pernet_operations.  That is a
>>>> lot of code to have carefully audited to ensure it can in parallel all
>>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>>> to the fact that there are data structures accessed by multiple network
>>>> namespaces.
>>>>
>>>> My preference would be to:
>>>>
>>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>>   register and unregister, and maybe net_ns_barrier and
>>>>   rtnl_link_unregister.
>>>>
>>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>>
>>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>>   pernet_list as:
>>>>
>>>>   pernet_sys
>>>>   pernet_subsys
>>>>   pernet_device
>>>>   pernet_dev
>>>>
>>>>   With the chunk in the middle taking the net_mutex.
>>>
>>> I think this approach will work. Thanks for the suggestion. Some more
>>> thoughts to the plan below.
>>>
>>> The only difficult thing there will be to choose the right order
>>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>>> to pernet_dev one by one.
>>>
>>> This is rather easy in case of tristate drivers, as modules may be loaded
>>> at any time, and the only important order is dependences between them.
>>> So, it's possible to start from a module, who has no dependences,
>>> and move it to pernet_sys, and then continue with modules,
>>> who have no more dependences in pernet_subsys. For pernet_device
>>> it's vise versa.
>>>
>>> In case of bool drivers, the situation may be worse, because
>>> the order is not so clear there. The same priority initcalls
>>> (for example, initcalls, registered via core_initcall()) may require
>>> the certain order, driven by linking order. I know one example from
>>> device mapper code, which lives here: drivers/md/Makefile.
>>> This problem is also solvable, even if such places do not contain
>>> comments about linking order. It's just need to respect Makefile
>>> order, when choosing a new candidate to move.
>>>  
>>>>   I think I would enforce the ordering with a failure to register
>>>>   if a subsystem or a device tried to register out of order.  
>>>>
>>>> - Disable use of the single threaded workqueue if nothing needs the
>>>>   net_mutex.
>>>
>>> We may use per-cpu worqueues in the future. The idea to refuse using
>>> worqueue doesn't seem good for me, because asynchronous net destroying
>>> looks very useful.
>> 
>> per-cpu workqueues are fine, and definitely what I am expecting.  If we
>> are doing this I want to get us off the single threaded workqueue that
>> serializes all of the cleanup.  That has a huge potential for
>> simplifying things and reducing maintenance if running everything in
>> parallel is actually safe.
>> 
>> I forget how the modern per-cpu workqueues work with respect to sleeps
>> and locking (I don't remember if when piece of work sleeps for a long
>> time we spin up another thread per-cpu workqueue thread, and thus avoid
>> priority inversion problems).
>> 
>> If locks between workqueues are not a problem we could start the
>> transition off of the single-threaded serializing workqueue sooner
>> rather than later.
>> 
>>>> - Add a test mode that deliberartely spawns threads on multiple
>>>>   processors and deliberately creates multiple network namespaces
>>>>   at the same time.
>>>>
>>>> - Add a test mode that deliberately spawns threads on multiple
>>>>   processors and delibearate destrosy multiple network namespaces
>>>>   at the same time.>
>>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>>   time.  Being careful with the loopback device it's order in the list
>>>>   strongly matters.
>>>>
>>>> - Finally remove the unnecessary code.
>>>>
>>>>
>>>> At the end of the day because all of the operations for one network
>>>> namespace will run in parallel with all of the operations for another
>>>> network namespace all of the sophistication that goes into batching the
>>>> cleanup of multiple network namespaces can be removed.  As different
>>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>>> without slowing each other down.
>>>>
>>>> I think we can remove the batching but I am afraid that will lead us into
>>>> rtnl_lock contention.
>>>
>>> I've looked into this lock. It used in many places for many reasons.
>>> It's a little strange, nobody tried to break it up in several small locks..
>> 
>> It gets tricky.  At this point getting net_mutex is enough to start
>> with. Mostly the rtnl_lock covers the slow path which tends to keep it
>> from rising to the top of the priority list.
>> 
>>>> I am definitely not comfortable with changing the locking on so much
>>>> code without an explanation at all why it is safe in the commit comments
>>>> in all 204 instances.  Which combined equal most of the well maintained
>>>> and regularly used part of the networking stack.
>>>
>>> David,
>>>
>>> could you please check the plan above and say whether 1)it's OK for you,
>>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>>> or we may go sequentially(50 patches a time, for example)?
>> 
>> I think I would start with the something covering the core networking
>> pieces so the improvements can be seen.
>
> Since the main problem is mutex held during synchronize_rcu(), the performance
> improvements may become seen only after it's completely removed.

In a distro kernel build sure.  But I think with just a few conversions
some built in network devices the core networking stack and a few ipv4
pieces you should be able to have a minimal but usualbe configuration
that does not need the lock at all.

Further dropping the lock earlier means the code can be actively
exercised running in parallel in case something is missed in review.

Eric
Kirill Tkhai Nov. 17, 2017, 8:12 p.m. | #25
On 17.11.2017 21:54, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 15.11.2017 19:29, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>>
>>>>>> Curently mutex is used to protect pernet operations list. It makes
>>>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>>>> which was used on the time of ->init, even after net namespace is
>>>>>> unlinked from net_namespace_list.
>>>>>>
>>>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>>>> from net_namespace_list():
>>>>>>
>>>>>> Destroy net_ns:
>>>>>> cleanup_net()
>>>>>>   mutex_lock(&net_mutex)
>>>>>>   list_del_rcu(&net->list)
>>>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>     ops_exit_list(ops, &net_exit_list)
>>>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>     ops_free_list(ops, &net_exit_list)
>>>>>>   mutex_unlock(&net_mutex)
>>>>>>
>>>>>> This primitive is not fast, especially on the systems with many processors
>>>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>>>
>>>>>> Create net_ns:
>>>>>> copy_net_ns()
>>>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>>>
>>>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>>>> will require down_read() locking only, while down_write() will be used
>>>>>> for changing pernet_list.
>>>>>>
>>>>>> This gives signify performance increase, like you may see below. There
>>>>>> is measured sequential net namespace creation in a cycle, in single
>>>>>> thread, without other tasks (single user mode):
>>>>>>
>>>>>> 1)int main(int argc, char *argv[])
>>>>>> {
>>>>>>         unsigned nr;
>>>>>>         if (argc < 2) {
>>>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>>>                 return 1;
>>>>>>         }
>>>>>>         nr = atoi(argv[1]);
>>>>>>         while (nr-- > 0) {
>>>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>>>                         perror("Can't unshare");
>>>>>>                         return 1;
>>>>>>                 }
>>>>>>         }
>>>>>>         return 0;
>>>>>> }
>>>>>>
>>>>>> Origin, 100000 unshare():
>>>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>>>
>>>>>> Patched, 100000 unshare():
>>>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>>>
>>>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>>>
>>>>>> Origin:
>>>>>> real 1m24,190s
>>>>>> user 0m6,225s
>>>>>> sys 0m15,132s
>>>>>>
>>>>>> Patched:
>>>>>> real 0m18,235s   (4.6 times faster)
>>>>>> user 0m4,544s
>>>>>> sys 0m13,796s
>>>>>>
>>>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>>>> from Linus tree (not in net-next yet).
>>>>>
>>>>> Using a rwsem to protect the list of operations makes sense.
>>>>>
>>>>> That should allow removing the sing
>>>>>
>>>>> I am not wild about taking a the rwsem down_write in
>>>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>>>> goes from being a mild hack to being a pretty bad hack and something
>>>>> else that can kill the parallelism you are seeking it add.
>>>>>
>>>>> There are about 204 instances of struct pernet_operations.  That is a
>>>>> lot of code to have carefully audited to ensure it can in parallel all
>>>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>>>> to the fact that there are data structures accessed by multiple network
>>>>> namespaces.
>>>>>
>>>>> My preference would be to:
>>>>>
>>>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>>>   register and unregister, and maybe net_ns_barrier and
>>>>>   rtnl_link_unregister.
>>>>>
>>>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>>>
>>>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>>>   pernet_list as:
>>>>>
>>>>>   pernet_sys
>>>>>   pernet_subsys
>>>>>   pernet_device
>>>>>   pernet_dev
>>>>>
>>>>>   With the chunk in the middle taking the net_mutex.
>>>>
>>>> I think this approach will work. Thanks for the suggestion. Some more
>>>> thoughts to the plan below.
>>>>
>>>> The only difficult thing there will be to choose the right order
>>>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>>>> to pernet_dev one by one.
>>>>
>>>> This is rather easy in case of tristate drivers, as modules may be loaded
>>>> at any time, and the only important order is dependences between them.
>>>> So, it's possible to start from a module, who has no dependences,
>>>> and move it to pernet_sys, and then continue with modules,
>>>> who have no more dependences in pernet_subsys. For pernet_device
>>>> it's vise versa.
>>>>
>>>> In case of bool drivers, the situation may be worse, because
>>>> the order is not so clear there. The same priority initcalls
>>>> (for example, initcalls, registered via core_initcall()) may require
>>>> the certain order, driven by linking order. I know one example from
>>>> device mapper code, which lives here: drivers/md/Makefile.
>>>> This problem is also solvable, even if such places do not contain
>>>> comments about linking order. It's just need to respect Makefile
>>>> order, when choosing a new candidate to move.
>>>>  
>>>>>   I think I would enforce the ordering with a failure to register
>>>>>   if a subsystem or a device tried to register out of order.  
>>>>>
>>>>> - Disable use of the single threaded workqueue if nothing needs the
>>>>>   net_mutex.
>>>>
>>>> We may use per-cpu worqueues in the future. The idea to refuse using
>>>> worqueue doesn't seem good for me, because asynchronous net destroying
>>>> looks very useful.
>>>
>>> per-cpu workqueues are fine, and definitely what I am expecting.  If we
>>> are doing this I want to get us off the single threaded workqueue that
>>> serializes all of the cleanup.  That has a huge potential for
>>> simplifying things and reducing maintenance if running everything in
>>> parallel is actually safe.
>>>
>>> I forget how the modern per-cpu workqueues work with respect to sleeps
>>> and locking (I don't remember if when piece of work sleeps for a long
>>> time we spin up another thread per-cpu workqueue thread, and thus avoid
>>> priority inversion problems).
>>>
>>> If locks between workqueues are not a problem we could start the
>>> transition off of the single-threaded serializing workqueue sooner
>>> rather than later.
>>>
>>>>> - Add a test mode that deliberartely spawns threads on multiple
>>>>>   processors and deliberately creates multiple network namespaces
>>>>>   at the same time.
>>>>>
>>>>> - Add a test mode that deliberately spawns threads on multiple
>>>>>   processors and delibearate destrosy multiple network namespaces
>>>>>   at the same time.>
>>>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>>>   time.  Being careful with the loopback device it's order in the list
>>>>>   strongly matters.
>>>>>
>>>>> - Finally remove the unnecessary code.
>>>>>
>>>>>
>>>>> At the end of the day because all of the operations for one network
>>>>> namespace will run in parallel with all of the operations for another
>>>>> network namespace all of the sophistication that goes into batching the
>>>>> cleanup of multiple network namespaces can be removed.  As different
>>>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>>>> without slowing each other down.
>>>>>
>>>>> I think we can remove the batching but I am afraid that will lead us into
>>>>> rtnl_lock contention.
>>>>
>>>> I've looked into this lock. It used in many places for many reasons.
>>>> It's a little strange, nobody tried to break it up in several small locks..
>>>
>>> It gets tricky.  At this point getting net_mutex is enough to start
>>> with. Mostly the rtnl_lock covers the slow path which tends to keep it
>>> from rising to the top of the priority list.
>>>
>>>>> I am definitely not comfortable with changing the locking on so much
>>>>> code without an explanation at all why it is safe in the commit comments
>>>>> in all 204 instances.  Which combined equal most of the well maintained
>>>>> and regularly used part of the networking stack.
>>>>
>>>> David,
>>>>
>>>> could you please check the plan above and say whether 1)it's OK for you,
>>>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>>>> or we may go sequentially(50 patches a time, for example)?
>>>
>>> I think I would start with the something covering the core networking
>>> pieces so the improvements can be seen.
>>
>> Since the main problem is mutex held during synchronize_rcu(), the performance
>> improvements may become seen only after it's completely removed.
> 
> In a distro kernel build sure.  But I think with just a few conversions
> some built in network devices the core networking stack and a few ipv4
> pieces you should be able to have a minimal but usualbe configuration
> that does not need the lock at all.
> 
> Further dropping the lock earlier means the code can be actively
> exercised running in parallel in case something is missed in review.

There is an issue. Look at the way setup_net() looks at the moment:

mutex_lock()
list_for_each_entry(ops, &pernet_list, list)
     ops_init(ops, net)
rtnl_lock
list_add_tail_rcu(&net->list, &net_namespace_list); <--- Here
rtnl_unlock();
mutex_unlock(&net_mutex);

There is linking net to net_namespace_list under net_mutex.
Can list_add_tail_rcu() be moved out of net_mutex like:

mutex_lock()
list_for_each_entry(ops, &pernet_list, list)
     ops_init(ops, net)
mutex_unlock(&net_mutex);
rtnl_lock
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();

I assume, it can't.

This means, it's impossible to separate the lists like below,
where sys and dev don't require the mutex:

sys, subsys, device, dev,

because we can't release it until the net::list is linked to net_namespace_list.

So, the only way is to go sequentially:
1)Create sys ahead of subsys, device
2)Move all subsys operations to sys
3)Remove subsys. Now we have sys and device, where sys doesn't require mutex, while device requires.
4)Create dev ahead of device: sys, dev, device
5)Move all device operations to dev
6)Remove device. Now we have sys and dev
7)Move above list_add_tail_rcu out of new_mutex.

This keeps net_mutex till every subsys and device is removed,
and it's impossible to find a .config, the mutex won't be held.
So, there still be synchronize_rcu() with mutex held.

Or maybe, there is no problem with moving list_add_tail_rcu()
out of lock right now? What do you think about this?
Kirill Tkhai Nov. 17, 2017, 8:16 p.m. | #26
On 17.11.2017 21:52, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
> 
>> On 15.11.2017 19:31, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>
>>>> On 15.11.2017 12:51, Kirill Tkhai wrote:
>>>>> On 15.11.2017 06:19, Eric W. Biederman wrote:
>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>>>
>>>>>>> On 14.11.2017 21:39, Cong Wang wrote:
>>>>>>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>>>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>>>
>>>>>>>>>         get_user_ns(user_ns);
>>>>>>>>>
>>>>>>>>> -       rv = mutex_lock_killable(&net_mutex);
>>>>>>>>> +       rv = down_read_killable(&net_sem);
>>>>>>>>>         if (rv < 0) {
>>>>>>>>>                 net_free(net);
>>>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>>>>>>>                 list_add_tail_rcu(&net->list, &net_namespace_list);
>>>>>>>>>                 rtnl_unlock();
>>>>>>>>>         }
>>>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>>>> +       up_read(&net_sem);
>>>>>>>>>         if (rv < 0) {
>>>>>>>>>                 dec_net_namespaces(ucounts);
>>>>>>>>>                 put_user_ns(user_ns);
>>>>>>>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>>>         list_replace_init(&cleanup_list, &net_kill_list);
>>>>>>>>>         spin_unlock_irq(&cleanup_list_lock);
>>>>>>>>>
>>>>>>>>> -       mutex_lock(&net_mutex);
>>>>>>>>> +       down_read(&net_sem);
>>>>>>>>>
>>>>>>>>>         /* Don't let anyone else find us. */
>>>>>>>>>         rtnl_lock();
>>>>>>>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>>>>>>>>         list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>>>>>>                 ops_free_list(ops, &net_exit_list);
>>>>>>>>>
>>>>>>>>> -       mutex_unlock(&net_mutex);
>>>>>>>>> +       up_read(&net_sem);
>>>>>>>>
>>>>>>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>>>>>>> given that ops_exit_list() is called on error path of setup_net() too,
>>>>>>>> it means ops->exit() now could run concurrently if it doesn't have its
>>>>>>>> own lock. Not sure if this breaks any existing user.
>>>>>>>
>>>>>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>>>>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>>>>>> have a problem with that. If they exist, they are hidden and not clear seen.
>>>>>>> The pernet operations in general do not touch someone else's memory.
>>>>>>> If suddenly there is one, KASAN should show it after a while.
>>>>>>
>>>>>> Certainly the use of hash tables shared between multiple network
>>>>>> namespaces would count.  I don't rembmer how many of these we have but
>>>>>> there used to be quite a few.
>>>>>
>>>>> Could you please provide an example of hash tables, you mean?
>>>>
>>>> Ah, I see, it's dccp_hashinfo etc.
>>
>> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>>
>>>
>>> The big one used to be the route cache.  With resizable hash tables
>>> things may be getting better in that regard.
>>
>> I've checked some fib-related things, and wasn't able to find that.
>> Excuse me, could you please clarify, if it's an assumption, or
>> there is exactly a problem hash table, you know? Could you please
>> point it me more exactly, if it's so.
> 
> Two things.
> 1) Hash tables are one case I know where we access data from multiple
>    network namespaces.  As such it can not be asserted that is no
>    possibility for problems.
> 
> 2) The responsible way to handle this is one patch for each set of
>    methods explaining why those methods are safe to run in parallel.
> 
>    That ensures there is opportunity for review and people are going
>    slowly enough that they actually look at these issues.
> 
> The reason I want to see this broken up is that at 200ish sets of
> methods it is too much to review all at once.

Ok, it's possible to split the changes in 400 patches, but there is
a problem with three-state (no compile, module, built-in) drivers.
Git bisect won't work anyway. Please see the description of the problem
in cover message "[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore"
I sent today.
 
> I completely agree that odds are that this can be made safe and that it
> is mostly likely already safe in practically every instance.    My guess
> would be that if there are problems that need to be addressed they
> happen in one or two places and we need to find them.  If possible I
> don't want to find them after the code has shipped in a stable release.

Kirill

Patch

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 54bcd970bfd3..36cc009f4580 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -32,7 +32,7 @@  extern int rtnl_trylock(void);
 extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
-extern struct mutex net_mutex;
+extern struct rw_semaphore net_sem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 10f99dafd5ac..aaed826ccbe7 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -60,7 +60,7 @@  struct net {
 
 	struct list_head	list;		/* list of network namespaces */
 	struct list_head	cleanup_list;	/* namespaces on death row */
-	struct list_head	exit_list;	/* Use only net_mutex */
+	struct list_head	exit_list;	/* Use only net_sem */
 
 	struct user_namespace   *user_ns;	/* Owning user namespace */
 	struct ucounts		*ucounts;
@@ -89,7 +89,12 @@  struct net {
 	/* core fib_rules */
 	struct list_head	rules_ops;
 
-	struct list_head	fib_notifier_ops;  /* protected by net_mutex */
+	/*
+	 * RCU-protected list, modifiable by pernet-init and -exit methods.
+	 * When net namespace is alive (net::count > 0), all the changes
+	 * are made under rw_sem held on write.
+	 */
+	struct list_head	fib_notifier_ops;
 
 	struct net_device       *loopback_dev;          /* The loopback */
 	struct netns_core	core;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6cfdc7c84c48..f502b11b507e 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -29,7 +29,7 @@ 
 
 static LIST_HEAD(pernet_list);
 static struct list_head *first_device = &pernet_list;
-DEFINE_MUTEX(net_mutex);
+DECLARE_RWSEM(net_sem);
 
 LIST_HEAD(net_namespace_list);
 EXPORT_SYMBOL_GPL(net_namespace_list);
@@ -65,11 +65,11 @@  static int net_assign_generic(struct net *net, unsigned int id, void *data)
 {
 	struct net_generic *ng, *old_ng;
 
-	BUG_ON(!mutex_is_locked(&net_mutex));
+	BUG_ON(!rwsem_is_locked(&net_sem));
 	BUG_ON(id < MIN_PERNET_OPS_ID);
 
 	old_ng = rcu_dereference_protected(net->gen,
-					   lockdep_is_held(&net_mutex));
+					   lockdep_is_held(&net_sem));
 	if (old_ng->s.len > id) {
 		old_ng->ptr[id] = data;
 		return 0;
@@ -278,7 +278,7 @@  struct net *get_net_ns_by_id(struct net *net, int id)
  */
 static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 {
-	/* Must be called with net_mutex held */
+	/* Must be called with net_sem held */
 	const struct pernet_operations *ops, *saved_ops;
 	int error = 0;
 	LIST_HEAD(net_exit_list);
@@ -406,7 +406,7 @@  struct net *copy_net_ns(unsigned long flags,
 
 	get_user_ns(user_ns);
 
-	rv = mutex_lock_killable(&net_mutex);
+	rv = down_read_killable(&net_sem);
 	if (rv < 0) {
 		net_free(net);
 		dec_net_namespaces(ucounts);
@@ -421,7 +421,7 @@  struct net *copy_net_ns(unsigned long flags,
 		list_add_tail_rcu(&net->list, &net_namespace_list);
 		rtnl_unlock();
 	}
-	mutex_unlock(&net_mutex);
+	up_read(&net_sem);
 	if (rv < 0) {
 		dec_net_namespaces(ucounts);
 		put_user_ns(user_ns);
@@ -446,7 +446,7 @@  static void cleanup_net(struct work_struct *work)
 	list_replace_init(&cleanup_list, &net_kill_list);
 	spin_unlock_irq(&cleanup_list_lock);
 
-	mutex_lock(&net_mutex);
+	down_read(&net_sem);
 
 	/* Don't let anyone else find us. */
 	rtnl_lock();
@@ -486,7 +486,7 @@  static void cleanup_net(struct work_struct *work)
 	list_for_each_entry_reverse(ops, &pernet_list, list)
 		ops_free_list(ops, &net_exit_list);
 
-	mutex_unlock(&net_mutex);
+	up_read(&net_sem);
 
 	/* Ensure there are no outstanding rcu callbacks using this
 	 * network namespace.
@@ -513,8 +513,8 @@  static void cleanup_net(struct work_struct *work)
  */
 void net_ns_barrier(void)
 {
-	mutex_lock(&net_mutex);
-	mutex_unlock(&net_mutex);
+	down_write(&net_sem);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL(net_ns_barrier);
 
@@ -841,7 +841,7 @@  static int __init net_ns_init(void)
 
 	rcu_assign_pointer(init_net.gen, ng);
 
-	mutex_lock(&net_mutex);
+	down_read(&net_sem);
 	if (setup_net(&init_net, &init_user_ns))
 		panic("Could not setup the initial network namespace");
 
@@ -851,7 +851,7 @@  static int __init net_ns_init(void)
 	list_add_tail_rcu(&init_net.list, &net_namespace_list);
 	rtnl_unlock();
 
-	mutex_unlock(&net_mutex);
+	up_read(&net_sem);
 
 	register_pernet_subsys(&net_ns_ops);
 
@@ -991,9 +991,9 @@  static void unregister_pernet_operations(struct pernet_operations *ops)
 int register_pernet_subsys(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	error =  register_pernet_operations(first_device, ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_subsys);
@@ -1009,9 +1009,9 @@  EXPORT_SYMBOL_GPL(register_pernet_subsys);
  */
 void unregister_pernet_subsys(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 
@@ -1037,11 +1037,11 @@  EXPORT_SYMBOL_GPL(unregister_pernet_subsys);
 int register_pernet_device(struct pernet_operations *ops)
 {
 	int error;
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	error = register_pernet_operations(&pernet_list, ops);
 	if (!error && (first_device == &pernet_list))
 		first_device = &ops->list;
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 	return error;
 }
 EXPORT_SYMBOL_GPL(register_pernet_device);
@@ -1057,11 +1057,11 @@  EXPORT_SYMBOL_GPL(register_pernet_device);
  */
 void unregister_pernet_device(struct pernet_operations *ops)
 {
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	if (&ops->list == first_device)
 		first_device = first_device->next;
 	unregister_pernet_operations(ops);
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_device);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5ace48926b19..caa215fd170b 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -390,11 +390,11 @@  static void rtnl_lock_unregistering_all(void)
 void rtnl_link_unregister(struct rtnl_link_ops *ops)
 {
 	/* Close the race with cleanup_net() */
-	mutex_lock(&net_mutex);
+	down_write(&net_sem);
 	rtnl_lock_unregistering_all();
 	__rtnl_link_unregister(ops);
 	rtnl_unlock();
-	mutex_unlock(&net_mutex);
+	up_write(&net_sem);
 }
 EXPORT_SYMBOL_GPL(rtnl_link_unregister);