Message ID | 1512665148-2413-2-git-send-email-xiangxia.m.yue@gmail.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
Series | [v5,1/2] sock: Change the netns_core member name. | expand |
On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: > In some case, we want to know how many sockets are in use in > different _net_ namespaces. It's a key resource metric. > ... > +static void sock_inuse_add(struct net *net, int val) > +{ > + if (net->core.prot_inuse) > + this_cpu_add(*net->core.sock_inuse, val); > +} This is very confusing. Why testing net->core.prot_inuse for NULL is needed at all ? Why not testing net->core.sock_inuse instead ?
On Thu, Dec 7, 2017 at 9:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: >> In some case, we want to know how many sockets are in use in >> different _net_ namespaces. It's a key resource metric. >> > > ... > >> +static void sock_inuse_add(struct net *net, int val) >> +{ >> + if (net->core.prot_inuse) >> + this_cpu_add(*net->core.sock_inuse, val); >> +} > > This is very confusing. > > Why testing net->core.prot_inuse for NULL is needed at all ? > > Why not testing net->core.sock_inuse instead ? I bet that is copy-n-paste error given that sock_inuse_exit_net() has a similar typo.
On Fri, Dec 8, 2017 at 1:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: >> In some case, we want to know how many sockets are in use in >> different _net_ namespaces. It's a key resource metric. >> > > ... > >> +static void sock_inuse_add(struct net *net, int val) >> +{ >> + if (net->core.prot_inuse) >> + this_cpu_add(*net->core.sock_inuse, val); >> +} > > This is very confusing. > > Why testing net->core.prot_inuse for NULL is needed at all ? > > Why not testing net->core.sock_inuse instead ? > Hi Eric and Cong, oh it's a typo. it's net->core.sock_inuse there. Why we should check the net->core.sock_inuse Now show you the code: cleanup_net will call all of the network namespace exit methods, rcu_barrier, and then remove the _net_ namespace. cleanup_net: list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); rcu_barrier(); /* for netlink sock, the ‘deferred_put_nlk_sk’ will be called. But sock_inuse has been released. */ /* Finally it is safe to free my network namespace structure */ list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {} Release the netlink sock created in kernel(not hold the _net_ namespace): netlink_release call_rcu(&nlk->rcu, deferred_put_nlk_sk); deferred_put_nlk_sk sk_free(sk); I may add a comment for sock_inuse_add in v6.
On Fri, 2017-12-08 at 13:28 +0800, Tonghao Zhang wrote: > On Fri, Dec 8, 2017 at 1:20 AM, Eric Dumazet <eric.dumazet@gmail.com> > wrote: > > On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: > > > In some case, we want to know how many sockets are in use in > > > different _net_ namespaces. It's a key resource metric. > > > > > > > ... > > > > > +static void sock_inuse_add(struct net *net, int val) > > > +{ > > > + if (net->core.prot_inuse) > > > + this_cpu_add(*net->core.sock_inuse, val); > > > +} > > > > This is very confusing. > > > > Why testing net->core.prot_inuse for NULL is needed at all ? > > > > Why not testing net->core.sock_inuse instead ? > > > > Hi Eric and Cong, oh it's a typo. it's net->core.sock_inuse there. > Why > we should check the net->core.sock_inuse > Now show you the code: > > cleanup_net will call all of the network namespace exit methods, > rcu_barrier, and then remove the _net_ namespace. > > cleanup_net: > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_exit_list(ops, &net_exit_list); > > rcu_barrier(); /* for netlink sock, the ‘deferred_put_nlk_sk’ > will > be called. But sock_inuse has been released. */ Thats would be a bug. Please find another way, but we want ultimately to check that before net->core.sock_inuse is freed, folding the inuse count on all cpus is 0, to make sure we do not have a bug somewhere. We should not have to test if net->core.sock_inuse is NULL or not from sock_inuse_add(). Pointer must be there all the time. The freeing should only happen once we are sure sock_inuse_add() can not be called anymore. > > > /* Finally it is safe to free my network namespace structure */ > list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {} > > > > Release the netlink sock created in kernel(not hold the _net_ > namespace): > > netlink_release > call_rcu(&nlk->rcu, deferred_put_nlk_sk); > > deferred_put_nlk_sk > sk_free(sk); > > > I may add a comment for sock_inuse_add in v6.
On Fri, Dec 8, 2017 at 1:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2017-12-08 at 13:28 +0800, Tonghao Zhang wrote: >> On Fri, Dec 8, 2017 at 1:20 AM, Eric Dumazet <eric.dumazet@gmail.com> >> wrote: >> > On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: >> > > In some case, we want to know how many sockets are in use in >> > > different _net_ namespaces. It's a key resource metric. >> > > >> > >> > ... >> > >> > > +static void sock_inuse_add(struct net *net, int val) >> > > +{ >> > > + if (net->core.prot_inuse) >> > > + this_cpu_add(*net->core.sock_inuse, val); >> > > +} >> > >> > This is very confusing. >> > >> > Why testing net->core.prot_inuse for NULL is needed at all ? >> > >> > Why not testing net->core.sock_inuse instead ? >> > >> >> Hi Eric and Cong, oh it's a typo. it's net->core.sock_inuse there. >> Why >> we should check the net->core.sock_inuse >> Now show you the code: >> >> cleanup_net will call all of the network namespace exit methods, >> rcu_barrier, and then remove the _net_ namespace. >> >> cleanup_net: >> list_for_each_entry_reverse(ops, &pernet_list, list) >> ops_exit_list(ops, &net_exit_list); >> >> rcu_barrier(); /* for netlink sock, the ‘deferred_put_nlk_sk’ >> will >> be called. But sock_inuse has been released. */ > > > Thats would be a bug. > > Please find another way, but we want ultimately to check that before > net->core.sock_inuse is freed, folding the inuse count on all cpus is > 0, to make sure we do not have a bug somewhere. Yes, I am aware of this issue even we will destroy the network namespace. By the way, we can counter the socket-inuse in sock_alloc or sock_release. In this way, we have to hold the network namespace again(via get_net()) while sock may hold it. what do you think of this idea? > We should not have to test if net->core.sock_inuse is NULL or not from > sock_inuse_add(). Pointer must be there all the time. > > The freeing should only happen once we are sure sock_inuse_add() can > not be called anymore. > >> >> >> /* Finally it is safe to free my network namespace structure */ >> list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {} >> >> >> >> Release the netlink sock created in kernel(not hold the _net_ >> namespace): >> >> netlink_release >> call_rcu(&nlk->rcu, deferred_put_nlk_sk); >> >> deferred_put_nlk_sk >> sk_free(sk); >> >> >> I may add a comment for sock_inuse_add in v6. > >
hi all. we can add synchronize_rcu and rcu_barrier in sock_inuse_exit_net to ensure there are no outstanding rcu callbacks using this network namespace. we will not have to test if net->core.sock_inuse is NULL or not from sock_inuse_add(). :) static void __net_exit sock_inuse_exit_net(struct net *net) { free_percpu(net->core.prot_inuse); + + synchronize_rcu(); + rcu_barrier(); + + free_percpu(net->core.sock_inuse); } On Fri, Dec 8, 2017 at 5:52 PM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > On Fri, Dec 8, 2017 at 1:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Fri, 2017-12-08 at 13:28 +0800, Tonghao Zhang wrote: >>> On Fri, Dec 8, 2017 at 1:20 AM, Eric Dumazet <eric.dumazet@gmail.com> >>> wrote: >>> > On Thu, 2017-12-07 at 08:45 -0800, Tonghao Zhang wrote: >>> > > In some case, we want to know how many sockets are in use in >>> > > different _net_ namespaces. It's a key resource metric. >>> > > >>> > >>> > ... >>> > >>> > > +static void sock_inuse_add(struct net *net, int val) >>> > > +{ >>> > > + if (net->core.prot_inuse) >>> > > + this_cpu_add(*net->core.sock_inuse, val); >>> > > +} >>> > >>> > This is very confusing. >>> > >>> > Why testing net->core.prot_inuse for NULL is needed at all ? >>> > >>> > Why not testing net->core.sock_inuse instead ? >>> > >>> >>> Hi Eric and Cong, oh it's a typo. it's net->core.sock_inuse there. >>> Why >>> we should check the net->core.sock_inuse >>> Now show you the code: >>> >>> cleanup_net will call all of the network namespace exit methods, >>> rcu_barrier, and then remove the _net_ namespace. >>> >>> cleanup_net: >>> list_for_each_entry_reverse(ops, &pernet_list, list) >>> ops_exit_list(ops, &net_exit_list); >>> >>> rcu_barrier(); /* for netlink sock, the ‘deferred_put_nlk_sk’ >>> will >>> be called. But sock_inuse has been released. */ >> >> >> Thats would be a bug. >> >> Please find another way, but we want ultimately to check that before >> net->core.sock_inuse is freed, folding the inuse count on all cpus is >> 0, to make sure we do not have a bug somewhere. > > Yes, I am aware of this issue even we will destroy the network namespace. > By the way, we can counter the socket-inuse in sock_alloc or sock_release. > In this way, we have to hold the network namespace again(via > get_net()) while sock > may hold it. > > what do you think of this idea? > >> We should not have to test if net->core.sock_inuse is NULL or not from >> sock_inuse_add(). Pointer must be there all the time. >> >> The freeing should only happen once we are sure sock_inuse_add() can >> not be called anymore. >> >>> >>> >>> /* Finally it is safe to free my network namespace structure */ >>> list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {} >>> >>> >>> >>> Release the netlink sock created in kernel(not hold the _net_ >>> namespace): >>> >>> netlink_release >>> call_rcu(&nlk->rcu, deferred_put_nlk_sk); >>> >>> deferred_put_nlk_sk >>> sk_free(sk); >>> >>> >>> I may add a comment for sock_inuse_add in v6. >> >>
On Fri, 2017-12-08 at 19:29 +0800, Tonghao Zhang wrote: > hi all. we can add synchronize_rcu and rcu_barrier in > sock_inuse_exit_net to > ensure there are no outstanding rcu callbacks using this network > namespace. > we will not have to test if net->core.sock_inuse is NULL or not from > sock_inuse_add(). :) > > static void __net_exit sock_inuse_exit_net(struct net *net) > { > free_percpu(net->core.prot_inuse); > + > + synchronize_rcu(); > + rcu_barrier(); > + > + free_percpu(net->core.sock_inuse); > } Oh well. Do you have any idea of the major problem this would add ? Try the following, before and after your patches : for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait ( Check commit 8ca712c373a462cfa1b62272870b6c2c74aa83f9 ) This is a complex problem, we wont accept patches that kill network namespaces dismantling performance by adding brute force synchronize_rcu() or rcu_barrier() calls. Why not freeing net->core.sock_inuse right before feeing net itself in net_free() ? You do not have to hijack sock_inuse_exit_net() just because it has a misleading name.
On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > > Release the netlink sock created in kernel(not hold the _net_ namespace): > You can avoid counting kernel sock by testing 'kern' in sk_alloc() and testing 'sk->sk_net_refcnt' in __sk_free().
On Fri, Dec 8, 2017 at 9:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2017-12-08 at 19:29 +0800, Tonghao Zhang wrote: >> hi all. we can add synchronize_rcu and rcu_barrier in >> sock_inuse_exit_net to >> ensure there are no outstanding rcu callbacks using this network >> namespace. >> we will not have to test if net->core.sock_inuse is NULL or not from >> sock_inuse_add(). :) >> >> static void __net_exit sock_inuse_exit_net(struct net *net) >> { >> free_percpu(net->core.prot_inuse); >> + >> + synchronize_rcu(); >> + rcu_barrier(); >> + >> + free_percpu(net->core.sock_inuse); >> } > > > Oh well. Do you have any idea of the major problem this would add ? > > Try the following, before and after your patches : > > for i in `seq 1 40` > do > (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & > done > wait > > ( Check commit 8ca712c373a462cfa1b62272870b6c2c74aa83f9 ) > Yes, I did the test. The patches drop the performance. Before patch: # time ./add_del_unshare.sh net_namespace 97 125 6016 5 8 : tunables 0 0 0 : slabdata 25 25 0 real 8m19.665s user 0m4.268s sys 0m6.477s After : # time ./add_del_unshare.sh net_namespace 102 130 6016 5 8 : tunables 0 0 0 : slabdata 26 26 0 real 8m52.563s user 0m4.040s sys 0m7.558s > > This is a complex problem, we wont accept patches that kill network > namespaces dismantling performance by adding brute force > synchronize_rcu() or rcu_barrier() calls. > > Why not freeing net->core.sock_inuse right before feeing net itself in > net_free() ? I try this way, alloc core.sock_inuse in net_alloc(), free it in net_free (). It does not drop performance, and we will not always to check the core.sock_inuse in sock_inuse_add(). After : # time ./add_del_unshare.sh net_namespace 109 135 6016 5 8 : tunables 0 0 0 : slabdata 27 27 0 real 8m19.265s user 0m4.090s sys 0m8.185s > You do not have to hijack sock_inuse_exit_net() just because it has a > misleading name. > >
On Sat, Dec 9, 2017 at 6:09 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >> >> Release the netlink sock created in kernel(not hold the _net_ namespace): >> > > You can avoid counting kernel sock by testing 'kern' in sk_alloc() > and testing 'sk->sk_net_refcnt' in __sk_free(). Hi cong, if we do it in this way, we will not counter the sock created in kernel, right ?
On Fri, Dec 8, 2017 at 9:27 PM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: > On Sat, Dec 9, 2017 at 6:09 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> On Thu, Dec 7, 2017 at 9:28 PM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote: >>> >>> Release the netlink sock created in kernel(not hold the _net_ namespace): >>> >> >> You can avoid counting kernel sock by testing 'kern' in sk_alloc() >> and testing 'sk->sk_net_refcnt' in __sk_free(). > Hi cong, if we do it in this way, we will not counter the sock created > in kernel, right ? Yes, it is not very useful for user-space to know how many kernel sockets we create, IMHO, so not counting kernel sockets seems fine.
diff --git a/include/net/netns/core.h b/include/net/netns/core.h index 45cfb5d..d1b4748f 100644 --- a/include/net/netns/core.h +++ b/include/net/netns/core.h @@ -11,6 +11,7 @@ struct netns_core { int sysctl_somaxconn; + int __percpu *sock_inuse; struct prot_inuse __percpu *prot_inuse; }; diff --git a/include/net/sock.h b/include/net/sock.h index 79e1a2c..0809b31 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1266,6 +1266,7 @@ static inline void sk_sockets_allocated_inc(struct sock *sk) /* Called with local bh disabled */ void sock_prot_inuse_add(struct net *net, struct proto *prot, int inc); int sock_prot_inuse_get(struct net *net, struct proto *proto); +int sock_inuse_get(struct net *net); #else static inline void sock_prot_inuse_add(struct net *net, struct proto *prot, int inc) diff --git a/net/core/sock.c b/net/core/sock.c index c2dd2d3..a11680a 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -145,6 +145,8 @@ static DEFINE_MUTEX(proto_list_mutex); static LIST_HEAD(proto_list); +static void sock_inuse_add(struct net *net, int val); + /** * sk_ns_capable - General socket capability test * @sk: Socket to use a capability on or through @@ -1534,6 +1536,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority, if (likely(sk->sk_net_refcnt)) get_net(net); sock_net_set(sk, net); + sock_inuse_add(net, 1); refcount_set(&sk->sk_wmem_alloc, 1); mem_cgroup_sk_alloc(sk); @@ -1595,6 +1598,8 @@ void sk_destruct(struct sock *sk) static void __sk_free(struct sock *sk) { + sock_inuse_add(sock_net(sk), -1); + if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt)) sock_diag_broadcast_destroy(sk); else @@ -1716,6 +1721,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) newsk->sk_priority = 0; newsk->sk_incoming_cpu = raw_smp_processor_id(); atomic64_set(&newsk->sk_cookie, 0); + sock_inuse_add(sock_net(newsk), 1); /* * Before updating sk_refcnt, we must commit prior changes to memory @@ -3061,15 +3067,53 @@ int sock_prot_inuse_get(struct net *net, struct proto *prot) } EXPORT_SYMBOL_GPL(sock_prot_inuse_get); +static void sock_inuse_add(struct net *net, int val) +{ + if (net->core.prot_inuse) + this_cpu_add(*net->core.sock_inuse, val); +} + +int sock_inuse_get(struct net *net) +{ + int cpu, res = 0; + + if (!net->core.prot_inuse) + return 0; + + for_each_possible_cpu(cpu) + res += *per_cpu_ptr(net->core.sock_inuse, cpu); + + return res >= 0 ? res : 0; +} +EXPORT_SYMBOL_GPL(sock_inuse_get); + static int __net_init sock_inuse_init_net(struct net *net) { net->core.prot_inuse = alloc_percpu(struct prot_inuse); - return net->core.prot_inuse ? 0 : -ENOMEM; + if (!net->core.prot_inuse) + return -ENOMEM; + + net->core.sock_inuse = alloc_percpu(int); + if (!net->core.sock_inuse) + goto out; + + return 0; +out: + free_percpu(net->core.prot_inuse); + return -ENOMEM; } static void __net_exit sock_inuse_exit_net(struct net *net) { - free_percpu(net->core.prot_inuse); + if (net->core.prot_inuse) { + free_percpu(net->core.prot_inuse); + net->core.prot_inuse = NULL; + } + + if (net->core.sock_inuse) { + free_percpu(net->core.sock_inuse); + net->core.prot_inuse = NULL; + } } static struct pernet_operations net_inuse_ops = { @@ -3112,6 +3156,10 @@ static inline void assign_proto_idx(struct proto *prot) static inline void release_proto_idx(struct proto *prot) { } + +static void sock_inuse_add(struct net *net, int val) +{ +} #endif static void req_prot_cleanup(struct request_sock_ops *rsk_prot) diff --git a/net/socket.c b/net/socket.c index 42d8e9c..183de8f01 100644 --- a/net/socket.c +++ b/net/socket.c @@ -163,12 +163,6 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos, static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly; /* - * Statistics counters of the socket lists - */ - -static DEFINE_PER_CPU(int, sockets_in_use); - -/* * Support routines. * Move socket addresses back and forth across the kernel/user * divide and look after the messy bits. @@ -574,7 +568,6 @@ struct socket *sock_alloc(void) inode->i_gid = current_fsgid(); inode->i_op = &sockfs_inode_ops; - this_cpu_add(sockets_in_use, 1); return sock; } EXPORT_SYMBOL(sock_alloc); @@ -601,7 +594,6 @@ void sock_release(struct socket *sock) if (rcu_dereference_protected(sock->wq, 1)->fasync_list) pr_err("%s: fasync list not empty!\n", __func__); - this_cpu_sub(sockets_in_use, 1); if (!sock->file) { iput(SOCK_INODE(sock)); return; @@ -2644,17 +2636,8 @@ static int __init sock_init(void) #ifdef CONFIG_PROC_FS void socket_seq_show(struct seq_file *seq) { - int cpu; - int counter = 0; - - for_each_possible_cpu(cpu) - counter += per_cpu(sockets_in_use, cpu); - - /* It can be negative, by the way. 8) */ - if (counter < 0) - counter = 0; - - seq_printf(seq, "sockets: used %d\n", counter); + seq_printf(seq, "sockets: used %d\n", + sock_inuse_get(seq->private)); } #endif /* CONFIG_PROC_FS */