diff mbox series

[net-next] net: do not create fallback tunnels for non-default namespaces

Message ID 20180308205141.77868-1-edumazet@google.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series [net-next] net: do not create fallback tunnels for non-default namespaces | expand

Commit Message

Eric Dumazet March 8, 2018, 8:51 p.m. UTC
fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
ip6tnl0, ip6gre0) are automatically created when the corresponding
module is loaded.

These tunnels are also automatically created when a new network
namespace is created, at a great cost.

In many cases, netns are used for isolation purposes, and these
extra network devices are a waste of resources. We are using
thousands of netns per host, and hit the netns creation/delete
bottleneck a lot. (Many thanks to Kirill for recent work on this)

Add a new sysctl so that we can opt-out from this automatic creation.

Note that these tunnels are still created for the initial namespace,
to be the least intrusive for typical setups.

Tested:
lpk43:~# cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
done
wait

lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh

real	0m37.521s
user	0m0.886s
sys	7m7.084s
lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh

real	0m4.761s
user	0m0.851s
sys	1m8.343s
lpk43:~#

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/sysctl/net.txt | 12 ++++++++++++
 include/linux/netdevice.h    |  7 +++++++
 include/net/ip_tunnels.h     |  2 ++
 net/core/sysctl_net_core.c   | 12 ++++++++++++
 net/ipv4/ip_tunnel.c         | 20 ++++++++++++--------
 net/ipv6/ip6_gre.c           |  4 +++-
 net/ipv6/ip6_tunnel.c        |  2 ++
 net/ipv6/sit.c               |  5 ++++-
 8 files changed, 54 insertions(+), 10 deletions(-)

Comments

Jiri Benc March 9, 2018, 11:06 a.m. UTC | #1
On Thu,  8 Mar 2018 12:51:41 -0800, Eric Dumazet wrote:
> Note that these tunnels are still created for the initial namespace,
> to be the least intrusive for typical setups.

Since this is a knob and must be turned on explicitly, why we don't get
rid of the automatic interfaces even for the initial name space? It
causes only problems nowadays, such as

ip link add name gre0 type gre <tunnel_options...>

failing with "File exists" even if there was no gre0 interface before.
And of course, even with the error, the interface with the name "gre0"
appears in the system. And of course, it does not have any of the
options specified. This is highly confusing. Not to mention the
autocreated gre0 interface is basically useless.

I'd like to switch the knob on by default on my systems and have the
kernel behave sane, finally, even without name spaces.

Thanks!

 Jiri
Eric Dumazet March 9, 2018, 12:53 p.m. UTC | #2
Hi Jiri

On Fri, Mar 9, 2018 at 3:06 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu,  8 Mar 2018 12:51:41 -0800, Eric Dumazet wrote:
>> Note that these tunnels are still created for the initial namespace,
>> to be the least intrusive for typical setups.
>
> Since this is a knob and must be turned on explicitly, why we don't get
> rid of the automatic interfaces even for the initial name space? It
> causes only problems nowadays, such as
>
> ip link add name gre0 type gre <tunnel_options...>
>
> failing with "File exists" even if there was no gre0 interface before.
> And of course, even with the error, the interface with the name "gre0"
> appears in the system. And of course, it does not have any of the
> options specified. This is highly confusing. Not to mention the
> autocreated gre0 interface is basically useless.

Unless you bring it up ;)

>
> I'd like to switch the knob on by default on my systems and have the
> kernel behave sane, finally, even without name spaces.

Compatibility problems,  mostly.
Some users might depend on existing behavior.

You and me would not care of breaking our setups, but maybe not
unaware people out there.

Since init_ns is created at boot time before the sysctl can be
changed, we rather should not change the default behavior for init_ns.

Thanks.
Jiri Benc March 9, 2018, 1:08 p.m. UTC | #3
On Fri, 9 Mar 2018 04:53:07 -0800, Eric Dumazet wrote:
> Unless you bring it up ;)

Often, you need some options to be specified, thus you end up creating
another interface anyway.

> Compatibility problems,  mostly.
> Some users might depend on existing behavior.

How is this a compatibility problem, when the user has to opt in?
Without the default value, the kernel behaves as before. I'm sorry,
I don't see the problem, what am I missing?

> You and me would not care of breaking our setups, but maybe not
> unaware people out there.

Sure, that's why this is an option and it defaults to off.

> Since init_ns is created at boot time before the sysctl can be
> changed, we rather should not change the default behavior for init_ns.

This could be a problem for built in drivers. With modules, I'm
perfectly happy doing this in initrd :-)

To behave consistently in all cases, we might consider adding a boot
option for this, too, though.

 Jiri
Eric Dumazet March 9, 2018, 1:18 p.m. UTC | #4
On Fri, Mar 9, 2018 at 5:08 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Fri, 9 Mar 2018 04:53:07 -0800, Eric Dumazet wrote:
>> Unless you bring it up ;)
>
> Often, you need some options to be specified, thus you end up creating
> another interface anyway.
>
>> Compatibility problems,  mostly.
>> Some users might depend on existing behavior.
>
> How is this a compatibility problem, when the user has to opt in?
> Without the default value, the kernel behaves as before. I'm sorry,
> I don't see the problem, what am I missing?

How do you opt in exactly ?

Once it is too late and gre0 is already there ?

ip_gre can be statically built in vmlinux

Are you suggesting to add a module parameter ?

I wont do it but feel free to send a patch and wait for David flames :)

>
>> You and me would not care of breaking our setups, but maybe not
>> unaware people out there.
>
> Sure, that's why this is an option and it defaults to off.
>
>> Since init_ns is created at boot time before the sysctl can be
>> changed, we rather should not change the default behavior for init_ns.
>
> This could be a problem for built in drivers. With modules, I'm
> perfectly happy doing this in initrd :-)
>
> To behave consistently in all cases, we might consider adding a boot
> option for this, too, though.

Feel free to send a patch.

And do not forget we have many types of tunnels. So do not only patch gre.
David Miller March 9, 2018, 4:23 p.m. UTC | #5
From: Eric Dumazet <edumazet@google.com>
Date: Thu,  8 Mar 2018 12:51:41 -0800

> fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
> ip6tnl0, ip6gre0) are automatically created when the corresponding
> module is loaded.
> 
> These tunnels are also automatically created when a new network
> namespace is created, at a great cost.
> 
> In many cases, netns are used for isolation purposes, and these
> extra network devices are a waste of resources. We are using
> thousands of netns per host, and hit the netns creation/delete
> bottleneck a lot. (Many thanks to Kirill for recent work on this)
> 
> Add a new sysctl so that we can opt-out from this automatic creation.
> 
> Note that these tunnels are still created for the initial namespace,
> to be the least intrusive for typical setups.
> 
> Tested:
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.
diff mbox series

Patch

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 35c62f522754a79f8ecaba74dd8c84a038a8928f..5992602469d894d594b829690b9b8caf7fda57f4 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -270,6 +270,18 @@  optmem_max
 Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
 of struct cmsghdr structures with appended data.
 
+fb_tunnels_only_for_init_net
+----------------------------
+
+Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
+sit0, ip6tnl0, ip6gre0) are automatically created when a new
+network namespace is created, if corresponding tunnel is present
+in initial network namespace.
+If set to 1, these devices are not automatically created, and
+user space is responsible for creating them if needed.
+
+Default : 0  (for compatibility reasons)
+
 2. /proc/sys/net/unix - Parameters for Unix domain sockets
 -------------------------------------------------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 95a613a7cc1c8fdcea1678825e8535b21f124222..9711108c39163cc58cfda08c129de169a0c86e82 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -585,6 +585,13 @@  struct netdev_queue {
 #endif
 } ____cacheline_aligned_in_smp;
 
+extern int sysctl_fb_tunnels_only_for_init_net;
+
+static inline bool net_has_fallback_tunnels(const struct net *net)
+{
+	return net == &init_net || !sysctl_fb_tunnels_only_for_init_net;
+}
+
 static inline int netdev_queue_numa_node_read(const struct netdev_queue *q)
 {
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index cbe5addb92933474aeeee3aaee51685a57769a80..540a4b4417bfbe4654a6da34ed2ee6d7543dbc62 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -180,8 +180,10 @@  struct tnl_ptk_info {
 
 struct ip_tunnel_net {
 	struct net_device *fb_tunnel_dev;
+	struct rtnl_link_ops *rtnl_link_ops;
 	struct hlist_head tunnels[IP_TNL_HASH_SIZE];
 	struct ip_tunnel __rcu *collect_md_tun;
+	int type;
 };
 
 static inline void ip_tunnel_key_init(struct ip_tunnel_key *key,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index d714f65782b7c3dc6d944ca572882b59b24f96f8..4f47f92459cc31675d4c8ec20dd3baea1b62cac1 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -32,6 +32,9 @@  static int max_skb_frags = MAX_SKB_FRAGS;
 
 static int net_msg_warn;	/* Unused, but still a sysctl */
 
+int sysctl_fb_tunnels_only_for_init_net __read_mostly = 0;
+EXPORT_SYMBOL(sysctl_fb_tunnels_only_for_init_net);
+
 #ifdef CONFIG_RPS
 static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 				void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -513,6 +516,15 @@  static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "fb_tunnels_only_for_init_net",
+		.data		= &sysctl_fb_tunnels_only_for_init_net,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 602597dfc395a85dc7d73150dbd36b53c18a7031..5fcb17cb426bfcce40add19bdf664aec66775b1e 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -347,8 +347,7 @@  static struct ip_tunnel *ip_tunnel_create(struct net *net,
 	struct net_device *dev;
 	int t_hlen;
 
-	BUG_ON(!itn->fb_tunnel_dev);
-	dev = __ip_tunnel_create(net, itn->fb_tunnel_dev->rtnl_link_ops, parms);
+	dev = __ip_tunnel_create(net, itn->rtnl_link_ops, parms);
 	if (IS_ERR(dev))
 		return ERR_CAST(dev);
 
@@ -822,7 +821,6 @@  int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd)
 	struct net *net = t->net;
 	struct ip_tunnel_net *itn = net_generic(net, t->ip_tnl_net_id);
 
-	BUG_ON(!itn->fb_tunnel_dev);
 	switch (cmd) {
 	case SIOCGETTUNNEL:
 		if (dev == itn->fb_tunnel_dev) {
@@ -847,7 +845,7 @@  int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd)
 				p->o_key = 0;
 		}
 
-		t = ip_tunnel_find(itn, p, itn->fb_tunnel_dev->type);
+		t = ip_tunnel_find(itn, p, itn->type);
 
 		if (cmd == SIOCADDTUNNEL) {
 			if (!t) {
@@ -991,10 +989,15 @@  int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id,
 	struct ip_tunnel_parm parms;
 	unsigned int i;
 
+	itn->rtnl_link_ops = ops;
 	for (i = 0; i < IP_TNL_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&itn->tunnels[i]);
 
-	if (!ops) {
+	if (!ops || !net_has_fallback_tunnels(net)) {
+		struct ip_tunnel_net *it_init_net;
+
+		it_init_net = net_generic(&init_net, ip_tnl_net_id);
+		itn->type = it_init_net->type;
 		itn->fb_tunnel_dev = NULL;
 		return 0;
 	}
@@ -1012,6 +1015,7 @@  int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id,
 		itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 		itn->fb_tunnel_dev->mtu = ip_tunnel_bind_dev(itn->fb_tunnel_dev);
 		ip_tunnel_add(itn, netdev_priv(itn->fb_tunnel_dev));
+		itn->type = itn->fb_tunnel_dev->type;
 	}
 	rtnl_unlock();
 
@@ -1019,10 +1023,10 @@  int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id,
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_init_net);
 
-static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head,
+static void ip_tunnel_destroy(struct net *net, struct ip_tunnel_net *itn,
+			      struct list_head *head,
 			      struct rtnl_link_ops *ops)
 {
-	struct net *net = dev_net(itn->fb_tunnel_dev);
 	struct net_device *dev, *aux;
 	int h;
 
@@ -1054,7 +1058,7 @@  void ip_tunnel_delete_nets(struct list_head *net_list, unsigned int id,
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		itn = net_generic(net, id);
-		ip_tunnel_destroy(itn, &list, ops);
+		ip_tunnel_destroy(net, itn, &list, ops);
 	}
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 18a3dfbd0300dae769fb67aabe8ff0ff3a831e93..7d8775c9570d858904faa6951e76f7bffc3f950a 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -236,7 +236,7 @@  static struct ip6_tnl *ip6gre_tunnel_lookup(struct net_device *dev,
 		return t;
 
 	dev = ign->fb_tunnel_dev;
-	if (dev->flags & IFF_UP)
+	if (dev && dev->flags & IFF_UP)
 		return netdev_priv(dev);
 
 	return NULL;
@@ -1472,6 +1472,8 @@  static int __net_init ip6gre_init_net(struct net *net)
 	struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
 	int err;
 
+	if (!net_has_fallback_tunnels(net))
+		return 0;
 	ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6gre0",
 					  NET_NAME_UNKNOWN,
 					  ip6gre_tunnel_setup);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 56c4967f1868e5d250e7eaa63650377ff426ff94..5c045fa407da105b1bf874c984e04c643c07dffd 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -2205,6 +2205,8 @@  static int __net_init ip6_tnl_init_net(struct net *net)
 	ip6n->tnls[0] = ip6n->tnls_wc;
 	ip6n->tnls[1] = ip6n->tnls_r_l;
 
+	if (!net_has_fallback_tunnels(net))
+		return 0;
 	err = -ENOMEM;
 	ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6tnl0",
 					NET_NAME_UNKNOWN, ip6_tnl_dev_setup);
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index a9c4ac6efe22c968537bcb4bf6d01ef7d03d63aa..8a4f8fddd8121a31416103551f915ad610ac6f94 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -182,7 +182,7 @@  static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
 #ifdef CONFIG_IPV6_SIT_6RD
 	struct ip_tunnel *t = netdev_priv(dev);
 
-	if (dev == sitn->fb_tunnel_dev) {
+	if (dev == sitn->fb_tunnel_dev || !sitn->fb_tunnel_dev) {
 		ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0);
 		t->ip6rd.relay_prefix = 0;
 		t->ip6rd.prefixlen = 16;
@@ -1835,6 +1835,9 @@  static int __net_init sit_init_net(struct net *net)
 	sitn->tunnels[2] = sitn->tunnels_r;
 	sitn->tunnels[3] = sitn->tunnels_r_l;
 
+	if (!net_has_fallback_tunnels(net))
+		return 0;
+
 	sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "sit0",
 					   NET_NAME_UNKNOWN,
 					   ipip6_tunnel_setup);