From patchwork Thu Mar 8 20:51:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 883341 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.b="H3uYNX4z"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3zy2kN0D8Cz9shp for ; Fri, 9 Mar 2018 07:51:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751212AbeCHUvy (ORCPT ); Thu, 8 Mar 2018 15:51:54 -0500 Received: from mail-pl0-f65.google.com ([209.85.160.65]:45574 "EHLO mail-pl0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750787AbeCHUvv (ORCPT ); Thu, 8 Mar 2018 15:51:51 -0500 Received: by mail-pl0-f65.google.com with SMTP id v9-v6so3975477plp.12 for ; Thu, 08 Mar 2018 12:51:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=dI6+yHlCrLKsmVrKPFYVGMLKjVFJHEzVqq8h40FuMt0=; b=H3uYNX4zKH92sE589Ksi+ckkTHDzegG1k4T0Y59JPBcZY6vBs7dF83ha1UKmm7LGqM ilSznbd2U2PhlprUOPjXFL94eE+RSRHteIq0sLNIxVxO8RlMesedkb1eikTsCRKIN/Nx PPVrrN1fNVWE+LPw37D1ubmXuryJ2IHSuCIQc7iD8F+zi3EQyWGJWJH83IJJsOP1VOSz pHrKtwY7TaZpg5yUaMBX9R8aHeSG2+RRzEtNWh9PjqDEfl7g1v4nfqqxmXfoNffrcBXd GlwXIzOcSKQ+04SAQSvRs3DrMruwlBMRRCXEFDAHoqzx5JHHxTMK9dgNDl8JZ9JcXHrR NGDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=dI6+yHlCrLKsmVrKPFYVGMLKjVFJHEzVqq8h40FuMt0=; b=FhWoK3ImXAUjWb2NII3PkVqS6TKtajltk3yupjNkFiAJKwuhiXzfQXnKa4T3NQU+kT inL15avWCUqUv/RvUATx79ua9Fma0gKOExTIsn5B8w6SEDipmDTTxWPmxlYAZY3UUxrr WdbZJkD8Db9vBexdn6J2m/vrHe1azWW1DeWZGDbMXE0VTVqQQFUY/CvH7z7HcxX9BBnC rauUnN+gKt/bUyWlBRV/7UL4PxyAsLRqLzLs5e6MvmRERG6pfmGGZ9igjStZmm4OZqs1 M4wD4v298YmXdpZrUPRnE6LThgxxWTwIWzYVr5ohaeT+RL+SZ3zh1jtwZMNpr0DZnFxM nyiQ== X-Gm-Message-State: APf1xPBEqvVKOcByEwJ0Pj4DG/mjDLupTVDTQFmRh9a+9rtvLpyAtGCA HUIyEnl0JYnm00GcipHNCGmAAw== X-Google-Smtp-Source: AG47ELse9+XOSZoadWsMllHfnVHPP0w/DskyQS0k1cysq7fHBdJ4CKXAC4vUIpjFt0pwtaYMe/9eEQ== X-Received: by 2002:a17:902:166:: with SMTP id 93-v6mr24720692plb.419.1520542310807; Thu, 08 Mar 2018 12:51:50 -0800 (PST) Received: from localhost ([2620:15c:2c4:201:f5a:7eca:440a:3ead]) by smtp.gmail.com with ESMTPSA id u10sm38016481pfh.57.2018.03.08.12.51.49 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 08 Mar 2018 12:51:49 -0800 (PST) From: Eric Dumazet To: "David S . Miller" Cc: netdev , Eric Dumazet , Eric Dumazet Subject: [PATCH net-next] net: do not create fallback tunnels for non-default namespaces Date: Thu, 8 Mar 2018 12:51:41 -0800 Message-Id: <20180308205141.77868-1-edumazet@google.com> X-Mailer: git-send-email 2.16.2.395.g2e18187dfd-goog Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0, ip6tnl0, ip6gre0) are automatically created when the corresponding module is loaded. These tunnels are also automatically created when a new network namespace is created, at a great cost. In many cases, netns are used for isolation purposes, and these extra network devices are a waste of resources. We are using thousands of netns per host, and hit the netns creation/delete bottleneck a lot. (Many thanks to Kirill for recent work on this) Add a new sysctl so that we can opt-out from this automatic creation. Note that these tunnels are still created for the initial namespace, to be the least intrusive for typical setups. Tested: lpk43:~# cat add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m37.521s user 0m0.886s sys 7m7.084s lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m4.761s user 0m0.851s sys 1m8.343s lpk43:~# Signed-off-by: Eric Dumazet --- Documentation/sysctl/net.txt | 12 ++++++++++++ include/linux/netdevice.h | 7 +++++++ include/net/ip_tunnels.h | 2 ++ net/core/sysctl_net_core.c | 12 ++++++++++++ net/ipv4/ip_tunnel.c | 20 ++++++++++++-------- net/ipv6/ip6_gre.c | 4 +++- net/ipv6/ip6_tunnel.c | 2 ++ net/ipv6/sit.c | 5 ++++- 8 files changed, 54 insertions(+), 10 deletions(-) diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt index 35c62f522754a79f8ecaba74dd8c84a038a8928f..5992602469d894d594b829690b9b8caf7fda57f4 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.txt @@ -270,6 +270,18 @@ optmem_max Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence of struct cmsghdr structures with appended data. +fb_tunnels_only_for_init_net +---------------------------- + +Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, +sit0, ip6tnl0, ip6gre0) are automatically created when a new +network namespace is created, if corresponding tunnel is present +in initial network namespace. +If set to 1, these devices are not automatically created, and +user space is responsible for creating them if needed. + +Default : 0 (for compatibility reasons) + 2. /proc/sys/net/unix - Parameters for Unix domain sockets ------------------------------------------------------- diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 95a613a7cc1c8fdcea1678825e8535b21f124222..9711108c39163cc58cfda08c129de169a0c86e82 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -585,6 +585,13 @@ struct netdev_queue { #endif } ____cacheline_aligned_in_smp; +extern int sysctl_fb_tunnels_only_for_init_net; + +static inline bool net_has_fallback_tunnels(const struct net *net) +{ + return net == &init_net || !sysctl_fb_tunnels_only_for_init_net; +} + static inline int netdev_queue_numa_node_read(const struct netdev_queue *q) { #if defined(CONFIG_XPS) && defined(CONFIG_NUMA) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index cbe5addb92933474aeeee3aaee51685a57769a80..540a4b4417bfbe4654a6da34ed2ee6d7543dbc62 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -180,8 +180,10 @@ struct tnl_ptk_info { struct ip_tunnel_net { struct net_device *fb_tunnel_dev; + struct rtnl_link_ops *rtnl_link_ops; struct hlist_head tunnels[IP_TNL_HASH_SIZE]; struct ip_tunnel __rcu *collect_md_tun; + int type; }; static inline void ip_tunnel_key_init(struct ip_tunnel_key *key, diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index d714f65782b7c3dc6d944ca572882b59b24f96f8..4f47f92459cc31675d4c8ec20dd3baea1b62cac1 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -32,6 +32,9 @@ static int max_skb_frags = MAX_SKB_FRAGS; static int net_msg_warn; /* Unused, but still a sysctl */ +int sysctl_fb_tunnels_only_for_init_net __read_mostly = 0; +EXPORT_SYMBOL(sysctl_fb_tunnels_only_for_init_net); + #ifdef CONFIG_RPS static int rps_sock_flow_sysctl(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) @@ -513,6 +516,15 @@ static struct ctl_table net_core_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &zero, }, + { + .procname = "fb_tunnels_only_for_init_net", + .data = &sysctl_fb_tunnels_only_for_init_net, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, { } }; diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 602597dfc395a85dc7d73150dbd36b53c18a7031..5fcb17cb426bfcce40add19bdf664aec66775b1e 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -347,8 +347,7 @@ static struct ip_tunnel *ip_tunnel_create(struct net *net, struct net_device *dev; int t_hlen; - BUG_ON(!itn->fb_tunnel_dev); - dev = __ip_tunnel_create(net, itn->fb_tunnel_dev->rtnl_link_ops, parms); + dev = __ip_tunnel_create(net, itn->rtnl_link_ops, parms); if (IS_ERR(dev)) return ERR_CAST(dev); @@ -822,7 +821,6 @@ int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd) struct net *net = t->net; struct ip_tunnel_net *itn = net_generic(net, t->ip_tnl_net_id); - BUG_ON(!itn->fb_tunnel_dev); switch (cmd) { case SIOCGETTUNNEL: if (dev == itn->fb_tunnel_dev) { @@ -847,7 +845,7 @@ int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd) p->o_key = 0; } - t = ip_tunnel_find(itn, p, itn->fb_tunnel_dev->type); + t = ip_tunnel_find(itn, p, itn->type); if (cmd == SIOCADDTUNNEL) { if (!t) { @@ -991,10 +989,15 @@ int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id, struct ip_tunnel_parm parms; unsigned int i; + itn->rtnl_link_ops = ops; for (i = 0; i < IP_TNL_HASH_SIZE; i++) INIT_HLIST_HEAD(&itn->tunnels[i]); - if (!ops) { + if (!ops || !net_has_fallback_tunnels(net)) { + struct ip_tunnel_net *it_init_net; + + it_init_net = net_generic(&init_net, ip_tnl_net_id); + itn->type = it_init_net->type; itn->fb_tunnel_dev = NULL; return 0; } @@ -1012,6 +1015,7 @@ int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id, itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; itn->fb_tunnel_dev->mtu = ip_tunnel_bind_dev(itn->fb_tunnel_dev); ip_tunnel_add(itn, netdev_priv(itn->fb_tunnel_dev)); + itn->type = itn->fb_tunnel_dev->type; } rtnl_unlock(); @@ -1019,10 +1023,10 @@ int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id, } EXPORT_SYMBOL_GPL(ip_tunnel_init_net); -static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head, +static void ip_tunnel_destroy(struct net *net, struct ip_tunnel_net *itn, + struct list_head *head, struct rtnl_link_ops *ops) { - struct net *net = dev_net(itn->fb_tunnel_dev); struct net_device *dev, *aux; int h; @@ -1054,7 +1058,7 @@ void ip_tunnel_delete_nets(struct list_head *net_list, unsigned int id, rtnl_lock(); list_for_each_entry(net, net_list, exit_list) { itn = net_generic(net, id); - ip_tunnel_destroy(itn, &list, ops); + ip_tunnel_destroy(net, itn, &list, ops); } unregister_netdevice_many(&list); rtnl_unlock(); diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 18a3dfbd0300dae769fb67aabe8ff0ff3a831e93..7d8775c9570d858904faa6951e76f7bffc3f950a 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -236,7 +236,7 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct net_device *dev, return t; dev = ign->fb_tunnel_dev; - if (dev->flags & IFF_UP) + if (dev && dev->flags & IFF_UP) return netdev_priv(dev); return NULL; @@ -1472,6 +1472,8 @@ static int __net_init ip6gre_init_net(struct net *net) struct ip6gre_net *ign = net_generic(net, ip6gre_net_id); int err; + if (!net_has_fallback_tunnels(net)) + return 0; ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6gre0", NET_NAME_UNKNOWN, ip6gre_tunnel_setup); diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 56c4967f1868e5d250e7eaa63650377ff426ff94..5c045fa407da105b1bf874c984e04c643c07dffd 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -2205,6 +2205,8 @@ static int __net_init ip6_tnl_init_net(struct net *net) ip6n->tnls[0] = ip6n->tnls_wc; ip6n->tnls[1] = ip6n->tnls_r_l; + if (!net_has_fallback_tunnels(net)) + return 0; err = -ENOMEM; ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6tnl0", NET_NAME_UNKNOWN, ip6_tnl_dev_setup); diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index a9c4ac6efe22c968537bcb4bf6d01ef7d03d63aa..8a4f8fddd8121a31416103551f915ad610ac6f94 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -182,7 +182,7 @@ static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn) #ifdef CONFIG_IPV6_SIT_6RD struct ip_tunnel *t = netdev_priv(dev); - if (dev == sitn->fb_tunnel_dev) { + if (dev == sitn->fb_tunnel_dev || !sitn->fb_tunnel_dev) { ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0); t->ip6rd.relay_prefix = 0; t->ip6rd.prefixlen = 16; @@ -1835,6 +1835,9 @@ static int __net_init sit_init_net(struct net *net) sitn->tunnels[2] = sitn->tunnels_r; sitn->tunnels[3] = sitn->tunnels_r_l; + if (!net_has_fallback_tunnels(net)) + return 0; + sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "sit0", NET_NAME_UNKNOWN, ipip6_tunnel_setup);