diff mbox

[3.16.y-ckt,009/126] sit: fix sit0 percpu double allocations

Message ID 1449653896-5236-10-git-send-email-luis.henriques@canonical.com
State New
Headers show

Commit Message

Luis Henriques Dec. 9, 2015, 9:36 a.m. UTC
3.16.7-ckt21 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <edumazet@google.com>

commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.

sit0 device allocates its percpu storage twice :
- One time in ipip6_tunnel_init()
- One time in ipip6_fb_tunnel_init()

Thus we leak 48 bytes per possible cpu per network namespace dismantle.

ipip6_fb_tunnel_init() can be much simpler and does not
return an error, and should be called after register_netdev()

Note that ipip6_tunnel_clone_6rd() also needs to be called
after register_netdev() (calling ipip6_tunnel_init())

Fixes: ebe084aafb7e ("sit: Use ipip6_tunnel_init as the ndo_init function.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
---
 net/ipv6/sit.c | 27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

Comments

Ben Hutchings Dec. 12, 2015, 4:18 a.m. UTC | #1
On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> please let me know.
> 
> ------------------
> 
> From: Eric Dumazet <edumazet@google.com>
> 
> commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> 
> sit0 device allocates its percpu storage twice :
> - One time in ipip6_tunnel_init()
> - One time in ipip6_fb_tunnel_init()
> 
> Thus we leak 48 bytes per possible cpu per network namespace
> dismantle.
> 
> ipip6_fb_tunnel_init() can be much simpler and does not
> return an error, and should be called after register_netdev()
[...]

Doesn't this introduce a race condition when sit is a module?  There
seems to be nothing to prevent access to the partially initialised
device after calling register_netdev(), if sit_init_net() is called
during module loading rather than during namespace creation.

Ben.
Luis Henriques Dec. 13, 2015, 6:54 p.m. UTC | #2
On Sat, Dec 12, 2015 at 04:18:26AM +0000, Ben Hutchings wrote:
> On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> > 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> > please let me know.
> > 
> > ------------------
> > 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> > 
> > sit0 device allocates its percpu storage twice :
> > - One time in ipip6_tunnel_init()
> > - One time in ipip6_fb_tunnel_init()
> > 
> > Thus we leak 48 bytes per possible cpu per network namespace
> > dismantle.
> > 
> > ipip6_fb_tunnel_init() can be much simpler and does not
> > return an error, and should be called after register_netdev()
> [...]
> 
> Doesn't this introduce a race condition when sit is a module?  There
> seems to be nothing to prevent access to the partially initialised
> device after calling register_netdev(), if sit_init_net() is called
> during module loading rather than during namespace creation.
> 

This seems to be an upstream issue, not specific to the 3.16.y-ckt
stable kernel.  If that is the case, I guess I'll just keep this patch
and later apply the fix.  Or do you think this race is really likely
to be a worst problem than then issue the patch is trying to fix?

Cheers,
--
Luís

> Ben.
> 
> -- 
> Ben Hutchings
> Knowledge is power.  France is bacon.
Ben Hutchings Dec. 13, 2015, 8:20 p.m. UTC | #3
On Sun, 2015-12-13 at 18:54 +0000, Luis Henriques wrote:
> On Sat, Dec 12, 2015 at 04:18:26AM +0000, Ben Hutchings wrote:
> > On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> > > 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> > > please let me know.
> > > 
> > > ------------------
> > > 
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> > > 
> > > sit0 device allocates its percpu storage twice :
> > > - One time in ipip6_tunnel_init()
> > > - One time in ipip6_fb_tunnel_init()
> > > 
> > > Thus we leak 48 bytes per possible cpu per network namespace
> > > dismantle.
> > > 
> > > ipip6_fb_tunnel_init() can be much simpler and does not
> > > return an error, and should be called after register_netdev()
> > [...]
> > 
> > Doesn't this introduce a race condition when sit is a module?  There
> > seems to be nothing to prevent access to the partially initialised
> > device after calling register_netdev(), if sit_init_net() is called
> > during module loading rather than during namespace creation.
> > 
> 
> This seems to be an upstream issue, not specific to the 3.16.y-ckt
> stable kernel.  If that is the case, I guess I'll just keep this patch
> and later apply the fix.  Or do you think this race is really likely
> to be a worst problem than then issue the patch is trying to fix?

It seems worse than the problem being fixed.

Ben.
Eric Dumazet Dec. 13, 2015, 8:43 p.m. UTC | #4
On Sun, 2015-12-13 at 20:20 +0000, Ben Hutchings wrote:
> On Sun, 2015-12-13 at 18:54 +0000, Luis Henriques wrote:
> > On Sat, Dec 12, 2015 at 04:18:26AM +0000, Ben Hutchings wrote:
> > > On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> > > > 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> > > > please let me know.
> > > > 
> > > > ------------------
> > > > 
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > 
> > > > commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> > > > 
> > > > sit0 device allocates its percpu storage twice :
> > > > - One time in ipip6_tunnel_init()
> > > > - One time in ipip6_fb_tunnel_init()
> > > > 
> > > > Thus we leak 48 bytes per possible cpu per network namespace
> > > > dismantle.
> > > > 
> > > > ipip6_fb_tunnel_init() can be much simpler and does not
> > > > return an error, and should be called after register_netdev()
> > > [...]
> > > 
> > > Doesn't this introduce a race condition when sit is a module?  There
> > > seems to be nothing to prevent access to the partially initialised
> > > device after calling register_netdev(), if sit_init_net() is called
> > > during module loading rather than during namespace creation.
> > > 
> > 
> > This seems to be an upstream issue, not specific to the 3.16.y-ckt
> > stable kernel.  If that is the case, I guess I'll just keep this patch
> > and later apply the fix.  Or do you think this race is really likely
> > to be a worst problem than then issue the patch is trying to fix?
> 
> It seems worse than the problem being fixed.

1) Sorry Ben, I do not understand the problem you mention.
   What is a partially initialized device exactly ?

2) I have no idea why this patch is even backported to 3.16,
 since it is fixing a problem added in 3.18 :

# git describe --contains ebe084aafb7e
v3.18-rc5~22^2~42^2~1

If your 3.16 kernel survives this loop without consuming memory like
crazy, then the backport is not needed.

modprobe sit
while :
do
 ip netns add ns1
 ip netns delete ns1
done
Ben Hutchings Dec. 13, 2015, 9:22 p.m. UTC | #5
On Sun, 2015-12-13 at 12:43 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-13 at 20:20 +0000, Ben Hutchings wrote:
> > On Sun, 2015-12-13 at 18:54 +0000, Luis Henriques wrote:
> > > On Sat, Dec 12, 2015 at 04:18:26AM +0000, Ben Hutchings wrote:
> > > > On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> > > > > 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> > > > > please let me know.
> > > > > 
> > > > > ------------------
> > > > > 
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > 
> > > > > commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> > > > > 
> > > > > sit0 device allocates its percpu storage twice :
> > > > > - One time in ipip6_tunnel_init()
> > > > > - One time in ipip6_fb_tunnel_init()
> > > > > 
> > > > > Thus we leak 48 bytes per possible cpu per network namespace
> > > > > dismantle.
> > > > > 
> > > > > ipip6_fb_tunnel_init() can be much simpler and does not
> > > > > return an error, and should be called after register_netdev()
> > > > [...]
> > > > 
> > > > Doesn't this introduce a race condition when sit is a module?  There
> > > > seems to be nothing to prevent access to the partially initialised
> > > > device after calling register_netdev(), if sit_init_net() is called
> > > > during module loading rather than during namespace creation.
> > > > 
> > > 
> > > This seems to be an upstream issue, not specific to the 3.16.y-ckt
> > > stable kernel.  If that is the case, I guess I'll just keep this patch
> > > and later apply the fix.  Or do you think this race is really likely
> > > to be a worst problem than then issue the patch is trying to fix?
> > 
> > It seems worse than the problem being fixed.
> 
> 1) Sorry Ben, I do not understand the problem you mention.
>    What is a partially initialized device exactly ?

A tunnel device which is registered but hasn't had its private
structure fully initialised yet.

> 2) I have no idea why this patch is even backported to 3.16,
>  since it is fixing a problem added in 3.18 :
> 
> # git describe --contains ebe084aafb7e
> v3.18-rc5~22^2~42^2~1
> 
> If your 3.16 kernel survives this loop without consuming memory like
> crazy, then the backport is not needed.
> 
> modprobe sit
> while :
> do
>  ip netns add ns1
>  ip netns delete ns1
> done

I can't detect a memory leak when doing this.

Ben.
Luis Henriques Dec. 13, 2015, 9:32 p.m. UTC | #6
On Sun, Dec 13, 2015 at 12:43:10PM -0800, Eric Dumazet wrote:
> On Sun, 2015-12-13 at 20:20 +0000, Ben Hutchings wrote:
> > On Sun, 2015-12-13 at 18:54 +0000, Luis Henriques wrote:
> > > On Sat, Dec 12, 2015 at 04:18:26AM +0000, Ben Hutchings wrote:
> > > > On Wed, 2015-12-09 at 09:36 +0000, Luis Henriques wrote:
> > > > > 3.16.7-ckt21 -stable review patch.  If anyone has any objections,
> > > > > please let me know.
> > > > > 
> > > > > ------------------
> > > > > 
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > 
> > > > > commit 4ece9009774596ee3df0acba65a324b7ea79387c upstream.
> > > > > 
> > > > > sit0 device allocates its percpu storage twice :
> > > > > - One time in ipip6_tunnel_init()
> > > > > - One time in ipip6_fb_tunnel_init()
> > > > > 
> > > > > Thus we leak 48 bytes per possible cpu per network namespace
> > > > > dismantle.
> > > > > 
> > > > > ipip6_fb_tunnel_init() can be much simpler and does not
> > > > > return an error, and should be called after register_netdev()
> > > > [...]
> > > > 
> > > > Doesn't this introduce a race condition when sit is a module?  There
> > > > seems to be nothing to prevent access to the partially initialised
> > > > device after calling register_netdev(), if sit_init_net() is called
> > > > during module loading rather than during namespace creation.
> > > > 
> > > 
> > > This seems to be an upstream issue, not specific to the 3.16.y-ckt
> > > stable kernel.  If that is the case, I guess I'll just keep this patch
> > > and later apply the fix.  Or do you think this race is really likely
> > > to be a worst problem than then issue the patch is trying to fix?
> > 
> > It seems worse than the problem being fixed.
> 
> 1) Sorry Ben, I do not understand the problem you mention.
>    What is a partially initialized device exactly ?
> 
> 2) I have no idea why this patch is even backported to 3.16,
>  since it is fixing a problem added in 3.18 :
> 
> # git describe --contains ebe084aafb7e
> v3.18-rc5~22^2~42^2~1
>

Right, but this commit has been included in several stable git trees
(I can see it at least in 3.14, 3.13 and 3.10, but it's probably in
some more).

Cheers,
--
Luís


> If your 3.16 kernel survives this loop without consuming memory like
> crazy, then the backport is not needed.
> 
> modprobe sit
> while :
> do
>  ip netns add ns1
>  ip netns delete ns1
> done
> 
> 
> 
>
Eric Dumazet Dec. 13, 2015, 9:44 p.m. UTC | #7
On Sun, 2015-12-13 at 21:22 +0000, Ben Hutchings wrote:
> On Sun, 2015-12-13 at 12:43 -0800, Eric Dumazet wrote:

> > 1) Sorry Ben, I do not understand the problem you mention.
> >    What is a partially initialized device exactly ?
> 
> A tunnel device which is registered but hasn't had its private
> structure fully initialised yet.

And you see this happening after my patch ? I am blind.

I am referring to current linux kernel, not to a backport to pre 3.18
kernels, that was not considered when I wrote this patch.

By the time ipip6_fb_tunnel_init() is called, dev->tstats had been
already allocated in ipip6_tunnel_init(), so what is missing ?

Thanks.
Ben Hutchings Dec. 13, 2015, 9:49 p.m. UTC | #8
On Sun, 2015-12-13 at 13:44 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-13 at 21:22 +0000, Ben Hutchings wrote:
> > On Sun, 2015-12-13 at 12:43 -0800, Eric Dumazet wrote:
> 
> > > 1) Sorry Ben, I do not understand the problem you mention.
> > >    What is a partially initialized device exactly ?
> > 
> > A tunnel device which is registered but hasn't had its private
> > structure fully initialised yet.
> 
> And you see this happening after my patch ? I am blind.
> 
> I am referring to current linux kernel, not to a backport to pre 3.18
> kernels, that was not considered when I wrote this patch.
> 
> By the time ipip6_fb_tunnel_init() is called, dev->tstats had been
> already allocated in ipip6_tunnel_init(), so what is missing ?

You moved this initialisation below the registration:

> 	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
> 	ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);

Ben.
David Miller Dec. 13, 2015, 10:14 p.m. UTC | #9
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 13 Dec 2015 12:43:10 -0800

> 2) I have no idea why this patch is even backported to 3.16,
>  since it is fixing a problem added in 3.18 :

Because someone on the path to some of the -stable trees aren't even
checking the Fixes: tag in the patches they are backporting.

And that's really bad...
Luis Henriques Dec. 13, 2015, 10:22 p.m. UTC | #10
On Sun, Dec 13, 2015 at 05:14:06PM -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 13 Dec 2015 12:43:10 -0800
> 
> > 2) I have no idea why this patch is even backported to 3.16,
> >  since it is fixing a problem added in 3.18 :
> 
> Because someone on the path to some of the -stable trees aren't even
> checking the Fixes: tag in the patches they are backporting.
> 
> And that's really bad...

As I already said before, this patch has been included in 3.16 because
the SHA1 in the "Fixes:" *is* in 3.16 (as it is in other even older
stable trees such as 3.12 and 3.14).

Cheers,
--
Luís
Eric Dumazet Dec. 13, 2015, 10:45 p.m. UTC | #11
On Sun, 2015-12-13 at 21:49 +0000, Ben Hutchings wrote:
> On Sun, 2015-12-13 at 13:44 -0800, Eric Dumazet wrote:
> > On Sun, 2015-12-13 at 21:22 +0000, Ben Hutchings wrote:
> > > On Sun, 2015-12-13 at 12:43 -0800, Eric Dumazet wrote:
> > 
> > > > 1) Sorry Ben, I do not understand the problem you mention.
> > > >    What is a partially initialized device exactly ?
> > > 
> > > A tunnel device which is registered but hasn't had its private
> > > structure fully initialised yet.
> > 
> > And you see this happening after my patch ? I am blind.
> > 
> > I am referring to current linux kernel, not to a backport to pre 3.18
> > kernels, that was not considered when I wrote this patch.
> > 
> > By the time ipip6_fb_tunnel_init() is called, dev->tstats had been
> > already allocated in ipip6_tunnel_init(), so what is missing ?
> 
> You moved this initialisation below the registration:
> 
> > 	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
> > 	ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);

Okay, so what is the exact problem you are seeing Ben ?

ipip6_tunnel_clone_6rd() looks to not contain a fatal race or mem leak.

Note that ipip6_tunnel_clone_6rd() can be called later from ioctl()
path.

ipip6_fb_tunnel_init() must be done once device is ready, as it
publishes state for packet processing.

rcu_assign_pointer(sitn->tunnels_wc[0], tunnel);

Looks like a rather correct way to register a device : init all fields,
then publish the RCU protected pointer for packets to catch it.

Really, I do not see a problem, please elaborate.
Ben Hutchings Dec. 13, 2015, 11:03 p.m. UTC | #12
On Sun, 2015-12-13 at 14:45 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-13 at 21:49 +0000, Ben Hutchings wrote:
> > On Sun, 2015-12-13 at 13:44 -0800, Eric Dumazet wrote:
> > > On Sun, 2015-12-13 at 21:22 +0000, Ben Hutchings wrote:
> > > > On Sun, 2015-12-13 at 12:43 -0800, Eric Dumazet wrote:
> > > 
> > > > > 1) Sorry Ben, I do not understand the problem you mention.
> > > > >    What is a partially initialized device exactly ?
> > > > 
> > > > A tunnel device which is registered but hasn't had its private
> > > > structure fully initialised yet.
> > > 
> > > And you see this happening after my patch ? I am blind.
> > > 
> > > I am referring to current linux kernel, not to a backport to pre 3.18
> > > kernels, that was not considered when I wrote this patch.
> > > 
> > > By the time ipip6_fb_tunnel_init() is called, dev->tstats had been
> > > already allocated in ipip6_tunnel_init(), so what is missing ?
> > 
> > You moved this initialisation below the registration:
> > 
> > > 	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
> > > 	ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
> 
> Okay, so what is the exact problem you are seeing Ben ?
> 
> ipip6_tunnel_clone_6rd() looks to not contain a fatal race or mem leak.
> 
> Note that ipip6_tunnel_clone_6rd() can be called later from ioctl()
> path.

That holds the rtnl_lock, though.

> ipip6_fb_tunnel_init() must be done once device is ready, as it
> publishes state for packet processing.

OK.

> rcu_assign_pointer(sitn->tunnels_wc[0], tunnel);
> 
> Looks like a rather correct way to register a device : init all fields,
> then publish the RCU protected pointer for packets to catch it.
> 
> Really, I do not see a problem, please elaborate.

Maybe there isn't one in this case.

Ben.
diff mbox

Patch

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 051e9c508933..b5bdd2aeb2f8 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1374,34 +1374,20 @@  static int ipip6_tunnel_init(struct net_device *dev)
 	return 0;
 }
 
-static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
+static void __net_init ipip6_fb_tunnel_init(struct net_device *dev)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 	struct iphdr *iph = &tunnel->parms.iph;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
-	tunnel->dev = dev;
-	tunnel->net = dev_net(dev);
-
 	iph->version		= 4;
 	iph->protocol		= IPPROTO_IPV6;
 	iph->ihl		= 5;
 	iph->ttl		= 64;
 
-	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-	if (!dev->tstats)
-		return -ENOMEM;
-
-	tunnel->dst_cache = alloc_percpu(struct ip_tunnel_dst);
-	if (!tunnel->dst_cache) {
-		free_percpu(dev->tstats);
-		return -ENOMEM;
-	}
-
 	dev_hold(dev);
 	rcu_assign_pointer(sitn->tunnels_wc[0], tunnel);
-	return 0;
 }
 
 static int ipip6_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -1738,23 +1724,18 @@  static int __net_init sit_init_net(struct net *net)
 	 */
 	sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 
-	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
-	if (err)
-		goto err_dev_free;
-
-	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
-
 	if ((err = register_netdev(sitn->fb_tunnel_dev)))
 		goto err_reg_dev;
 
+	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
+	ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
+
 	t = netdev_priv(sitn->fb_tunnel_dev);
 
 	strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
 	return 0;
 
 err_reg_dev:
-	dev_put(sitn->fb_tunnel_dev);
-err_dev_free:
 	ipip6_dev_free(sitn->fb_tunnel_dev);
 err_alloc_dev:
 	return err;