diff mbox

IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?

Message ID 20100422.185400.71096585.davem@davemloft.net
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller April 23, 2010, 1:54 a.m. UTC
From: Jiri Bohac <jbohac@suse.cz>
Date: Thu, 22 Apr 2010 17:49:08 +0200

> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

It likely wants to do both.

Just doing dst_release() doesn't mark the 'dst' object as obsolete,
and therefore it won't get force garbage collected.

That's why the dst_free() is necessary, to really get rid of it when
the refcount does hit zero.

Actually, what's really interesting is that at the top of the
linux-2.6-history tree this code reads:

		dst_hold(&ifp->rt->u.dst);
		if (ip6_del_rt(ifp->rt, NULL, NULL))
			dst_free(&ifp->rt->u.dst);
		else
			dst_release(&ifp->rt->u.dst);

and in Linus's initial GIT import, it reads this way too.

Where did it change to the current form that lacks the
else block?

Aha!  Here it is:

commit 4641e7a334adf6856300a98e7296dfc886c446af
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 2 16:55:45 2006 -0800

    [IPV6]: Don't hold extra ref count in ipv6_ifa_notify
    
    Currently the logic in ipv6_ifa_notify is to hold an extra reference
    count for addrconf dst's that get added to the routing table.  Thus,
    when addrconf dst entries are taken out of the routing table, we need
    to drop that dst.  However, addrconf dst entries may be removed from
    the routing table by means other than __ipv6_ifa_notify.
    
    So we're faced with the choice of either fixing up all places where
    addrconf dst entries are removed, or dropping the extra reference count
    altogether.
    
    I chose the latter because the ifp itself always holds a dst reference
    count of 1 while it's alive.  This is dropped just before we kfree the
    ifp object.  Therefore we know that in __ipv6_ifa_notify we will always
    hold that count.
    
    This bug was found by Eric W. Biederman.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu April 23, 2010, 2:10 a.m. UTC | #1
On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.

Actually I don't think the problem is in __ipv6_ifa_notify.  The
fact is none of this stuff is meant to be idempotent.  So it's
up to the entity that is requesting the deletion to make sure that
a single object is not deleted more than once.

Yes the original symptom was in __ipv6_ifa_notify, but it is
merely pointing out that we have a problem further up.

My patch is indeed not sufficient as Jiri pointed out, because
I didn't deal with the case of an administrative deletion of
autmatically generated IPv6 addresses.

I will post an updated patch later today to deal with that.

Cheers,
Herbert Xu April 23, 2010, 3:05 p.m. UTC | #2
On Fri, Apr 23, 2010 at 10:10:00AM +0800, Herbert Xu wrote:
> 
> I will post an updated patch later today to deal with that.

Just got back from a business trip.

This stuff is more broken than I thought.  For example, we perform
a number of actions when DAD succeeds, e.g., joining an anycast
group.  However, this is not synchronised with respect to address
deletion at all, so if DAD succeeds just as someone deletes the
address, you can easily get stuck on that anycast group.

I will try to untangle this mess tomorrow.

Cheers,
Jiri Bohac April 27, 2010, 3:50 p.m. UTC | #3
On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.
> 
> Just doing dst_release() doesn't mark the 'dst' object as obsolete,
> and therefore it won't get force garbage collected.

Sure. So If I understand it correctly, there are two problems:

- the reference taken by dst_hold() just above the ip6_del_rt()
  is never dropped if ip6_del_rt() fails; so shouldn't the code
  be like this?:

	dst_hold(&ifp->rt->u.dst);
	if (ip6_del_rt(ifp->rt)) {
		dst_release(&ifp->rt->u.dst);
		dst_free(&ifp->rt->u.dst);
	}

- if ip6_del_rt() fails because it races with something else
  deleting the address, dst_free() will be called twice. This is
  what Herbert is fixing with additional locking. However -- even
  when he fixes that -- how can ip6_del_rt() fail with the
  ifp->rt still needing a dst_free()?
  AFAICS, it can fail by:
  	- __ip6_del_rt() finding that (rt ==
	  net->ipv6.ip6_null_entry); we don't want to call
	  dst_free() on net->ipv6.ip6_null_entry, do we?

	- fib6_del() returning -ENOENT for multiple reasons;
	  but doesn't that mean that something else has removed
	  the route already and called dst_free on it?

  In either case, the dst_free() looks like not being needed. And it only
  does no harm in most cases, because these events are rare and
  it usually finds (obsolete > 2) and does nothing.
  
  I think that what Herbert is doing is only going to enforce that
  the ip6_del_rt() is never going to fail, so the dst_free()
  won't ever be called anyway, right?

Thanks,
Herbert Xu April 27, 2010, 3:55 p.m. UTC | #4
On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
>
>   I think that what Herbert is doing is only going to enforce that
>   the ip6_del_rt() is never going to fail, so the dst_free()
>   won't ever be called anyway, right?

Yes I think you're right.  But let's fix the bigger problem first,
and then we can audit this and possibly turn it into a WARN_ON.

Thanks,
Herbert Xu May 9, 2010, 6:48 a.m. UTC | #5
On Tue, Apr 27, 2010 at 11:55:33PM +0800, Herbert Xu wrote:
> On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
> >
> >   I think that what Herbert is doing is only going to enforce that
> >   the ip6_del_rt() is never going to fail, so the dst_free()
> >   won't ever be called anyway, right?
> 
> Yes I think you're right.  But let's fix the bigger problem first,
> and then we can audit this and possibly turn it into a WARN_ON.

Sorry for not fixing this sooner, but I'm back on the case.

Also, the dst_free really is needed.  It's there for the case
where we delete the address before DAD completion.  In that
case the route object is allocated, but not inserted in the
routing table.  So we must manually dst_free it.

Cheers,
Herbert Xu May 18, 2010, 11:02 a.m. UTC | #6
On Fri, Apr 23, 2010 at 11:05:40PM +0800, Herbert Xu wrote:
>
> This stuff is more broken than I thought.  For example, we perform
> a number of actions when DAD succeeds, e.g., joining an anycast
> group.  However, this is not synchronised with respect to address
> deletion at all, so if DAD succeeds just as someone deletes the
> address, you can easily get stuck on that anycast group.
> 
> I will try to untangle this mess tomorrow.

Tomorrow took a while to arrive :)

Here is a first batch of patches.  Note that this is by no means
a comprehensive fix for all the ndisc/addrconf race conditions.
It is just a first step in trying to address the problems.

The patchset revolves around a new lock, ifp->state_lock.  I
added it instead of trying to reuse the existing ifp->lock because
the latter has serious nesting issues that prevent it from easily
being used.  My long term plan is to restructure the locking and
eventually phase out ifp->lock in favour of ifp->state_lock.

Cheers,
stephen hemminger May 18, 2010, 5:25 p.m. UTC | #7
On Tue, 18 May 2010 21:02:43 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Apr 23, 2010 at 11:05:40PM +0800, Herbert Xu wrote:
> >
> > This stuff is more broken than I thought.  For example, we perform
> > a number of actions when DAD succeeds, e.g., joining an anycast
> > group.  However, this is not synchronised with respect to address
> > deletion at all, so if DAD succeeds just as someone deletes the
> > address, you can easily get stuck on that anycast group.
> > 
> > I will try to untangle this mess tomorrow.
> 
> Tomorrow took a while to arrive :)
> 
> Here is a first batch of patches.  Note that this is by no means
> a comprehensive fix for all the ndisc/addrconf race conditions.
> It is just a first step in trying to address the problems.
> 
> The patchset revolves around a new lock, ifp->state_lock.  I
> added it instead of trying to reuse the existing ifp->lock because
> the latter has serious nesting issues that prevent it from easily
> being used.  My long term plan is to restructure the locking and
> eventually phase out ifp->lock in favour of ifp->state_lock.

I wonder if so many fine grained locks are really necessary at
all.  Everything but timers looks like it is under RTNL mutex
already.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 18, 2010, 10:27 p.m. UTC | #8
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 18 May 2010 10:25:50 -0700

> I wonder if so many fine grained locks are really necessary at
> all.  Everything but timers looks like it is under RTNL mutex
> already.

I can't see any reasonable alternative, and it took weeks to get
a fix at all.

So I'm going to apply Herbert's fixes with the __KERNEL__ header
bit fixed up.

Thanks Herbert!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu May 18, 2010, 10:35 p.m. UTC | #9
On Tue, May 18, 2010 at 03:27:59PM -0700, David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 18 May 2010 10:25:50 -0700
> 
> > I wonder if so many fine grained locks are really necessary at
> > all.  Everything but timers looks like it is under RTNL mutex
> > already.
> 
> I can't see any reasonable alternative, and it took weeks to get
> a fix at all.

Right, the issue is with forcing an immediate effect after an
NDISC packet triggers an address addition/deletion.

From what I can gather, the hard case was with NDISC address
addition with DAD disabled.  That is, you have an NDISC packet
that causes an addition followed immediately by a packet destined
(or otherwise relying on) that new address.  In that case, we
simply have no chance of using the RTNL.

Cheers,
diff mbox

Patch

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d328d59..1db5048 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3321,9 +3321,7 @@  static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 
 	switch (event) {
 	case RTM_NEWADDR:
-		dst_hold(&ifp->rt->u.dst);
-		if (ip6_ins_rt(ifp->rt, NULL, NULL, NULL))
-			dst_release(&ifp->rt->u.dst);
+		ip6_ins_rt(ifp->rt, NULL, NULL, NULL);
 		if (ifp->idev->cnf.forwarding)
 			addrconf_join_anycast(ifp);
 		break;
@@ -3334,8 +3332,6 @@  static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 		dst_hold(&ifp->rt->u.dst);
 		if (ip6_del_rt(ifp->rt, NULL, NULL, NULL))
 			dst_free(&ifp->rt->u.dst);
-		else
-			dst_release(&ifp->rt->u.dst);
 		break;
 	}
 }