diff mbox

IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?

Message ID 20100420174401.GB1334@midget.suse.cz
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Bohac April 20, 2010, 5:44 p.m. UTC
Hi,

I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
to dst_free().

__ipv6_ifa_notify() contains:

        case RTM_DELADDR:
                if (ifp->idev->cnf.forwarding)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
                if (ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;

AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
deletes the route: 
	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
	-> rt6_release() -> dst_free()

If it fails (like when it races with another invocation of ip6_del_rt()), it
will return nonzero and this will cause the above code to call dst_free() on its own.

dst_free() has no protection against concurrent invocation and if
two invocations make it through the "if (dst->obsolete > 1)"
check before one of them calls __dst_free(), the same dst_entry
may end up either:
	1) dst_destroy()ed and put on the dst_garbage.list, or
	2) put on the dst_garbage.list twice
both resulting in trouble once the GC is run.

One possible code path leading to two invocations of __ipv6_ifa_notify() seems
to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
the bonding master is in the DAD phase with a tentative address:

netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
 -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
 -> __ipv6_ifa_notify


What is the reason __ipv6_ifa_notify() calls dst_free() when
ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
with the dst still needing to be freed.

I am just testing whether the following will help:


Thanks,

Comments

Eric Dumazet April 20, 2010, 5:57 p.m. UTC | #1
Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> Hi,
> 
> I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
> to dst_free().
> 
> __ipv6_ifa_notify() contains:
> 
>         case RTM_DELADDR:
>                 if (ifp->idev->cnf.forwarding)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
>                 if (ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;
> 
> AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
> deletes the route: 
> 	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
> 	-> rt6_release() -> dst_free()
> 
> If it fails (like when it races with another invocation of ip6_del_rt()), it
> will return nonzero and this will cause the above code to call dst_free() on its own.
> 
> dst_free() has no protection against concurrent invocation and if

Sorry ? of course dst_free() has a protection...

By definition, the dst_destroy() is called only by the last thread with
the final refcount on object.

> two invocations make it through the "if (dst->obsolete > 1)"
> check before one of them calls __dst_free(), the same dst_entry
> may end up either:
> 	1) dst_destroy()ed and put on the dst_garbage.list, or
> 	2) put on the dst_garbage.list twice
> both resulting in trouble once the GC is run.
> 
> One possible code path leading to two invocations of __ipv6_ifa_notify() seems
> to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
> the bonding master is in the DAD phase with a tentative address:
> 
> netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
> ... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
>  -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
>  -> __ipv6_ifa_notify
> 
> 
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.
> 
> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }
> 


I dont understand the problem Jiri.

We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
dst_free(), or we leak a refcount.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Bohac April 20, 2010, 8:49 p.m. UTC | #2
On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> >  			addrconf_leave_anycast(ifp);
> >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> >  		dst_hold(&ifp->rt->u.dst);
> > -		if (ip6_del_rt(ifp->rt))
> > -			dst_free(&ifp->rt->u.dst);
> > +		ip6_del_rt(ifp->rt);
> >  		break;
> >  	}
> >  }
> > 
> 
> 
> I dont understand the problem Jiri.
> 
> We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> dst_free(), or we leak a refcount.

Well, no ... dst_free() does not decrement the refcount.
The "opposite" of dst_hold() is dst_release(). And it does not
automatically call dst_free() when the refcount drops to 0.
dst_free() needs to be called explicitly and it apparently
expects the caller to ensure that two dst_free()s won't be called
simultaneously. But my bonding example shows this is not the
case.
Jiri Bohac April 21, 2010, 9:34 p.m. UTC | #3
Hi,

On Tue, Apr 20, 2010 at 07:44:01PM +0200, Jiri Bohac wrote:
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.

checked again and I still think that if ip6_del_rt() fails,
ifp->rt must have been freed already. Anybody with a
counterexample?

> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }

not surprisingly, it helps my case -- the race condition does not
happen and the resulting oopses disappears. Of course, this does
not prove the patch is correct.

Could anybody with more insight into the dst refcounting please
take a look? The code originally came from:

	Author: Hideaki Yoshifuji <yoshfuji@linux-ipv6.org>
	Date:   Wed Aug 18 05:25:16 2004 +0900

	    [IPV6] refer inet6 device via corresponding local route from address structure.

And has been modified later by:
	commit 4641e7a334adf6856300a98e7296dfc886c446af
	Author: Herbert Xu <herbert@gondor.apana.org.au>
	Date:   Thu Feb 2 16:55:45 2006 -0800

	    [IPV6]: Don't hold extra ref count in ipv6_ifa_notify

Thanks,
Herbert Xu April 22, 2010, 2:32 a.m. UTC | #4
On Wed, Apr 21, 2010 at 11:34:29PM +0200, Jiri Bohac wrote:
> Hi,
> 
> On Tue, Apr 20, 2010 at 07:44:01PM +0200, Jiri Bohac wrote:
> > What is the reason __ipv6_ifa_notify() calls dst_free() when
> > ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> > with the dst still needing to be freed.
> 
> checked again and I still think that if ip6_del_rt() fails,
> ifp->rt must have been freed already. Anybody with a
> counterexample?

I agree with your diagnosis and the two duplicate NDISC messages
scenario sounds plausible.

Anyway, I think the root of the issue is the fact that NDISC is
calling addrconf_dad_failure with no locking whatsoever.  The
latter is not idempotent so some form of locking is needed.

This bug appears to have been around since the very start.

I'll dig deeper to see where we might be able to add some locks.

Cheers,
David Miller April 22, 2010, 7:43 a.m. UTC | #5
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 22 Apr 2010 10:32:11 +0800

> Anyway, I think the root of the issue is the fact that NDISC is
> calling addrconf_dad_failure with no locking whatsoever.  The
> latter is not idempotent so some form of locking is needed.
> 
> This bug appears to have been around since the very start.
> 
> I'll dig deeper to see where we might be able to add some locks.

Thanks Herbert.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
+++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
@@ -3974,8 +3974,7 @@  static void __ipv6_ifa_notify(int event,
 			addrconf_leave_anycast(ifp);
 		addrconf_leave_solict(ifp->idev, &ifp->addr);
 		dst_hold(&ifp->rt->u.dst);
-		if (ip6_del_rt(ifp->rt))
-			dst_free(&ifp->rt->u.dst);
+		ip6_del_rt(ifp->rt);
 		break;
 	}
 }