diff mbox

ipv4: remove all rt cache entries on UNREGISTER event

Message ID 4CA208C9.1020800@6wind.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Nicolas Dichtel Sept. 28, 2010, 3:24 p.m. UTC
Hi,

I face a problem when I try to remove an interface, 
netdev_wait_allrefs() complains about refcount.

Here is a trivial scenario to reproduce the problem:
# ip tunnel add mode ipip remote 10.16.0.164 local 10.16.0.72 dev eth0
# ./a.out tunl1
# ip tunnel del tunl1

Note: a.out binary create an IPv4 raw socket, attach it to tunl1 
(SO_BINDTODEVICE), set it as multicast (IP_MULTICAST_LOOP), set the 
multicast interface to tunl1 (IP_MULTICAST_IF), build the IP header 
(IP_HDRINCL) and then send a single packet (192.168.6.1 -> 224.0.0.18).

Note2: when a.out is executed, tunl1 has no ip address and is down.

Then, I got a serie of "kernel:[1206699.728010] unregister_netdevice: 
waiting for tunl1 to become free. Usage count = 3" and after some time, 
interface is removed.

The problem is that route cache entries are only invalidate on 
UNREGISTER event, and not removed (introduced by commit 
e2ce146848c81af2f6d42e67990191c284bf0c33). We must wait that 
rt_check_expire() remove the remaining route cache entries.

To fix the problem, I propose to remove a part of the previous commit.

Regards,
Nicolas

Comments

Eric Dumazet Sept. 28, 2010, 4:33 p.m. UTC | #1
Le mardi 28 septembre 2010 à 17:24 +0200, Nicolas Dichtel a écrit :
> Hi,
> 
> I face a problem when I try to remove an interface, 
> netdev_wait_allrefs() complains about refcount.
> 
> Here is a trivial scenario to reproduce the problem:
> # ip tunnel add mode ipip remote 10.16.0.164 local 10.16.0.72 dev eth0
> # ./a.out tunl1
> # ip tunnel del tunl1
> 
> Note: a.out binary create an IPv4 raw socket, attach it to tunl1 
> (SO_BINDTODEVICE), set it as multicast (IP_MULTICAST_LOOP), set the 
> multicast interface to tunl1 (IP_MULTICAST_IF), build the IP header 
> (IP_HDRINCL) and then send a single packet (192.168.6.1 -> 224.0.0.18).
> 
> Note2: when a.out is executed, tunl1 has no ip address and is down.
> 

CC Octavian Purdila, the patch author.

I am just wondering why this route is created in the first place.

Maybe a fix would be to forbid this ?

Some machines have a giant route cache, so its very important to avoid
expensive scans.

> Then, I got a serie of "kernel:[1206699.728010] unregister_netdevice: 
> waiting for tunl1 to become free. Usage count = 3" and after some time, 
> interface is removed.
> 
> The problem is that route cache entries are only invalidate on 
> UNREGISTER event, and not removed (introduced by commit 
> e2ce146848c81af2f6d42e67990191c284bf0c33). We must wait that 
> rt_check_expire() remove the remaining route cache entries.
> 
> To fix the problem, I propose to remove a part of the previous commit.
> 
> Regards,
> Nicolas
> pièce jointe différences entre fichiers
> (0001-ipv4-remove-all-rt-cache-entries-on-UNREGISTER-even.patch)
> From 3344e2e0431fe803c4dac8757a8746908357d780 Mon Sep 17 00:00:00 2001
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Tue, 28 Sep 2010 16:38:19 +0200
> Subject: [PATCH] ipv4: remove all rt cache entries on UNREGISTER event
> 
> Commit e2ce146848c81af2f6d42e67990191c284bf0c33 (ipv4: factorize cache clearing
> for batched unregister operations) add a new parameter to fib_disable_ip() to
> only invalidate route cache entries on unregister event.
> This is wrong, we should ensure that all cache entries are removed on
> unregister event, else netdev_wait_allrefs() may complain. A cache entry
> can be created between event DOWN and UNREGISTER.
> 
> So, I revert a part of the patch.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>  net/ipv4/fib_frontend.c |   10 +++++-----
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 7d02a9f..377e815 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -917,11 +917,11 @@ static void nl_fib_lookup_exit(struct net *net)
>  	net->ipv4.fibnl = NULL;
>  }
>  
> -static void fib_disable_ip(struct net_device *dev, int force, int delay)
> +static void fib_disable_ip(struct net_device *dev, int force)
>  {
>  	if (fib_sync_down_dev(dev, force))
>  		fib_flush(dev_net(dev));
> -	rt_cache_flush(dev_net(dev), delay);
> +	rt_cache_flush(dev_net(dev), 0);
>  	arp_ifdown(dev);
>  }
>  
> @@ -944,7 +944,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
>  			/* Last address was deleted from this interface.
>  			   Disable IP.
>  			 */
> -			fib_disable_ip(dev, 1, 0);
> +			fib_disable_ip(dev, 1);
>  		} else {
>  			rt_cache_flush(dev_net(dev), -1);
>  		}
> @@ -959,7 +959,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>  	struct in_device *in_dev = __in_dev_get_rtnl(dev);
>  
>  	if (event == NETDEV_UNREGISTER) {
> -		fib_disable_ip(dev, 2, -1);
> +		fib_disable_ip(dev, 2);
>  		return NOTIFY_DONE;
>  	}
>  
> @@ -977,7 +977,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>  		rt_cache_flush(dev_net(dev), -1);
>  		break;
>  	case NETDEV_DOWN:
> -		fib_disable_ip(dev, 0, 0);
> +		fib_disable_ip(dev, 0);
>  		break;
>  	case NETDEV_CHANGEMTU:
>  	case NETDEV_CHANGE:


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Dichtel Sept. 28, 2010, 4:45 p.m. UTC | #2
Eric Dumazet wrote:
> Le mardi 28 septembre 2010 à 17:24 +0200, Nicolas Dichtel a écrit :
>> Hi,
>>
>> I face a problem when I try to remove an interface, 
>> netdev_wait_allrefs() complains about refcount.
>>
>> Here is a trivial scenario to reproduce the problem:
>> # ip tunnel add mode ipip remote 10.16.0.164 local 10.16.0.72 dev eth0
>> # ./a.out tunl1
>> # ip tunnel del tunl1
>>
>> Note: a.out binary create an IPv4 raw socket, attach it to tunl1 
>> (SO_BINDTODEVICE), set it as multicast (IP_MULTICAST_LOOP), set the 
>> multicast interface to tunl1 (IP_MULTICAST_IF), build the IP header 
>> (IP_HDRINCL) and then send a single packet (192.168.6.1 -> 224.0.0.18).
>>
>> Note2: when a.out is executed, tunl1 has no ip address and is down.
>>
> 
> CC Octavian Purdila, the patch author.
> 
> I am just wondering why this route is created in the first place.
At first, I asked myself the same question, but it seems that this is 
allowed to send a packet through this kind of socket, even if interface 
is down. Packet will be destroyed by the noop qdisk.
But I agree that it is strange to perform route lookup and everything to 
   destroy the packet at the end ...
Maybe raw_sendmsg() can delete it directly ;-) ... or maybe 
ip_route_output_flow().

Any suggestions welcome.

Regards,
Nicolas

> 
> Maybe a fix would be to forbid this ?
> 
> Some machines have a giant route cache, so its very important to avoid
> expensive scans.
> 
>> Then, I got a serie of "kernel:[1206699.728010] unregister_netdevice: 
>> waiting for tunl1 to become free. Usage count = 3" and after some time, 
>> interface is removed.
>>
>> The problem is that route cache entries are only invalidate on 
>> UNREGISTER event, and not removed (introduced by commit 
>> e2ce146848c81af2f6d42e67990191c284bf0c33). We must wait that 
>> rt_check_expire() remove the remaining route cache entries.
>>
>> To fix the problem, I propose to remove a part of the previous commit.
>>
>> Regards,
>> Nicolas
>> pièce jointe différences entre fichiers
>> (0001-ipv4-remove-all-rt-cache-entries-on-UNREGISTER-even.patch)
>> From 3344e2e0431fe803c4dac8757a8746908357d780 Mon Sep 17 00:00:00 2001
>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> Date: Tue, 28 Sep 2010 16:38:19 +0200
>> Subject: [PATCH] ipv4: remove all rt cache entries on UNREGISTER event
>>
>> Commit e2ce146848c81af2f6d42e67990191c284bf0c33 (ipv4: factorize cache clearing
>> for batched unregister operations) add a new parameter to fib_disable_ip() to
>> only invalidate route cache entries on unregister event.
>> This is wrong, we should ensure that all cache entries are removed on
>> unregister event, else netdev_wait_allrefs() may complain. A cache entry
>> can be created between event DOWN and UNREGISTER.
>>
>> So, I revert a part of the patch.
>>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> ---
>>  net/ipv4/fib_frontend.c |   10 +++++-----
>>  1 files changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
>> index 7d02a9f..377e815 100644
>> --- a/net/ipv4/fib_frontend.c
>> +++ b/net/ipv4/fib_frontend.c
>> @@ -917,11 +917,11 @@ static void nl_fib_lookup_exit(struct net *net)
>>  	net->ipv4.fibnl = NULL;
>>  }
>>  
>> -static void fib_disable_ip(struct net_device *dev, int force, int delay)
>> +static void fib_disable_ip(struct net_device *dev, int force)
>>  {
>>  	if (fib_sync_down_dev(dev, force))
>>  		fib_flush(dev_net(dev));
>> -	rt_cache_flush(dev_net(dev), delay);
>> +	rt_cache_flush(dev_net(dev), 0);
>>  	arp_ifdown(dev);
>>  }
>>  
>> @@ -944,7 +944,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
>>  			/* Last address was deleted from this interface.
>>  			   Disable IP.
>>  			 */
>> -			fib_disable_ip(dev, 1, 0);
>> +			fib_disable_ip(dev, 1);
>>  		} else {
>>  			rt_cache_flush(dev_net(dev), -1);
>>  		}
>> @@ -959,7 +959,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>>  	struct in_device *in_dev = __in_dev_get_rtnl(dev);
>>  
>>  	if (event == NETDEV_UNREGISTER) {
>> -		fib_disable_ip(dev, 2, -1);
>> +		fib_disable_ip(dev, 2);
>>  		return NOTIFY_DONE;
>>  	}
>>  
>> @@ -977,7 +977,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
>>  		rt_cache_flush(dev_net(dev), -1);
>>  		break;
>>  	case NETDEV_DOWN:
>> -		fib_disable_ip(dev, 0, 0);
>> +		fib_disable_ip(dev, 0);
>>  		break;
>>  	case NETDEV_CHANGEMTU:
>>  	case NETDEV_CHANGE:
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Sept. 28, 2010, 5:35 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tuesday 28 September 2010, 19:33:49

> Le mardi 28 septembre 2010 à 17:24 +0200, Nicolas Dichtel a écrit :
> > Hi,
> > 
> > I face a problem when I try to remove an interface,
> > netdev_wait_allrefs() complains about refcount.
> > 
> > Here is a trivial scenario to reproduce the problem:
> > # ip tunnel add mode ipip remote 10.16.0.164 local 10.16.0.72 dev eth0
> > # ./a.out tunl1
> > # ip tunnel del tunl1
> > 
> > Note: a.out binary create an IPv4 raw socket, attach it to tunl1
> > (SO_BINDTODEVICE), set it as multicast (IP_MULTICAST_LOOP), set the
> > multicast interface to tunl1 (IP_MULTICAST_IF), build the IP header
> > (IP_HDRINCL) and then send a single packet (192.168.6.1 -> 224.0.0.18).
> > 
> > Note2: when a.out is executed, tunl1 has no ip address and is down.
> 
> CC Octavian Purdila, the patch author.
> 
> I am just wondering why this route is created in the first place.
> 
> Maybe a fix would be to forbid this ?
> 
> Some machines have a giant route cache, so its very important to avoid
> expensive scans.
> 
> > Then, I got a serie of "kernel:[1206699.728010] unregister_netdevice:
> > waiting for tunl1 to become free. Usage count = 3" and after some time,
> > interface is removed.
> > 
> > The problem is that route cache entries are only invalidate on
> > UNREGISTER event, and not removed (introduced by commit
> > e2ce146848c81af2f6d42e67990191c284bf0c33). We must wait that
> > rt_check_expire() remove the remaining route cache entries.
> > 
> > To fix the problem, I propose to remove a part of the previous commit.
> > 


Hi Nicolas,

The purpose of my original patch was to speed up interfaces deregistration 
even more after Eric's batch work. Reverting it might slow things down again, 
but since this is breaking things we should probably revert it and think a 
proper optimization afterward. I know that Eric B has done some more work in 
this area, for batch namespace cleanup, maybe the issue is not even there 
anymore. 

So, Ack from me.

We might even fully revert the patch, since the bit that is left doesn't have 
any value anymore.

Thanks,
tavi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 3344e2e0431fe803c4dac8757a8746908357d780 Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 28 Sep 2010 16:38:19 +0200
Subject: [PATCH] ipv4: remove all rt cache entries on UNREGISTER event

Commit e2ce146848c81af2f6d42e67990191c284bf0c33 (ipv4: factorize cache clearing
for batched unregister operations) add a new parameter to fib_disable_ip() to
only invalidate route cache entries on unregister event.
This is wrong, we should ensure that all cache entries are removed on
unregister event, else netdev_wait_allrefs() may complain. A cache entry
can be created between event DOWN and UNREGISTER.

So, I revert a part of the patch.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv4/fib_frontend.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 7d02a9f..377e815 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -917,11 +917,11 @@  static void nl_fib_lookup_exit(struct net *net)
 	net->ipv4.fibnl = NULL;
 }
 
-static void fib_disable_ip(struct net_device *dev, int force, int delay)
+static void fib_disable_ip(struct net_device *dev, int force)
 {
 	if (fib_sync_down_dev(dev, force))
 		fib_flush(dev_net(dev));
-	rt_cache_flush(dev_net(dev), delay);
+	rt_cache_flush(dev_net(dev), 0);
 	arp_ifdown(dev);
 }
 
@@ -944,7 +944,7 @@  static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 			/* Last address was deleted from this interface.
 			   Disable IP.
 			 */
-			fib_disable_ip(dev, 1, 0);
+			fib_disable_ip(dev, 1);
 		} else {
 			rt_cache_flush(dev_net(dev), -1);
 		}
@@ -959,7 +959,7 @@  static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 	struct in_device *in_dev = __in_dev_get_rtnl(dev);
 
 	if (event == NETDEV_UNREGISTER) {
-		fib_disable_ip(dev, 2, -1);
+		fib_disable_ip(dev, 2);
 		return NOTIFY_DONE;
 	}
 
@@ -977,7 +977,7 @@  static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(dev_net(dev), -1);
 		break;
 	case NETDEV_DOWN:
-		fib_disable_ip(dev, 0, 0);
+		fib_disable_ip(dev, 0);
 		break;
 	case NETDEV_CHANGEMTU:
 	case NETDEV_CHANGE:
-- 
1.5.6.5