diff mbox

[1/2] net: Allow to create links with given ifindex

Message ID 50160EEF.6050406@parallels.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Pavel Emelyanov July 30, 2012, 4:34 a.m. UTC
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container). The question what to do with pre-created devices such
as lo or sit fbdev is open, but for manually created devices this can be 
solved by this patch.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman July 30, 2012, 10:49 a.m. UTC | #1
Pavel Emelyanov <xemul@parallels.com> writes:

> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
> is not zero. I propose to allow requesting ifindices on link creation. This
> is required by the checkpoint-restore to correctly restore a net namespace
> (i.e. -- a container). The question what to do with pre-created devices such
> as lo or sit fbdev is open, but for manually created devices this can be 
> solved by this patch.

Have you walked through and found the locations where we still rely on
ifindex being globally unique?

Last time I was working in this area there were serveral places where
things were indexed by just the interface index.

I susepct it might be easier to generate hotplug events at restart time
saying someone removed and added an identical set of network devices.
Certainly for physical hardware that needs to happen, because things
like mac addresses will change.

Eric


> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
>
> ---
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ebaea1..5966e2f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5533,7 +5533,12 @@ int register_netdevice(struct net_device *dev)
>  		}
>  	}
>  
> -	dev->ifindex = dev_new_index(net);
> +	ret = -EBUSY;
> +	if (!dev->ifindex)
> +		dev->ifindex = dev_new_index(net);
> +	else if (__dev_get_by_index(net, dev->ifindex))
> +		goto err_uninit;
> +
>  	if (dev->iflink == -1)
>  		dev->iflink = dev->ifindex;
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 334b930..76e19aa 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1801,8 +1801,6 @@ replay:
>  			return -ENODEV;
>  		}
>  
> -		if (ifm->ifi_index)
> -			return -EOPNOTSUPP;
>  		if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO])
>  			return -EOPNOTSUPP;
>  
> @@ -1828,10 +1826,14 @@ replay:
>  			return PTR_ERR(dest_net);
>  
>  		dev = rtnl_create_link(net, dest_net, ifname, ops, tb);
> -
> -		if (IS_ERR(dev))
> +		if (IS_ERR(dev)) {
>  			err = PTR_ERR(dev);
> -		else if (ops->newlink)
> +			goto out;
> +		}
> +
> +		dev->ifindex = ifm->ifi_index;
> +
> +		if (ops->newlink)
>  			err = ops->newlink(net, dev, tb, data);
>  		else
>  			err = register_netdevice(dev);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman July 30, 2012, 10:56 a.m. UTC | #2
ebiederm@xmission.com (Eric W. Biederman) writes:

> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>> is not zero. I propose to allow requesting ifindices on link creation. This
>> is required by the checkpoint-restore to correctly restore a net namespace
>> (i.e. -- a container). The question what to do with pre-created devices such
>> as lo or sit fbdev is open, but for manually created devices this can be 
>> solved by this patch.
>
> Have you walked through and found the locations where we still rely on
> ifindex being globally unique?
>
> Last time I was working in this area there were serveral places where
> things were indexed by just the interface index.

If it is really safe to make ifindex per network namespace at this
point you can make dev_new_ifindex have a per network namespace base
counter, and that will fix your problems with the loopback device.

Unless you have done the work to root out the last of dependencies on
ifindex being globally unique I think you will run into some operational
problems.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 30, 2012, 11:51 a.m. UTC | #3
On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
> > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
> > is not zero. I propose to allow requesting ifindices on link creation. This
> > is required by the checkpoint-restore to correctly restore a net namespace
> > (i.e. -- a container). The question what to do with pre-created devices such
> > as lo or sit fbdev is open, but for manually created devices this can be 
> > solved by this patch.
> 
> Have you walked through and found the locations where we still rely on
> ifindex being globally unique?
> 
> Last time I was working in this area there were serveral places where
> things were indexed by just the interface index.

Really ? This would be very strange.

AFAIK dev_new_index() is always called, even in the
dev_change_net_namespace() case if there is a conflict.

And dev_new_index() could use a pernet net->ifindex instead of a
shared/static one.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman July 30, 2012, 12:33 p.m. UTC | #4
Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>> > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>> > is not zero. I propose to allow requesting ifindices on link creation. This
>> > is required by the checkpoint-restore to correctly restore a net namespace
>> > (i.e. -- a container). The question what to do with pre-created devices such
>> > as lo or sit fbdev is open, but for manually created devices this can be 
>> > solved by this patch.
>> 
>> Have you walked through and found the locations where we still rely on
>> ifindex being globally unique?
>> 
>> Last time I was working in this area there were serveral places where
>> things were indexed by just the interface index.
>
> Really ? This would be very strange.

There at least were places that used oif or iff without being pernet
last time I was working on this.

It was never code that I understood particularly well so my memory of
what that code is, is unfortunately fuzzy.

> AFAIK dev_new_index() is always called, even in the
> dev_change_net_namespace() case if there is a conflict.

Except we never have a conflict because it takes an absurd number of
network devices to cause a 32bit counter to wrap.

> And dev_new_index() could use a pernet net->ifindex instead of a
> shared/static one.

Yes.  I made all of the core changes, and held back on making
dev_new_index() use a pernet net->ifindex because of a couple of problem
cases.

It has been a long time and those cases might have been fixed.

I'm not seeing anything obvious in the network stack with a quick skim,
but before we start relying on the property that interface indicies are
not globally unique I expect an good hard look at the networking stack
to see if any of those cases where there were problems still exist.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov July 31, 2012, 9:03 a.m. UTC | #5
On 07/30/2012 02:56 PM, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
>> Pavel Emelyanov <xemul@parallels.com> writes:
>>
>>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>>> is not zero. I propose to allow requesting ifindices on link creation. This
>>> is required by the checkpoint-restore to correctly restore a net namespace
>>> (i.e. -- a container). The question what to do with pre-created devices such
>>> as lo or sit fbdev is open, but for manually created devices this can be 
>>> solved by this patch.
>>
>> Have you walked through and found the locations where we still rely on
>> ifindex being globally unique?
>>
>> Last time I was working in this area there were serveral places where
>> things were indexed by just the interface index.
> 
> If it is really safe to make ifindex per network namespace at this
> point you can make dev_new_ifindex have a per network namespace base
> counter, and that will fix your problems with the loopback device.

Not it's not so unfortunately :(

First, let's imagine that on host A the loopback device got registered as
first device, but on host B for some reason some other device got registered
first. In that case after migration from A to B the lo on B will have index
equals 2. And there's no any strict requirement that lo's per net operations
are registered first. Please, correct me if I'm wrong.

Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
fallback devices. Both gets created on netns creation and obtain whatever
ifindices are generated for them. Even if we make ifidex per netns chances
that sit gets registered _strictly_ before ipgre equal zero, since they are
both modules.

> Unless you have done the work to root out the last of dependencies on
> ifindex being globally unique I think you will run into some operational
> problems.

I totally agree with that. Before doing this patch I revisited the ancient
attempt to make ifindices per netns and checked the issues Dave and you
discussed then -- I have looked through how the ifindices are used in the
networking code and found no places where the system-wide uniqueness is still
required. That's why I proposed this patch for inclusion. If you know the 
places I've missed, please let me know, I will work on it.

> Eric
> 
> .
> 

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov July 31, 2012, 9:06 a.m. UTC | #6
> I'm not seeing anything obvious in the network stack with a quick skim,
> but before we start relying on the property that interface indicies are
> not globally unique I expect an good hard look at the networking stack
> to see if any of those cases where there were problems still exist.

Just an idea -- is it worth moving the possibility to have ifindidces intersect
under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
the code in real-life?

> Eric
> 
> .
> 

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman July 31, 2012, 11:58 a.m. UTC | #7
Pavel Emelyanov <xemul@parallels.com> writes:

> On 07/30/2012 02:56 PM, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>>>> is not zero. I propose to allow requesting ifindices on link creation. This
>>>> is required by the checkpoint-restore to correctly restore a net namespace
>>>> (i.e. -- a container). The question what to do with pre-created devices such
>>>> as lo or sit fbdev is open, but for manually created devices this can be 
>>>> solved by this patch.
>>>
>>> Have you walked through and found the locations where we still rely on
>>> ifindex being globally unique?
>>>
>>> Last time I was working in this area there were serveral places where
>>> things were indexed by just the interface index.
>> 
>> If it is really safe to make ifindex per network namespace at this
>> point you can make dev_new_ifindex have a per network namespace base
>> counter, and that will fix your problems with the loopback device.
>
> Not it's not so unfortunately :(
>
> First, let's imagine that on host A the loopback device got registered as
> first device, but on host B for some reason some other device got registered
> first. In that case after migration from A to B the lo on B will have index
> equals 2. And there's no any strict requirement that lo's per net operations
> are registered first. Please, correct me if I'm wrong.

Actually there is a hard requirement that the loopback device be the
last device in a network namespace to be unregistered.  We meet that
requirement by registering the loopback device first
"net/core/dev.c:net_dev_init()".

> Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
> fallback devices. Both gets created on netns creation and obtain whatever
> ifindices are generated for them. Even if we make ifidex per netns chances
> that sit gets registered _strictly_ before ipgre equal zero, since they are
> both modules.

True.  However those fallback devices should no longer be needed,
and even if they are I think you can delete and recreate them.

Making lo the particularly interesting case.

>> Unless you have done the work to root out the last of dependencies on
>> ifindex being globally unique I think you will run into some operational
>> problems.
>
> I totally agree with that. Before doing this patch I revisited the ancient
> attempt to make ifindices per netns and checked the issues Dave and you
> discussed then -- I have looked through how the ifindices are used in the
> networking code and found no places where the system-wide uniqueness is still
> required. That's why I proposed this patch for inclusion. If you know the 
> places I've missed, please let me know, I will work on it.

I took a quick look and I did not see anything.  I saw places under
net/sched/ that looked a bit suspicious, and of course there are places
where we use oif and iff in some of the routing code that make we wonder
a bit.  But if you have looked and if I have looked I think we are ok.

> Just an idea -- is it worth moving the possibility to have ifindidces intersect
> under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
> the code in real-life?

I think the best testing we are going to get diversity wise is to create
a per netns counter into dev_new_index when net-next opens up.

Having an ifindex that we can only set at netdevice creation time seems
reasonable.  

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov July 31, 2012, 1:30 p.m. UTC | #8
>> First, let's imagine that on host A the loopback device got registered as
>> first device, but on host B for some reason some other device got registered
>> first. In that case after migration from A to B the lo on B will have index
>> equals 2. And there's no any strict requirement that lo's per net operations
>> are registered first. Please, correct me if I'm wrong.
> 
> Actually there is a hard requirement that the loopback device be the
> last device in a network namespace to be unregistered.  We meet that
> requirement by registering the loopback device first
> "net/core/dev.c:net_dev_init()".

Hm... Indeed, and this is good news!

>> Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
>> fallback devices. Both gets created on netns creation and obtain whatever
>> ifindices are generated for them. Even if we make ifidex per netns chances
>> that sit gets registered _strictly_ before ipgre equal zero, since they are
>> both modules.
> 
> True.  However those fallback devices should no longer be needed,
> and even if they are I think you can delete and recreate them.

Good idea! I will look at that direction.

> Making lo the particularly interesting case.

Yup, provided we can manually recreate those auto-created devices this solves
the issue.

>> Just an idea -- is it worth moving the possibility to have ifindidces intersect
>> under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
>> the code in real-life?
> 
> I think the best testing we are going to get diversity wise is to create
> a per netns counter into dev_new_index when net-next opens up.
> 
> Having an ifindex that we can only set at netdevice creation time seems
> reasonable.  

OK, thank you, Eric.

> Eric
> .
> 

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ebaea1..5966e2f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5533,7 +5533,12 @@  int register_netdevice(struct net_device *dev)
 		}
 	}
 
-	dev->ifindex = dev_new_index(net);
+	ret = -EBUSY;
+	if (!dev->ifindex)
+		dev->ifindex = dev_new_index(net);
+	else if (__dev_get_by_index(net, dev->ifindex))
+		goto err_uninit;
+
 	if (dev->iflink == -1)
 		dev->iflink = dev->ifindex;
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 334b930..76e19aa 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1801,8 +1801,6 @@  replay:
 			return -ENODEV;
 		}
 
-		if (ifm->ifi_index)
-			return -EOPNOTSUPP;
 		if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO])
 			return -EOPNOTSUPP;
 
@@ -1828,10 +1826,14 @@  replay:
 			return PTR_ERR(dest_net);
 
 		dev = rtnl_create_link(net, dest_net, ifname, ops, tb);
-
-		if (IS_ERR(dev))
+		if (IS_ERR(dev)) {
 			err = PTR_ERR(dev);
-		else if (ops->newlink)
+			goto out;
+		}
+
+		dev->ifindex = ifm->ifi_index;
+
+		if (ops->newlink)
 			err = ops->newlink(net, dev, tb, data);
 		else
 			err = register_netdevice(dev);