diff mbox series

[RFC,2/3] netdev: kernel-only IFF_HIDDEN netdevice

Message ID 1522573990-5242-3-git-send-email-si-wei.liu@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show
Series Userspace compatible driver model for virtio_bypass | expand

Commit Message

Si-Wei Liu April 1, 2018, 9:13 a.m. UTC
Hidden netdevice is not visible to userspace such that
typical network utilites e.g. ip, ifconfig and et al,
cannot sense its existence or configure it. Internally
hidden netdev may associate with an upper level netdev
that userspace has access to. Although userspace cannot
manipulate the lower netdev directly, user may control
or configure the underlying hidden device through the
upper-level netdev. For identification purpose, the
kobject for hidden netdev still presents in the sysfs
hierarchy, however, no uevent message will be generated
when the sysfs entry is created, modified or destroyed.

For that end, a separate namescope needs to be carved
out for IFF_HIDDEN netdevs. As of now netdev name that
starts with colon i.e. ':' is invalid in userspace,
since socket ioctls such as SIOCGIFCONF use ':' as the
separator for ifname. The absence of namescope started
with ':' can rightly be used as the namescope for
the kernel-only IFF_HIDDEN netdevs.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 include/linux/netdevice.h   |  12 ++
 include/net/net_namespace.h |   2 +
 net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
 net/core/net_namespace.c    |   1 +
 4 files changed, 263 insertions(+), 33 deletions(-)

Comments

David Ahern April 1, 2018, 4:11 p.m. UTC | #1
On 4/1/18 3:13 AM, Si-Wei Liu wrote:
> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---
>  include/linux/netdevice.h   |  12 ++
>  include/net/net_namespace.h |   2 +
>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>  net/core/net_namespace.c    |   1 +
>  4 files changed, 263 insertions(+), 33 deletions(-)
> 

There are other use cases that want to hide a device from userspace. I
would prefer a better solution than playing games with name prefixes and
one that includes an API for users to list all devices -- even ones
hidden by default.

https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00

https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed

Also, why are you suggesting that the device should still be visible via
/sysfs? That leads to inconsistent views of networking state - /sys
shows a device but a link dump does not.
Siwei Liu April 3, 2018, 7:40 a.m. UTC | #2
On Sun, Apr 1, 2018 at 9:11 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei
David Ahern April 3, 2018, 2:57 p.m. UTC | #3
On 4/3/18 1:40 AM, Siwei Liu wrote:
>> There are other use cases that want to hide a device from userspace.
> 
> Can you elaborate your case in more details? Looking at the links
> below I realize that the purpose of hiding devices in your case is
> quite different from the our migration case. Particularly, I don't

some kernel drivers create "control" netdev's. They are not intended for
users to manipulate and doing so may actually break networking.

> like the part of elaborately allowing user to manipulate the link's
> visibility - things fall apart easily while live migration is on
> going. And, why doing additional check for invisible links in every
> for_each_netdev() and its friends. This is effectively changing
> semantics of internal APIs that exist for decades.

Read the patch again: there are 40 references to for_each_netdev and
that patch touches 2 of them -- link dumps via rtnetlink and link dumps
via ioctl.

>> one that includes an API for users to list all devices -- even ones
> 
> What kind of API you would like to query for hidden devices?
> rtnetlink? a private socket API? or something else?

There are existing, established APIs for dumping links. No new API is
needed. As suggested in the 2 patches I referenced the hidden /
invisibility cloak is an attribute of the device. When a link dump is
requested if the attribute is set, the device is skipped and not
included in the dump. However, if the user knows the device name the
GETLINK / SETLINK / DELLINK apis all work as normal. This allows the
device to be hidden from apps like snmpd, lldpd, etc, yet still usable.

> 
> For our case, the sysfs interface is what we need and is sufficient,
> since udev is the main target we'd like to support to make the naming
> of virtio_bypass consistent and compatible.

You are not hiding a device if it is visible in 1 API (/sysfs) and not
visible by another API (rtnetlink). That only creates confusion.

> 
>> hidden by default.
>>
>> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>> Also, why are you suggesting that the device should still be visible via
>> /sysfs? That leads to inconsistent views of networking state - /sys
>> shows a device but a link dump does not.
> 
> See my clarifications above. I don't mind kernel-only netdevs being
> visible via sysfs, as that way we get a good trade-off between
> backwards compatibility and visibility. There's still kobject created
> there right. Bottom line is that all kernel devices and its life-cycle
> uevents are made invisible to userpace network utilities, and I think
> it simply gets to the goal of not breaking existing apps while being
> able to add new features.
Jiri Pirko April 3, 2018, 3:42 p.m. UTC | #4
Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilites e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>> 
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>> 
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>> 
>
>There are other use cases that want to hide a device from userspace. I

What usecases do you have in mind?

>would prefer a better solution than playing games with name prefixes and
>one that includes an API for users to list all devices -- even ones
>hidden by default.

Netdevice hiding feels a bit scarry for me. This smells like a workaround
for userspace issues. Why can't the netdevice be visible always and
userspace would know what is it and what should it do with it?

Once we start with hiding, there are other things related to that which
appear. Like who can see what, levels of visibility etc...


>
>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
>Also, why are you suggesting that the device should still be visible via
>/sysfs? That leads to inconsistent views of networking state - /sys
>shows a device but a link dump does not.
Stephen Hemminger April 3, 2018, 5:35 p.m. UTC | #5
On Sun,  1 Apr 2018 05:13:09 -0400
Si-Wei Liu <si-wei.liu@oracle.com> wrote:

> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---

I understand the use case. I proposed using . as a prefix before
but that ran into resistance. Using colon seems worse.

Rather than playing with names and all the issues that can cause,
why not make it an attribute flag of the device in netlink.
Siwei Liu April 3, 2018, 7:23 p.m. UTC | #6
On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>>  net/core/net_namespace.c    |   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.
David Ahern April 4, 2018, 1:04 a.m. UTC | #7
On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>
>> There are other use cases that want to hide a device from userspace. I
> 
> What usecases do you have in mind?

As mentioned in a previous response some kernel drivers create control
netdevs. Just as in this case users should not be mucking with it, and
S/W like lldpd should ignore it.

> 
>> would prefer a better solution than playing games with name prefixes and
>> one that includes an API for users to list all devices -- even ones
>> hidden by default.
> 
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
> 
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
> 

I would not advocate for any API that does not allow users to have full
introspection. The intent is to hide the netdev by default but have an
option to see it.
Jiri Pirko April 4, 2018, 6:19 a.m. UTC | #8
Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>> 
>> What usecases do you have in mind?
>
>As mentioned in a previous response some kernel drivers create control
>netdevs. Just as in this case users should not be mucking with it, and

virtio_net. Any other drivers?


>S/W like lldpd should ignore it.

It's just a matter of identification of the netdevs, so the user knows
what to do.


>
>> 
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>> 
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>> 
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>> 
>
>I would not advocate for any API that does not allow users to have full
>introspection. The intent is to hide the netdev by default but have an
>option to see it.

As an administrator, I want to see all by default. I think it is
reasonable requirements. Again, this awfully smells like a workaround...
Siwei Liu April 4, 2018, 7:36 a.m. UTC | #9
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei
Siwei Liu April 4, 2018, 8:01 a.m. UTC | #10
On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei
Siwei Liu April 4, 2018, 8:28 a.m. UTC | #11
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.
David Ahern April 4, 2018, 5:21 p.m. UTC | #12
On 4/4/18 1:36 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>> I would not advocate for any API that does not allow users to have full
>> introspection. The intent is to hide the netdev by default but have an
>> option to see it.
> 
> I'm fine with having a link dump API to inspect the hidden netdev. As
> said, the name for hidden netdevs should be in a separate device
> namespace, and we did not even get closer to what it should look like
> as I don't want to make it just an option for ip link. Perhaps a new
> set of sub-commands of, say, 'ip device'.

It is a netdev so there is no reason to have a separate ip command to
inspect it. 'ip link' is the right place.
David Miller April 4, 2018, 5:37 p.m. UTC | #13
From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:21:54 -0600

> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

I agree on this.

What I really don't understand still is the use case... really.

So there are control netdevs, what exactly is the problem with that?

Are we not exporting enough information for applications to handle
these devices sanely?  If so, then's let add that information.

We can set netdev->type to ETH_P_LINUXCONTROL or something like that.

Another alternative is to add an interface flag like IFF_CONTROL or
similar, and that probably is much nicer.

Hiding the devices means that we acknowledge that applications are
currently broken with control netdevs... and we want them to stay
broken!

That doesn't sound like a good plan to me.

So let's fix handling of control netdevs instead of hiding them.

Thanks.
David Ahern April 4, 2018, 5:37 p.m. UTC | #14
[ dropping virtio-dev@lists.oasis-open.org since it is a closed list and
I am tired of deleting bounces ]

On 4/4/18 2:28 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
> 
> I'm still not sure I understand your case: why you want to hide the
> control netdev, as I assume those devices could choose either to
> silently ignore the request, or fail loudly against user operations?
> Is it creating issues already, or what problem you want to solve if
> not making the netdev invisible. Why couldn't lldpd check some
> specific flag and ignore the control netdevice (can you please give an
> example of a concrete driver for control netdevice *in tree*).
> 
> And I'm completely lost why you want an API to make a hidden netdev
> visible again for these control devices.

Networking vendors have out of tree kernel modules. Those modules use a
netdev (call it a master netdev, a control netdev, cpu port, whatever)
to pull packets from the ASIC and deliver to virtual netdevices
representing physical ports. The master netdev should not be mucked with
by a user. It should be ignored by certain s/w with lldpd as just an
*example*.

The short of it is that you have your reasons for wanting to hide the
virtio bypass device; other users have other arguments for wanting a
similar capability.

--

From there I think you are confusing my intentions: I fundamentally do
not believe the kernel should be hiding anything from an admin. Not
showing data by default is completely different than not showing that
data at all.

The intention of my patch with the IFF_HIDDEN attribute is:
1. it is a netdev attribute

2. that attribute can be used by userpsace to indicate to the kernel I
want all or I want the default

3. that attribute can be controlled by an admin.

The patches go beyond my specific use case (preventing a user from
modifying a netdev it should not be touching) but to defining the
semantics of a generic capability which is what the kernel should have.
David Miller April 4, 2018, 5:42 p.m. UTC | #15
From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:37:52 -0600

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Two approaches:

1) Add an IFF_CONTROL and make userspace understand this.  It is probably
   long overdue.

2) Design the driver properly.  Have a non-netdev master device like
   mlxsw does, and control it using devlink or similar.  This is exactly
   how this stuff was meant to be architected.

> From there I think you are confusing my intentions: I fundamentally do
> not believe the kernel should be hiding anything from an admin. Not
> showing data by default is completely different than not showing that
> data at all.

It is the same David.

It measn we have no intention of fixing applications to properly know
what to do with and how to handle these devices.

If you hide these objects, we are basically giving up on fixing the
tools and or the drivers themselves to be architected differently
(see #2 above).

That really isn't acceptable in my opinion.

> The intention of my patch with the IFF_HIDDEN attribute is:
> 1. it is a netdev attribute
> 
> 2. that attribute can be used by userpsace to indicate to the kernel I
> want all or I want the default
> 
> 3. that attribute can be controlled by an admin.
> 
> The patches go beyond my specific use case (preventing a user from
> modifying a netdev it should not be touching) but to defining the
> semantics of a generic capability which is what the kernel should have.

"Teach, do not hide!" -Yoda
Stephen Hemminger April 4, 2018, 5:44 p.m. UTC | #16
On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern <dsahern@gmail.com> wrote:

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...
Siwei Liu April 4, 2018, 6:02 p.m. UTC | #17
On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei
Jiri Pirko April 4, 2018, 6:20 p.m. UTC | #18
Wed, Apr 04, 2018 at 07:37:49PM CEST, davem@davemloft.net wrote:
>From: David Ahern <dsahern@gmail.com>
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely?  If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.

Exactly. Don't workaround userspace issues by kernel patches.
Andrew Lunn April 4, 2018, 8:08 p.m. UTC | #19
> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.

Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)

> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.

I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.

    Andrew
Siwei Liu April 6, 2018, 9:29 p.m. UTC | #20
(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei
Siwei Liu April 7, 2018, 2:32 a.m. UTC | #21
On Wed, Apr 4, 2018 at 10:37 AM, David Miller <davem@davemloft.net> wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.
Andrew Lunn April 7, 2018, 3:19 a.m. UTC | #22
Hi Siwei

> I think everyone seems to agree not to fiddle with the ":" prefix, but
> rather have a new class of network subsystem under /sys/class thus a
> separate device namespace e.g. /sys/class/net-kernel for those
> auto-managed lower netdevs is needed.
 
How do you get a device into this new class? I don't know the Linux
driver model too well, but to get a device out of one class and into
another, i think you need to device_del(dev). modify dev->class and
then device_add(dev). However, device_add() says you are not allowed
to do this.

And i don't even see how this helps. Are you also not going to call
list_netdevice()? Are you going to add some other list for these
devices in a different class?

   Andrew
David Miller April 8, 2018, 4:32 p.m. UTC | #23
From: Siwei Liu <loseweigh@gmail.com>
Date: Fri, 6 Apr 2018 19:32:05 -0700

> And I assume everyone here understands the use case for live
> migration (in the context of providing cloud service) is very
> different, and we have to hide the netdevs. If not, I'm more than
> happy to clarify.

I think you still need to clarify.

netdevs are netdevs.  If they have special attributes, mark them as
such and the tools base their actions upon that.

"Hiding", or changing classes, doesn't make any sense to me still.
Siwei Liu April 9, 2018, 10:07 p.m. UTC | #24
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>    Andrew
Andrew Lunn April 9, 2018, 10:15 p.m. UTC | #25
> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.

My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.

Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.

	Andrew
Siwei Liu April 9, 2018, 10:30 p.m. UTC | #26
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
>         Andrew
Stephen Hemminger April 9, 2018, 11:03 p.m. UTC | #27
On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> >         Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?
Siwei Liu April 9, 2018, 11:31 p.m. UTC | #28
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> >         Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei
Siwei Liu April 10, 2018, 6:48 a.m. UTC | #29
On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
> From: Siwei Liu <loseweigh@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.
Siwei Liu April 18, 2018, 12:26 a.m. UTC | #30
I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>> From: Siwei Liu <loseweigh@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not look more appealing than just configure generic
> bonding using its own set of daemons or scripts. But again, cloud
> users don't welcome that facility. And basically it would get to
> nearly the same set of problems if leaving userspace alone.
>
> IMHO we're not hiding the devices, think it the way we're adding a
> feature transparent to user. Those auto-managed slaves are ones users
> don't care about much. And user is still able to see and configure the
> lower netdevs if they really desires to do so. But generally the
> target user for this feature won't need to know that. Why they care
> how many interfaces a VM virtually has rather than how many interfaces
> are actually _useable_ to them??
>
> Thanks,
> -Siwei
>
>
>>
>> netdevs are netdevs.  If they have special attributes, mark them as
>> such and the tools base their actions upon that.
>>
>> "Hiding", or changing classes, doesn't make any sense to me still.
Samudrala, Sridhar April 18, 2018, 11:33 p.m. UTC | #31
On 4/17/2018 5:26 PM, Siwei Liu wrote:
> I ran this with a few folks offline and gathered some good feedbacks
> that I'd like to share thus revive the discussion.
>
> First of all, as illustrated in the reply below, cloud service
> providers require transparent live migration. Specifically, the main
> target of our case is to support SR-IOV live migration via kernel
> upgrade while keeping the userspace of old distros unmodified. If it's
> because this use case is not appealing enough for the mainline to
> adopt, I will shut up and not continue discussing, although
> technically it's entirely possible (and there's precedent in other
> implementation) to do so to benefit any cloud service providers.
>
> If it's just the implementation of hiding netdev itself needs to be
> improved, such as implementing it as attribute flag or adding linkdump
> API, that's completely fine and we can look into that. However, the
> specific issue needs to be undestood beforehand is to make transparent
> SR-IOV to be able to take over the name (so inherit all the configs)
> from the lower netdev, which needs some games with uevents and name
> space reservation. So far I don't think it's been well discussed.
>
> One thing in particular I'd like to point out is that the 3-netdev
> model currently missed to address the core problem of live migration:
> migration of hardware specific feature/state, for e.g. ethtool configs
> and hardware offloading states. Only general network state (IP
> address, gateway, for eg.) associated with the bypass interface can be
> migrated. As a follow-up work, bypass driver can/should be enhanced to
> save and apply those hardware specific configs before or after
> migration as needed. The transparent 1-netdev model being proposed as
> part of this patch series will be able to solve that problem naturally
> by making all hardware specific configurations go through the central
> bypass driver, such that hardware configurations can be replayed when
> new VF or passthrough gets plugged back in. Although that
> corresponding function hasn't been implemented today, I'd like to
> refresh everyone's mind that is the core problem any live migration
> proposal should have addressed.
>
> If it would make things more clear to defer netdev hiding until all
> functionalities regarding centralizing and replay are implemented,
> we'd take advices like that and move on to implementing those features
> as follow-up patches. Once all needed features get done, we'd resume
> the work for hiding lower netdev at that point. Think it would be the
> best to make everyone understand the big picture in advance before
> going too far.

I think we should get the 3-netdev model integrated and add any additional
ndo_ops/ethool ops that we would like to support/migrate before looking into
hiding the lower netdevs.


>
> Thanks, comments welcome.
>
> -Siwei
>
>
> On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Siwei Liu <loseweigh@gmail.com>
>>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>>
>>>> And I assume everyone here understands the use case for live
>>>> migration (in the context of providing cloud service) is very
>>>> different, and we have to hide the netdevs. If not, I'm more than
>>>> happy to clarify.
>>> I think you still need to clarify.
>> OK. The short answer is cloud users really want *transparent* live migration.
>>
>> By being transparent it means they don't and shouldn't care about the
>> existence and the occurence of live migration, but they do if
>> userspace toolstack and libraries have to be updated or modified,
>> which means potential dependency brokeness of their applications. They
>> don't like any change to the userspace envinroment (existing apps
>> lift-and-shift, no recompilation, no re-packaging, no re-certification
>> needed), while no one barely cares about ABI or API compatibility in
>> the kernel level, as long as their applications don't break.
>>
>> I agree the current bypass solution for SR-IOV live migration requires
>> guest cooperation. Though it doesn't mean guest *userspace*
>> cooperation. As a matter of fact, techinically it shouldn't invovle
>> userspace at all to get SR-IOV migration working. It's the kernel that
>> does the real work. If I understand the goal of this in-kernel
>> approach correctly, it was meant to save userspace from modification
>> or corresponding toolstack support, as those additional 2 interfaces
>> is more a side product of this approach, rather than being neccessary
>> for users to be aware of. All what the user needs to deal with is one
>> single interface, and that's what they care about. It's more a trouble
>> than help when they see 2 extra interfaces are present. Management
>> tools in the old distros don't recoginze them and try to bring up
>> those extra interfaces for its own. Various odd warnings start to spew
>> out, and there's a lot of caveats for the users to get around...
>>
>> On the other hand, if we "teach" those cloud users to update the
>> userspace toolstack just for trading a feature they don't need, no one
>> is likely going to embrace the change. As such there's just no real
>> value of adopting this in-kernel bypass facility for any cloud service
>> provider. It does not look more appealing than just configure generic
>> bonding using its own set of daemons or scripts. But again, cloud
>> users don't welcome that facility. And basically it would get to
>> nearly the same set of problems if leaving userspace alone.
>>
>> IMHO we're not hiding the devices, think it the way we're adding a
>> feature transparent to user. Those auto-managed slaves are ones users
>> don't care about much. And user is still able to see and configure the
>> lower netdevs if they really desires to do so. But generally the
>> target user for this feature won't need to know that. Why they care
>> how many interfaces a VM virtually has rather than how many interfaces
>> are actually _useable_ to them??
>>
>> Thanks,
>> -Siwei
>>
>>
>>> netdevs are netdevs.  If they have special attributes, mark them as
>>> such and the tools base their actions upon that.
>>>
>>> "Hiding", or changing classes, doesn't make any sense to me still.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
Michael S. Tsirkin April 19, 2018, 4:41 a.m. UTC | #32
On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > I ran this with a few folks offline and gathered some good feedbacks
> > that I'd like to share thus revive the discussion.
> > 
> > First of all, as illustrated in the reply below, cloud service
> > providers require transparent live migration. Specifically, the main
> > target of our case is to support SR-IOV live migration via kernel
> > upgrade while keeping the userspace of old distros unmodified. If it's
> > because this use case is not appealing enough for the mainline to
> > adopt, I will shut up and not continue discussing, although
> > technically it's entirely possible (and there's precedent in other
> > implementation) to do so to benefit any cloud service providers.
> > 
> > If it's just the implementation of hiding netdev itself needs to be
> > improved, such as implementing it as attribute flag or adding linkdump
> > API, that's completely fine and we can look into that. However, the
> > specific issue needs to be undestood beforehand is to make transparent
> > SR-IOV to be able to take over the name (so inherit all the configs)
> > from the lower netdev, which needs some games with uevents and name
> > space reservation. So far I don't think it's been well discussed.
> > 
> > One thing in particular I'd like to point out is that the 3-netdev
> > model currently missed to address the core problem of live migration:
> > migration of hardware specific feature/state, for e.g. ethtool configs
> > and hardware offloading states. Only general network state (IP
> > address, gateway, for eg.) associated with the bypass interface can be
> > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > save and apply those hardware specific configs before or after
> > migration as needed. The transparent 1-netdev model being proposed as
> > part of this patch series will be able to solve that problem naturally
> > by making all hardware specific configurations go through the central
> > bypass driver, such that hardware configurations can be replayed when
> > new VF or passthrough gets plugged back in. Although that
> > corresponding function hasn't been implemented today, I'd like to
> > refresh everyone's mind that is the core problem any live migration
> > proposal should have addressed.
> > 
> > If it would make things more clear to defer netdev hiding until all
> > functionalities regarding centralizing and replay are implemented,
> > we'd take advices like that and move on to implementing those features
> > as follow-up patches. Once all needed features get done, we'd resume
> > the work for hiding lower netdev at that point. Think it would be the
> > best to make everyone understand the big picture in advance before
> > going too far.
> 
> I think we should get the 3-netdev model integrated and add any additional
> ndo_ops/ethool ops that we would like to support/migrate before looking into
> hiding the lower netdevs.

Once they are exposed, I don't think we'll be able to hide them -
they will be a kernel ABI.

Do you think everyone needs to hide the SRIOV device?
Or that only some users need this?
Samudrala, Sridhar April 19, 2018, 5 a.m. UTC | #33
On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>> I ran this with a few folks offline and gathered some good feedbacks
>>> that I'd like to share thus revive the discussion.
>>>
>>> First of all, as illustrated in the reply below, cloud service
>>> providers require transparent live migration. Specifically, the main
>>> target of our case is to support SR-IOV live migration via kernel
>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>> because this use case is not appealing enough for the mainline to
>>> adopt, I will shut up and not continue discussing, although
>>> technically it's entirely possible (and there's precedent in other
>>> implementation) to do so to benefit any cloud service providers.
>>>
>>> If it's just the implementation of hiding netdev itself needs to be
>>> improved, such as implementing it as attribute flag or adding linkdump
>>> API, that's completely fine and we can look into that. However, the
>>> specific issue needs to be undestood beforehand is to make transparent
>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>> from the lower netdev, which needs some games with uevents and name
>>> space reservation. So far I don't think it's been well discussed.
>>>
>>> One thing in particular I'd like to point out is that the 3-netdev
>>> model currently missed to address the core problem of live migration:
>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>> and hardware offloading states. Only general network state (IP
>>> address, gateway, for eg.) associated with the bypass interface can be
>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>> save and apply those hardware specific configs before or after
>>> migration as needed. The transparent 1-netdev model being proposed as
>>> part of this patch series will be able to solve that problem naturally
>>> by making all hardware specific configurations go through the central
>>> bypass driver, such that hardware configurations can be replayed when
>>> new VF or passthrough gets plugged back in. Although that
>>> corresponding function hasn't been implemented today, I'd like to
>>> refresh everyone's mind that is the core problem any live migration
>>> proposal should have addressed.
>>>
>>> If it would make things more clear to defer netdev hiding until all
>>> functionalities regarding centralizing and replay are implemented,
>>> we'd take advices like that and move on to implementing those features
>>> as follow-up patches. Once all needed features get done, we'd resume
>>> the work for hiding lower netdev at that point. Think it would be the
>>> best to make everyone understand the big picture in advance before
>>> going too far.
>> I think we should get the 3-netdev model integrated and add any additional
>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>> hiding the lower netdevs.
> Once they are exposed, I don't think we'll be able to hide them -
> they will be a kernel ABI.
>
> Do you think everyone needs to hide the SRIOV device?
> Or that only some users need this?

Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
think it is a hard requirement. And also,  as we don't yet have a consensus on how to hide
the lower netdevs, we could make it as another feature bit to hide lower netdevs once
we have an acceptable solution.
Michael S. Tsirkin April 19, 2018, 5:07 a.m. UTC | #34
On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> > > On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > > > I ran this with a few folks offline and gathered some good feedbacks
> > > > that I'd like to share thus revive the discussion.
> > > > 
> > > > First of all, as illustrated in the reply below, cloud service
> > > > providers require transparent live migration. Specifically, the main
> > > > target of our case is to support SR-IOV live migration via kernel
> > > > upgrade while keeping the userspace of old distros unmodified. If it's
> > > > because this use case is not appealing enough for the mainline to
> > > > adopt, I will shut up and not continue discussing, although
> > > > technically it's entirely possible (and there's precedent in other
> > > > implementation) to do so to benefit any cloud service providers.
> > > > 
> > > > If it's just the implementation of hiding netdev itself needs to be
> > > > improved, such as implementing it as attribute flag or adding linkdump
> > > > API, that's completely fine and we can look into that. However, the
> > > > specific issue needs to be undestood beforehand is to make transparent
> > > > SR-IOV to be able to take over the name (so inherit all the configs)
> > > > from the lower netdev, which needs some games with uevents and name
> > > > space reservation. So far I don't think it's been well discussed.
> > > > 
> > > > One thing in particular I'd like to point out is that the 3-netdev
> > > > model currently missed to address the core problem of live migration:
> > > > migration of hardware specific feature/state, for e.g. ethtool configs
> > > > and hardware offloading states. Only general network state (IP
> > > > address, gateway, for eg.) associated with the bypass interface can be
> > > > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > > > save and apply those hardware specific configs before or after
> > > > migration as needed. The transparent 1-netdev model being proposed as
> > > > part of this patch series will be able to solve that problem naturally
> > > > by making all hardware specific configurations go through the central
> > > > bypass driver, such that hardware configurations can be replayed when
> > > > new VF or passthrough gets plugged back in. Although that
> > > > corresponding function hasn't been implemented today, I'd like to
> > > > refresh everyone's mind that is the core problem any live migration
> > > > proposal should have addressed.
> > > > 
> > > > If it would make things more clear to defer netdev hiding until all
> > > > functionalities regarding centralizing and replay are implemented,
> > > > we'd take advices like that and move on to implementing those features
> > > > as follow-up patches. Once all needed features get done, we'd resume
> > > > the work for hiding lower netdev at that point. Think it would be the
> > > > best to make everyone understand the big picture in advance before
> > > > going too far.
> > > I think we should get the 3-netdev model integrated and add any additional
> > > ndo_ops/ethool ops that we would like to support/migrate before looking into
> > > hiding the lower netdevs.
> > Once they are exposed, I don't think we'll be able to hide them -
> > they will be a kernel ABI.
> > 
> > Do you think everyone needs to hide the SRIOV device?
> > Or that only some users need this?
> 
> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
> think it is a hard requirement.

OK, fine.

> And also,  as we don't yet have a consensus on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
> we have an acceptable solution.

Guest/host interface isn't more flexible than the userspace/kernel
interface.  The feature bit you propose would say what exactly?
Hypervisor has no idea what guest kernel shows guest userspace.
Note that the backup flag doesn't tell guest kernel what to do,
it just tells guest that there is or will be a faster main device
connected to the same backend, so the backup should only be used
when main device is not present.
Samudrala, Sridhar April 19, 2018, 6:10 a.m. UTC | #35
On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>> that I'd like to share thus revive the discussion.
>>>>>
>>>>> First of all, as illustrated in the reply below, cloud service
>>>>> providers require transparent live migration. Specifically, the main
>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>> because this use case is not appealing enough for the mainline to
>>>>> adopt, I will shut up and not continue discussing, although
>>>>> technically it's entirely possible (and there's precedent in other
>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>
>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>> API, that's completely fine and we can look into that. However, the
>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>> from the lower netdev, which needs some games with uevents and name
>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>
>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>> model currently missed to address the core problem of live migration:
>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>> and hardware offloading states. Only general network state (IP
>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>> save and apply those hardware specific configs before or after
>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>> part of this patch series will be able to solve that problem naturally
>>>>> by making all hardware specific configurations go through the central
>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>> new VF or passthrough gets plugged back in. Although that
>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>> refresh everyone's mind that is the core problem any live migration
>>>>> proposal should have addressed.
>>>>>
>>>>> If it would make things more clear to defer netdev hiding until all
>>>>> functionalities regarding centralizing and replay are implemented,
>>>>> we'd take advices like that and move on to implementing those features
>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>> best to make everyone understand the big picture in advance before
>>>>> going too far.
>>>> I think we should get the 3-netdev model integrated and add any additional
>>>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>>>> hiding the lower netdevs.
>>> Once they are exposed, I don't think we'll be able to hide them -
>>> they will be a kernel ABI.
>>>
>>> Do you think everyone needs to hide the SRIOV device?
>>> Or that only some users need this?
>> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
>> think it is a hard requirement.
> OK, fine.
>
>> And also,  as we don't yet have a consensus on how to hide
>> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
>> we have an acceptable solution.
> Guest/host interface isn't more flexible than the userspace/kernel
> interface.  The feature bit you propose would say what exactly?
> Hypervisor has no idea what guest kernel shows guest userspace.
> Note that the backup flag doesn't tell guest kernel what to do,
> it just tells guest that there is or will be a faster main device
> connected to the same backend, so the backup should only be used
> when main device is not present.

The current bypass module supports 3-netdev and 2-netdev models via 2 sets of interfaces
bypass_master_create/destroy and bypass_master_register/unregister.  So theoretically
we can support the 2 models via 2 different feature bits. BACKUP and BACKUP_2_NETDEV.

Similarly if we can figure out a way to hide both the netdevs, we can add BACKUP_1_NETDEV
feature bit and update the bypass module to provide another set of interfaces that can
be used by virtio_net to support this model.

Now that we are leaning towards 'standby' as the name for the lower virtio-net, should we
change the feature bit name also to VIRTIO_NET_F_STANDBY?

>
Siwei Liu April 19, 2018, 6:31 a.m. UTC | #36
On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>
Siwei Liu April 19, 2018, 6:43 a.m. UTC | #37
On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on how to hide
>>> the lower netdevs, we could make it as another feature bit to hide lower
>>> netdevs once
>>> we have an acceptable solution.
>>
>> Guest/host interface isn't more flexible than the userspace/kernel
>> interface.  The feature bit you propose would say what exactly?
>> Hypervisor has no idea what guest kernel shows guest userspace.
>> Note that the backup flag doesn't tell guest kernel what to do,
>> it just tells guest that there is or will be a faster main device
>> connected to the same backend, so the backup should only be used
>> when main device is not present.
>
>
> The current bypass module supports 3-netdev and 2-netdev models via 2 sets
> of interfaces
> bypass_master_create/destroy and bypass_master_register/unregister.  So
> theoretically
> we can support the 2 models via 2 different feature bits. BACKUP and
> BACKUP_2_NETDEV.

I'm still trying to understand the value of so many models to support.
If we all agree eventually the transparent 1-netdev model can address
the more general case while 2-netdev or 3-netdev is unable to, what's
the point for supporting these many features?

-Siwei
>
> Similarly if we can figure out a way to hide both the netdevs, we can add
> BACKUP_1_NETDEV
> feature bit and update the bypass module to provide another set of
> interfaces that can
> be used by virtio_net to support this model.
>
> Now that we are leaning towards 'standby' as the name for the lower
> virtio-net, should we
> change the feature bit name also to VIRTIO_NET_F_STANDBY?
>
>>
>
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef789e1..1a70f3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1380,6 +1380,7 @@  struct net_device_ops {
  * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
+ * @IFF_HIDDEN: device is not visible to userspace
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1410,6 +1411,7 @@  enum netdev_priv_flags {
 	IFF_RXFH_CONFIGURED		= 1<<25,
 	IFF_PHONY_HEADROOM		= 1<<26,
 	IFF_MACSEC			= 1<<27,
+	IFF_HIDDEN			= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1439,6 +1441,7 @@  enum netdev_priv_flags {
 #define IFF_TEAM			IFF_TEAM
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
+#define IFF_HIDDEN			IFF_HIDDEN
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1659,6 +1662,7 @@  enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
+	struct hlist_node	name_cmpl_hlist;
 	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
@@ -1680,6 +1684,7 @@  struct net_device {
 	unsigned long		state;
 
 	struct list_head	dev_list;
+	struct list_head	dev_cmpl_list;
 	struct list_head	napi_list;
 	struct list_head	unreg_list;
 	struct list_head	close_list;
@@ -2326,6 +2331,7 @@  struct netdev_lag_lower_state_info {
 #define NETDEV_UDP_TUNNEL_PUSH_INFO	0x001C
 #define NETDEV_UDP_TUNNEL_DROP_INFO	0x001D
 #define NETDEV_CHANGE_TX_QUEUE_LEN	0x001E
+#define NETDEV_PRE_GETNAME	0x001F
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
@@ -2393,6 +2399,8 @@  static inline void netdev_notifier_info_init(struct netdev_notifier_info *info,
 		for_each_netdev_rcu(&init_net, slave)	\
 			if (netdev_master_upper_dev_get_rcu(slave) == (bond))
 #define net_device_entry(lh)	list_entry(lh, struct net_device, dev_list)
+#define for_each_netdev_complete(net, d)		\
+		list_for_each_entry(d, &(net)->dev_cmpl_head, dev_cmpl_list)
 
 static inline struct net_device *next_net_device(struct net_device *dev)
 {
@@ -2462,6 +2470,10 @@  static inline void unregister_netdevice(struct net_device *dev)
 	unregister_netdevice_queue(dev, NULL);
 }
 
+void netdev_set_hidden(struct net_device *dev);
+int hide_netdevice(struct net_device *dev);
+void unhide_netdevice(struct net_device *dev);
+
 int netdev_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
 void netdev_freemem(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0490084..f9ce9b4 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -80,7 +80,9 @@  struct net {
 	struct sock		*genl_sock;
 
 	struct list_head 	dev_base_head;
+	struct list_head 	dev_cmpl_head;
 	struct hlist_head 	*dev_name_head;
+	struct hlist_head 	*dev_name_cmpl_head;
 	struct hlist_head	*dev_index_head;
 	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
 	int			ifindex;
diff --git a/net/core/dev.c b/net/core/dev.c
index 613fb40..a991b35 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -211,6 +211,13 @@  static inline struct hlist_head *dev_name_hash(struct net *net, const char *name
 	return &net->dev_name_head[hash_32(hash, NETDEV_HASHBITS)];
 }
 
+static inline struct hlist_head *dev_cname_hash(struct net *net, const char *name)
+{
+	unsigned int hash = full_name_hash(net, name, strnlen(name, IFNAMSIZ));
+
+	return &net->dev_name_cmpl_head[hash_32(hash, NETDEV_HASHBITS)];
+}
+
 static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 {
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
@@ -237,11 +244,19 @@  static void list_netdevice(struct net_device *dev)
 
 	ASSERT_RTNL();
 
+
 	write_lock_bh(&dev_base_lock);
-	list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
-	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
-	hlist_add_head_rcu(&dev->index_hlist,
-			   dev_index_hash(net, dev->ifindex));
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
+		hlist_add_head_rcu(&dev->name_hlist,
+				   dev_name_hash(net, dev->name));
+		hlist_add_head_rcu(&dev->index_hlist,
+				   dev_index_hash(net, dev->ifindex));
+	}
+	list_add_tail_rcu(&dev->dev_cmpl_list,
+			  &net->dev_cmpl_head);
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(net);
@@ -256,9 +271,13 @@  static void unlist_netdevice(struct net_device *dev)
 
 	/* Unlink dev from the device chain */
 	write_lock_bh(&dev_base_lock);
-	list_del_rcu(&dev->dev_list);
-	hlist_del_rcu(&dev->name_hlist);
-	hlist_del_rcu(&dev->index_hlist);
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_del_rcu(&dev->dev_list);
+		hlist_del_rcu(&dev->name_hlist);
+		hlist_del_rcu(&dev->index_hlist);
+	}
+	list_del_rcu(&dev->dev_cmpl_list);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(dev_net(dev));
@@ -736,11 +755,15 @@  int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 struct net_device *__dev_get_by_name(struct net *net, const char *name)
 {
 	struct net_device *dev;
-	struct hlist_head *head = dev_name_hash(net, name);
+	struct hlist_head *head = dev_cname_hash(net, name);
+	bool hidden_name = (*name == ':');
 
-	hlist_for_each_entry(dev, head, name_hlist)
+	hlist_for_each_entry(dev, head, name_cmpl_hlist) {
+		if (hidden_name && !(dev->priv_flags & IFF_HIDDEN))
+			continue;
 		if (!strncmp(dev->name, name, IFNAMSIZ))
 			return dev;
+	}
 
 	return NULL;
 }
@@ -1015,15 +1038,7 @@  struct net_device *__dev_get_by_flags(struct net *net, unsigned short if_flags,
 }
 EXPORT_SYMBOL(__dev_get_by_flags);
 
-/**
- *	dev_valid_name - check if name is okay for network device
- *	@name: name string
- *
- *	Network device names need to be valid file names to
- *	to allow sysfs to work.  We also disallow any kind of
- *	whitespace.
- */
-bool dev_valid_name(const char *name)
+static bool __dev_valid_name(const char *name, bool hidden)
 {
 	if (*name == '\0')
 		return false;
@@ -1033,12 +1048,27 @@  bool dev_valid_name(const char *name)
 		return false;
 
 	while (*name) {
-		if (*name == '/' || *name == ':' || isspace(*name))
+		if (*name == '/' || isspace(*name))
+			return false;
+		if (!hidden && *name == ':')
 			return false;
 		name++;
 	}
 	return true;
 }
+
+/**
+ *	dev_valid_name - check if name is okay for network device
+ *	@name: name string
+ *
+ *	Network device names need to be valid file names to
+ *	to allow sysfs to work.  We also disallow any kind of
+ *	whitespace.
+ */
+bool dev_valid_name(const char *name)
+{
+	return __dev_valid_name(name, false);
+}
 EXPORT_SYMBOL(dev_valid_name);
 
 /**
@@ -1064,9 +1094,6 @@  static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 	unsigned long *inuse;
 	struct net_device *d;
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	p = strchr(name, '%');
 	if (p) {
 		/*
@@ -1082,7 +1109,7 @@  static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 		if (!inuse)
 			return -ENOMEM;
 
-		for_each_netdev(net, d) {
+		for_each_netdev_complete(net, d) {
 			if (!sscanf(d->name, name, &i))
 				continue;
 			if (i < 0 || i >= max_netdevices)
@@ -1139,18 +1166,18 @@  static int dev_alloc_name_ns(struct net *net,
 
 int dev_alloc_name(struct net_device *dev, const char *name)
 {
+	if (!dev_valid_name(name))
+		return -EINVAL;
+
 	return dev_alloc_name_ns(dev_net(dev), dev, name);
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-int dev_get_valid_name(struct net *net, struct net_device *dev,
-		       const char *name)
+static int __dev_get_name(struct net *net, struct net_device *dev,
+			  const char *name)
 {
 	BUG_ON(!net);
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	if (strchr(name, '%'))
 		return dev_alloc_name_ns(net, dev, name);
 	else if (__dev_get_by_name(net, name))
@@ -1160,6 +1187,15 @@  int dev_get_valid_name(struct net *net, struct net_device *dev,
 
 	return 0;
 }
+
+int dev_get_valid_name(struct net *net, struct net_device *dev,
+		       const char *name)
+{
+	if (!__dev_valid_name(name, (dev->priv_flags & IFF_HIDDEN)))
+		return -EINVAL;
+
+	return __dev_get_name(net, dev, name);
+}
 EXPORT_SYMBOL(dev_get_valid_name);
 
 /**
@@ -1221,12 +1257,15 @@  int dev_change_name(struct net_device *dev, const char *newname)
 
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	synchronize_rcu();
 
 	write_lock_bh(&dev_base_lock);
 	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	ret = call_netdevice_notifiers(NETDEV_CHANGENAME, dev);
@@ -1594,7 +1633,7 @@  int register_netdevice_notifier(struct notifier_block *nb)
 	if (dev_boot_phase)
 		goto unlock;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
 			err = notifier_to_errno(err);
 			if (err)
@@ -1614,7 +1653,7 @@  int register_netdevice_notifier(struct notifier_block *nb)
 rollback:
 	last = dev;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev == last)
 				goto outroll;
 
@@ -1659,7 +1698,7 @@  int unregister_netdevice_notifier(struct notifier_block *nb)
 		goto unlock;
 
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev->flags & IFF_UP) {
 				call_netdevice_notifier(nb, NETDEV_GOING_DOWN,
 							dev);
@@ -7642,6 +7681,11 @@  int register_netdevice(struct net_device *dev)
 	spin_lock_init(&dev->addr_list_lock);
 	netdev_set_addr_lockdep_class(dev);
 
+	ret = call_netdevice_notifiers(NETDEV_PRE_GETNAME, dev);
+	ret = notifier_to_errno(ret);
+	if (ret)
+		goto out;
+
 	ret = dev_get_valid_name(net, dev, dev->name);
 	if (ret < 0)
 		goto out;
@@ -8461,6 +8505,166 @@  int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 }
 EXPORT_SYMBOL_GPL(dev_change_net_namespace);
 
+/**
+ *	netdev_set_hidden - indicate a hidden netdev before or at
+ *			    early point of driver registration
+ *	@dev: device
+ *
+ *	Callers must hold the rtnl semaphore, typically before or
+ *	at some early point (e.g in NETDEV_PRE_GETNAME notifier)
+ *	of driver registrationr, or it won't take effect to hide
+ *	the netdev post registration.
+ */
+void netdev_set_hidden(struct net_device *dev)
+{
+	dev->priv_flags |= IFF_HIDDEN;
+	strlcpy(dev->name, ":eth%d", IFNAMSIZ);
+}
+EXPORT_SYMBOL(netdev_set_hidden);
+
+/**
+ *	hide_netdevice - hide device from userspace's visibility
+ *	@dev: device
+ *
+ *	This function shuts down a device interface and removes it
+ *	from all userspace visible dev lists, and moves it to 
+ *	comprehensive dev lists containing both userspace-visible
+ *	and kernel-only devices. On success 0 is returned, on
+ *	a failure a netagive errno code is returned.
+ */
+int hide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+
+	err = 0;
+	/* Get out if there is nothing to do */
+	if (dev->priv_flags & IFF_HIDDEN)
+		goto out;
+
+	err = -EINVAL;
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, ":eth%d");
+       	if (err < 0)
+		goto out;
+
+	/*
+	 * And now a mini version of register_netdevice unregister_netdevice.
+	 */
+
+	/* If device is running close it first. */
+	dev_close(dev);
+
+	/* And unlink it from device chain */
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	/* Notify protocols, that we are about to destroy
+	 * this device. They should clean all the things.
+	 *
+	 * Note that dev->reg_state stays at NETREG_REGISTERED.
+	 * This is wanted because this way 8021q and macvlan know
+	 * the device is just moving and can keep their slaves up.
+	 */
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
+
+	/*
+	 *	Flush the unicast and multicast chains
+	 */
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-removed uevent to the old namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
+	netdev_adjacent_del_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	dev->priv_flags |= IFF_HIDDEN;
+	list_netdevice(dev);
+
+	/* Notify protocols, that a new device appeared. */
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	synchronize_net();
+	err = 0;
+out:
+	rtnl_unlock();
+	return err;
+}
+EXPORT_SYMBOL(hide_netdevice);
+
+/**
+ *	unhide_netdevice - make a hidden device visible to userspace
+ *	@dev: device
+ *
+ *	This function moves a hidden device to userspace visible
+ *	interfaces. A %NETDEV_REGISTER message will be sent to
+ *	the netdev notifier chain.
+ */
+void unhide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+	/* Get out if there is nothing to do */
+	if (!(dev->priv_flags & IFF_HIDDEN))
+		goto out;
+
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, "eth%d");
+	WARN_ON(err < 0);
+
+	/* If device is running close it first. */
+	dev_close(dev);
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-add uevent to the new namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
+	netdev_adjacent_add_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	/* Add the device back in the hashes */
+	dev->priv_flags &= ~IFF_HIDDEN;
+	list_netdevice(dev);
+
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
+	synchronize_net();
+out:
+	rtnl_unlock();
+}
+EXPORT_SYMBOL(unhide_netdevice);
+
 static int dev_cpu_dead(unsigned int oldcpu)
 {
 	struct sk_buff **list_skb;
@@ -8571,13 +8775,19 @@  static struct hlist_head * __net_init netdev_create_hash(void)
 /* Initialize per network namespace state */
 static int __net_init netdev_init(struct net *net)
 {
-	if (net != &init_net)
+	if (net != &init_net) {
 		INIT_LIST_HEAD(&net->dev_base_head);
+		INIT_LIST_HEAD(&net->dev_cmpl_head);
+	}
 
 	net->dev_name_head = netdev_create_hash();
 	if (net->dev_name_head == NULL)
 		goto err_name;
 
+	net->dev_name_cmpl_head = netdev_create_hash();
+	if (net->dev_name_cmpl_head == NULL)
+		goto err_cname;
+
 	net->dev_index_head = netdev_create_hash();
 	if (net->dev_index_head == NULL)
 		goto err_idx;
@@ -8585,6 +8795,8 @@  static int __net_init netdev_init(struct net *net)
 	return 0;
 
 err_idx:
+	kfree(net->dev_name_cmpl_head);
+err_cname:
 	kfree(net->dev_name_head);
 err_name:
 	return -ENOMEM;
@@ -8676,9 +8888,12 @@  void func(const struct net_device *dev, const char *fmt, ...)	\
 static void __net_exit netdev_exit(struct net *net)
 {
 	kfree(net->dev_name_head);
+	kfree(net->dev_name_cmpl_head);
 	kfree(net->dev_index_head);
-	if (net != &init_net)
+	if (net != &init_net) {
 		WARN_ON_ONCE(!list_empty(&net->dev_base_head));
+		WARN_ON_ONCE(!list_empty(&net->dev_cmpl_head));
+	}
 }
 
 static struct pernet_operations __net_initdata netdev_net_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be..1c399e9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -37,6 +37,7 @@ 
 struct net init_net = {
 	.count		= ATOMIC_INIT(1),
 	.dev_base_head	= LIST_HEAD_INIT(init_net.dev_base_head),
+	.dev_cmpl_head	= LIST_HEAD_INIT(init_net.dev_cmpl_head),
 };
 EXPORT_SYMBOL(init_net);