diff mbox

PATCH: Network Device Naming mechanism and policy

Message ID 20091009210909.GA9836@auslistsprd01.us.dell.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Matt Domsch Oct. 9, 2009, 9:09 p.m. UTC
On Fri, Oct 09, 2009 at 09:00:01AM -0500, Narendra K wrote:
> On Fri, Oct 09, 2009 at 07:12:07PM +0530, K, Narendra wrote:
> > > example udev config:
> > > SUBSYSTEM=="net",
> > SYMLINK+="net/by-mac/$sysfs{ifindex}.$sysfs{address}"
> > 
> > work as well.  But coupling the ifindex to the MAC address like this
> > doesn't work.  (In general, coupling any two unrelated attributes when
> > trying to do persistent names doesn't work.)
> > 
> Attaching the latest patch incorporating review comments.

Same patch, rebased to linux-next.

By creating character devices for every network device, we can use
udev to maintain alternate naming policies for devices, including
additional names for the same device, without interfering with the
name that the kernel assigns a device.

This is conditionalized on CONFIG_NET_CDEV.  If enabled (the default),
device nodes will automatically be created in /dev/netdev/ for each
network device.  (/dev/net/ is already populated by the tun device.)

These device nodes are not functional at the moment - open() returns
-ENOSYS.  Their only purpose is to provide userspace with a kernel
name to ifindex mapping, in a form that udev can easily manage.

Signed-off-by: Jordan Hargrave <Jordan_Hargrave@dell.com>
Signed-off-by: Narendra K <Narendra_K@dell.com>
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>

---
 include/linux/netdevice.h |    4 ++++
 net/Kconfig               |   10 ++++++++++
 net/core/Makefile         |    1 +
 net/core/cdev.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
 net/core/cdev.h           |   13 +++++++++++++
 net/core/dev.c            |   10 ++++++++++
 net/core/net-sysfs.c      |   13 +++++++++++++
 7 files changed, 93 insertions(+), 0 deletions(-)
 create mode 100644 net/core/cdev.c
 create mode 100644 net/core/cdev.h

Comments

stephen hemminger Oct. 10, 2009, 2:44 a.m. UTC | #1
On Fri, 9 Oct 2009 16:09:09 -0500
Matt Domsch <Matt_Domsch@dell.com> wrote:

> On Fri, Oct 09, 2009 at 09:00:01AM -0500, Narendra K wrote:
> > On Fri, Oct 09, 2009 at 07:12:07PM +0530, K, Narendra wrote:
> > > > example udev config:
> > > > SUBSYSTEM=="net",
> > > SYMLINK+="net/by-mac/$sysfs{ifindex}.$sysfs{address}"
> > > 
> > > work as well.  But coupling the ifindex to the MAC address like this
> > > doesn't work.  (In general, coupling any two unrelated attributes when
> > > trying to do persistent names doesn't work.)
> > > 
> > Attaching the latest patch incorporating review comments.
> 
> Same patch, rebased to linux-next.
> 
> By creating character devices for every network device, we can use
> udev to maintain alternate naming policies for devices, including
> additional names for the same device, without interfering with the
> name that the kernel assigns a device.
> 
> This is conditionalized on CONFIG_NET_CDEV.  If enabled (the default),
> device nodes will automatically be created in /dev/netdev/ for each
> network device.  (/dev/net/ is already populated by the tun device.)
> 
> These device nodes are not functional at the moment - open() returns
> -ENOSYS.  Their only purpose is to provide userspace with a kernel
> name to ifindex mapping, in a form that udev can easily manage.
> 
> Signed-off-by: Jordan Hargrave <Jordan_Hargrave@dell.com>
> Signed-off-by: Narendra K <Narendra_K@dell.com>
> Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>

Maybe I'm dense but can't see why having a useless /dev/net/ symlinks
is a good interface choice. Perhaps you should explain the race between
PCI scan and udev in more detail, and why solving it in either of those
places won't work. As it stands you are proposing yet another wart to
the already complex set of network interface API's which has implications
for security as well as increasing the number of possible bugs.
Matt Domsch Oct. 10, 2009, 4:40 a.m. UTC | #2
On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote:
> Maybe I'm dense but can't see why having a useless /dev/net/ symlinks
> is a good interface choice. Perhaps you should explain the race between
> PCI scan and udev in more detail, and why solving it in either of those
> places won't work. As it stands you are proposing yet another wart to
> the already complex set of network interface API's which has implications
> for security as well as increasing the number of possible bugs.

The fundamental challenge is that system administrators, particularly
those of server-class hardware with multiple network ports present
(some on the motherboard, some on add-in cards), have the
not-so-unreasonable expectation that there is a deterministic mapping
between those ports and the name one uses to address those ports.

The fundamental roadblock to this is that enumeration != naming,
except that it is for network devices, and we keep changing the
enumeration order.

Today, port naming is completely nondeterministic.  If you have but
one NIC, there are few chances to get the name wrong (it'll be eth0).
If you have >1 NIC, chances increase to get it wrong.

The complexity arises at multiple levels.

First, device driver load order.  In the 2.4 kernel days, and even
mostly early 2.6 kernel days, the order in which network drivers
loaded played a role in determining the name of the device.  Drivers
loaded first would get their devices named first.  If I have two types
of devices, say an e100-driven NIC and a tg3-driven NIC, I could
figure out that the names would be eth0=e100 and eth1=tg3 by setting
the load order in /etc/modules.conf (now modprobe.conf).  If I wanted
the other order, fine, just switch it around in modules.conf and
reboot.  OS installers, being the first running instance of Linux,
before modprobe.conf existed to set that ordering, had to have other
mechanisms to load drivers (often manually, or if programmatically
such as in a kickstart or autoyast file, was still somewhat fixed).

With the advent of modaliases + udev, now modprobe.conf doesn't
contain this ordering anymore, and udev loads the drivers.  So while
it wasn't perfect, it was better than nothing, and that's gone now.

It gets even worse as, to speed up boot time, modprobes can be run in
parallel, and even within individual drivers, the NICs get initialized
(and named) in parallel.  Further confusing things, some devices need
firmware loaded into them before getting names assigned, which is done
from userspace, and they race.

Second, PCI device list order.  In the 2.4 kernel days, the PCI device
list was scanned "breadth-first" (for each bus; for each device; for
each function; do load...).  FWIW, Windows still does this.  It gives
BIOS, which assigns PCI bus numbers, a chance to put LOMs at a lower
bus number than add-in cards.  Module load order still mattered, but
at least if you had say 2 e1000 ports as LOMs, and 2 e1000 ports on
add-in cards, you pretty much knew the ordering would be eth0 as
lowest bdf on the motherboard, eth1 as next bdf on the motherboard,
and eth2 and 3 as the add-in cards in ascending slot order.

With the advent of PCI hot plug in the 2.5 kernel series, the
breadth-first ordering became depth-first.    (for each bus; for each
device; if the device is a bridge, scan the busses behind it.).  This
caused NICs on bus 0 device 5, and bus 1 device 3, (eth0 and 1
respectively) to be enumerated differently due to the  a bridge from
bus 0 to bus 1 at 0:4.  My crude hack of pci=bfsort, with some dmi
strings to match and auto-enable, at least reverted this back to the
ordering the 2.4 kernel and Windows used.  Now we have to keep adding
systems to this DMI list (Dell has a number of systems on this list
today; HP has even more).  And it doesn't completely solve the
problem, just masks it.

So, to address the ordering problem, I placed a constraint on our
server hardware teams, forcing them to lay out their boards and assign
PCIe lanes and bus numbers, such that at least the designed "first"
LOM would get found first in either depth-first or breadth-first
order.  Our 10G and 11G servers have this restriction in place, though
it wasn't easy.  And it's gotten even harder, as the PCIe switches
expand the number of lanes available.  We no longer have the
traditional tiered buses architecture, but the PCI layer for this
purpose thinks we do.  I need to remove this constraint on the
hardware teams - it's gotten to be impossible for the chipset lanes to
be laid out efficiently with this constraint.

All of the above just papered over the enumeration != naming problem.

Third, stateless computing is becoming more and more commonplace.  The
Field Replaceable Unit is the server itself.  Got a bad server?  Pull
it out, move the disks to an identical unit, insert the new server,
and go.  Fix the bad server offline and bring it back.  In this model,
having MAC addresses as the mechanism that is providing the
determinism (/etc/mactab or udev persistent naming rules) breaks,
because the MAC addresses of the ports on the new server won't be the
same as on the old server.  HP even has a technology to solve _this_
problem (in their blade chassis) - Virtual Connect.  The MACs get
assigned by the chassis to the blades at POST, and are fixed to the
slot.  Slick, and Dell has an even more flexible similar feature
FlexAddress.  This doesn't solve the OS installer problem of "which of
these NICs should I use to do an install?" but it does recognize the
problem space and tries to overcome it.

Fourth, for OS installers, choosing which NIC to use at installtime,
when all the NICs are plugged in, can be difficult.  PXE environments,
using pxelinux and its IPAPPEND 2 option, will append
"BOOTIF=xx:xx:xx:xx:xx:xx" to the kernel command line, that
containing the MAC address of the NIC used for PXE.  Neat trick.  Yes,
we then had to teach the OS installers to recognize and use this.  But
it only works if you PXE boot, and only for that one NIC.

Fifth, network devices can have only a single name.  eth0.  If we look
at disks, we see udev manages a tree of symlinks for
/dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid. And as a
system admin, if I wanted to also create a udev rule for
/dev/disk/by-function (boot, swap, mattsstorage), it's trivial to do
so.  Why can't we have this flexibility for network devices too?

So, how do we get deterministic naming for all the NICs in a system?
That's what I'm going for.  Picture a network switch, with several
blades, and several ports on each blade.  The network admin addresses
each port as say 1/16 (the 16th port on blade 1, clearly labeled).
The parallel on servers is the chassis label printed on the outside
(say, "Gb1").  But due the above, there is no guarantee, and in fact
little chance, that Gb1 will be consistently named eth0 - it may vary
from boot to boot.  That's full of fail.

For a concrete example, the 4 bnx2 chips in my PowerEdge R610 with a
current 2.6 kernel, loading only one driver, the ports get assigned
names in nondeterministic order on each boot.  Given that the
ifcfg-eth* rules, netfilter rules, and the rest all expect
deterministic naming, massive failure ensues unless some form of
determinism is brought back in.

The idea to use a character device node to expose the ifindex value,
and udev to manage a tree of symlinks to it, really follows the model
used today for disks.  It allows us to get deterministic names for
devices (albeit, the names are symlinks), and multiple names for
devices (through multiple symlink rules).  That some people want to
use the char device to call ioctl() and read/write, as is possible on
the BSDs, would just be gravy IMHO.

It does require a change in behavior for a system administrator.
Instead of hard-coding 'eth0' into her scripts, she uses
'/dev/net/by-function/boot' or somesuch.  But then that name is
guaranteed to always refer to the "right" NIC.  Every admin I've
spoken to is willing to make this kind of change, as long as they get
the consistent, deterministic naming they expect but don't have
today.  And it does require patching userspace apps to take both a
kernel device name, or a path, and to resolve the path to device name
or ifindex.  We wrote libnetdevname (really, one function), and have
patches for several userspace apps to use it, to prove it can be done.

One alternative would be to do something using the sysfs ifindex value
already exported.  e.g.
  /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/0000:06:07.0/net/eth0/ifindex

but we have never had symlinks from /dev into /sys before (doesn't
mean we couldn't though).  In that case, udev would grow to manage
/dev/net/by-chassis-label/Embedded_NIC_1 -> /sys/devices/.../net/eth0,
and libnetdevname would be used to follow the symlink in applications.
This approach could solve my problem without (many or any?) kernel
changes needed, but wouldn't help those who want to do
ioctl/read/write to a devnode.

Given the problem, I really do need a solution.  I've proposed one
method, and an alternative, but I can't afford to let the problem stay
unaddressed any longer, and need a clear direction to be chosen.  The
char device gives me what I need, and others what they want also.

Thanks for listening to the diatribe.  For more examples and
workarounds that we've been telling our customers for several years,
check out http://linux.dell.com/papers.shtml for the Network Interface
Card Naming whitepaper.
Greg KH Oct. 10, 2009, 5:23 a.m. UTC | #3
On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
> The fundamental roadblock to this is that enumeration != naming,
> except that it is for network devices, and we keep changing the
> enumeration order.

No, the hardware changes the enumeration order, it places _no_
guarantees on what order stuff will be found in.  So this is not the
kernel changing, just to be clear.

Again, I have a machine here that likes to reorder PCI devices every 4th
or so boot times, and that's fine according to the PCI spec.  Yeah, it's
a crappy BIOS, but the manufacturer rightly pointed out that it is not
in violation of anything.

> Today, port naming is completely nondeterministic.  If you have but
> one NIC, there are few chances to get the name wrong (it'll be eth0).
> If you have >1 NIC, chances increase to get it wrong.

That is why all distros name network devices based on the only
deterministic thing they have today, the MAC address.  I still fail to
see why you do not like this solution, it is honestly the only way to
properly name network devices in a sane manner.

All distros also provide a way to easily rename the network devices, to
place a specific name on a specific MAC address, so again, this should
all be solved already.

No matter how badly your BIOS teams mess up the PCI enumeration order :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sujit K M Oct. 10, 2009, 8:17 a.m. UTC | #4
Greg,


> No, the hardware changes the enumeration order, it places _no_
> guarantees on what order stuff will be found in.  So this is not the
> kernel changing, just to be clear.
> Again, I have a machine here that likes to reorder PCI devices every 4th
> or so boot times, and that's fine according to the PCI spec.  Yeah, it's
> a crappy BIOS, but the manufacturer rightly pointed out that it is not
> in violation of anything.
>

I think the open call should be implemented then. By the patch very little
knowledge is being shared on type of network implementation it is trying to
do.Also it is messing with core datastructure and procedures. This seems
to be simplified by changing implementing the other operations like poll().

> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.

This is feature that needs to be implemented. As per the rules followed.

>
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.
>
> No matter how badly your BIOS teams mess up the PCI enumeration order :)

This is an problem, But I think this can be solved by implementing some of the
routines in the network device.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt Domsch Oct. 10, 2009, 12:47 p.m. UTC | #5
On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote:
> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
> > The fundamental roadblock to this is that enumeration != naming,
> > except that it is for network devices, and we keep changing the
> > enumeration order.
> 
> No, the hardware changes the enumeration order, it places _no_
> guarantees on what order stuff will be found in.  So this is not the
> kernel changing, just to be clear.

Over time the kernel has changed its enumeration mechanisms, and
introduced parallelism into the process (which is a good thing),
which, from a user perspective, makes names nondeterministic.  Yes,
fixing this up by hard-coding MAC addresses after install has been
the traditional mechanism to address this.  I think there's a better
way.

> Again, I have a machine here that likes to reorder PCI devices every 4th
> or so boot times, and that's fine according to the PCI spec.  Yeah, it's
> a crappy BIOS, but the manufacturer rightly pointed out that it is not
> in violation of anything.

I haven't encounted this myself, but yes, it's valid but annoying.
 
> > Today, port naming is completely nondeterministic.  If you have but
> > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > If you have >1 NIC, chances increase to get it wrong.
> 
> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.
>
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.

It's not the only way, it introduces state where there's a desire for
a stateless solution, it's useless for getting all the names right at
initial OS install time, and it restricts us to a single "name" for a
given device.

We can get additional information from BIOS.  SMBIOS 2.6 (types 9 and
41) has the fields to let us get a "label" for an device at a given
b/d/f.  On my PowerEdge R610, I see "Embedded NIC 1" .. "Embedded NIC
4" for the 4 LOMs.  These labels have a clear correlation to the
labels on the back of the chassis at these ports.  biosdevname can
parse and report this.  HP made a similar vendor-specific extension to
SMBIOS for their platforms, which biosdevname also parses.  Even if
BIOS decides they need to renumber the busses on every boot, it can
keep this table correct.  (insert general mistrust of BIOS authors
rant; that's not the point here.)

biosdevname can be used in udev rules to create multiple names for a
given device.  Rules such as:

 PROGRAM="/sbin/biosdevname --policy=all_names -i %k", SYMLINK+="net/by-slot-name/%c", OPTIONS+="string_escape=replace"
 PROGRAM="/sbin/biosdevname --policy=smbios_names -i %k", SYMLINK+="net/by-chassis-label/%c", OPTIONS+="string_escape=replace"

SMBIOS has its own problems, specifically that it's not hot-plug
aware (it's a static table created during POST).  And if a better way
is found (perhaps through the PCI SIG or ACPI), great, biosdevname can
be extended to use it.  But, without at least a change in udev or the
kernel, it doesn't do any good.
 
> No matter how badly your BIOS teams mess up the PCI enumeration
> order :)

In my case, the BIOS for a given system always configures the ports
the same way, and assigns b/d/f the same way.  With no change in the
BIOS or hardware, I still see the ports enumerated differently on each
boot. :-(
Greg KH Oct. 10, 2009, 4:25 p.m. UTC | #6
On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote:
> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote:
> > On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
> > > The fundamental roadblock to this is that enumeration != naming,
> > > except that it is for network devices, and we keep changing the
> > > enumeration order.
> > 
> > No, the hardware changes the enumeration order, it places _no_
> > guarantees on what order stuff will be found in.  So this is not the
> > kernel changing, just to be clear.
> 
> Over time the kernel has changed its enumeration mechanisms, and
> introduced parallelism into the process (which is a good thing),
> which, from a user perspective, makes names nondeterministic.  Yes,
> fixing this up by hard-coding MAC addresses after install has been
> the traditional mechanism to address this.  I think there's a better
> way.

Ok, but that way can be done in userspace, without the need for this
char device, right?

> > > Today, port naming is completely nondeterministic.  If you have but
> > > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > > If you have >1 NIC, chances increase to get it wrong.
> > 
> > That is why all distros name network devices based on the only
> > deterministic thing they have today, the MAC address.  I still fail to
> > see why you do not like this solution, it is honestly the only way to
> > properly name network devices in a sane manner.
> >
> > All distros also provide a way to easily rename the network devices, to
> > place a specific name on a specific MAC address, so again, this should
> > all be solved already.
> 
> It's not the only way, it introduces state where there's a desire for
> a stateless solution, it's useless for getting all the names right at
> initial OS install time, and it restricts us to a single "name" for a
> given device.
> 
> We can get additional information from BIOS.  SMBIOS 2.6 (types 9 and
> 41) has the fields to let us get a "label" for an device at a given
> b/d/f.  On my PowerEdge R610, I see "Embedded NIC 1" .. "Embedded NIC
> 4" for the 4 LOMs.  These labels have a clear correlation to the
> labels on the back of the chassis at these ports.  biosdevname can
> parse and report this.  HP made a similar vendor-specific extension to
> SMBIOS for their platforms, which biosdevname also parses.  Even if
> BIOS decides they need to renumber the busses on every boot, it can
> keep this table correct.  (insert general mistrust of BIOS authors
> rant; that's not the point here.)
> 
> biosdevname can be used in udev rules to create multiple names for a
> given device.  Rules such as:

Yes, if you want multiple ways to name a network device, then you need
the char nodes.  But without that, you can just pick "always use the
biosdevname" type option from your distro setup screen and go with that.
Then you have everything always working properly from the very
beginning.

> > No matter how badly your BIOS teams mess up the PCI enumeration
> > order :)
> 
> In my case, the BIOS for a given system always configures the ports
> the same way, and assigns b/d/f the same way.  With no change in the
> BIOS or hardware, I still see the ports enumerated differently on each
> boot. :-(

Again, that's legal from a PCI standpoint :)

So you really want this for multiple ways to name the same network
device.  That's a choice the network developers are going to have to
make, as to if that is going to be a legal thing to have happen or not.

But this code is not a requirement to "solve" the fact that network
devices can show up in different order, that problem can be solved as
long as the user picks a single way to name the devices, using tools
that are already present today in distros.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 10, 2009, 4:27 p.m. UTC | #7
On Sat, Oct 10, 2009 at 01:47:39PM +0530, Sujit K M wrote:
> Greg,
> 
> 
> > No, the hardware changes the enumeration order, it places _no_
> > guarantees on what order stuff will be found in. ?So this is not the
> > kernel changing, just to be clear.
> > Again, I have a machine here that likes to reorder PCI devices every 4th
> > or so boot times, and that's fine according to the PCI spec. ?Yeah, it's
> > a crappy BIOS, but the manufacturer rightly pointed out that it is not
> > in violation of anything.
> >
> 
> I think the open call should be implemented then. By the patch very little
> knowledge is being shared on type of network implementation it is trying to
> do.

What would open() accomplish?  What good would the file descriptor be?
What could you use it for?

> Also it is messing with core datastructure and procedures. This seems
> to be simplified by changing implementing the other operations like poll().

I don't understand.

> > That is why all distros name network devices based on the only
> > deterministic thing they have today, the MAC address. ?I still fail to
> > see why you do not like this solution, it is honestly the only way to
> > properly name network devices in a sane manner.
> 
> This is feature that needs to be implemented. As per the rules followed.

This feature is already implemented today, all distros have it.

> > All distros also provide a way to easily rename the network devices, to
> > place a specific name on a specific MAC address, so again, this should
> > all be solved already.
> >
> > No matter how badly your BIOS teams mess up the PCI enumeration order :)
> 
> This is an problem, But I think this can be solved by implementing some of the
> routines in the network device.

I don't, see the rules that your distro ships today for persistant
network devices, it's already there, no need to change the kernel at
all.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bryan Kadzban Oct. 10, 2009, 5:34 p.m. UTC | #8
Greg KH wrote:
> On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote:
>> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote:
>>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
>>>> The fundamental roadblock to this is that enumeration !=
>>>> naming, except that it is for network devices, and we keep
>>>> changing the enumeration order.
>>> No, the hardware changes the enumeration order, it places _no_ 
>>> guarantees on what order stuff will be found in.  So this is not
>>> the kernel changing, just to be clear.
>> Over time the kernel has changed its enumeration mechanisms, and 
>> introduced parallelism into the process (which is a good thing), 
>> which, from a user perspective, makes names nondeterministic.  Yes,
>> fixing this up by hard-coding MAC addresses after install has been
>> the traditional mechanism to address this.  I think there's a
>> better way.
> 
> Ok, but that way can be done in userspace, without the need for this 
> char device, right?

For the record -- when I tried to send a patch that did exactly this
(provided an option to use by-path persistence for network drivers), it
was rejected because "that doesn't work for USB".

True, it doesn't.  But by-mac (what we have today) doesn't work for
replacing motherboards in a random home system (that can't override the
MAC address in the BIOS), either.

So why not provide both alternatives?

As you say below, it's up to the network devs whether this should be
allowed...

>> biosdevname can be used in udev rules to create multiple names for
>> a given device.  Rules such as:
> 
> Yes, if you want multiple ways to name a network device, then you
> need the char nodes.  But without that, you can just pick "always use
> the biosdevname" type option from your distro setup screen and go
> with that. Then you have everything always working properly from the
> very beginning.

*If* biosdevname works on your system.  It doesn't on mine: this SMBIOS
extension doesn't exist.  :-)

> So you really want this for multiple ways to name the same network 
> device.  That's a choice the network developers are going to have to 
> make, as to if that is going to be a legal thing to have happen or
> not.

Yes.  So do I, actually (for what little that's worth)...

> But this code is not a requirement to "solve" the fact that network 
> devices can show up in different order, that problem can be solved as
> long as the user picks a single way to name the devices, using tools
> that are already present today in distros.

This code is not a requirement, no.  But -- as you say -- it does
provide a halfway-decent way to assign multiple names to a NIC.  And
that provides admins the choice to use a couple different persistence
schemes, depending on how they expect their hardware to work.

(It *may* even be possible to use some kind of layer-2 traffic to see
what else is on the connected network and provide symlinks based on
that.  IPv6 autoconfig type of thing, maybe.  That's probably a *lot*
more complicated, and may be impossible, but would be even closer to
what I think Dell customers are asking for based on Matt's posts.)
Bill Fink Oct. 10, 2009, 6:11 p.m. UTC | #9
On Fri, 9 Oct 2009, Greg KH wrote:

> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
> > The fundamental roadblock to this is that enumeration != naming,
> > except that it is for network devices, and we keep changing the
> > enumeration order.
> 
> No, the hardware changes the enumeration order, it places _no_
> guarantees on what order stuff will be found in.  So this is not the
> kernel changing, just to be clear.
> 
> Again, I have a machine here that likes to reorder PCI devices every 4th
> or so boot times, and that's fine according to the PCI spec.  Yeah, it's
> a crappy BIOS, but the manufacturer rightly pointed out that it is not
> in violation of anything.
> 
> > Today, port naming is completely nondeterministic.  If you have but
> > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > If you have >1 NIC, chances increase to get it wrong.
> 
> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.
> 
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.
> 
> No matter how badly your BIOS teams mess up the PCI enumeration order :)

No comment on the specific implementation decision, but I am in the
process of setting up a large number of test systems with identical
hardware configurations, and using a master disk image to clone all the
test systems.  The biggest pain in this process is identiying the MAC
addresses for each of the six or more network interfaces in each test
system (we want eth0...ethN to always reference the same physical port
on the test systems), and then having to modify the 70-persistent-net.rules
udev file and the HWADDR entry for all the ifcfg-ethX files to reflect
the correct MAC addresses.  It would be fantastic if there were some
mechanism for making this part of the process unnecessary.

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Oct. 10, 2009, 6:32 p.m. UTC | #10
On Fri, 9 Oct 2009 23:40:57 -0500
Matt Domsch <Matt_Domsch@dell.com> wrote:

> On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote:
> > Maybe I'm dense but can't see why having a useless /dev/net/ symlinks
> > is a good interface choice. Perhaps you should explain the race between
> > PCI scan and udev in more detail, and why solving it in either of those
> > places won't work. As it stands you are proposing yet another wart to
> > the already complex set of network interface API's which has implications
> > for security as well as increasing the number of possible bugs.
> 
> The fundamental challenge is that system administrators, particularly
> those of server-class hardware with multiple network ports present
> (some on the motherboard, some on add-in cards), have the
> not-so-unreasonable expectation that there is a deterministic mapping
> between those ports and the name one uses to address those ports.
> 
> The fundamental roadblock to this is that enumeration != naming,
> except that it is for network devices, and we keep changing the
> enumeration order.
> 
> Today, port naming is completely nondeterministic.  If you have but
> one NIC, there are few chances to get the name wrong (it'll be eth0).
> If you have >1 NIC, chances increase to get it wrong.
> 
> The complexity arises at multiple levels.
> 
> First, device driver load order.  In the 2.4 kernel days, and even
> mostly early 2.6 kernel days, the order in which network drivers
> loaded played a role in determining the name of the device.  Drivers
> loaded first would get their devices named first.  If I have two types
> of devices, say an e100-driven NIC and a tg3-driven NIC, I could
> figure out that the names would be eth0=e100 and eth1=tg3 by setting
> the load order in /etc/modules.conf (now modprobe.conf).  If I wanted
> the other order, fine, just switch it around in modules.conf and
> reboot.  OS installers, being the first running instance of Linux,
> before modprobe.conf existed to set that ordering, had to have other
> mechanisms to load drivers (often manually, or if programmatically
> such as in a kickstart or autoyast file, was still somewhat fixed).
> 
> With the advent of modaliases + udev, now modprobe.conf doesn't
> contain this ordering anymore, and udev loads the drivers.  So while
> it wasn't perfect, it was better than nothing, and that's gone now.
> 
> It gets even worse as, to speed up boot time, modprobes can be run in
> parallel, and even within individual drivers, the NICs get initialized
> (and named) in parallel.  Further confusing things, some devices need
> firmware loaded into them before getting names assigned, which is done
> from userspace, and they race.
> 
> Second, PCI device list order.  In the 2.4 kernel days, the PCI device
> list was scanned "breadth-first" (for each bus; for each device; for
> each function; do load...).  FWIW, Windows still does this.  It gives
> BIOS, which assigns PCI bus numbers, a chance to put LOMs at a lower
> bus number than add-in cards.  Module load order still mattered, but
> at least if you had say 2 e1000 ports as LOMs, and 2 e1000 ports on
> add-in cards, you pretty much knew the ordering would be eth0 as
> lowest bdf on the motherboard, eth1 as next bdf on the motherboard,
> and eth2 and 3 as the add-in cards in ascending slot order.
> 
> With the advent of PCI hot plug in the 2.5 kernel series, the
> breadth-first ordering became depth-first.    (for each bus; for each
> device; if the device is a bridge, scan the busses behind it.).  This
> caused NICs on bus 0 device 5, and bus 1 device 3, (eth0 and 1
> respectively) to be enumerated differently due to the  a bridge from
> bus 0 to bus 1 at 0:4.  My crude hack of pci=bfsort, with some dmi
> strings to match and auto-enable, at least reverted this back to the
> ordering the 2.4 kernel and Windows used.  Now we have to keep adding
> systems to this DMI list (Dell has a number of systems on this list
> today; HP has even more).  And it doesn't completely solve the
> problem, just masks it.
> 
> So, to address the ordering problem, I placed a constraint on our
> server hardware teams, forcing them to lay out their boards and assign
> PCIe lanes and bus numbers, such that at least the designed "first"
> LOM would get found first in either depth-first or breadth-first
> order.  Our 10G and 11G servers have this restriction in place, though
> it wasn't easy.  And it's gotten even harder, as the PCIe switches
> expand the number of lanes available.  We no longer have the
> traditional tiered buses architecture, but the PCI layer for this
> purpose thinks we do.  I need to remove this constraint on the
> hardware teams - it's gotten to be impossible for the chipset lanes to
> be laid out efficiently with this constraint.
> 
> All of the above just papered over the enumeration != naming problem.
> 
> Third, stateless computing is becoming more and more commonplace.  The
> Field Replaceable Unit is the server itself.  Got a bad server?  Pull
> it out, move the disks to an identical unit, insert the new server,
> and go.  Fix the bad server offline and bring it back.  In this model,
> having MAC addresses as the mechanism that is providing the
> determinism (/etc/mactab or udev persistent naming rules) breaks,
> because the MAC addresses of the ports on the new server won't be the
> same as on the old server.  HP even has a technology to solve _this_
> problem (in their blade chassis) - Virtual Connect.  The MACs get
> assigned by the chassis to the blades at POST, and are fixed to the
> slot.  Slick, and Dell has an even more flexible similar feature
> FlexAddress.  This doesn't solve the OS installer problem of "which of
> these NICs should I use to do an install?" but it does recognize the
> problem space and tries to overcome it.
> 
> Fourth, for OS installers, choosing which NIC to use at installtime,
> when all the NICs are plugged in, can be difficult.  PXE environments,
> using pxelinux and its IPAPPEND 2 option, will append
> "BOOTIF=xx:xx:xx:xx:xx:xx" to the kernel command line, that
> containing the MAC address of the NIC used for PXE.  Neat trick.  Yes,
> we then had to teach the OS installers to recognize and use this.  But
> it only works if you PXE boot, and only for that one NIC.
> 
> Fifth, network devices can have only a single name.  eth0.  If we look
> at disks, we see udev manages a tree of symlinks for
> /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid. And as a
> system admin, if I wanted to also create a udev rule for
> /dev/disk/by-function (boot, swap, mattsstorage), it's trivial to do
> so.  Why can't we have this flexibility for network devices too?
> 
> So, how do we get deterministic naming for all the NICs in a system?
> That's what I'm going for.  Picture a network switch, with several
> blades, and several ports on each blade.  The network admin addresses
> each port as say 1/16 (the 16th port on blade 1, clearly labeled).
> The parallel on servers is the chassis label printed on the outside
> (say, "Gb1").  But due the above, there is no guarantee, and in fact
> little chance, that Gb1 will be consistently named eth0 - it may vary
> from boot to boot.  That's full of fail.
> 
> For a concrete example, the 4 bnx2 chips in my PowerEdge R610 with a
> current 2.6 kernel, loading only one driver, the ports get assigned
> names in nondeterministic order on each boot.  Given that the
> ifcfg-eth* rules, netfilter rules, and the rest all expect
> deterministic naming, massive failure ensues unless some form of
> determinism is brought back in.
> 
> The idea to use a character device node to expose the ifindex value,
> and udev to manage a tree of symlinks to it, really follows the model
> used today for disks.  It allows us to get deterministic names for
> devices (albeit, the names are symlinks), and multiple names for
> devices (through multiple symlink rules).  That some people want to
> use the char device to call ioctl() and read/write, as is possible on
> the BSDs, would just be gravy IMHO.
> 
> It does require a change in behavior for a system administrator.
> Instead of hard-coding 'eth0' into her scripts, she uses
> '/dev/net/by-function/boot' or somesuch.  But then that name is
> guaranteed to always refer to the "right" NIC.  Every admin I've
> spoken to is willing to make this kind of change, as long as they get
> the consistent, deterministic naming they expect but don't have
> today.  And it does require patching userspace apps to take both a
> kernel device name, or a path, and to resolve the path to device name
> or ifindex.  We wrote libnetdevname (really, one function), and have
> patches for several userspace apps to use it, to prove it can be done.
> 
> One alternative would be to do something using the sysfs ifindex value
> already exported.  e.g.
>   /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/0000:06:07.0/net/eth0/ifindex
> 
> but we have never had symlinks from /dev into /sys before (doesn't
> mean we couldn't though).  In that case, udev would grow to manage
> /dev/net/by-chassis-label/Embedded_NIC_1 -> /sys/devices/.../net/eth0,
> and libnetdevname would be used to follow the symlink in applications.
> This approach could solve my problem without (many or any?) kernel
> changes needed, but wouldn't help those who want to do
> ioctl/read/write to a devnode.
> 
> Given the problem, I really do need a solution.  I've proposed one
> method, and an alternative, but I can't afford to let the problem stay
> unaddressed any longer, and need a clear direction to be chosen.  The
> char device gives me what I need, and others what they want also.
> 
> Thanks for listening to the diatribe.  For more examples and
> workarounds that we've been telling our customers for several years,
> check out http://linux.dell.com/papers.shtml for the Network Interface
> Card Naming whitepaper.
> 
> 

Why isn't the available through sysfs enough, if not why not
add the necessary attributes there.

BTW, for our distro, we are looking into device renaming based on PCI slot
because that is what router OS's do. Customers expect if they replace the card
in slot 0, it will come back with the same name.  This is not what server
customers expect.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kay Sievers Oct. 10, 2009, 6:35 p.m. UTC | #11
On Sat, Oct 10, 2009 at 20:11, Bill Fink <billfink@mindspring.com> wrote:
> No comment on the specific implementation decision, but I am in the
> process of setting up a large number of test systems with identical
> hardware configurations, and using a master disk image to clone all the
> test systems.  The biggest pain in this process is identiying the MAC
> addresses for each of the six or more network interfaces in each test
> system (we want eth0...ethN to always reference the same physical port
> on the test systems), and then having to modify the 70-persistent-net.rules
> udev file and the HWADDR entry for all the ifcfg-ethX files to reflect
> the correct MAC addresses.  It would be fantastic if there were some
> mechanism for making this part of the process unnecessary.

Udev creates the persistent rules only if no other rule set a name.
Adding something like:
  SUBSYSTEM=="net", KERNEL==""eth*", NAME="eth%n"
in any earlier rules file before the udev generated one will skip all
off the automatic udev rule creation.

Kay
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Oct. 10, 2009, 7 p.m. UTC | #12
On Sat, 2009-10-10 at 09:27 -0700, Greg KH wrote:
> On Sat, Oct 10, 2009 at 01:47:39PM +0530, Sujit K M wrote:
> > Greg,
> > 
> > 
> > > No, the hardware changes the enumeration order, it places _no_
> > > guarantees on what order stuff will be found in. ?So this is not the
> > > kernel changing, just to be clear.
> > > Again, I have a machine here that likes to reorder PCI devices every 4th
> > > or so boot times, and that's fine according to the PCI spec. ?Yeah, it's
> > > a crappy BIOS, but the manufacturer rightly pointed out that it is not
> > > in violation of anything.
> > >
> > 
> > I think the open call should be implemented then. By the patch very little
> > knowledge is being shared on type of network implementation it is trying to
> > do.
> 
> What would open() accomplish?  What good would the file descriptor be?
> What could you use it for?

Currently all net device ioctls are carried out through arbitrary
sockets and identify the device by name (aside from one to look up the
name by ifindex).  Ever since it became possible to rename net devices,
it has been possible for a sequence of ioctls intended for one device to
race with renaming of that device.  Adding open() and ioctl() to the
character device (which seems reasonably easy) would provide a way to
avoid this.

On the other hand, the netlink configuration APIs already use ifindex so
it may be better just to say that the device ioctls are deprecated and
applications should use netlink.

> > Also it is messing with core datastructure and procedures. This seems
> > to be simplified by changing implementing the other operations like poll().
> 
> I don't understand.
> 
> > > That is why all distros name network devices based on the only
> > > deterministic thing they have today, the MAC address. ?I still fail to
> > > see why you do not like this solution, it is honestly the only way to
> > > properly name network devices in a sane manner.
> > 
> > This is feature that needs to be implemented. As per the rules followed.
> 
> This feature is already implemented today, all distros have it.

No, see below.

> > > All distros also provide a way to easily rename the network devices, to
> > > place a specific name on a specific MAC address, so again, this should
> > > all be solved already.
> > >
> > > No matter how badly your BIOS teams mess up the PCI enumeration order :)
> > 
> > This is an problem, But I think this can be solved by implementing some of the
> > routines in the network device.
> 
> I don't, see the rules that your distro ships today for persistant
> network devices, it's already there, no need to change the kernel at
> all.

The udev persistent net rules work tolerably well for a single system
with a stable set of net devices.

They do not solve the problem Matt's talking about, which is lack of
consistency between multiple systems, because the initial enumeration
order is not predictable.

They also result in name changes when a NIC (or motherboard) is swapped.
For some users, that's fine; for others, it's not.

The ability to specify NICs by port name or PCI address should solve
these problems.

Ben.
Greg KH Oct. 10, 2009, 9:06 p.m. UTC | #13
On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote:
> 
> BTW, for our distro, we are looking into device renaming based on PCI slot
> because that is what router OS's do. Customers expect if they replace the card
> in slot 0, it will come back with the same name.  This is not what server
> customers expect.

If your bios exposes the PCI slots to userspace (through the proper ACPI
namespace), doing this type of naming should be trivial with some simple
udev rules, no additional kernel infrastructure is needed.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 10, 2009, 9:10 p.m. UTC | #14
On Sat, Oct 10, 2009 at 08:00:30PM +0100, Ben Hutchings wrote:
> On the other hand, the netlink configuration APIs already use ifindex so
> it may be better just to say that the device ioctls are deprecated and
> applications should use netlink.

I thought that is what was already encouraged to happen.

> > > > That is why all distros name network devices based on the only
> > > > deterministic thing they have today, the MAC address. ?I still fail to
> > > > see why you do not like this solution, it is honestly the only way to
> > > > properly name network devices in a sane manner.
> > > 
> > > This is feature that needs to be implemented. As per the rules followed.
> > 
> > This feature is already implemented today, all distros have it.
> 
> No, see below.

Yes, if not, file a bug in your distro, all of the infrastructure is
already in place, and the udev rules and scripts are already written.

> > > > All distros also provide a way to easily rename the network devices, to
> > > > place a specific name on a specific MAC address, so again, this should
> > > > all be solved already.
> > > >
> > > > No matter how badly your BIOS teams mess up the PCI enumeration order :)
> > > 
> > > This is an problem, But I think this can be solved by implementing some of the
> > > routines in the network device.
> > 
> > I don't, see the rules that your distro ships today for persistant
> > network devices, it's already there, no need to change the kernel at
> > all.
> 
> The udev persistent net rules work tolerably well for a single system
> with a stable set of net devices.
> 
> They do not solve the problem Matt's talking about, which is lack of
> consistency between multiple systems, because the initial enumeration
> order is not predictable.

Again, you name the device as a MAC address.  Or something else that the
BIOS exports in a unique manner (PCI slot name, etc.).  That is
consistant.  If not, then fix the BIOS.

> They also result in name changes when a NIC (or motherboard) is swapped.
> For some users, that's fine; for others, it's not.
> 
> The ability to specify NICs by port name or PCI address should solve
> these problems.

That can be done today quite easily.  But note that PCI addresses are
not guaranteed to be stable.  As lots of machines are known to have
happen.

Again, none of this requires any kernel changes today at all, let alone
adding dummy char devices for network devices.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 10, 2009, 9:13 p.m. UTC | #15
On Sat, Oct 10, 2009 at 10:34:16AM -0700, Bryan Kadzban wrote:
> Greg KH wrote:
> > On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote:
> >> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote:
> >>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
> >>>> The fundamental roadblock to this is that enumeration !=
> >>>> naming, except that it is for network devices, and we keep
> >>>> changing the enumeration order.
> >>> No, the hardware changes the enumeration order, it places _no_ 
> >>> guarantees on what order stuff will be found in.  So this is not
> >>> the kernel changing, just to be clear.
> >> Over time the kernel has changed its enumeration mechanisms, and 
> >> introduced parallelism into the process (which is a good thing), 
> >> which, from a user perspective, makes names nondeterministic.  Yes,
> >> fixing this up by hard-coding MAC addresses after install has been
> >> the traditional mechanism to address this.  I think there's a
> >> better way.
> > 
> > Ok, but that way can be done in userspace, without the need for this 
> > char device, right?
> 
> For the record -- when I tried to send a patch that did exactly this
> (provided an option to use by-path persistence for network drivers), it
> was rejected because "that doesn't work for USB".
> 
> True, it doesn't.  But by-mac (what we have today) doesn't work for
> replacing motherboards in a random home system (that can't override the
> MAC address in the BIOS), either.

If you replace a motherboard, you honestly expect no configuration to be
needed to be changed?  If so, then don't use the MAC naming scheme for
your systems.

> > But this code is not a requirement to "solve" the fact that network 
> > devices can show up in different order, that problem can be solved as
> > long as the user picks a single way to name the devices, using tools
> > that are already present today in distros.
> 
> This code is not a requirement, no.  But -- as you say -- it does
> provide a halfway-decent way to assign multiple names to a NIC.  And
> that provides admins the choice to use a couple different persistence
> schemes, depending on how they expect their hardware to work.

But the names need to then be resolved back to a "real" kernel name in
order to do anything with that network connection, as the char devices
are not real ones.  So that adds an additional layer of complexity on
all of the system configuration tools.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marco d'Itri Oct. 11, 2009, 12:37 a.m. UTC | #16
On Oct 10, Matt Domsch <Matt_Domsch@dell.com> wrote:

> It does require a change in behavior for a system administrator.
> Instead of hard-coding 'eth0' into her scripts, she uses
> '/dev/net/by-function/boot' or somesuch.  But then that name is
> guaranteed to always refer to the "right" NIC.  Every admin I've
> spoken to is willing to make this kind of change, as long as they get
> the consistent, deterministic naming they expect but don't have
> today.  And it does require patching userspace apps to take both a
> kernel device name, or a path, and to resolve the path to device name
> or ifindex.  We wrote libnetdevname (really, one function), and have
> patches for several userspace apps to use it, to prove it can be done.
For the records, before being a distribution developer I am a system
administrator (who designed and manages many firewalls with multiple
network interfaces) and I am still unconvinced that what you are
proposing is a practical solution and that its downsides justify the
significant changes both in software and in system administration
practices that it requires.
The first issue which greatly concerns me is the need to modify *every*
userspace application and kernel tool (what about iptables? What about
the kernel logs?): from an users experience point of view it would be
very annoying if different applications used different names to refer to
the same network device.
I am also concerned with the practical implications of trying to use
such long (and unusual) names: IFNAMSIZ is 16, so user interfaces tend
to assume both short names and that they match something like
/^[a-z0-9]+$/. What about e.g. distribution scripts which use the
interface name as a file system path component? Do you already have a
(standard) scheme to losslessly convert the names to a form without
slashes?
David Zeuthen Oct. 11, 2009, 4:40 p.m. UTC | #17
On Sat, 2009-10-10 at 09:25 -0700, Greg KH wrote:
> Ok, but that way can be done in userspace, without the need for this
> char device, right?

It might actually be nice to have a device file anyway since you can use
existing udev infrastructure to adjust permissions (e.g. chown it to the
netdev group) and add ACLs. This would allow running some software as an
unprivileged user instead of uid 0.

      David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 11, 2009, 6:47 p.m. UTC | #18
On Sun, Oct 11, 2009 at 12:40:18PM -0400, David Zeuthen wrote:
> On Sat, 2009-10-10 at 09:25 -0700, Greg KH wrote:
> > Ok, but that way can be done in userspace, without the need for this
> > char device, right?
> 
> It might actually be nice to have a device file anyway since you can use
> existing udev infrastructure to adjust permissions (e.g. chown it to the
> netdev group) and add ACLs. This would allow running some software as an
> unprivileged user instead of uid 0.

But as the char nodes would not actually control access to anything, how
would this help?  Remember, these device nodes are "dummies" with
nothing behind them (open() fails).

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Townley Oct. 11, 2009, 9:10 p.m. UTC | #19
On Sat, Oct 10, 2009 at 12:23 AM, Greg KH <greg@kroah.com> wrote:
> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
>> The fundamental roadblock to this is that enumeration != naming,
>> except that it is for network devices, and we keep changing the
>> enumeration order.
>
> No, the hardware changes the enumeration order, it places _no_
> guarantees on what order stuff will be found in.  So this is not the
> kernel changing, just to be clear.
>
> Again, I have a machine here that likes to reorder PCI devices every 4th
> or so boot times, and that's fine according to the PCI spec.  Yeah, it's
> a crappy BIOS, but the manufacturer rightly pointed out that it is not
> in violation of anything.
>
>> Today, port naming is completely nondeterministic.  If you have but
>> one NIC, there are few chances to get the name wrong (it'll be eth0).
>> If you have >1 NIC, chances increase to get it wrong.
>
> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.
>
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.
>
> No matter how badly your BIOS teams mess up the PCI enumeration order :)
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

So when an add-in PCI NIC has a lower MAC than the motherboard NICs,
the add-in cards will come before the motherboard NICs.   i don't like it.

But please whatever is done, make sure ping and tracert still works when
telling it to use a ethX source interface:

eth0 = 4.3.2.8, the default gateway is thru eth1.
ping -I eth0 208.67.222.222              FAILS
ping -I 4.3.2.8 208.67.222.222          WORKS
tracert -i eth0 -I 208.67.222.222        FAILS
tracert -s 4.3.2.8 -I 208.67.222.222   WORKS
tracert -i eth0 208.67.222.222           FAILS
tracert -s 4.3.2.8 208.67.222.222      WORKS
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt Domsch Oct. 11, 2009, 11:04 p.m. UTC | #20
On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote:
> So when an add-in PCI NIC has a lower MAC than the motherboard NICs,
> the add-in cards will come before the motherboard NICs.   i don't like it.

Actually, MAC address has nothing to do with device naming/ordering at
all.  Often systems will have onboard NICs in ascending MAC address
order, but that's not a requirement, and I've seen systems not do
that.  And once you get to add-in vs onboard, BIOS wouldn't be able to
enforce such an ordering anyhow (in general).

But yes, you raise the point that, without using MAC-assigned names or
another naming mechanism designed to cope with this, adding or
removing a card can cause a difference in device enumeration, and thus
name.
Greg KH Oct. 12, 2009, 3 a.m. UTC | #21
On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote:
> So when an add-in PCI NIC has a lower MAC than the motherboard NICs,
> the add-in cards will come before the motherboard NICs.   i don't like it.

Huh?  Have you used the MAC persistant rules?  If you add a new card,
what does it pick for it?

> But please whatever is done, make sure ping and tracert still works when
> telling it to use a ethX source interface:
> 
> eth0 = 4.3.2.8, the default gateway is thru eth1.
> ping -I eth0 208.67.222.222              FAILS
> ping -I 4.3.2.8 208.67.222.222          WORKS
> tracert -i eth0 -I 208.67.222.222        FAILS
> tracert -s 4.3.2.8 -I 208.67.222.222   WORKS
> tracert -i eth0 208.67.222.222           FAILS
> tracert -s 4.3.2.8 208.67.222.222      WORKS

Again, is what we currently have broken?  I am confused as to what this
is referring to.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bryan Kadzban Oct. 12, 2009, 6:21 a.m. UTC | #22
Greg KH wrote:
> On Sat, Oct 10, 2009 at 10:34:16AM -0700, Bryan Kadzban wrote:
>> Greg KH wrote:
>>> On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote:
>>>> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote:
>>>>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote:
>>>>>> The fundamental roadblock to this is that enumeration != 
>>>>>> naming, except that it is for network devices, and we keep 
>>>>>> changing the enumeration order.
>>>>> No, the hardware changes the enumeration order, it places
>>>>> _no_ guarantees on what order stuff will be found in.  So
>>>>> this is not the kernel changing, just to be clear.
>>>> Over time the kernel has changed its enumeration mechanisms,
>>>> and introduced parallelism into the process (which is a good
>>>> thing), which, from a user perspective, makes names
>>>> nondeterministic.  Yes, fixing this up by hard-coding MAC
>>>> addresses after install has been the traditional mechanism to
>>>> address this.  I think there's a better way.
>>> Ok, but that way can be done in userspace, without the need for
>>> this char device, right?
>> For the record -- when I tried to send a patch that did exactly
>> this (provided an option to use by-path persistence for network
>> drivers), it was rejected because "that doesn't work for USB".
>> 
>> True, it doesn't.  But by-mac (what we have today) doesn't work for
>> replacing motherboards in a random home system (that can't override
>> the MAC address in the BIOS), either.
> 
> If you replace a motherboard, you honestly expect no configuration to
> be needed to be changed?  If so, then don't use the MAC naming scheme
> for your systems.

What else is there?  biosdevname doesn't work with this BIOS.  It looks
like at least path_id has been updated to work with NICs now, so that
might work, with a bit of custom rule hacking.

Or at least, it won't work any more poorly than for disks, which seem to
work pretty well...  :-)

>>> But this code is not a requirement to "solve" the fact that
>>> network devices can show up in different order, that problem can
>>> be solved as long as the user picks a single way to name the
>>> devices, using tools that are already present today in distros.
>> This code is not a requirement, no.  But -- as you say -- it does 
>> provide a halfway-decent way to assign multiple names to a NIC.
>> And that provides admins the choice to use a couple different
>> persistence schemes, depending on how they expect their hardware to
>> work.
> 
> But the names need to then be resolved back to a "real" kernel name
> in order to do anything with that network connection, as the char
> devices are not real ones.  So that adds an additional layer of
> complexity on all of the system configuration tools.

Yes, that is true -- and no, this change isn't perfect.  But it lets me
have multiple "names" per interface, and have "names" that are longer
than IFNAMSIZ, though, which is why I like it.

(Now, if open() would return effectively a netlink socket bound to that
ifindex already, such that the program didn't need to fill in the
various ifindex fields for e.g. rtnetlink... but it's probably really
hard to do that, so this isn't a serious suggestion.)
Kurt Van Dijck Oct. 12, 2009, 7:30 a.m. UTC | #23
On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote:
> > On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote:
[...]
> 
> Why isn't the available through sysfs enough, if not why not
> add the necessary attributes there.
True. If sysfs is not sufficient, what exact naming scheme could be
applied that the chardev based naming could use?
> 
[...]

Kurt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bryan Kadzban Oct. 12, 2009, 4:19 p.m. UTC | #24
Bryan Kadzban wrote:
> (Now, if open() would return effectively a netlink socket bound to
> that ifindex already, such that the program didn't need to fill in
> the various ifindex fields for e.g. rtnetlink... but it's probably
> really hard to do that, so this isn't a serious suggestion.)

Wait, scratch that.  It's not "really hard", it's "almost impossible".

At open() time, you have no idea which netlink family the program wants
to communicate with.  bind() is also hard.  (In theory, you could
support bind() on this new FD -- but then why is userspace using a file
in the first place, and not a socket?)

So this is even less of a serious suggestion now.

I'd still like to be able to refer to NICs by multiple names though, if
we can find a way that works...
Bill Nottingham Oct. 12, 2009, 5:45 p.m. UTC | #25
Greg KH (greg@kroah.com) said: 
> > Today, port naming is completely nondeterministic.  If you have but
> > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > If you have >1 NIC, chances increase to get it wrong.
> 
> That is why all distros name network devices based on the only
> deterministic thing they have today, the MAC address.  I still fail to
> see why you do not like this solution, it is honestly the only way to
> properly name network devices in a sane manner.
> 
> All distros also provide a way to easily rename the network devices, to
> place a specific name on a specific MAC address, so again, this should
> all be solved already.

No, it's not solved. Even if you have persistent names once you install,
if you ever re-image, you're likely to get *different* persistent names;
the first load will always be non-detmerministic.

The only way around this would be to have some sort of screen like:

  Would you like your network devices to be enumerated by

  [ ] MAC address
  [ ] PCI device order
  [ ] Driver name
  [ ] Other

which is just all sorts of fail in and of itself. Especially since
once you get to the point where you can coherently ask this in a
native installer, the drivers have already loaded.

Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 12, 2009, 5:55 p.m. UTC | #26
On Mon, Oct 12, 2009 at 01:45:28PM -0400, Bill Nottingham wrote:
> Greg KH (greg@kroah.com) said: 
> > > Today, port naming is completely nondeterministic.  If you have but
> > > one NIC, there are few chances to get the name wrong (it'll be eth0).
> > > If you have >1 NIC, chances increase to get it wrong.
> > 
> > That is why all distros name network devices based on the only
> > deterministic thing they have today, the MAC address.  I still fail to
> > see why you do not like this solution, it is honestly the only way to
> > properly name network devices in a sane manner.
> > 
> > All distros also provide a way to easily rename the network devices, to
> > place a specific name on a specific MAC address, so again, this should
> > all be solved already.
> 
> No, it's not solved. Even if you have persistent names once you install,
> if you ever re-image, you're likely to get *different* persistent names;
> the first load will always be non-detmerministic.
> 
> The only way around this would be to have some sort of screen like:
> 
>   Would you like your network devices to be enumerated by
> 
>   [ ] MAC address
>   [ ] PCI device order
>   [ ] Driver name
>   [ ] Other

[ ] PCI slot name

That's one that modern systems are now reporting, and should solve
Matt's problem as well, right?

> which is just all sorts of fail in and of itself. Especially since
> once you get to the point where you can coherently ask this in a
> native installer, the drivers have already loaded.

No, the driver load order doesn't determine this, you need the drivers
loaded first before you can rename anything :)

And I don't see how Matt's proposed patch helps resolve this type of
issue any better than what we currently have today, do you?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bill Nottingham Oct. 12, 2009, 6:07 p.m. UTC | #27
Greg KH (greg@kroah.com) said: 
> > No, it's not solved. Even if you have persistent names once you install,
> > if you ever re-image, you're likely to get *different* persistent names;
> > the first load will always be non-detmerministic.
> > 
> > The only way around this would be to have some sort of screen like:
> > 
> >   Would you like your network devices to be enumerated by
> > 
> >   [ ] MAC address
> >   [ ] PCI device order
> >   [ ] Driver name
> >   [ ] Other
> 
> [ ] PCI slot name
> 
> That's one that modern systems are now reporting, and should solve
> Matt's problem as well, right?

... maybe. On my laptop, the first 'slot' enumerated appears to be
the cardbus bridge, before the on-board ethernet. And on the desktop
next to me, the slot driver shows nothing.

> And I don't see how Matt's proposed patch helps resolve this type of
> issue any better than what we currently have today, do you?

It allows multiple addressing schemes to be active at once, which
can allow the admin to choose post-install without making an
active choice at installation. This is an improvement, even if
it doesn't solve the world.

Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH Oct. 12, 2009, 6:15 p.m. UTC | #28
On Mon, Oct 12, 2009 at 02:07:42PM -0400, Bill Nottingham wrote:
> Greg KH (greg@kroah.com) said: 
> > > No, it's not solved. Even if you have persistent names once you install,
> > > if you ever re-image, you're likely to get *different* persistent names;
> > > the first load will always be non-detmerministic.
> > > 
> > > The only way around this would be to have some sort of screen like:
> > > 
> > >   Would you like your network devices to be enumerated by
> > > 
> > >   [ ] MAC address
> > >   [ ] PCI device order
> > >   [ ] Driver name
> > >   [ ] Other
> > 
> > [ ] PCI slot name
> > 
> > That's one that modern systems are now reporting, and should solve
> > Matt's problem as well, right?
> 
> ... maybe. On my laptop, the first 'slot' enumerated appears to be
> the cardbus bridge, before the on-board ethernet. And on the desktop
> next to me, the slot driver shows nothing.

On servers, where this matters (multiple ethernet pci devices), this
should all be present if the manufacturer wants it to be, as it is just
an ACPI table entry.

> > And I don't see how Matt's proposed patch helps resolve this type of
> > issue any better than what we currently have today, do you?
> 
> It allows multiple addressing schemes to be active at once, which
> can allow the admin to choose post-install without making an
> active choice at installation. This is an improvement, even if
> it doesn't solve the world.

But these different names can not be used by the networking stack, or in
scripts, as others have pointed out.  Which seems to be the big problem
here.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Townley Oct. 12, 2009, 6:35 p.m. UTC | #29
On Sun, Oct 11, 2009 at 10:00 PM, Greg KH <greg@kroah.com> wrote:
> On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote:
>> So when an add-in PCI NIC has a lower MAC than the motherboard NICs,
>> the add-in cards will come before the motherboard NICs.   i don't like it.
>
> Huh?  Have you used the MAC persistant rules?  If you add a new card,
> what does it pick for it?

i have a hp-dl360 (two nics) with a fibre optic add in nic.  On a
fresh install, the add-in is eth0.  i didn't like it, but ran it for
years.

>
>> But please whatever is done, make sure ping and tracert still works when
>> telling it to use a ethX source interface:
>>
>> eth0 = 4.3.2.8, the default gateway is thru eth1.
>> ping -I eth0 208.67.222.222              FAILS
>> ping -I 4.3.2.8 208.67.222.222          WORKS
>> tracert -i eth0 -I 208.67.222.222        FAILS
>> tracert -s 4.3.2.8 -I 208.67.222.222   WORKS
>> tracert -i eth0 208.67.222.222           FAILS
>> tracert -s 4.3.2.8 208.67.222.222      WORKS
>
> Again, is what we currently have broken?  I am confused as to what this
> is referring to.

Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox.
On a multinic, multigatewayed machine, passing ethX instead of the IP
address will give the false result: "Destination Host Unreachable"
when the machine's default gateway is reached thru the other nic.   In
the following example, the default gateway is thru eth1, not eth0.
Pay attention to the text between the '*****'.

ping -c 1 -B -I  eth0 208.67.222.222
PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 eth0*****:
56(84) bytes of data.
From 4.3.2.8 icmp_seq=1 Destination Host Unreachable

#ping -c 1 -B -I  4.3.2.8 208.67.222.222
PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 *****: 56(84)
bytes of data.
64 bytes from 208.67.222.222: icmp_seq=1 ttl=55 time=562 ms



>
> greg k-h
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt Domsch Oct. 12, 2009, 6:44 p.m. UTC | #30
On Mon, Oct 12, 2009 at 01:35:25PM -0500, Rob Townley wrote:
> > Again, is what we currently have broken?  I am confused as to what this
> > is referring to.
> 
> Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox.
> On a multinic, multigatewayed machine, passing ethX instead of the IP
> address will give the false result: "Destination Host Unreachable"
> when the machine's default gateway is reached thru the other nic.   In
> the following example, the default gateway is thru eth1, not eth0.

Unrelated to this thread.  We're having a hard enough time making sure
this conversation accurately reflects the views and needs of everyone
involved.  Please let's not throw in another tangent.

Thanks,
Matt
Dan Williams Oct. 13, 2009, 6:02 p.m. UTC | #31
On Sat, 2009-10-10 at 14:06 -0700, Greg KH wrote:
> On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote:
> > 
> > BTW, for our distro, we are looking into device renaming based on PCI slot
> > because that is what router OS's do. Customers expect if they replace the card
> > in slot 0, it will come back with the same name.  This is not what server
> > customers expect.
> 
> If your bios exposes the PCI slots to userspace (through the proper ACPI
> namespace), doing this type of naming should be trivial with some simple
> udev rules, no additional kernel infrastructure is needed.

By and large, the people that care most about persistent network device
names based on *location in the machine* are server users.  This allows
hotswap of cards or single-image-multiple-machine without needing
configuration changes, which is nice.

Those users can reasonably be expected to choose hardware whose BIOS
supports the ACPI tables that (mostly) guarantee to provide actual,
stable names for their hardware.  If there's even a 10% chance that on
consumer-level systems the names won't be stable on a given boot (and I
can't see how, without BIOS support, we can guarantee 100% stability)
then it's a worthless guarantee.

If the BIOS support exists, it is trivial to use udev to create the
correct naming mechanism for your machine, either using MAC address or
BIOS-provided slot naming.  No kernel patch is required.

If the BIOS support does not exist, you are not guaranteed a stable
naming mechanism except by MAC address, because the BIOS may randomly
change enumeration based on the time of day, or it may not.  A 90 or 95%
stability guarantee is not a guarantee at all.

Third, USB enumeration will always be unstable.  Thus we have an
unsolvable discrepancy in behavior between PCI and USB.

Is this correct?

Dan


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Narendra K Oct. 13, 2009, 6:53 p.m. UTC | #32
>If the BIOS support exists, it is trivial to use udev to 
>create the correct naming mechanism for your machine, either 
>using MAC address or BIOS-provided slot naming.  No kernel 
>patch is required.
>

Yes. In case, we want to rename only once. MAC address or slot names do
provide persistent naming. They help in retaining whatever names are
assigned during install time, which is the first instantiation of the
OS. But these names may not be as expected (like first on board network
interface name is expected to be "eth0" which is not always the case and
might not reflect what is written on the chassis label as "Gb1" and
"Gb2" etc) which would result in unattended installs break. Also image
based deployments will face problems by introducing state such as MAC
address. 

With regards,
Narendra K
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b332eef..a2f23b4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -44,6 +44,7 @@ 
 #include <linux/workqueue.h>
 
 #include <linux/ethtool.h>
+#include <linux/cdev.h>
 #include <net/net_namespace.h>
 #include <net/dsa.h>
 #ifdef CONFIG_DCB
@@ -916,6 +917,9 @@  struct net_device
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+#ifdef CONFIG_NET_CDEV
+	struct cdev cdev;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/net/Kconfig b/net/Kconfig
index 041c35e..bdc5bd7 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -43,6 +43,16 @@  config COMPAT_NETLINK_MESSAGES
 	  Newly written code should NEVER need this option but do
 	  compat-independent messages instead!
 
+config NET_CDEV
+       bool "/dev files for network devices"
+       default y
+       help
+         This option causes /dev entries to be created for each
+         network device.  This allows the use of udev to create
+         alternate device naming policies.
+
+	 If unsure, say Y.
+
 menu "Networking options"
 
 source "net/packet/Kconfig"
diff --git a/net/core/Makefile b/net/core/Makefile
index 796f46e..0b40d2c 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,4 +19,5 @@  obj-$(CONFIG_NET_DMA) += user_dma.o
 obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
+obj-$(CONFIG_NET_CDEV) += cdev.o
 
diff --git a/net/core/cdev.c b/net/core/cdev.c
new file mode 100644
index 0000000..1f36076
--- /dev/null
+++ b/net/core/cdev.c
@@ -0,0 +1,42 @@ 
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <linux/netdevice.h>
+#include <linux/device.h>
+
+/* Used for network dynamic major number */
+static dev_t netdev_devt;
+
+static int netdev_cdev_open(struct inode *inode, struct file *filep)
+{
+	/* no operations on this device are implemented */
+	return -ENOSYS;
+}
+
+static const struct file_operations netdev_cdev_fops = {
+	.owner = THIS_MODULE,
+	.open = netdev_cdev_open,
+};
+
+void netdev_cdev_alloc(void)
+{
+	alloc_chrdev_region(&netdev_devt, 0, 1<<20, "net");
+}
+
+void netdev_cdev_init(struct net_device *dev)
+{
+	cdev_init(&dev->cdev, &netdev_cdev_fops);
+	cdev_add(&dev->cdev, MKDEV(MAJOR(netdev_devt), dev->ifindex), 1);
+
+}
+
+void netdev_cdev_del(struct net_device *dev)
+{
+	if (dev->cdev.dev)
+		cdev_del(&dev->cdev);
+}
+
+void netdev_cdev_kobj_init(struct device *dev, struct net_device *net)
+{
+	if (net->cdev.dev)
+		dev->devt = net->cdev.dev;
+}
diff --git a/net/core/cdev.h b/net/core/cdev.h
new file mode 100644
index 0000000..9cf5a90
--- /dev/null
+++ b/net/core/cdev.h
@@ -0,0 +1,13 @@ 
+#include <linux/netdevice.h>
+
+#ifdef CONFIG_NET_CDEV
+void netdev_cdev_alloc(void);
+void netdev_cdev_init(struct net_device *dev);
+void netdev_cdev_del(struct net_device *dev);
+void netdev_cdev_kobj_init(struct device *dev, struct net_device *net);
+#else
+static inline void netdev_cdev_alloc(void) {}
+static inline void netdev_cdev_init(struct net_device *dev) {}
+static inline void netdev_cdev_del(struct net_device *dev) {}
+static inline void netdev_cdev_kobj_init(struct device *dev, struct net_device *net) {}
+#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index a74c8fd..d771438 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@ 
 #include <trace/events/napi.h>
 
 #include "net-sysfs.h"
+#include "cdev.h"
 
 /* Instead of increasing this, you should create a hash table. */
 #define MAX_GRO_SKBS 8
@@ -4684,6 +4685,7 @@  static void rollback_registered(struct net_device *dev)
 
 	/* Remove entries from kobject tree */
 	netdev_unregister_kobject(dev);
+	netdev_cdev_del(dev);
 
 	synchronize_net();
 
@@ -4835,6 +4837,8 @@  int register_netdevice(struct net_device *dev)
 	if (dev->features & NETIF_F_SG)
 		dev->features |= NETIF_F_GSO;
 
+	netdev_cdev_init(dev);
+
 	netdev_initialize_kobject(dev);
 
 	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
@@ -4870,6 +4874,7 @@  out:
 	return ret;
 
 err_uninit:
+	netdev_cdev_del(dev);
 	if (dev->netdev_ops->ndo_uninit)
 		dev->netdev_ops->ndo_uninit(dev);
 	goto out;
@@ -5377,6 +5382,7 @@  int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 	dev_addr_discard(dev);
 
 	netdev_unregister_kobject(dev);
+	netdev_cdev_del(dev);
 
 	/* Actually switch the network namespace */
 	dev_net_set(dev, net);
@@ -5393,6 +5399,8 @@  int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 			dev->iflink = dev->ifindex;
 	}
 
+	netdev_cdev_init(dev);
+
 	/* Fixup kobjects */
 	err = netdev_register_kobject(dev);
 	WARN_ON(err);
@@ -5626,6 +5634,8 @@  static int __init net_dev_init(void)
 
 	BUG_ON(!dev_boot_phase);
 
+	netdev_cdev_alloc();
+
 	if (dev_proc_init())
 		goto out;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 753c420..f4ee557 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -19,6 +19,7 @@ 
 #include <net/wext.h>
 
 #include "net-sysfs.h"
+#include "cdev.h"
 
 #ifdef CONFIG_SYSFS
 static const char fmt_hex[] = "%#x\n";
@@ -501,6 +502,14 @@  static void netdev_release(struct device *d)
 	kfree((char *)dev - dev->padded);
 }
 
+#ifdef CONFIG_NET_CDEV
+static char *netdev_devnode(struct device *d, mode_t *mode)
+{
+	struct net_device *dev = to_net_dev(d);
+	return kasprintf(GFP_KERNEL, "netdev/%s", dev->name);
+}
+#endif
+
 static struct class net_class = {
 	.name = "net",
 	.dev_release = netdev_release,
@@ -510,6 +519,9 @@  static struct class net_class = {
 #ifdef CONFIG_HOTPLUG
 	.dev_uevent = netdev_uevent,
 #endif
+#ifdef CONFIG_NET_CDEV
+	.devnode = netdev_devnode,
+#endif
 };
 
 /* Delete sysfs entries but hold kobject reference until after all
@@ -536,6 +548,7 @@  int netdev_register_kobject(struct net_device *net)
 	dev->class = &net_class;
 	dev->platform_data = net;
 	dev->groups = groups;
+	netdev_cdev_kobj_init(dev, net);
 
 	dev_set_name(dev, "%s", net->name);