diff mbox

[net-next,1/2] add iovnl netlink support

Message ID 20100419191807.10423.84600.stgit@savbu-pc100.cisco.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Scott Feldman April 19, 2010, 7:18 p.m. UTC
From: Scott Feldman <scofeldm@cisco.com>

IOV netlink (IOVNL) adds I/O Virtualization control support to a master
device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
design allows for the case where master and slave are the
same netdev interface.

One control setting example is MAC/VLAN settings for a VF.  Another
example control setting is a port-profile for a VF.  A port-profile is an
identifier that defines policy-based settings on the network port
backing the VF.  The network port settings examples are VLAN membership,
QoS settings, and L2 security settings, typical of a data center network.

This patch adds the iovnl interface definitions and an iovnl module.

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
Signed-off-by: Roopa Prabhu<roprabhu@cisco.com>
---
 include/linux/iovnl.h     |  124 +++++++++++++++++++++
 include/linux/netdevice.h |    4 +
 include/linux/rtnetlink.h |    5 +
 include/net/iovnl.h       |   36 ++++++
 net/Kconfig               |    1 
 net/Makefile              |    3 +
 net/iovnl/Kconfig         |   10 ++
 net/iovnl/Makefile        |    1 
 net/iovnl/iovnl.c         |  260 +++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 444 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Arnd Bergmann April 20, 2010, 1:48 p.m. UTC | #1
On Monday 19 April 2010, Scott Feldman wrote:

> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
> design allows for the case where master and slave are the
> same netdev interface.

What is the reason for controlling the slave device through the master,
rather than talking to the slave directly? The kernel always knows
the master for each slave, so it seems to me that this information
is redundant.

Is this new interface only for the case that you have a switch integrated
in the NIC, or also for the case where you do an LLDP and EDP exchange
with an adjacent bridge and put the device into VEPA mode?

> One control setting example is MAC/VLAN settings for a VF.  Another
> example control setting is a port-profile for a VF.  A port-profile is an
> identifier that defines policy-based settings on the network port
> backing the VF.  The network port settings examples are VLAN membership,
> QoS settings, and L2 security settings, typical of a data center network.
> 
> This patch adds the iovnl interface definitions and an iovnl module.

How does this relate to the existing DCB netlink interface? My feeling
is that there is some overlap in how it would get used, and some parts
that are very distinct. In particular, I'd guess that you'd want to
be able to set DCB parameters for each VF, but not all DCB adapters
would support SR-IOV.

Did you consider making this code an extension to the DCB interface
instead of a separate one? What was the reason for your decision
to keep it separate?

Also, do you expect your interface to be supported by dcbd/lldpad,
or is there a good reason to create a new tool for iovnl?

> + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING)
> + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING)

As mentioned above, why not drop one of these, and just pass the VF's IFNAME?

> + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> + *   (NLA_NUL_STRING)

How does the definition of the port profile get into the NIC's switch?
Is there any way to list the available port profiles?

> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)

Can you elaborate more on what these do? Who is the 'client' and the 'host'
in this case, and why do you need to identify them?

> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])

Just one mac address? What happens if we want to assign multiple mac
addresses to the VF later? Also, how is this defined specifically?
Will a SIOCSIFHWADDR with a different MAC address on the VF fail
later, or is this just the default value?

> + * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16)

Same here: Should you be able to set multiple MAC addresses, or
trunk mode? Can the VF override it?
Also, for the new multi-channel VEPA, I'd guess that you also need
to supply an 802.1ad S-VLAN ID.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Wright April 20, 2010, 2:34 p.m. UTC | #2
* Arnd Bergmann (arnd@arndb.de) wrote:
> On Monday 19 April 2010, Scott Feldman wrote:
> 
> > IOV netlink (IOVNL) adds I/O Virtualization control support to a master
> > device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
> > control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
> > design allows for the case where master and slave are the
> > same netdev interface.
> 
> What is the reason for controlling the slave device through the master,
> rather than talking to the slave directly? The kernel always knows
> the master for each slave, so it seems to me that this information
> is redundant.

Not all devices have this relationship explicit (i.e. not all are pure
sr-iov devices).  If there's always a way to discover the master from
the device, then I agree we only need the slave.

> Is this new interface only for the case that you have a switch integrated
> in the NIC, or also for the case where you do an LLDP and EDP exchange
> with an adjacent bridge and put the device into VEPA mode?

It should be useful for both.  That's part of the reason for using
netlink, a userspace daemon running the VDP state machine (like lldpad)
can listen for these messages and see a set_port_profile request when
the user starts up a VM.

> > One control setting example is MAC/VLAN settings for a VF.  Another
> > example control setting is a port-profile for a VF.  A port-profile is an
> > identifier that defines policy-based settings on the network port
> > backing the VF.  The network port settings examples are VLAN membership,
> > QoS settings, and L2 security settings, typical of a data center network.
> > 
> > This patch adds the iovnl interface definitions and an iovnl module.
> 
> How does this relate to the existing DCB netlink interface? My feeling
> is that there is some overlap in how it would get used, and some parts
> that are very distinct. In particular, I'd guess that you'd want to
> be able to set DCB parameters for each VF, but not all DCB adapters
> would support SR-IOV.
> 
> Did you consider making this code an extension to the DCB interface
> instead of a separate one? What was the reason for your decision
> to keep it separate?

Well, aside from the fact that DCB and VDP have some low level
similarities in the PDU and they are both communication between
the host and the switch, they are doing different things.

> Also, do you expect your interface to be supported by dcbd/lldpad,
> or is there a good reason to create a new tool for iovnl?

lldpad would listen, I don't see why iproute2 couldn't send, and libvirt
will send as well.

> > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING)
> > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING)
> 
> As mentioned above, why not drop one of these, and just pass the VF's IFNAME?
> 
> > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> > + *   (NLA_NUL_STRING)
> 
> How does the definition of the port profile get into the NIC's switch?
> Is there any way to list the available port profiles?

The port profile is a concept external to the NIC's switch.  It's a value
that exists in the external physical layer 2 switching infrastructure.
So an admin knows this value and is informing the adjacent switch that a
new virutal interface is coming up and needs some particular port profile.

> > + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> > + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> 
> Can you elaborate more on what these do? Who is the 'client' and the 'host'
> in this case, and why do you need to identify them?
> 
> > + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> 
> Just one mac address? What happens if we want to assign multiple mac
> addresses to the VF later? Also, how is this defined specifically?
> Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> later, or is this just the default value?
> 
> > + * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16)
> 
> Same here: Should you be able to set multiple MAC addresses, or
> trunk mode? Can the VF override it?
> Also, for the new multi-channel VEPA, I'd guess that you also need
> to supply an 802.1ad S-VLAN ID.

Something like set_port_profile() would initiate the negotiation for the
s-vlan id for a particular channel, not sure it's needed as part of the
netlink interface or not.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 20, 2010, 2:57 p.m. UTC | #3
On Tuesday 20 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Monday 19 April 2010, Scott Feldman wrote:
> > 
> > What is the reason for controlling the slave device through the master,
> > rather than talking to the slave directly? The kernel always knows
> > the master for each slave, so it seems to me that this information
> > is redundant.
> 
> Not all devices have this relationship explicit (i.e. not all are pure
> sr-iov devices).  If there's always a way to discover the master from
> the device, then I agree we only need the slave.

Hmm, is there an actual example of a card where the relationship is not
known to the kernel?

> > Is this new interface only for the case that you have a switch integrated
> > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > with an adjacent bridge and put the device into VEPA mode?
> 
> It should be useful for both.  That's part of the reason for using
> netlink, a userspace daemon running the VDP state machine (like lldpad)
> can listen for these messages and see a set_port_profile request when
> the user starts up a VM.

After thinking some more about this case, I now believe we should do
it the other way around, and have lldpad in control of this interface
from the user space side, and letting user programs (lldptool, libvirt,
...) talk to lldpad in order to set it up.
 
> > Also, do you expect your interface to be supported by dcbd/lldpad,
> > or is there a good reason to create a new tool for iovnl?
> 
> lldpad would listen, I don't see why iproute2 couldn't send, and libvirt
> will send as well.

Not sure. We need lldpad to do this exchange for the case of VEPA with
VDP, so always using lldpad would let us unify the user interface for
both cases. We can of course have iproute2 talk to lldpad, in the
same way that libvirt does.

> > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> > > + *   (NLA_NUL_STRING)
> > 
> > How does the definition of the port profile get into the NIC's switch?
> > Is there any way to list the available port profiles?
> 
> The port profile is a concept external to the NIC's switch.  It's a value
> that exists in the external physical layer 2 switching infrastructure.
> So an admin knows this value and is informing the adjacent switch that a
> new virutal interface is coming up and needs some particular port profile.

But that's only the case if the NIC itself is in VEPA mode. If that
were the case, there would be no need for a kernel interface at all,
because then we could just drive the port profile selection from user
space.

The proposed interface only seems to make sense if you use it to
configure the NIC itself! Why should it care about the port profile
otherwise?

> > Same here: Should you be able to set multiple MAC addresses, or
> > trunk mode? Can the VF override it?
> > Also, for the new multi-channel VEPA, I'd guess that you also need
> > to supply an 802.1ad S-VLAN ID.
> 
> Something like set_port_profile() would initiate the negotiation for the
> s-vlan id for a particular channel, not sure it's needed as part of the
> netlink interface or not.

Well, you have to set up the s-vlan ID in order to have something to
set the port profile in.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Wright April 20, 2010, 3:22 p.m. UTC | #4
* Arnd Bergmann (arnd@arndb.de) wrote:
> On Tuesday 20 April 2010, Chris Wright wrote:
> > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > On Monday 19 April 2010, Scott Feldman wrote:
> > > 
> > > What is the reason for controlling the slave device through the master,
> > > rather than talking to the slave directly? The kernel always knows
> > > the master for each slave, so it seems to me that this information
> > > is redundant.
> > 
> > Not all devices have this relationship explicit (i.e. not all are pure
> > sr-iov devices).  If there's always a way to discover the master from
> > the device, then I agree we only need the slave.
> 
> Hmm, is there an actual example of a card where the relationship is not
> known to the kernel?
> 
> > > Is this new interface only for the case that you have a switch integrated
> > > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > > with an adjacent bridge and put the device into VEPA mode?
> > 
> > It should be useful for both.  That's part of the reason for using
> > netlink, a userspace daemon running the VDP state machine (like lldpad)
> > can listen for these messages and see a set_port_profile request when
> > the user starts up a VM.
> 
> After thinking some more about this case, I now believe we should do
> it the other way around, and have lldpad in control of this interface
> from the user space side, and letting user programs (lldptool, libvirt,
> ...) talk to lldpad in order to set it up.

lldpad won't be involved in all cases, yet a mgmt tool like libvirt will.
so this seems backwards.

> > > Also, do you expect your interface to be supported by dcbd/lldpad,
> > > or is there a good reason to create a new tool for iovnl?
> > 
> > lldpad would listen, I don't see why iproute2 couldn't send, and libvirt
> > will send as well.
> 
> Not sure. We need lldpad to do this exchange for the case of VEPA with
> VDP, so always using lldpad would let us unify the user interface for
> both cases. We can of course have iproute2 talk to lldpad, in the
> same way that libvirt does.
> 
> > > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> > > > + *   (NLA_NUL_STRING)
> > > 
> > > How does the definition of the port profile get into the NIC's switch?
> > > Is there any way to list the available port profiles?
> > 
> > The port profile is a concept external to the NIC's switch.  It's a value
> > that exists in the external physical layer 2 switching infrastructure.
> > So an admin knows this value and is informing the adjacent switch that a
> > new virutal interface is coming up and needs some particular port profile.
> 
> But that's only the case if the NIC itself is in VEPA mode. If that
> were the case, there would be no need for a kernel interface at all,
> because then we could just drive the port profile selection from user
> space.
> 
> The proposed interface only seems to make sense if you use it to
> configure the NIC itself! Why should it care about the port profile
> otherwise?

In the case of devices that can do adjacent switch negotiations directly.

> > > Same here: Should you be able to set multiple MAC addresses, or
> > > trunk mode? Can the VF override it?
> > > Also, for the new multi-channel VEPA, I'd guess that you also need
> > > to supply an 802.1ad S-VLAN ID.
> > 
> > Something like set_port_profile() would initiate the negotiation for the
> > s-vlan id for a particular channel, not sure it's needed as part of the
> > netlink interface or not.
> 
> Well, you have to set up the s-vlan ID in order to have something to
> set the port profile in.

Right, depends if the use the port profile to establish the channel and
negotiate the s-vlan ID.  I don't recall the order there.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 20, 2010, 4:19 p.m. UTC | #5
On Tuesday 20 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Chris Wright wrote:
> >
> > After thinking some more about this case, I now believe we should do
> > it the other way around, and have lldpad in control of this interface
> > from the user space side, and letting user programs (lldptool, libvirt,
> > ...) talk to lldpad in order to set it up.
> 
> lldpad won't be involved in all cases, yet a mgmt tool like libvirt will.
> so this seems backwards.

Well, that part is still the matter of this discussion, as far as I can tell ;-)

> > But that's only the case if the NIC itself is in VEPA mode. If that
> > were the case, there would be no need for a kernel interface at all,
> > because then we could just drive the port profile selection from user
> > space.
> > 
> > The proposed interface only seems to make sense if you use it to
> > configure the NIC itself! Why should it care about the port profile
> > otherwise?
> 
> In the case of devices that can do adjacent switch negotiations directly.

I thought the idea to deal with those devices was to beat sense into
the respective developers until they do the negotiation in software 8-)

> > > > Same here: Should you be able to set multiple MAC addresses, or
> > > > trunk mode? Can the VF override it?
> > > > Also, for the new multi-channel VEPA, I'd guess that you also need
> > > > to supply an 802.1ad S-VLAN ID.
> > > 
> > > Something like set_port_profile() would initiate the negotiation for the
> > > s-vlan id for a particular channel, not sure it's needed as part of the
> > > netlink interface or not.
> > 
> > Well, you have to set up the s-vlan ID in order to have something to
> > set the port profile in.
> 
> Right, depends if the use the port profile to establish the channel and
> negotiate the s-vlan ID.  I don't recall the order there.

I'm pretty sure that setting up the channel (for 802.1bg) is done before
any port profile comes in.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman April 20, 2010, 7:56 p.m. UTC | #6
On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Monday 19 April 2010, Scott Feldman wrote:
> 
>> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
>> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
>> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
>> design allows for the case where master and slave are the
>> same netdev interface.
> 
> What is the reason for controlling the slave device through the master,
> rather than talking to the slave directly? The kernel always knows
> the master for each slave, so it seems to me that this information
> is redundant.

The interface would allow talking to the slave directly. In fact, that's the
example with enic port-profile in patch 2/2.  But, it would be nice not to
rule out the case where the master proxies slave control and the master is
under exclusively controlled by hypervisor.
 
> Is this new interface only for the case that you have a switch integrated
> in the NIC, or also for the case where you do an LLDP and EDP exchange
> with an adjacent bridge and put the device into VEPA mode?

All of the above.  Basing this on netlink give us flexibility to work with
user-space mgmt tools or directly with kernel netdev as in the enic case.
Not trying to make assumptions about where (user-space, kernel) and by which
entity sources or sinks the netlink msg.
 
>> One control setting example is MAC/VLAN settings for a VF.  Another
>> example control setting is a port-profile for a VF.  A port-profile is an
>> identifier that defines policy-based settings on the network port
>> backing the VF.  The network port settings examples are VLAN membership,
>> QoS settings, and L2 security settings, typical of a data center network.
>> 
>> This patch adds the iovnl interface definitions and an iovnl module.
> 
> How does this relate to the existing DCB netlink interface? My feeling
> is that there is some overlap in how it would get used, and some parts
> that are very distinct. In particular, I'd guess that you'd want to
> be able to set DCB parameters for each VF, but not all DCB adapters
> would support SR-IOV.
>
> Did you consider making this code an extension to the DCB interface
> instead of a separate one? What was the reason for your decision
> to keep it separate?

Considered it but DCB interface is well defined for DCB and it didn't seem
right gluing on interfaces not specified within DCB.  I agree that there is
some overlap in the sense that both interface are used to configure a netdev
with some properties interesting for the data center, but the DCB interface
is for local setting of the properties on the host whereas iovnl is about
pushing the setting of those properties to the network for policy-based
control.
 
> Also, do you expect your interface to be supported by dcbd/lldpad,
> or is there a good reason to create a new tool for iovnl?

Lldpad supporting this interface would seem right, for those cases where
lldpad is responsible for configuring the netdev.
 
>> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
>> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> 
> Can you elaborate more on what these do? Who is the 'client' and the 'host'
> in this case, and why do you need to identify them?

Those are optional and useful, for example, by the network mgmt tool for
presenting a view such as:

    - blade 1/2                     // know by host uuid
        - vm-rhel5-eth0             // client name
            - port-profile: xyz

Something like that.
 
>> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> 
> Just one mac address? What happens if we want to assign multiple mac
> addresses to the VF later? Also, how is this defined specifically?
> Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> later, or is this just the default value?

Depends on how the VF wants to handle this.  For our use-case with enic we
only need the port-profile op so I'm not sure what the best design is for
mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
needed, let's scratch it.

-scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman April 20, 2010, 8:26 p.m. UTC | #7
On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

>>> But that's only the case if the NIC itself is in VEPA mode. If that
>>> were the case, there would be no need for a kernel interface at all,
>>> because then we could just drive the port profile selection from user
>>> space.
>>> 
>>> The proposed interface only seems to make sense if you use it to
>>> configure the NIC itself! Why should it care about the port profile
>>> otherwise?
>> 
>> In the case of devices that can do adjacent switch negotiations directly.
> 
> I thought the idea to deal with those devices was to beat sense into
> the respective developers until they do the negotiation in software 8-)

When the device can do the negotiation directly with the switch, why does it
make sense to bypass that and use software on the host?  I don't think we'd
want to give up on link speed/duplex auto-negotiation and punt those setting
back to the user/host like in the old days.

-scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2010, 11:26 a.m. UTC | #8
On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> > On Monday 19 April 2010, Scott Feldman wrote:
> > 
> >> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
> >> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
> >> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
> >> design allows for the case where master and slave are the
> >> same netdev interface.
> > 
> > What is the reason for controlling the slave device through the master,
> > rather than talking to the slave directly? The kernel always knows
> > the master for each slave, so it seems to me that this information
> > is redundant.
> 
> The interface would allow talking to the slave directly. In fact, that's the
> example with enic port-profile in patch 2/2.  But, it would be nice not to
> rule out the case where the master proxies slave control and the master is
> under exclusively controlled by hypervisor.

Not sure I understand. Do you mean the case where this code runs in
the hypervisor (e.g. KVM), or a different scerario with the setup being
done in a guest driver?

So far, I have assumed that we would always do the setup on the host
side, which always has access to both the master, and a slave proxy.
In particular, your interface requires access to the slave AFAICT,
because otherwise the VF IFNAME does not have any significance.

Take the case where you use network namespaces and put the VF into
a separate namespace. With your interface, the PF is still in the
root namespace, but passing both interface names in this interface
won't help you because they are never visible in the same namespace
(e.g. both might be named eth0 in their respective containers).

> > Is this new interface only for the case that you have a switch integrated
> > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > with an adjacent bridge and put the device into VEPA mode?
> 
> All of the above.  Basing this on netlink give us flexibility to work with
> user-space mgmt tools or directly with kernel netdev as in the enic case.
> Not trying to make assumptions about where (user-space, kernel) and by which
> entity sources or sinks the netlink msg.

ok.

> > Did you consider making this code an extension to the DCB interface
> > instead of a separate one? What was the reason for your decision
> > to keep it separate?
> 
> Considered it but DCB interface is well defined for DCB and it didn't seem
> right gluing on interfaces not specified within DCB.  I agree that there is
> some overlap in the sense that both interface are used to configure a netdev
> with some properties interesting for the data center, but the DCB interface
> is for local setting of the properties on the host whereas iovnl is about
> pushing the setting of those properties to the network for policy-based
> control.
>
> > Also, do you expect your interface to be supported by dcbd/lldpad,
> > or is there a good reason to create a new tool for iovnl?
> 
> Lldpad supporting this interface would seem right, for those cases where
> lldpad is responsible for configuring the netdev.

I believe we meant different things here, because I misunderstood the
intention of the code. My question was whether lldpad would send the
netlink messages to iovnl, but from what you and Chris write, the
real idea was that both lldpad and kernel/iovnl can receive the
same messages, right?
 
> >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> > 
> > Can you elaborate more on what these do? Who is the 'client' and the 'host'
> > in this case, and why do you need to identify them?
> 
> Those are optional and useful, for example, by the network mgmt tool for
> presenting a view such as:
> 
>     - blade 1/2                     // know by host uuid
>         - vm-rhel5-eth0             // client name
>             - port-profile: xyz
> 
> Something like that.

Hmm, but how do they get from the device driver to the the network
management tool then? Also, these are similar to the attributes
that are passed in the 802.1Qbg VDP protocol, but not compatible.
If the idea is use the same netlink protocol for both your internal
representation and for the standard based protocol, I think we should
make them compatible.

Instead of a string identifying the port profile, this needs to pass
a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).

There is also a UUID in VDP, but it identifies the guest, not the host,
so this is really confusing.

VDP also needs a list of MAC addresses and VLAN IDs (normally only
one of each), but that would be separate from what you tell the adapter,
see below: 

> >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> > 
> > Just one mac address? What happens if we want to assign multiple mac
> > addresses to the VF later? Also, how is this defined specifically?
> > Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> > later, or is this just the default value?
> 
> Depends on how the VF wants to handle this.  For our use-case with enic we
> only need the port-profile op so I'm not sure what the best design is for
> mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
> needed, let's scratch it.

In order to make VEPA work, it's absolutely required to impose a hard limit
on what MAC+VLAN IDs are visible to the VF, because the switch identifies
the guest by those and forwards any frames to/from that address according
to the VSI type.

However, I feel that we should strictly separate the steps of configuring
the adapter from talking to the switch. When we do the VDP association
in user land, we still need to set up the VLAN and MAC configuration for
the VF through a kernel interface. If we ignore the port profile stuff
for a moment, your netlink interface looks like a good fit for that.

Since it seems what you really want to do is to do the exchange with the
switch from here, maybe the hardware configuration part should be moved
the DCB interface?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bijay Singh April 21, 2010, 11:42 a.m. UTC | #9
I am noticing a strange and a troublesome behavior with tcp md5 checksums. Some selective packets are going out with invalid md5 checksums.

The only thing that is changing is the ack number (between the packets with valid and invalid md5 checksums), so while most packets have correct md5 checksums few 1 in 1000s have md5 checksums errors.

I am on 2.6.26 and I know that there have been significant changes since this version in this area. I have gone thru them but none of issues they address seem like the cause for this problem.

I have the scatter/gather and tcp segmentation disabled in the card.

The packet captures are attached.

Bijay
Arnd Bergmann April 21, 2010, 1:17 p.m. UTC | #10
On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>
> >> In the case of devices that can do adjacent switch negotiations directly.
> > 
> > I thought the idea to deal with those devices was to beat sense into
> > the respective developers until they do the negotiation in software 8-)
> 
> When the device can do the negotiation directly with the switch, why does it
> make sense to bypass that and use software on the host?  I don't think we'd
> want to give up on link speed/duplex auto-negotiation and punt those setting
> back to the user/host like in the old days.

For the link negotiation, the card is the right place because it's necessary
to get the link working before the OS can talk to the switch.
For VDP, that's different because the hypervisor needs to talk to the switch
before the guest can communicate, so there is no interdependency.

More importantly, the card cannot possibly do the protocol by itself,
because the information that gets exchanged is specific to the hypervisor and
the guest, not to the hardware. What you have implemented is another protocol
between the hypervisor and the NIC that exchanges the exact same data that
then gets sent to the switch. We already need to have an implementation that
sends this data to the switch from user space for all cards that don't do
it in firmware, so doing an alternative path in the adapter really creates
more work for the users, and means that when we fix bugs or add features
to the common code, you don't get them ;-).

	Arnd

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Wright April 21, 2010, 4:18 p.m. UTC | #11
* Arnd Bergmann (arnd@arndb.de) wrote:
> On Tuesday 20 April 2010, Scott Feldman wrote:
> > On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > > Also, do you expect your interface to be supported by dcbd/lldpad,
> > > or is there a good reason to create a new tool for iovnl?
> > 
> > Lldpad supporting this interface would seem right, for those cases where
> > lldpad is responsible for configuring the netdev.
> 
> I believe we meant different things here, because I misunderstood the
> intention of the code. My question was whether lldpad would send the
> netlink messages to iovnl, but from what you and Chris write, the
> real idea was that both lldpad and kernel/iovnl can receive the
> same messages, right?

Correct.  An example set of steps for initiating host to switch
negotiation and subsequently launching a VM would be (expect user below
to be a mgmt tool like libvirt):

1) user sends netlink message w/ relevant host interface and port profile id
2) recipient picks this up (enic, lldpad, whatever)
3) recipient does negotiation w/ adjacent switch
4) user creates macvtap associated w/ relevant host interface
5) user launches guest

> > >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> > >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> > > 
> > > Can you elaborate more on what these do? Who is the 'client' and the 'host'
> > > in this case, and why do you need to identify them?
> > 
> > Those are optional and useful, for example, by the network mgmt tool for
> > presenting a view such as:
> > 
> >     - blade 1/2                     // know by host uuid
> >         - vm-rhel5-eth0             // client name
> >             - port-profile: xyz
> > 
> > Something like that.
> 
> Hmm, but how do they get from the device driver to the the network
> management tool then? Also, these are similar to the attributes
> that are passed in the 802.1Qbg VDP protocol, but not compatible.
> If the idea is use the same netlink protocol for both your internal
> representation and for the standard based protocol, I think we should
> make them compatible.

Indeed, that's my expectation.

> Instead of a string identifying the port profile, this needs to pass
> a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).

I think we just need a u8 array, 4 bytes for VDP, some maxlen that is
at least as large as enic expects.

> There is also a UUID in VDP, but it identifies the guest, not the host,
> so this is really confusing.

Yes, I had same confusion.  I expected guest, enic wants to send host as
well.

> VDP also needs a list of MAC addresses and VLAN IDs (normally only
> one of each), but that would be separate from what you tell the adapter,
> see below: 
> 
> > >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> > > 
> > > Just one mac address? What happens if we want to assign multiple mac
> > > addresses to the VF later? Also, how is this defined specifically?
> > > Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> > > later, or is this just the default value?
> > 
> > Depends on how the VF wants to handle this.  For our use-case with enic we
> > only need the port-profile op so I'm not sure what the best design is for
> > mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
> > needed, let's scratch it.
> 
> In order to make VEPA work, it's absolutely required to impose a hard limit
> on what MAC+VLAN IDs are visible to the VF, because the switch identifies
> the guest by those and forwards any frames to/from that address according
> to the VSI type.
> 
> However, I feel that we should strictly separate the steps of configuring
> the adapter from talking to the switch. When we do the VDP association
> in user land, we still need to set up the VLAN and MAC configuration for
> the VF through a kernel interface. If we ignore the port profile stuff
> for a moment, your netlink interface looks like a good fit for that.
> 
> Since it seems what you really want to do is to do the exchange with the
> switch from here, maybe the hardware configuration part should be moved
> the DCB interface?

I suppose this would work  (although it's a bit odd being out of scope
of DCB spec).  I don't expect mgmt app to care about the implementation
specifics of an adapter, so it will always send this and iovnl message
too.  All as part of same setup.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Scott Feldman April 21, 2010, 4:28 p.m. UTC | #12
On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Tuesday 20 April 2010, Scott Feldman wrote:
>> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>> 
>>>> In the case of devices that can do adjacent switch negotiations directly.
>>> 
>>> I thought the idea to deal with those devices was to beat sense into
>>> the respective developers until they do the negotiation in software 8-)
>> 
>> When the device can do the negotiation directly with the switch, why does it
>> make sense to bypass that and use software on the host?  I don't think we'd
>> want to give up on link speed/duplex auto-negotiation and punt those setting
>> back to the user/host like in the old days.
> 
> For the link negotiation, the card is the right place because it's necessary
> to get the link working before the OS can talk to the switch.
> For VDP, that's different because the hypervisor needs to talk to the switch
> before the guest can communicate, so there is no interdependency.
> 
> More importantly, the card cannot possibly do the protocol by itself,
> because the information that gets exchanged is specific to the hypervisor and
> the guest, not to the hardware. What you have implemented is another protocol
> between the hypervisor and the NIC that exchanges the exact same data that
> then gets sent to the switch. We already need to have an implementation that
> sends this data to the switch from user space for all cards that don't do
> it in firmware, so doing an alternative path in the adapter really creates
> more work for the users, and means that when we fix bugs or add features
> to the common code, you don't get them ;-).

But the point of iovnl was to provide a single mechanism for both types of
adapters (w/ or w/o firmware assist) to exchange this data with the switch,
therefore making the difference in the adapters transparent to the user.  So
I'm missing your point about more work for the users.

-scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2010, 5:52 p.m. UTC | #13
On Wednesday 21 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Scott Feldman wrote:
> > I believe we meant different things here, because I misunderstood the
> > intention of the code. My question was whether lldpad would send the
> > netlink messages to iovnl, but from what you and Chris write, the
> > real idea was that both lldpad and kernel/iovnl can receive the
> > same messages, right?
> 
> Correct.  An example set of steps for initiating host to switch
> negotiation and subsequently launching a VM would be (expect user below
> to be a mgmt tool like libvirt):
> 
> 1) user sends netlink message w/ relevant host interface and port profile id
> 2) recipient picks this up (enic, lldpad, whatever)
> 3) recipient does negotiation w/ adjacent switch
> 4) user creates macvtap associated w/ relevant host interface
> 5) user launches guest

I'd move point 4 before 1, but otherwise it makes sense and it would still
work either way.

> > If the idea is use the same netlink protocol for both your internal
> > representation and for the standard based protocol, I think we should
> > make them compatible.
> 
> Indeed, that's my expectation.
>
> [...]
>
> > Instead of a string identifying the port profile, this needs to pass
> > a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).
> 
> I think we just need a u8 array, 4 bytes for VDP, some maxlen that is
> at least as large as enic expects.
> 
> > There is also a UUID in VDP, but it identifies the guest, not the host,
> > so this is really confusing.
> 
> Yes, I had same confusion.  I expected guest, enic wants to send host as
> well.

So given all these differences, how compatible can we make them?

With the current definition, most of fields are at least slightly
different. The differences seem to stem mostly from the fact that
Cisco switches use a nonstandard protocol, rather than the difference
between the firmware and userland implementations of the protocol,
and of course we shouldn't confuse the two.

> > In order to make VEPA work, it's absolutely required to impose a hard limit
> > on what MAC+VLAN IDs are visible to the VF, because the switch identifies
> > the guest by those and forwards any frames to/from that address according
> > to the VSI type.
> > 
> > However, I feel that we should strictly separate the steps of configuring
> > the adapter from talking to the switch. When we do the VDP association
> > in user land, we still need to set up the VLAN and MAC configuration for
> > the VF through a kernel interface. If we ignore the port profile stuff
> > for a moment, your netlink interface looks like a good fit for that.
> > 
> > Since it seems what you really want to do is to do the exchange with the
> > switch from here, maybe the hardware configuration part should be moved
> > the DCB interface?
> 
> I suppose this would work  (although it's a bit odd being out of scope
> of DCB spec).

It could be anywhere, it doesn't have to be the DCB interface, but could
be anything ranging from ethtool to iplink I guess. And we should define
it in a way that works for any SR-IOV card, whether it's using Cisco's
protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
none of the above and just provides an internal switch like all the
existing NICs.

> I don't expect mgmt app to care about the implementation
> specifics of an adapter, so it will always send this and iovnl message
> too.  All as part of same setup.

Why? I really see these things as separate. Obviously a management
tool like libvirt would need to do both these things eventually, but
each of them has multiple options that can be combined in various
ways:

1. Setting up the slave device
 a) create an SR-IOV VF to assign to a guest
 b) create a macvtap device to pass to qemu or vhost
 c) attach a tap device to a bridge
 d) create a macvlan device and put it into a container
 e) create a virtual interface for a VMDq adapter

2) Registering the slave with the switch
 a) use Cisco protocol in enic firmware (see patch 2/2)
 b) use standard VDP in lldpad
 c) use reverse-engineered cisco protocol in some user tool for
    non-enic adapters.
 d) use standard VDP in firmware (hopefully this never happens)
 e) do nothing at all (as we do today)

Some of the cases can be treated identically, e.g. 1d) and 1e), or
2a) and 2c), but in general the management app needs to have some
idea of which combination it's going to set up.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2010, 6:04 p.m. UTC | #14
On Wednesday 21 April 2010, Scott Feldman wrote:
> On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > More importantly, the card cannot possibly do the protocol by itself,
> > because the information that gets exchanged is specific to the hypervisor and
> > the guest, not to the hardware. What you have implemented is another protocol
> > between the hypervisor and the NIC that exchanges the exact same data that
> > then gets sent to the switch. We already need to have an implementation that
> > sends this data to the switch from user space for all cards that don't do
> > it in firmware, so doing an alternative path in the adapter really creates
> > more work for the users, and means that when we fix bugs or add features
> > to the common code, you don't get them ;-).
> 
> But the point of iovnl was to provide a single mechanism for both types of
> adapters (w/ or w/o firmware assist) to exchange this data with the switch,
> therefore making the difference in the adapters transparent to the user.  So
> I'm missing your point about more work for the users.

It creates an extra step: Normally we'd simply implement the network protocol
in user space, e.g. in lldpad and have other code use the lldptool command
line interface to start the negotiation.

Now we have a user protocol based on netlink that is about as complex as the
wire protocol itself, at least if you want to implement both the standard
VDP and the Cisco variant, and do all the interesting parts like guest migration
and synchronously waiting for the negotiation to complete.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2010, 9:22 p.m. UTC | #15
On Tuesday 20 April 2010, Arnd Bergmann wrote:
> > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING)
> > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING)
> 
> As mentioned above, why not drop one of these, and just pass the VF's IFNAME?
> 

Coming back to this point, I now think it would be ideal if we could actually
leave out IOV_ATTR_VF_IFNAME and just pass the master IFNAME and the slave
MAC address. Since we're not actually doing anything with the slave itself
but really talking the switch, it should not be needed at all.

That would solve all problems with the slave having moved to another namespace
already, and make it totally clear that this is not about configuring the
slave but about registering it.

Scott, would that still work with your driver?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 22, 2010, 6:48 a.m. UTC | #16
From: Scott Feldman <scofeldm@cisco.com>
Date: Mon, 19 Apr 2010 12:18:07 -0700

> +#define IOVNL_PROTO_VERSION 1
> +

Please delete this in the final version, the macro isn't even used by
the code.

We don't do protocol versioning in netlink.  Instead we get the base
stuff solid from the beginning, and then if something needs fixing up
we handle this using new attributes in a way which is both backward
and forward compatible.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 22, 2010, 6:52 a.m. UTC | #17
From: Scott Feldman <scofeldm@cisco.com>
Date: Mon, 19 Apr 2010 12:18:07 -0700

> +	if (tb[IOV_ATTR_VF_IFNAME])
> +		vf_dev = dev_get_by_name(&init_net,
> +			nla_data(tb[IOV_ATTR_VF_IFNAME]));

It's probably best to check this for NULL and notify
the user with an error in that case (don't forget to
put 'dev' in that error path :-)

As things stand it looks like if we can't find vf_dev, we'll just send
NULL down to the vf_dev arg of the various operations and possibly
silently succeed.

That's not desirable, semantically.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2010, 10:53 a.m. UTC | #18
On Thursday 22 April 2010, David Miller wrote:
> From: Scott Feldman <scofeldm@cisco.com>
> Date: Mon, 19 Apr 2010 12:18:07 -0700
> 
> > +     if (tb[IOV_ATTR_VF_IFNAME])
> > +             vf_dev = dev_get_by_name(&init_net,
> > +                     nla_data(tb[IOV_ATTR_VF_IFNAME]));
> 
> It's probably best to check this for NULL and notify
> the user with an error in that case (don't forget to
> put 'dev' in that error path :-)

Since you brought up that hunk: shouldn't the namespace better
be current->nsproxy->net_ns instead of init_ns? If the sender
is confined in a separate network namespace, I would expect
that it should be able to modify devices in its own namespace
but none that are in the root namespace.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 22, 2010, 10:56 a.m. UTC | #19
From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 22 Apr 2010 12:53:11 +0200

> On Thursday 22 April 2010, David Miller wrote:
>> From: Scott Feldman <scofeldm@cisco.com>
>> Date: Mon, 19 Apr 2010 12:18:07 -0700
>> 
>> > +     if (tb[IOV_ATTR_VF_IFNAME])
>> > +             vf_dev = dev_get_by_name(&init_net,
>> > +                     nla_data(tb[IOV_ATTR_VF_IFNAME]));
>> 
>> It's probably best to check this for NULL and notify
>> the user with an error in that case (don't forget to
>> put 'dev' in that error path :-)
> 
> Since you brought up that hunk: shouldn't the namespace better
> be current->nsproxy->net_ns instead of init_ns? If the sender
> is confined in a separate network namespace, I would expect
> that it should be able to modify devices in its own namespace
> but none that are in the root namespace.

Yes, the namespace needs to be handled better.

But reading other parts of the discussion it seems that
IOV_ATTR_VF_IFNAME and some other bits will likely be
removed in the initial implementation of this stuff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2010, 11:12 a.m. UTC | #20
On Thursday 22 April 2010, David Miller wrote:
> But reading other parts of the discussion it seems that
> IOV_ATTR_VF_IFNAME and some other bits will likely be
> removed in the initial implementation of this stuff.

That's what I suggested, yes. However, I'm still waiting for
a reply from Scott wether it's actually possibly to remove
it based on the way that the enic firmware works.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/iovnl.h b/include/linux/iovnl.h
new file mode 100644
index 0000000..ac5fcd3
--- /dev/null
+++ b/include/linux/iovnl.h
@@ -0,0 +1,124 @@ 
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __LINUX_IOVNL_H__
+#define __LINUX_IOVNL_H__
+
+#include <linux/types.h>
+
+#define IOVNL_PROTO_VERSION 1
+
+/**
+ * IOV netlink (IOVNL) adds I/O Virtualization control support to a master
+ * device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
+ * control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
+ * design allows for the degenerative case where master and slave are the
+ * same netdev interface.
+ *
+ * One control setting example is MAC/VLAN settings for a VF.  Another
+ * example control setting is a port-profile for a VF.  A port-profile is an
+ * identifier that defines policy-based settings on the network port
+ * backing the VF.  The network port settings examples are VLAN membership,
+ * QoS settings, and L2 security settings, typical of a data center network.
+ *
+ * This file defines an rtnetlink interface to allow setting of IOVNL
+ * on capable netdev devices.
+ */
+
+struct iovnlmsg {
+	__u8	family;
+	__u8	cmd;
+	__u16	pad;
+};
+
+/**
+ * enum iovnl_cmds - supported IOV commands
+ *
+ * @IOV_CMD_UNDEFINED: unspecified command to catch errors
+ * @IOV_CMD_SET_PORT_PROFILE: set the port-profile on the device
+ * @IOV_CMD_UNSET_PORT_PROFILE: clear port-profile on the device
+ * @IOV_CMD_GET_PORT_PROFILE_STATUS: return status of last
+ *   IOV_CMD_SET_PORT_PROFILE command
+ * @IOV_SET_MAC_VLAN: Set the MAC address and VLAN on the device
+ */
+enum iovnl_cmds {
+	IOV_CMD_UNDEFINED,
+
+	IOV_CMD_SET_PORT_PROFILE,
+	IOV_CMD_UNSET_PORT_PROFILE,
+	IOV_CMD_GET_PORT_PROFILE_STATUS,
+
+	IOV_CMD_SET_MAC_VLAN,
+
+	__IOV_CMD_ENUM_MAX,
+	IOV_CMD_MAX = __IOV_CMD_ENUM_MAX - 1,
+};
+
+/**
+ * enum iovnl_attrs - IOV top-level netlink attributes
+ *
+ * @IOV_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING)
+ * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING)
+ * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
+ *   (NLA_NUL_STRING)
+ * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
+ * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
+ * @IOV_ATTR_PORT_PROFILE_STATUS: status of last IOV_CMD_SET_PORT_PROFILE
+ *   command (NLA_U8)
+ * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
+ * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16)
+ # @IOV_ATTR_STATUS: cmd return status code
+ */
+enum iovnl_attrs {
+	IOV_ATTR_UNDEFINED,
+
+	IOV_ATTR_IFNAME,
+	IOV_ATTR_VF_IFNAME,
+
+	IOV_ATTR_PORT_PROFILE,
+	IOV_ATTR_CLIENT_NAME,
+	IOV_ATTR_HOST_UUID,
+	IOV_ATTR_PORT_PROFILE_STATUS,
+
+	IOV_ATTR_MAC_ADDR,
+	IOV_ATTR_VLAN,
+
+	IOV_ATTR_STATUS,
+
+	__IOV_ATTR_ENUM_MAX,
+	IOV_ATTR_MAX = __IOV_ATTR_ENUM_MAX - 1,
+};
+
+/**
+ * enum iovnl_port_profile_status - IOV_ATTR_PORT_PROFILE_STATUS status
+ * return codes
+ *
+ * @IOV_PORT_PROFILE_STATUS_UNKNOWN: unspecified to catch errors
+ * @IOV_PORT_PROFILE_STATUS_SUCCESS:  port-profile aiovlied successfully
+ * @IOV_PORT_PROFILE_STATUS_ERROR: port-profile setting had error
+ * @IOV_PORT_PROFILE_STATUS_INPROGRESS: port-profile setting in-progress
+ */
+enum iovnl_port_profile_status {
+	IOV_PORT_PROFILE_STATUS_UNKNOWN,
+	IOV_PORT_PROFILE_STATUS_SUCCESS,
+	IOV_PORT_PROFILE_STATUS_ERROR,
+	IOV_PORT_PROFILE_STATUS_INPROGRESS,
+};
+
+#endif /* __LINUX_IOVNL_H__ */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..b531b0d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -50,6 +50,7 @@ 
 #ifdef CONFIG_DCB
 #include <net/dcbnl.h>
 #endif
+#include <net/iovnl.h>
 
 struct vlan_group;
 struct netpoll_info;
@@ -1048,6 +1049,9 @@  struct net_device {
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
 
+	/* IOV netlink ops */
+	const struct iovnl_ops *iovnl_ops;
+
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index d1c7c90..aafadf7 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -113,6 +113,11 @@  enum {
 	RTM_SETDCB,
 #define RTM_SETDCB RTM_SETDCB
 
+	RTM_GETIOV = 82,
+#define RTM_GETIOV RTM_GETIOV
+	RTM_SETIOV,
+#define RTM_SETIOV RTM_SETIOV
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/include/net/iovnl.h b/include/net/iovnl.h
new file mode 100644
index 0000000..c353eee
--- /dev/null
+++ b/include/net/iovnl.h
@@ -0,0 +1,36 @@ 
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __NET_IOVNL_H__
+#define __NET_IOVNL_H__
+
+/*
+ * Ops struct for the netlink callbacks.  Used by IOVNL-enabled drivers through
+ * the netdevice struct.
+ */
+struct iovnl_ops {
+	int (*set_port_profile)(struct net_device *, struct net_device *,
+		char *, u8 *, char *, char *);
+	int (*unset_port_profile)(struct net_device *, struct net_device *);
+	int (*get_port_profile_status)(struct net_device *,
+		struct net_device *);
+	int (*set_mac_vlan)(struct net_device *, struct net_device *,
+		u8 *, u16);
+};
+
+#endif /* __NET_IOVNL_H__ */
diff --git a/net/Kconfig b/net/Kconfig
index 0d68b40..aca5de0 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -203,6 +203,7 @@  source "net/phonet/Kconfig"
 source "net/ieee802154/Kconfig"
 source "net/sched/Kconfig"
 source "net/dcb/Kconfig"
+source "net/iovnl/Kconfig"
 
 config RPS
 	boolean
diff --git a/net/Makefile b/net/Makefile
index cb7bdc1..23589e9 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -61,6 +61,9 @@  obj-$(CONFIG_CAIF)		+= caif/
 ifneq ($(CONFIG_DCB),)
 obj-y				+= dcb/
 endif
+ifneq ($(CONFIG_IOVNL),)
+obj-y				+= iovnl/
+endif
 obj-y				+= ieee802154/
 
 ifeq ($(CONFIG_NET),y)
diff --git a/net/iovnl/Kconfig b/net/iovnl/Kconfig
new file mode 100644
index 0000000..4548417
--- /dev/null
+++ b/net/iovnl/Kconfig
@@ -0,0 +1,10 @@ 
+config IOVNL
+	tristate "IOV rtnetlink support"
+	default n
+	---help---
+	  This enables support for configuring IOV
+	  on Ethernet adapters via rtnetlink.  Say 'Y'
+	  if you have a Ethernet adapter which supports network
+	  configuration using IOV rtnetlinl.
+
+	  If unsure, say N.
diff --git a/net/iovnl/Makefile b/net/iovnl/Makefile
new file mode 100644
index 0000000..9256d01
--- /dev/null
+++ b/net/iovnl/Makefile
@@ -0,0 +1 @@ 
+obj-$(CONFIG_IOVNL) += iovnl.o
diff --git a/net/iovnl/iovnl.c b/net/iovnl/iovnl.c
new file mode 100644
index 0000000..ce9db50
--- /dev/null
+++ b/net/iovnl/iovnl.c
@@ -0,0 +1,260 @@ 
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <linux/iovnl.h>
+#include <net/netlink.h>
+#include <net/rtnetlink.h>
+#include <net/iovnl.h>
+#include <net/sock.h>
+
+MODULE_AUTHOR("Roopa Prabhu <roprabhu@cisco.com, "
+	"Scott Feldman <scofeldm@cisco.com>");
+MODULE_DESCRIPTION("IOV netlink");
+MODULE_LICENSE("GPL");
+
+/* IOVNL netlink attributes policy */
+static const struct nla_policy iovnl_rtnl_policy[IOV_ATTR_MAX + 1] = {
+	[IOV_ATTR_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 },
+	[IOV_ATTR_VF_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 },
+	[IOV_ATTR_PORT_PROFILE] =  { .type = NLA_NUL_STRING, .len = 32 },
+	[IOV_ATTR_CLIENT_NAME] = { .type = NLA_NUL_STRING, .len = 32 },
+	[IOV_ATTR_HOST_UUID] = { .type = NLA_NUL_STRING, .len = 64 },
+	[IOV_ATTR_PORT_PROFILE_STATUS] = { .type = NLA_U8 },
+	[IOV_ATTR_MAC_ADDR] = { .len = 6 },
+	[IOV_ATTR_VLAN] = { .type = NLA_U16 },
+	[IOV_ATTR_STATUS] = { .type = NLA_U8 },
+};
+
+/* standard netlink reply call */
+static int iovnl_reply(u8 value, u8 event, u8 cmd, u8 attr, u32 pid,
+	u32 seq, u16 flags)
+{
+	struct sk_buff *skb;
+	struct iovnlmsg *iov;
+	struct nlmsghdr *nlh;
+	int ret = -EINVAL;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ret;
+
+	nlh = NLMSG_NEW(skb, pid, seq, event, sizeof(*iov), flags);
+
+	iov = NLMSG_DATA(nlh);
+	iov->family = AF_UNSPEC;
+	iov->cmd = cmd;
+	iov->pad = 0;
+
+	ret = nla_put_u8(skb, attr, value);
+	if (ret)
+		goto err;
+
+	/* end the message, assign the nlmsg_len. */
+	nlmsg_end(skb, nlh);
+	ret = rtnl_unicast(skb, &init_net, pid);
+	if (ret)
+		return -EINVAL;
+
+	return 0;
+nlmsg_failure:
+err:
+	kfree_skb(skb);
+	return ret;
+}
+
+static int iovnl_get_port_profile_status(struct net_device *dev,
+	struct net_device *vf_dev, u32 pid, u32 seq, u16 flags)
+{
+	int ret;
+
+	if (!dev->iovnl_ops->get_port_profile_status)
+		return -EINVAL;
+
+	ret = dev->iovnl_ops->get_port_profile_status(dev, vf_dev);
+
+	return  iovnl_reply(ret, RTM_GETIOV,
+		IOV_CMD_GET_PORT_PROFILE_STATUS, IOV_ATTR_PORT_PROFILE_STATUS,
+		pid, seq, flags);
+}
+
+
+static int iovnl_set_port_profile(struct net_device *dev,
+	struct net_device *vf_dev, struct nlattr **tb,
+	u32 pid, u32 seq, u16 flags)
+{
+	int i, ret;
+	char *port_profile = NULL;
+	u8 *mac_addr = NULL;
+	char *client_name = NULL;
+	char *host_uuid = NULL;
+
+	if (!tb[IOV_ATTR_PORT_PROFILE] || !dev->iovnl_ops->set_port_profile)
+		return -EINVAL;
+
+	for (i = 0; i <= IOV_ATTR_MAX; i++) {
+		if (!tb[i])
+			continue;
+		switch (tb[i]->nla_type) {
+		case IOV_ATTR_PORT_PROFILE:
+			port_profile = nla_data(tb[i]);
+			break;
+		case IOV_ATTR_MAC_ADDR:
+			mac_addr = nla_data(tb[i]);
+			break;
+		case IOV_ATTR_CLIENT_NAME:
+			client_name = nla_data(tb[i]);
+			break;
+		case IOV_ATTR_HOST_UUID:
+			host_uuid = nla_data(tb[i]);
+			break;
+		}
+	}
+
+	ret = dev->iovnl_ops->set_port_profile(dev, vf_dev,
+		port_profile, mac_addr, client_name, host_uuid);
+
+	return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_SET_PORT_PROFILE,
+		IOV_ATTR_STATUS, pid, seq, flags);
+}
+
+static int iovnl_set_mac_vlan(struct net_device *dev,
+	struct net_device *vf_dev, struct nlattr **tb,
+	u32 pid, u32 seq, u16 flags)
+{
+	int i, ret;
+	u8 *mac_addr = NULL;
+	u16 vlan = 0;
+
+	if (!dev->iovnl_ops->set_mac_vlan)
+		return -EINVAL;
+
+	for (i = 0; i <= IOV_ATTR_MAX; i++) {
+		if (!tb[i])
+			continue;
+		switch (tb[i]->nla_type) {
+		case IOV_ATTR_MAC_ADDR:
+			mac_addr = nla_data(tb[i]);
+			break;
+		case IOV_ATTR_VLAN:
+			vlan = nla_get_u16(tb[i]);
+			break;
+		}
+	}
+
+	ret = dev->iovnl_ops->set_mac_vlan(dev, vf_dev,
+		mac_addr, vlan);
+
+	return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_SET_MAC_VLAN,
+		IOV_ATTR_STATUS, pid, seq, flags);
+}
+
+static int iovnl_unset_port_profile(struct net_device *dev,
+	struct net_device *vf_dev, struct nlattr **tb,
+	u32 pid, u32 seq, u16 flags)
+{
+	int ret;
+
+	if (!dev->iovnl_ops->unset_port_profile)
+		return -EINVAL;
+
+	ret = dev->iovnl_ops->unset_port_profile(dev, vf_dev);
+
+	return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_UNSET_PORT_PROFILE,
+		IOV_ATTR_STATUS, pid, seq, flags);
+}
+
+static int iovnl_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
+{
+	struct net *net = sock_net(skb->sk);
+	struct net_device *dev;
+	struct net_device *vf_dev = NULL;
+	struct iovnlmsg  *iov = (struct iovnlmsg *)NLMSG_DATA(nlh);
+	struct nlattr *tb[IOV_ATTR_MAX + 1];
+	u32 pid = skb ? NETLINK_CB(skb).pid : 0;
+	int ret;
+
+	if (!net_eq(net, &init_net))
+		return -EINVAL;
+
+	ret = nlmsg_parse(nlh, sizeof(*iov), tb, IOV_ATTR_MAX,
+		iovnl_rtnl_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[IOV_ATTR_IFNAME])
+		return -EINVAL;
+
+	dev = dev_get_by_name(&init_net, nla_data(tb[IOV_ATTR_IFNAME]));
+	if (!dev)
+		return -EINVAL;
+
+	if (tb[IOV_ATTR_VF_IFNAME])
+		vf_dev = dev_get_by_name(&init_net,
+			nla_data(tb[IOV_ATTR_VF_IFNAME]));
+
+	if (!dev->iovnl_ops)
+		goto errout;
+
+	switch (iov->cmd) {
+	case IOV_CMD_SET_PORT_PROFILE:
+		ret = iovnl_set_port_profile(dev, vf_dev,
+			tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags);
+		goto out;
+	case IOV_CMD_UNSET_PORT_PROFILE:
+		ret = iovnl_unset_port_profile(dev, vf_dev,
+			tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags);
+		goto out;
+	case IOV_CMD_GET_PORT_PROFILE_STATUS:
+		ret = iovnl_get_port_profile_status(dev, vf_dev,
+			pid, nlh->nlmsg_seq, nlh->nlmsg_flags);
+		goto out;
+	case IOV_CMD_SET_MAC_VLAN:
+		ret = iovnl_set_mac_vlan(dev, vf_dev,
+			tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags);
+		goto out;
+	default:
+		goto errout;
+	}
+errout:
+	ret = -EINVAL;
+out:
+	dev_put(dev);
+	if (vf_dev)
+		dev_put(vf_dev);
+
+	return ret;
+}
+
+static int __init iovnl_init(void)
+{
+	rtnl_register(PF_UNSPEC, RTM_GETIOV, iovnl_doit, NULL);
+	rtnl_register(PF_UNSPEC, RTM_SETIOV, iovnl_doit, NULL);
+
+	return 0;
+}
+module_init(iovnl_init);
+
+static void __exit iovnl_exit(void)
+{
+	rtnl_unregister(PF_UNSPEC, RTM_GETIOV);
+	rtnl_unregister(PF_UNSPEC, RTM_SETIOV);
+}
+module_exit(iovnl_exit);