Patchwork [V2,09/12] net/eipoib: Add main driver functionality

login
register
mail settings
Submitter Or Gerlitz
Date Aug. 1, 2012, 5:09 p.m.
Message ID <1343840975-3252-10-git-send-email-ogerlitz@mellanox.com>
Download mbox | patch
Permalink /patch/174539/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Or Gerlitz - Aug. 1, 2012, 5:09 p.m.
From: Erez Shitrit <erezsh@mellanox.co.il>

The eipoib driver provides a standard Ethernet netdevice over
the InfiniBand IPoIB interface .

Some services can run only on top of Ethernet L2 interfaces, and cannot be
bound to an IPoIB interface. With this new driver, these services can run
seamlessly.

Main use case of the driver is the Ethernet Virtual Switching used in
virtualized environments, where an eipoib netdevice can be used as a
Physical Interface (PIF) in the hypervisor domain, and allow other
guests Virtual Interfaces (VIF) connected to the same Virtual Switch
to run over the InfiniBand fabric.

This driver supports L2 Switching (Direct Bridging) as well as other L3
Switching modes (e.g. NAT).

Whenever an IPoIB interface is created, one eIPoIB PIF netdevice
will be created. The default naming scheme is as in other Ethernet
interfaces: ethX, for example, on a system with two IPoIB interfaces,
ib0 and ib1, two interfaces will be created ethX and ethX+1 When "X"
is the next free Ethernet number in the system.

Using "ethtool -i " over the new interface can tell on which IPoIB
PIF interface that interface is above.  For example: driver: eth_ipoib:ib0
indicates that eth3 is the Ethernet interface over the ib0 IPoIB interface.

The driver can be used as independent interface or to serve in
virtualization environment as the physical layer for the virtual
interfaces on the virtual guest.

The driver interface (eipoib interface or which is also referred to as parent)
uses slave interfaces, IPoIB clones, which are the VIFs described above.

VIFs interfaces are enslaved/released from the eipoib driver on demand, according
to the management interface provided to user space.

Note: Each ethX interface has at least one ibX.Y slave to serve the PIF
itself, in the VIFs list of ethX you'll notice that ibX.1 is always created
to serve applications running from the Hypervisor on top of ethX interface directly.

For IB applications that require native IPoIB interfaces (e.g. RDMA-CM), the
original ipoib interfaces ibX can still be used.  For example, RDMA-CM and
eth_ipoib drivers can co-exist and make use of IPoIB

Support for Live migration:

The driver expose sysfs interface through which the manager can notify on
migrated out guests ("detached VIF"). The abandoned eIPoIB driver instance
sends ARP requests (defined number) to the network in order to triger the migrated
guest to publish its mac address of the new VIF that hosts it. When the guest
responds to that ARP, the eIPoIB driver on the host that owns that guest, sends
Gratuitous ARP in behalf of that guest such that all the peers on the network
which communicate with this VM are noticed on the new VIF address which is
used by the guest.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_main.c | 1953 +++++++++++++++++++++++++++++++++++
 1 files changed, 1953 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_main.c
Eric W. Biederman - Aug. 2, 2012, 5:15 p.m.
Or Gerlitz <ogerlitz@mellanox.com> writes:

> From: Erez Shitrit <erezsh@mellanox.co.il>
>
> The eipoib driver provides a standard Ethernet netdevice over
> the InfiniBand IPoIB interface .
>
> Some services can run only on top of Ethernet L2 interfaces, and cannot be
> bound to an IPoIB interface. With this new driver, these services can run
> seamlessly.

Do I read this code correctly that what you are doing is not tunneling
ethernet over IB but instead you are removing an ethernet header and
replacing it with an IB header?

Do I also read this code correctly if you can't find your destination
mac address in your ""neighbor table"" you do a normal IPoIB arp
for the infiniband GUID?

Do I read this right that if presented with a non-IPv4 or ARP packet
this code will do something undefined and unpredictable?

Maybe this makes some sense but just skimming it looks like you
are trying to force a square peg into a round hole resulting in
some weird code and some very weird maintainability issues.

I am honestly surprised at this approach.  I would think it would be
faster and simpler to run an IB queue pair directly to the hypervisor or
possibly even the guest operating system bypassing the kernel and doing
all of this translation in userspace.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ali Ayoub - Aug. 3, 2012, 8:31 p.m.
On 8/2/2012 10:15 AM, Eric W. Biederman wrote:
> Or Gerlitz <ogerlitz@mellanox.com> writes:
> 
>> From: Erez Shitrit <erezsh@mellanox.co.il>
>>
>> The eipoib driver provides a standard Ethernet netdevice over
>> the InfiniBand IPoIB interface .
>>
>> Some services can run only on top of Ethernet L2 interfaces, and cannot be
>> bound to an IPoIB interface. With this new driver, these services can run
>> seamlessly.
> 
> Do I read this code correctly that what you are doing is not tunneling
> ethernet over IB but instead you are removing an ethernet header and
> replacing it with an IB header?
Correct.
eIPoIB runs standard IPoIB on the wire, thus it doesn't encapsulate the
Ethernet frame on top of IPoIB, but rather translates it to an IPoIB
packet, this allows us to expose an Ethernet L2 network device, and
still keep interoperability with existing IPoIB endpoints. Running full
encapsulation (i.e. EoIPoIB) will break interoperability.

> Do I also read this code correctly if you can't find your destination
> mac address in your ""neighbor table"" you do a normal IPoIB arp
> for the infiniband GUID?
Correct.
Wire protocol remains IPoIB.

> Do I read this right that if presented with a non-IPv4 or ARP packet
> this code will do something undefined and unpredictable?
The current code drops IPv6 packets (see IS_E_IPOIB_PROTO), IPv6 support
will be added later on.

> Maybe this makes some sense but just skimming it looks like you
> are trying to force a square peg into a round hole resulting in
> some weird code and some very weird maintainability issues.
> 
> I am honestly surprised at this approach.  I would think it would be
> faster and simpler to run an IB queue pair directly to the hypervisor or
> possibly even the guest operating system bypassing the kernel and doing
> all of this translation in userspace.
With eIPoIB architecture, the VM sees standard Ethernet emulator,
allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
as if it was standard Ethernet. Other approaches that exposes IB QP to
the VM (with w/o bypassing the kernel) won't be possible with the
current emulators and management tools.

--
Ali;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 3, 2012, 9:33 p.m.
From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 13:31:35 -0700

> With eIPoIB architecture, the VM sees standard Ethernet emulator,
> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
> as if it was standard Ethernet. Other approaches that exposes IB QP to
> the VM (with w/o bypassing the kernel) won't be possible with the
> current emulators and management tools.

So then fix the emulators and management tools to handle IB instead
of adding this bogus new protocol?

This new protocol seems to exist only because you don't want to have
to enhance the emulators and tools, and I'm sorry that isn't a valid
reason to do something like this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ali Ayoub - Aug. 3, 2012, 10:39 p.m.
On 8/3/2012 2:33 PM, David Miller wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 13:31:35 -0700
> 
>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>> the VM (with w/o bypassing the kernel) won't be possible with the
>> current emulators and management tools.
> 
> So then fix the emulators and management tools to handle IB instead
> of adding this bogus new protocol?
> 
> This new protocol seems to exist only because you don't want to have
> to enhance the emulators and tools, and I'm sorry that isn't a valid
> reason to do something like this.

This driver exists to allow the user to have an Ethernet interface on
top of a high-speed InfiniBand (IB) interconnect.
Users would like to use sockets API from the VM without re-writing their
applications on top of IB verbs, this driver meant to allow such a user
to do so.

Exposing IB emulators and having IB support in the management tools for
the VM/Hypervisor won't address the usecases that this driver meant for.

With this driver, existing VMs, and their existing IP applications, can
run as-is on InfiniBand network.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 3, 2012, 11:36 p.m.
From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 15:39:36 -0700

> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.

That's what IPoIB was for, the application writers who don't want to have
to be knowledgable about IB verbs.

You're messing with the link layer here, and that's what is upsetting me.

It's a complete cop-out to changing the VM tools and emulators properly to
handle a new link layer.

The applications writers already have a way to use IB whilst using
something familiar, like IPv4, via IPoIB.  You're doing something
completely different here, and it stinks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ali Ayoub - Aug. 4, 2012, 12:02 a.m.
On 8/3/2012 3:39 PM, Ali Ayoub wrote:
> On 8/3/2012 2:33 PM, David Miller wrote:
>> From: Ali Ayoub <ali@mellanox.com>
>> Date: Fri, 03 Aug 2012 13:31:35 -0700
>>
>>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>>> the VM (with w/o bypassing the kernel) won't be possible with the
>>> current emulators and management tools.
>>
>> So then fix the emulators and management tools to handle IB instead
>> of adding this bogus new protocol?
>>
>> This new protocol seems to exist only because you don't want to have
>> to enhance the emulators and tools, and I'm sorry that isn't a valid
>> reason to do something like this.
> 
> This driver exists to allow the user to have an Ethernet interface on
> top of a high-speed InfiniBand (IB) interconnect.
> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.
> 
> Exposing IB emulators and having IB support in the management tools for
> the VM/Hypervisor won't address the usecases that this driver meant for.
> 
> With this driver, existing VMs, and their existing IP applications, can
> run as-is on InfiniBand network.

This driver exists to allow the user to have an Ethernet interface on
top of a high-speed InfiniBand (IB) interconnect.
Users would like to use sockets API from the VM without re-writing their
applications on top of IB verbs, this driver meant to allow such a user
to do so.

Exposing IB emulators and having IB support in the management tools for
the VM/Hypervisor won't address the usecases that this driver meant for.

With this driver, existing VMs, and their existing IP applications, can
run as-is on InfiniBand network.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 4, 2012, 12:05 a.m.
From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 17:02:35 -0700

> On 8/3/2012 3:39 PM, Ali Ayoub wrote:
>> On 8/3/2012 2:33 PM, David Miller wrote:
>>> From: Ali Ayoub <ali@mellanox.com>
>>> Date: Fri, 03 Aug 2012 13:31:35 -0700
>>>
>>>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>>>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>>>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>>>> the VM (with w/o bypassing the kernel) won't be possible with the
>>>> current emulators and management tools.
>>>
>>> So then fix the emulators and management tools to handle IB instead
>>> of adding this bogus new protocol?
>>>
>>> This new protocol seems to exist only because you don't want to have
>>> to enhance the emulators and tools, and I'm sorry that isn't a valid
>>> reason to do something like this.
>> 
>> This driver exists to allow the user to have an Ethernet interface on
>> top of a high-speed InfiniBand (IB) interconnect.
>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.
>> 
>> Exposing IB emulators and having IB support in the management tools for
>> the VM/Hypervisor won't address the usecases that this driver meant for.
>> 
>> With this driver, existing VMs, and their existing IP applications, can
>> run as-is on InfiniBand network.
> 
> This driver exists to allow the user to have an Ethernet interface on
> top of a high-speed InfiniBand (IB) interconnect.
> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.
> 
> Exposing IB emulators and having IB support in the management tools for
> the VM/Hypervisor won't address the usecases that this driver meant for.
> 
> With this driver, existing VMs, and their existing IP applications, can
> run as-is on InfiniBand network.

Just saying the same thing twice doesn't make your argument stronger.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 4, 2012, 1:34 a.m.
Ali Ayoub <ali@mellanox.com> wrote:

>
>With this driver, existing VMs, and their existing IP applications, can
>run as-is on InfiniBand network.

Actually it doesn't work like that.

If my application really needs ethernet it will not work with your driver.  I can not run decnet or appletalk or ATAoE or PPPoE or LACP or use VLANs or any of a thousand other things that require real live ethernet to function.

Most VMs will talk to linux with a tap interface, and the output of a tap interface can be routed just fine, and that works with existing tools, and existing already deployed  kernels.

Alternatively if you are silly you can implement a tap interface on top of IPoIB and have the same interface you do now, and it works with existing already deployed kernels.

So eIPoIB does not make sense from a time to market perspective.

Similarly eIPoIB does not make sense from a performance standpoint because the best performance requires the applications and hypervisors are infiniband and teaching the apps to cope.

eIPoIB also has considerable maintenance overhead as it is complex code doing some crazy things.

Now personally NAPT44 just about made the internet unusable for peer to peer appications.  NAT66 aka network prefix translation introduces much less breakage but it still requires someone to run STUN on IPv6 to keep applications working, ick.  NATEIB aka eIPoIB just looks complex and already breaks most of ethernet and history says that kind of breakage actually matters a lot.   I think everyone who suggests any kind of NAT is a good idea should have to implement RFC 5245 Interacive Connectivity Establishment.  It takes 2 100+ page rfcs to come up with a way that  allows applications to with the address munging over the open internet.   That way takes 3000+ lines of code in each application.   That is the kind of complexity you are asking people up stream of you to deal with your NATTing of ethernet to infiniband.  So I totally think eIPoIB stinks, and that it does not at all let existing applications work without modification.

Eric






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 4, 2012, 9:23 p.m.
On Sat, Aug 4, 2012 at 2:36 AM, David Miller <davem@davemloft.net> wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 15:39:36 -0700

>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.

> That's what IPoIB was for, the application writers who don't want to have
> to be knowledgable about IB verbs.
>
> You're messing with the link layer here, and that's what is upsetting me.
>
> It's a complete cop-out to changing the VM tools and emulators properly to
> handle a new link layer.
>
> The applications writers already have a way to use IB whilst using
> something familiar, like IPv4, via IPoIB.  You're doing something
> completely different here, and it stinks.


Dave,

Just quick recap of things to make sure we're in sync on the facts
here and (me) trying to better follow your arguments:

Bringing IB to app usage is done through the IB verbs / RDMA CM, which
is supported
for quite a long while for native mode and BTW as for SRIOV, there's a
now running patch set @ linux-rdma submitted to Roland
http://marc.info/?l=linux-rdma&m=134398354428293&w=2

IPoIB is meant to allow IP apps to use IB, defined by IETF spec (RFC 4391/4392)

The idea in eIPoIB was to allow IP apps running on VMs under a
Para-Virtual set of mind, e.g when the Linux PV networking stack comes
into play, to use that stack w.o modifying it.

When looking on that, we thought so far so good, and went in the way
posted here. If reusing your last sentence... this driver provides a
way for apps to use the PV stack AND IB whilst using something
familiar, like IPv4.


I'd like to better understand your "messing with the link layer here,
and that's what is upsetting" claim --- since the service here is IP,
and this service has well defined mechanisms, eIPoIB is basically a
shim layer that allows to provide this service for VMs that interacts
with the emulators, drivers and tools used in Linux and the related
guest OS PV drivers (virtio and friends). For example, that messing
mostly goes down to translating ARP requests/replies from being
carried in Ethernet frame to be carried in IPoIB packet, is that
really too hacky? why and what?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 4, 2012, 9:33 p.m.
On Sat, Aug 4, 2012 at 4:34 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> wrote:

>>With this driver, existing VMs, and their existing IP applications, can
>>run as-is on InfiniBand network.

> Actually it doesn't work like that.
> If my application really needs ethernet it will not work with your driver.

Eric,

Starting to address your email

If your application really needs Ethernet, it will not work with
IPoIB, since the latter only supports IP apps. So this isn't the
problem we're trying to solve here, but rather allow IP app running in
PV mode to use IPoIB.

> I can not run decnet or appletalk or ATAoE or PPPoE or LACP

[...]

> or use VLANs

YES you can, vlan devices can be set over eipoib devices, and the
eipoib driver maps the vlan ID to IB Partition using a mapping defined
by the admin.

> or any of a thousand other things that require real live ethernet to function.
[...]
> Similarly eIPoIB does not make sense from a performance standpoint because
> the best performance requires the applications and hypervisors are infiniband
> and teaching the apps to cope.

agree 100% re the "best performance" part of this sentence, if your
app looks for best perf, yes you can use IB in either native or SRIOV
mode, as I mentioned in my response to Dave.

Or.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 4, 2012, 9:44 p.m.
On Sun, Aug 5, 2012 at 12:23 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote:

> The idea in eIPoIB was to allow IP apps running on VMs under a Para-Virtual set of
> mind, e.g when the Linux PV networking stack comes into play, to use that stack w.o > modifying it. When looking on that, we thought so far so good, and went in the way
> posted here. If reusing your last sentence... this driver provides a way for apps to use
> the PV stack AND IB whilst using something familiar, like IPv4.

OK, when I said the PV stack, I meant the portion of the PV stack
which assumes Ethernet link layer, ofcourse... If someone uses routing
they don't need this driver.

Again, since the app only uses IP,  which is well defined, etc. the
work done by the eipoib driver, didn't seem as hackish messing, so I'm
again with that WW (Why/What) question from 5m ago.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 4, 2012, 11:19 p.m.
Or Gerlitz <or.gerlitz@gmail.com> wrote:

>On Sun, Aug 5, 2012 at 12:23 AM, Or Gerlitz <or.gerlitz@gmail.com>
>wrote:
>
>> The idea in eIPoIB was to allow IP apps running on VMs under a
>Para-Virtual set of
>> mind, e.g when the Linux PV networking stack comes into play, to use
>that stack w.o > modifying it. When looking on that, we thought so far
>so good, and went in the way
>> posted here. If reusing your last sentence... this driver provides a
>way for apps to use
>> the PV stack AND IB whilst using something familiar, like IPv4.
>
>OK, when I said the PV stack, I meant the portion of the PV stack
>which assumes Ethernet link layer, ofcourse... If someone uses routing
>they don't need this driver.
>
>Again, since the app only uses IP,  which is well defined, etc. the
>work done by the eipoib driver, didn't seem as hackish messing, so I'm
>again with that WW (Why/What) question from 5m ago.


Something is missing from your sentence. 

What is the alternative that you view as worse?

Juast as a point of information.  In general when bridging is desired but not possible people deploy proxy arp.

I completely fail to see how having the VM output to a tun interface and then routing that, would not be supported by current solutions.  I believe that is where VM solutions all started networking wise.  Outputting to an interface is needed to support interfaces like 802.11 where bridging frequently does not work.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Aug. 5, 2012, 6:50 p.m.
On Thu, Aug 02, 2012 at 10:15:23AM -0700, Eric W. Biederman wrote:
> Or Gerlitz <ogerlitz@mellanox.com> writes:
> 
> > From: Erez Shitrit <erezsh@mellanox.co.il>
> >
> > The eipoib driver provides a standard Ethernet netdevice over
> > the InfiniBand IPoIB interface .
> >
> > Some services can run only on top of Ethernet L2 interfaces, and cannot be
> > bound to an IPoIB interface. With this new driver, these services can run
> > seamlessly.
> 
> Do I read this code correctly that what you are doing is not tunneling
> ethernet over IB but instead you are removing an ethernet header and
> replacing it with an IB header?
> 
> Do I also read this code correctly if you can't find your destination
> mac address in your ""neighbor table"" you do a normal IPoIB arp
> for the infiniband GUID?
> 
> Do I read this right that if presented with a non-IPv4 or ARP packet
> this code will do something undefined and unpredictable?
> 
> Maybe this makes some sense but just skimming it looks like you
> are trying to force a square peg into a round hole resulting in
> some weird code and some very weird maintainability issues.
> 
> I am honestly surprised at this approach.  I would think it would be
> faster and simpler to run an IB queue pair directly to the hypervisor or
> possibly even the guest operating system bypassing the kernel and doing
> all of this translation in userspace.
> 
> Eric

I'm on vacation and I have not looked at the patches, at Erez' request,
just reacting to the presentation and the discussion.

Bypassing the kernel has its own set of issues, not the
least of which is the need to lock all of guest memory which breaks
overcommit. Running an IB queue pair directly to the hypervisor
will also break live migration.

Another problem with exposing IB to guests has to do with the fact that
IB addresses such as combinations of LIDs, GIDs and QPNs to best of my
knowledge do not support soft hardware address setting, which interferes
with live migration.

So it seems that a sane solution would involve an extra level of
indirection, with guest addresses being translated to host IB addresses.

As long as you do this, maybe using an ethernet frame format makes
sense.

So far the things that make sense. Here are some that don't, to me:

- Is a pdf presentation all you have in terms of documentation?
  We are talking communication protocols here - I would expect a
  proper spec, and some effort to standardize, otherwise where's the
  guarantee it won't change in an incompatible way?
  Other things that I would expect to be addressed in such a spec is
  interaction with other IPoIB features, such as connected
  mode, checksum offloading etc, and IB features such as multipath etc.

- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
  more to IB addressing than just the LID.  Since everyone on the subnet
  need access to this translation, I think it makes sense to store it in
  the SM. I think this would also obviate some IPv4 specific hacks
  in kernel.

- IGMP/MAC snooping in a driver is just too hairy.
  As you point out, bridge currently needs the uplink in promisc mode.
  I don't think a driver should work around that limitation.
  For some setups, it might be interesting to remove the
  promisc mode requirement, failing that,
  I think you could use macvtap passthrough.

- Currently migration works without host kernel help, would be
  preferable to keep it that way.


Hope this helps,
MST

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ali Ayoub - Aug. 7, 2012, 12:14 a.m.
On 8/3/2012 4:36 PM, David Miller wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 15:39:36 -0700
> 
>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.
> 
> That's what IPoIB was for, the application writers who don't want to have
> to be knowledgable about IB verbs.
> 
> You're messing with the link layer here, and that's what is upsetting me.
> 
> It's a complete cop-out to changing the VM tools and emulators properly to
> handle a new link layer.
> 
> The applications writers already have a way to use IB whilst using
> something familiar, like IPv4, via IPoIB.  You're doing something
> completely different here, and it stinks.

Indeed, IPoIB driver meant to allow IP applications to run over
InfiniBand, but IPoIB cannot serve IPoE traffic.

The goal of eIPoIB driver to show to the user an Ethernet L2 netdev; by
translating IPoE packets to IPoIB. It keeps the same IPoIB wire
protocol, and exposes to the host an ethX interface, all IPoIB packets
are then translated to IPoIB.

Among other things, the main benefit we're targeting is to allow IPoE
traffic within the VM to go through the (Ethernet) vBridge down to the
eIPoIB PIF, and eventually to IPoIB and to the IB network.

In Para virtualized environment, the VM emulator sends/receives packets
with Ethernet header, and the vBridge also performs L2 switching based
on the Ethernet header, in addition to other tools that expect an
Ethernet link layer. We'd like to support them on top of IPoIB.

I see your point to change the the tools and VM emulations to handle
IPoIB link layer, but this involves not only changing many
components/layers such netfront/vbridge/vconfig/etc.. but also -unlike
Ethernet- having IPoIB-aware network device emulation in the VM domain
requires giving access from the VM to the IB Hardware, and this
association will break upon VM migration, as Tsirkin indicated in the
other Email.

Another issue with IPoIB-aware emulator, is that the IPoIB frame doesn't
include the destination link layer per data packet (RFC 4391), therefor
the neighbor address needs to be passed from the VM domain down to the
ipoib-aware vBridge.

I don't see in other alternatives a solution for the problem we're
trying to solve. If there are changes/suggestions to improve eIPoIB
netdev driver to avoid "messing with the link layer" and make it
acceptable, we can discuss and apply them.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 7, 2012, 12:44 a.m.
Ali Ayoub <ali@mellanox.com> writes:

> Among other things, the main benefit we're targeting is to allow IPoE
> traffic within the VM to go through the (Ethernet) vBridge down to the
> eIPoIB PIF, and eventually to IPoIB and to the IB network.

That works today without code changes.  It is called routing.

> In Para virtualized environment, the VM emulator sends/receives packets
> with Ethernet header, and the vBridge also performs L2 switching based
> on the Ethernet header, in addition to other tools that expect an
> Ethernet link layer. We'd like to support them on top of IPoIB.

See routing.  The code is already done.

> I don't see in other alternatives a solution for the problem we're
> trying to solve. If there are changes/suggestions to improve eIPoIB
> netdev driver to avoid "messing with the link layer" and make it
> acceptable, we can discuss and apply them.

Nothing needs to be applied the code is done.  Routing from
IPoE to IPoIB works.

There is nothing in what anyone has posted as requirements that needs
work to implement.

I totally fail to see how getting packets of of the VM as ethernet
frames, and then  IP layer routing those packets over IP is not an
option.  What requirement am I missing.

All VMs should suport that mode of operation, and certainly the kernel
does.

Implementations involving bridges like macvlan and macvtap are
performance optimizations, and the optimizations don't even apply in
areas like 802.11, where only one mac address is supported per adapter.

Bridging can ocassionally also be an administrative simplification as
well, but you should be able to achieve the a similar simplification
with a dhcprelay and proxy arp.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naoto MATSUMOTO - Aug. 7, 2012, 1:21 a.m.
Hi Eric, Ali

On Mon, 06 Aug 2012 17:44:06 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.

We had nearly equal situation perfomance result. share for you.

-----------------------------------------------------------------------
Ethernet over EoIB using gretap with "Raid STP" performance result.
http://twitpic.com/agd1pp

How to use EoIB with gretap. 
http://twitpic.com/agd4io


ifconfig ib0 10.1.1.1/24 up 
ip link add eoib0 type gretap remote 10.1.1.2 local 10.1.1.1
ip link add eoib1 type gretap remote 10.1.1.3 local 10.1.1.1
ip link add type veth
ifconfig eoib0 up up
ifconfig eoib1 up up
ifconfig veth0 up up
ifconfig veth1 up up

brctl addbr br0
brctl addif br0 eoib0
brctl addif br0 eoib1
bcrtl addif br0 veth1

rstpd
rstpctl rstp br0 on
rstpctl setbridgeprio br0 40960
rstpctl setportpathcost br0 vpn1 40000
rstpctl sethello br0 1
rstpctl setmaxage br0 6
rstpctl setfdelay br0 4
brctl stp br0 on
brctl sethello br0 1
brctl setfd br0 4
brctl setmaxage br0 6
-----------------------------------------------------------------------

regards,
Eric W. Biederman - Aug. 7, 2012, 3:33 a.m.
ebiederm@xmission.com (Eric W. Biederman) writes:

> Ali Ayoub <ali@mellanox.com> writes:
>
>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.

Oh yes.  It just occurred to me there is huge problem with eIPoIB as
currently presented in these patches.  It breaks DHCPv4 the same way
it breaks ARP, but DHCPv4 is not fixed up.

I am stunned to to realize there has been so much push for a solution
that doesn't even support dhcp.

How can anyone possibly argue that eIPoIB works or is useful?  I just
can't imagine.

Totally stunned,
Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Glanville - Aug. 7, 2012, 3:37 a.m.
On 7 August 2012 10:44, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> writes:
>
>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.
>
> That works today without code changes.  It is called routing.
>
>> In Para virtualized environment, the VM emulator sends/receives packets
>> with Ethernet header, and the vBridge also performs L2 switching based
>> on the Ethernet header, in addition to other tools that expect an
>> Ethernet link layer. We'd like to support them on top of IPoIB.
>
> See routing.  The code is already done.
>
>> I don't see in other alternatives a solution for the problem we're
>> trying to solve. If there are changes/suggestions to improve eIPoIB
>> netdev driver to avoid "messing with the link layer" and make it
>> acceptable, we can discuss and apply them.
>
> Nothing needs to be applied the code is done.  Routing from
> IPoE to IPoIB works.
>
> There is nothing in what anyone has posted as requirements that needs
> work to implement.
>
> I totally fail to see how getting packets of of the VM as ethernet
> frames, and then  IP layer routing those packets over IP is not an
> option.  What requirement am I missing.
>
> All VMs should suport that mode of operation, and certainly the kernel
> does.
>
> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.
>
> Bridging can ocassionally also be an administrative simplification as
> well, but you should be able to achieve the a similar simplification
> with a dhcprelay and proxy arp.
>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi,

While I agree this driver doesn't offer an adequate solution to the
problem I think the problem still very much exists.

Ultimately if you have an Infiniband fabric and you want to use it for
inter-guest communication across hosts you are left with 2 options,
tunneling and routing.
Routing can be ok sometimes but it's not a great solution for the vast
majority of people, it requires considerable edge intelligence in the
hypervisor and ISIS routing setups.
The biggest issue is that it's just not L2, guest applications on
guest operating systems expect to have access to fully fledged
Ethernet devices.
This is especially an issue for hosting providers etc where they have
little to no control over what applications their customers want to
use.

Tunnelling solves some of these problems, generally done over L3 and
using existing IP fabric to transport Ethernet L2 frames to the
destination hypervisors.
Better because now you have true L2 encapsulation but to be honest
it's slow and chews alot of cycles and generally has poor PPS
performance.

IMO this driver doesn't make sense compared to a true Ethernet over IB
encapsulation, which isn't actually all that much extra work. (If I am
right you already have a driver that does just this)
The point about retaining compatibility with existing IPoIB boggles my
mind a little.
It means little if no benefit is really added because only IP and ARP
will work.. no other L2 protocol will work correctly as others have
noted.

With full encapsulation you can make use of all the existing
infrastructure, linux bridge, OpenvSwitch, ebtables/netfilter etc.

Joseph.
Or Gerlitz - Aug. 8, 2012, 5:23 a.m.
On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:

[...]
> So it seems that a sane solution would involve an extra level of
> indirection, with guest addresses being translated to host IB addresses.
> As long as you do this, maybe using an ethernet frame format makes sense.
[...]

Yep, that's among the points we're trying to make, the way you've put
it makes it clearer.

> So far the things that make sense. Here are some that don't, to me:

> - Is a pdf presentation all you have in terms of documentation?
>   We are talking communication protocols here - I would expect a
>   proper spec, and some effort to standardize, otherwise where's the
>   guarantee it won't change in an incompatible way?

To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
see a need
for any spec change / standardization effort. This might go to the 1st
point you've
brought... improve the documentation, will do that. The pdf you looked
at was presented
in a conference.

>   Other things that I would expect to be addressed in such a spec is
>   interaction with other IPoIB features, such as connected
>   mode, checksum offloading etc, and IB features such as multipath etc.

For the eipoib interface, it doesn't really matters if the underlyind
ipoib clones used by it (we call them VIFs) use connected or datagram
mode, what does matter is the MTU and offload features supported by
these VIFs, for which the eipoib interface will have the min among all
these VIFs. Since for a given eipoib nic, all its VIFs must originated
from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
all have the same mtu / features which are needed for that eipoib nic,
e.g by using the same mode (connected/datagram for all of them), hope
this is clear.


> - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>   more to IB addressing than just the LID.  Since everyone on the subnet
>   need access to this translation, I think it makes sense to store it in
>   the SM. I think this would also obviate some IPv4 specific hacks in kernel.

The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
HCA end-node. I wasn't sure to understand the comment re the IPv4 hacks.

> - IGMP/MAC snooping in a driver is just too hairy.

mmm, any rough idea/direction how to do that otherwise?

>   As you point out, bridge currently needs the uplink in promisc mode.
>   I don't think a driver should work around that limitation.
>   For some setups, it might be interesting to remove the
>   promisc mode requirement, failing that,
>   I think you could use macvtap passthrough.

That's in the plans, the current code doesn't assume that the eipoib
has bridge on top, for VM networking it works with bridge + tap,
bridge + macvtap, but it would easily work with passthrough when we
allow to create multiple eipoib interfaces on the same ipoib PIF (e.g
today for the ib0 PIF we create eipoib eth0, and then two VIFs ib0.1
and ib0.2 that are enslaved by eth0, but next we will create eth1 and
eth2 which will use ib0.1 and ib0.2
respectively.

> - Currently migration works without host kernel help, would be
>   preferable to keep it that way.

OK
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 8, 2012, 6:04 a.m.
Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Ali Ayoub <ali@mellanox.com> writes:

>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.

> Oh yes.  It just occurred to me there is huge problem with eIPoIB as
> currently presented in these patches.  It breaks DHCPv4 the same way
> it breaks ARP, but DHCPv4 is not fixed up.

To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
flag in the dhcp server config file.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 8, 2012, 7:32 a.m.
Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> writes:
[...]
>> I don't see in other alternatives a solution for the problem we're
>> trying to solve. If there are changes/suggestions to improve eIPoIB
>> netdev driver to avoid "messing with the link layer" and make it
>> acceptable, we can discuss and apply them.

> Nothing needs to be applied the code is done.  Routing from
> IPoE to IPoIB works. There is nothing in what anyone has posted as requirements
>  that needs work to implement.

> I totally fail to see how getting packets of of the VM as ethernet
> frames, and then  IP layer routing those packets over IP is not an
> option.  What requirement am I missing.


As you've indicated routing w/w.o using proxy-arp is an option, however,

> All VMs should suport that mode of operation, and certainly the kernel does.
> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.
> Bridging can ocassionally also be an administrative simplification as
> well, but you should be able to achieve the a similar simplification
> with a dhcprelay and proxy arp.

as you wrote here, when performance and ease-of-use is under the spot,
VM deployments tend to not to use routing.

This is b/c it involves more over-head on the packet forwarding, and
more administration work, for example, for setting routing rules that
involve the VM IP address, something which AFAIK the hypervisor have
no clue on, also its unclear to me if/how live migration can work in
such setting.

From this exact reason, there's a bunch of use-cases by tools and
cloud stacks (such as open stack, ovirt, more) which do use bridged
mode and the rest of the Ethernet envelope, such as using virtual L2
vlan domains, ebtables based rules, etc etc. Where they and are not
application to ipoib, but are working file ith eipoib.

You mentioned that bridging mode doesn't apply to environment such as
802.11, and hence routing mode is used, we are trying to make a point
here that bridging mode applies to ipoib with the approach suggested
by eipoib.

Also, if we extend the discussion a bit, there are two more aspects to throw in:

The first is the performance thing we have already started to mention
-- specifically, the approach for RX zero copy (into the VM buffer),
use designs such as vhost + macvtap NIC in passthrough mode which is
likey to be set over a per VM hypervisor NIC, e.g such as the ones
provided by VMDQ patches John Fastabend started to post (see
http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
on top of each they can be plugged to that design.

The 2nd aspect, is NON VM environments where a NIC with Ethernet look
and feel is required for IP traffic, but this have to live within an
echo-system that fully uses IPoIB.
In other words, a use case where IPoIB has to be below the cover for
set of some specific apps, or nodes but do IP interaction with other
apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
functionality.

So, to sum up, routing / proxy-arp seems to be off where we are targeting.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 8, 2012, 8:36 a.m.
Or Gerlitz <or.gerlitz@gmail.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> Ali Ayoub <ali@mellanox.com> writes:
>
>>> Among other things, the main benefit we're targeting is to allow IPoE
>>> traffic within the VM to go through the (Ethernet) vBridge down to the
>>> eIPoIB PIF, and eventually to IPoIB and to the IB network.
>
>> Oh yes.  It just occurred to me there is huge problem with eIPoIB as
>> currently presented in these patches.  It breaks DHCPv4 the same way
>> it breaks ARP, but DHCPv4 is not fixed up.
>
> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
> flag in the dhcp server config file.

Wrong.  DHCPv4 is broken over eIPoIB. 

Coming from ethernet
htype == 1 not 32 as required by RFC4390
hlen == 6 not 0 as required by RFC4390
The chaddr field is has 6 bytes of the ethernet mac address not the
required 16 bytes of 0.

The client-identifier field is optional over ethernet.

An ethernet DHCPv4 client simply does not generate a dhcp packet that
conforms to RFC4390.

Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
may reasonably look at the DHCP packet and drop it because it is
garbage.

You might find a forgiving dhcp server that doesn't drop insane packets
on the floor and tries to make things work.

I am sorry.  eIPoIB is broken as designed.  eIPoIB most assuredly is not
compatible with ethernet.  eIPoIB most definitely does not work even for
the general case of transporing IP traffic.  Claiming that eIPoIB is any
else is a lie.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 8, 2012, 9:17 a.m.
Or Gerlitz <or.gerlitz@gmail.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Ali Ayoub <ali@mellanox.com> writes:
> [...]
>>> I don't see in other alternatives a solution for the problem we're
>>> trying to solve. If there are changes/suggestions to improve eIPoIB
>>> netdev driver to avoid "messing with the link layer" and make it
>>> acceptable, we can discuss and apply them.
>
>> Nothing needs to be applied the code is done.  Routing from
>> IPoE to IPoIB works. There is nothing in what anyone has posted as requirements
>>  that needs work to implement.
>
>> I totally fail to see how getting packets of of the VM as ethernet
>> frames, and then  IP layer routing those packets over IP is not an
>> option.  What requirement am I missing.
>
>
> As you've indicated routing w/w.o using proxy-arp is an option, however,
>
>> All VMs should suport that mode of operation, and certainly the kernel does.
>> Implementations involving bridges like macvlan and macvtap are
>> performance optimizations, and the optimizations don't even apply in
>> areas like 802.11, where only one mac address is supported per adapter.
>> Bridging can ocassionally also be an administrative simplification as
>> well, but you should be able to achieve the a similar simplification
>> with a dhcprelay and proxy arp.
>
> as you wrote here, when performance and ease-of-use is under the spot,
> VM deployments tend to not to use routing.
>
> This is b/c it involves more over-head on the packet forwarding, and
> more administration work, for example, for setting routing rules that
> involve the VM IP address, something which AFAIK the hypervisor have
> no clue on, also its unclear to me if/how live migration can work in
> such setting.

All you need to make proxy-arp essentially pain free is a smart dhcp
relay, that sets up the routes.

> From this exact reason, there's a bunch of use-cases by tools and
> cloud stacks (such as open stack, ovirt, more) which do use bridged
> mode and the rest of the Ethernet envelope, such as using virtual L2
> vlan domains, ebtables based rules, etc etc. Where they and are not
> application to ipoib, but are working file ith eipoib.

Yes I am certain all of their IPv6 traffic works fine.

Regardless those are open source projects and can be modified to add
support to cleanly support inifinibnad.

> You mentioned that bridging mode doesn't apply to environment such as
> 802.11, and hence routing mode is used, we are trying to make a pointn
> here that bridging mode applies to ipoib with the approach suggested
> by eipoib.

You are completely failing.  Every time I look I see something about
eIPoIB that is even more broken.  Given that eIPoIB is a NAT
implementation that isn't really a surprise but still.

eIPoIB imposes enough overhead that I expect that routing is cheaper,
so your performance advantges go right out the window.

eIPoIB is seriously incompatible with ethernet breaking almost
everything and barely allowing IPv4 to work.

> Also, if we extend the discussion a bit, there are two more aspects to throw in:
>
> The first is the performance thing we have already started to mention
> -- specifically, the approach for RX zero copy (into the VM buffer),
> use designs such as vhost + macvtap NIC in passthrough mode which is
> likey to be set over a per VM hypervisor NIC, e.g such as the ones
> provided by VMDQ patches John Fastabend started to post (see
> http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
> clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
> on top of each they can be plugged to that design.

If you care about performance link-layer NAT is not the way to go.
Teach the pieces you care about how to talk infiniband.

> The 2nd aspect, is NON VM environments where a NIC with Ethernet look
> and feel is required for IP traffic, but this have to live within an
> echo-system that fully uses IPoIB.
> In other words, a use case where IPoIB has to be below the cover for
> set of some specific apps, or nodes but do IP interaction with other
> apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
> functionality.

ip link add type dummy.

There now you have an interface with ethernet look and feel, and
routing can happily avoid it.

> So, to sum up, routing / proxy-arp seems to be off where we are
> targeting.

My condolences.

The existence of router / proxy-arp means that solutions do exist
(unlike your previous claim) you just don't like the idea of deploying
them.

Infiniband is standard enough you could quite easily implement virtual
infiniband bridging as an alternative to ethernet bridging.


At this stage of the game eIPoIB is interesting the same way a zombie is
interesting.  It is fascinating to see it still moving as chunks of
flesh fall to the floor removing any doubt that it is dead.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 9, 2012, 4:06 a.m.
Eric W. Biederman <ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:

>> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
>> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
>> flag in the dhcp server config file.

> Wrong.  DHCPv4 is broken over eIPoIB. Coming from ethernet
> htype == 1 not 32 as required by RFC4390
> hlen == 6 not 0 as required by RFC4390
> The chaddr field is has 6 bytes of the ethernet mac address not the
> required 16 bytes of 0. The client-identifier field is optional over ethernet.
> An ethernet DHCPv4 client simply does not generate a dhcp packet that
> conforms to RFC4390.
>
> Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
> may reasonably look at the DHCP packet and drop it because it is garbage.
>
> You might find a forgiving dhcp server that doesn't drop insane packets
> on the floor and tries to make things work.

Under the eIPoIB design, the VM DHCP interaction follows
Ethernet DHCP, and not the IPoIB DHCP (RFC 4390).

The DHCP server has no reason to drop such packets.

DHCP is a L7 (L5 to be precise) construct, I don't see
why that the fact IPoIB DHCP RFC exists, means/mandates
the DHCP server to care on the link layer type.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 9, 2012, 4:34 a.m.
Eric W. Biederman <ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:

>> as you wrote here, when performance and ease-of-use is under the spot,
>> VM deployments tend to not to use routing.
>> This is b/c it involves more over-head on the packet forwarding, and
>> more administration work, for example, for setting routing rules that
>> involve the VM IP address, something which AFAIK the hypervisor have
>> no clue on, also its unclear to me if/how live migration can work in such setting.

> All you need to make proxy-arp essentially pain free is a smart dhcp
> relay, that sets up the routes.

So dhcp relay would help to maybe avoid some of the pain, however,
can you elaborate if/how live migration is supported under this scheme?

>> From this exact reason, there's a bunch of use-cases by tools and
>> cloud stacks (such as open stack, ovirt, more) which do use bridged
>> mode and the rest of the Ethernet envelope, such as using virtual L2
>> vlan domains, ebtables based rules, etc etc. Where they and are not
>> application to ipoib, but are working file ith eipoib.

> Yes I am certain all of their IPv6 traffic works fine.

I'm not sure to follow this comment.

> Regardless those are open source projects and can be modified to add
> support to cleanly support inifiniband.

open source/code can be modified indeed, but since exposing IPoIB link
layer to tools/emulators and VMs doesn't really make sense (see below),
we brought that this approach as a way to go for allowing people to use
bridging mode when the fabric is IB.

>> You mentioned that bridging mode doesn't apply to environment such as
>> 802.11, and hence routing mode is used, we are trying to make a pointn
>> here that bridging mode applies to ipoib with the approach suggested by eipoib.

> You are completely failing.  Every time I look I see something about
> eIPoIB that is even more broken.  Given that eIPoIB is a NAT
> implementation that isn't really a surprise but still.

> eIPoIB imposes enough overhead that I expect that routing is cheaper,
> so your performance advantges go right out the window.
>
> eIPoIB is seriously incompatible with ethernet breaking almost
> everything and barely allowing IPv4 to work.

I don't agree on this incompatiblity statement, you had a claim
on DHCP and I addressed it, beyond that, you don't like the eIPoIB
basic idea/design but this can't base an incompatiblity argument.


>> Also, if we extend the discussion a bit, there are two more aspects to throw in:
>> The first is the performance thing we have already started to mention
>> -- specifically, the approach for RX zero copy (into the VM buffer),
>> use designs such as vhost + macvtap NIC in passthrough mode which is
>> likey to be set over a per VM hypervisor NIC, e.g such as the ones
>> provided by VMDQ patches John Fastabend started to post (see
>> http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
>> clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
>> on top of each they can be plugged to that design.

> If you care about performance link-layer NAT is not the way to go.
> Teach the pieces you care about how to talk infiniband.
>
>> The 2nd aspect, is NON VM environments where a NIC with Ethernet look
>> and feel is required for IP traffic, but this have to live within an
>> echo-system that fully uses IPoIB.
>> In other words, a use case where IPoIB has to be below the cover for
>> set of some specific apps, or nodes but do IP interaction with other
>> apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
>> functionality.


> ip link add type dummy.

> There now you have an interface with ethernet look and feel, and
> routing can happily avoid it.

again, not sure to follow, you mean "routing can happily use it", correct? that
is do routing between the dummy interface to IPoIB interface?


>> So, to sum up, routing / proxy-arp seems to be off where we are targeting.

> My condolences.
> The existence of router / proxy-arp means that solutions do exist
> (unlike your previous claim) you just don't like the idea of deploying them.

Don't like them from set of arguments, which we are covering here, re
manageability it still needs to clarified if/how live migration work and what
does it mean to always mandate dhcp relay.

> Infiniband is standard enough you could quite easily implement virtual
> infiniband bridging as an alternative to ethernet bridging.

Not really, as Michael indicated in his response over this thread
http://marc.info/?l=linux-netdev&m=134419288218373&w=2
IPoIB link layer addresses use IB HW constructs for which soft
hardware address setting isn't supported, and this interferes
with live migration.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Aug. 12, 2012, 10:22 a.m.
On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> [...]
> > So it seems that a sane solution would involve an extra level of
> > indirection, with guest addresses being translated to host IB addresses.
> > As long as you do this, maybe using an ethernet frame format makes sense.
> [...]
> 
> Yep, that's among the points we're trying to make, the way you've put
> it makes it clearer.
> 
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> 
> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> see a need
> for any spec change / standardization effort.

Yes, I am guessing this is the real reason you pack LID/QPN
in the MAC - to make it all local. But it's a hack really,
and if you start storing it all in the SM you will need
to document the format so others can inter-operate.

> This might go to the 1st
> point you've
> brought... improve the documentation, will do that. The pdf you looked
> at was presented
> in a conference.
> 
> >   Other things that I would expect to be addressed in such a spec is
> >   interaction with other IPoIB features, such as connected
> >   mode, checksum offloading etc, and IB features such as multipath etc.
> 
> For the eipoib interface, it doesn't really matters if the underlyind
> ipoib clones used by it (we call them VIFs) use connected or datagram
> mode, what does matter is the MTU and offload features supported by
> these VIFs, for which the eipoib interface will have the min among all
> these VIFs. Since for a given eipoib nic, all its VIFs must originated
> from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
> all have the same mtu / features which are needed for that eipoib nic,
> e.g by using the same mode (connected/datagram for all of them), hope
> this is clear.
> 

Just pointing out all this needs to be documented.

> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >   need access to this translation, I think it makes sense to store it in
> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> 
> The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
> HCA end-node.

But then it breaks with VM migration, IB failover, softmac setting in
guest, probably more?

> I wasn't sure to understand the comment re the IPv4 hacks.

This refers to the ARP hack that you use to fix
VM migration.

> > - IGMP/MAC snooping in a driver is just too hairy.
> 
> mmm, any rough idea/direction how to do that otherwise?

Sure, even two ways, ideally you'd do both :)
A. fix macvtap
1. Use netdev_for_each_mc_addr etc to get multicast addresses
2. teach macvtap to fill that in (it currently floods multicasts
   for guest to guest communication so we ned to fix it anyway)

B. fix bridge
   teach bridge to work for VMs without using promisc mode


> >   As you point out, bridge currently needs the uplink in promisc mode.
> >   I don't think a driver should work around that limitation.
> >   For some setups, it might be interesting to remove the
> >   promisc mode requirement, failing that,
> >   I think you could use macvtap passthrough.
> 
> That's in the plans, the current code doesn't assume that the eipoib
> has bridge on top, for VM networking it works with bridge + tap,
> bridge + macvtap, but it would easily work with passthrough when we
> allow to create multiple eipoib interfaces on the same ipoib PIF (e.g
> today for the ib0 PIF we create eipoib eth0, and then two VIFs ib0.1
> and ib0.2 that are enslaved by eth0, but next we will create eth1 and
> eth2 which will use ib0.1 and ib0.2
> respectively.

The whole promisc mode emulation is there for the bridge, no?
Since you don't support promisc, ideally we'd check a hardware
capability and fail gracefully, though naturally this is not top
priority.

> > - Currently migration works without host kernel help, would be
> >   preferable to keep it that way.
> 
> OK
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Aug. 12, 2012, 10:36 a.m.
On Thu, Aug 09, 2012 at 07:34:23AM +0300, Or Gerlitz wrote:
> > Infiniband is standard enough you could quite easily implement virtual
> > infiniband bridging as an alternative to ethernet bridging.
> 
> Not really, as Michael indicated in his response over this thread
> http://marc.info/?l=linux-netdev&m=134419288218373&w=2
> IPoIB link layer addresses use IB HW constructs for which soft
> hardware address setting isn't supported, and this interferes
> with live migration.
> 
> Or.

But I'd just like to point out your code does not support softmac
either.
Michael S. Tsirkin - Aug. 12, 2012, 10:54 a.m.
On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> 
> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> see a need
> for any spec change / standardization effort.

Thinking about it some more, the solution is not really 100% local:
- ARP packets used at migration are visible on the network
- QPN/LID in MAC are guest visible, and could be exported on the network

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 12, 2012, 1:09 p.m.
On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
>> >On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin<mst@redhat.com>  wrote:
>> >
>> >[...]
>>> > >So it seems that a sane solution would involve an extra level of
>>> > >indirection, with guest addresses being translated to host IB addresses.
>>> > >As long as you do this, maybe using an ethernet frame format makes sense.
>> >[...]
>> >
>> >Yep, that's among the points we're trying to make, the way you've put
>> >it makes it clearer.
>> >
>>> > >So far the things that make sense. Here are some that don't, to me:
>> >
>>> > >- Is a pdf presentation all you have in terms of documentation?
>>> > >   We are talking communication protocols here - I would expect a
>>> > >   proper spec, and some effort to standardize, otherwise where's the
>>> > >   guarantee it won't change in an incompatible way?
>> >
>> >To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
>> >see a need
>> >for any spec change / standardization effort.
> Yes, I am guessing this is the real reason you pack LID/QPN
> in the MAC - to make it all local. But it's a hack really,
> and if you start storing it all in the SM you will need
> to document the format so others can inter-operate.

I'd like to review the way we generate these MAC addresses, maybe it
can be done differently.


>
>
>> >This might go to the 1stpoint you've
>> >brought... improve the documentation, will do that. The pdf you looked
>> >at was presentedin a conference.
>>
>>> > >   Other things that I would expect to be addressed in such a spec is
>>> > >   interaction with other IPoIB features, such as connected
>>> > >   mode, checksum offloading etc, and IB features such as multipath etc.
>> >
>> >For the eipoib interface, it doesn't really matters if the underlyind
>> >ipoib clones used by it (we call them VIFs) use connected or datagram
>> >mode, what does matter is the MTU and offload features supported by
>> >these VIFs, for which the eipoib interface will have the min among all
>> >these VIFs. Since for a given eipoib nic, all its VIFs must originated
>> >from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
>> >all have the same mtu / features which are needed for that eipoib nic,
>> >e.g by using the same mode (connected/datagram for all of them), hope
>> >this is clear.
>> >
> Just pointing out all this needs to be documented.

OK, will do

>
>
>>> > >- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>>> > >   more to IB addressing than just the LID.  Since everyone on the subnet
>>> > >   need access to this translation, I think it makes sense to store it in
>>> > >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
>> >
>> >The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
>> >HCA end-node.
> But then it breaks with VM migration, IB failover, softmac setting in guest, probably more?

With the current design/code the remote mac of a VM changes, when that 
VM migrates or IB
LIDs are changed. As for softmac setting in the guest, we  don't send 
the guest MAC on the wire
anyway, since the Ethernet header is removed.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 12, 2012, 1:15 p.m.
On 12/08/2012 13:22, Michael S. Tsirkin wrote:
>>> >>- IGMP/MAC snooping in a driver is just too hairy.
>>
>> >mmm, any rough idea/direction how to do that otherwise?
> Sure, even two ways, ideally you'd do both:)
> A. fix macvtap
> 1. Use netdev_for_each_mc_addr etc to get multicast addresses
> 2. teach macvtap to fill that in (it currently floods multicasts
>     for guest to guest communication so we ned to fix it anyway)
>
> B. fix bridge
>     teach bridge to work for VMs without using promisc mode

I wasn't sure to fully follow... need some more bits of info, the 
macvtap fix
relates only to IGMP snooping, correct? as for the bridge fix, does this 
somehow
relates to ARP snooping we do in the driver? how?

Or.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 12, 2012, 1:19 p.m.
On 12/08/2012 13:54, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
>>> So far the things that make sense. Here are some that don't, to me:
>>> - Is a pdf presentation all you have in terms of documentation?
>>>    We are talking communication protocols here - I would expect a
>>>    proper spec, and some effort to standardize, otherwise where's the
>>>    guarantee it won't change in an incompatible way?
>> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
>> see a need for any spec change / standardization effort.
> Thinking about it some more, the solution is not really 100% local:
> - ARP packets used at migration are visible on the network

The ARP packets on the wire are IPoIB ARP packets -- indeed, when the 
guest moves, the IPoIB
link layer address which was associated with the VIF that used to serve 
this guest on the host
we migrated from, changes to be of a VIF on the host that we migrated to.


> - QPN/LID in MAC are guest visible, and could be exported on the network
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Aug. 12, 2012, 1:41 p.m.
On Sun, Aug 12, 2012 at 04:09:35PM +0300, Or Gerlitz wrote:
> On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> >On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> >>>On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin<mst@redhat.com>  wrote:
> >>>
> >>>[...]
> >>>> >So it seems that a sane solution would involve an extra level of
> >>>> >indirection, with guest addresses being translated to host IB addresses.
> >>>> >As long as you do this, maybe using an ethernet frame format makes sense.
> >>>[...]
> >>>
> >>>Yep, that's among the points we're trying to make, the way you've put
> >>>it makes it clearer.
> >>>
> >>>> >So far the things that make sense. Here are some that don't, to me:
> >>>
> >>>> >- Is a pdf presentation all you have in terms of documentation?
> >>>> >   We are talking communication protocols here - I would expect a
> >>>> >   proper spec, and some effort to standardize, otherwise where's the
> >>>> >   guarantee it won't change in an incompatible way?
> >>>
> >>>To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> >>>see a need
> >>>for any spec change / standardization effort.
> >Yes, I am guessing this is the real reason you pack LID/QPN
> >in the MAC - to make it all local. But it's a hack really,
> >and if you start storing it all in the SM you will need
> >to document the format so others can inter-operate.
> 
> I'd like to review the way we generate these MAC addresses, maybe it
> can be done differently.
> 
> 
> >
> >
> >>>This might go to the 1stpoint you've
> >>>brought... improve the documentation, will do that. The pdf you looked
> >>>at was presentedin a conference.
> >>
> >>>> >   Other things that I would expect to be addressed in such a spec is
> >>>> >   interaction with other IPoIB features, such as connected
> >>>> >   mode, checksum offloading etc, and IB features such as multipath etc.
> >>>
> >>>For the eipoib interface, it doesn't really matters if the underlyind
> >>>ipoib clones used by it (we call them VIFs) use connected or datagram
> >>>mode, what does matter is the MTU and offload features supported by
> >>>these VIFs, for which the eipoib interface will have the min among all
> >>>these VIFs. Since for a given eipoib nic, all its VIFs must originated
> >>>from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
> >>>all have the same mtu / features which are needed for that eipoib nic,
> >>>e.g by using the same mode (connected/datagram for all of them), hope
> >>>this is clear.
> >>>
> >Just pointing out all this needs to be documented.
> 
> OK, will do
> 
> >
> >
> >>>> >- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >>>> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >>>> >   need access to this translation, I think it makes sense to store it in
> >>>> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> >>>
> >>>The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
> >>>HCA end-node.
> >But then it breaks with VM migration, IB failover, softmac setting in guest, probably more?
> 
> With the current design/code the remote mac of a VM changes, when
> that VM migrates or IB
> LIDs are changed.

Which is exactly the problem with IB and VM migration.

> As for softmac setting in the guest, we  don't
> send the guest MAC on the wire
> anyway, since the Ethernet header is removed.
> 
> Or.

Yes but you generate remote addresses automatically so admin can not
change the local address without risking conflicts.
Michael S. Tsirkin - Aug. 12, 2012, 1:55 p.m.
On Sun, Aug 12, 2012 at 04:15:52PM +0300, Or Gerlitz wrote:
> On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> >>>>>- IGMP/MAC snooping in a driver is just too hairy.
> >>
> >>>mmm, any rough idea/direction how to do that otherwise?
> >Sure, even two ways, ideally you'd do both:)
> >A. fix macvtap
> >1. Use netdev_for_each_mc_addr etc to get multicast addresses
> >2. teach macvtap to fill that in (it currently floods multicasts
> >    for guest to guest communication so we ned to fix it anyway)
> >
> >B. fix bridge
> >    teach bridge to work for VMs without using promisc mode
> 
> I wasn't sure to fully follow... need some more bits of info, the
> macvtap fix
> relates only to IGMP snooping, correct? as for the bridge fix, does
> this somehow
> relates to ARP snooping we do in the driver? how?
> 
> Or.

I didn't realize you do ARP snooping. Why?

I know you mangle outgoing ARP packets, this will go
away if you maintain a mapping in SM accessible to all guests.
Michael S. Tsirkin - Aug. 12, 2012, 2:05 p.m.
On Thu, Aug 09, 2012 at 07:06:46AM +0300, Or Gerlitz wrote:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Or Gerlitz <or.gerlitz@gmail.com> writes:
> 
> >> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
> >> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
> >> flag in the dhcp server config file.
> 
> > Wrong.  DHCPv4 is broken over eIPoIB. Coming from ethernet
> > htype == 1 not 32 as required by RFC4390
> > hlen == 6 not 0 as required by RFC4390
> > The chaddr field is has 6 bytes of the ethernet mac address not the
> > required 16 bytes of 0. The client-identifier field is optional over ethernet.
> > An ethernet DHCPv4 client simply does not generate a dhcp packet that
> > conforms to RFC4390.
> >
> > Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
> > may reasonably look at the DHCP packet and drop it because it is garbage.
> >
> > You might find a forgiving dhcp server that doesn't drop insane packets
> > on the floor and tries to make things work.
> 
> Under the eIPoIB design, the VM DHCP interaction follows
> Ethernet DHCP, and not the IPoIB DHCP (RFC 4390).
> 
> The DHCP server has no reason to drop such packets.
> 
> DHCP is a L7 (L5 to be precise) construct, I don't see
> why that the fact IPoIB DHCP RFC exists, means/mandates
> the DHCP server to care on the link layer type.
> 
> Or.

For example DHCP server could be configured with
HW address/IP address table.
Or Gerlitz - Aug. 12, 2012, 2:13 p.m.
On 12/08/2012 16:55, Michael S. Tsirkin wrote:
> I didn't realize you do ARP snooping. Why? I know you mangle outgoing 
> ARP packets, 

Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP 
packets, from/to Ethernet ARPs
to/from IPoIB ARPs.


> this will go away if you maintain a mapping in SM accessible to all 
> guests. 

guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
another driver in the host. But what mapping exactly? and remember that
this code (VM through eipoib) can talk to any IPoIB element on the 
fabric, native,
virtualized,  HW/SW gateways, etc etc.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 12, 2012, 3:40 p.m.
Or Gerlitz <or.gerlitz@gmail.com> writes:

> On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> [...]
>> So it seems that a sane solution would involve an extra level of
>> indirection, with guest addresses being translated to host IB addresses.
>> As long as you do this, maybe using an ethernet frame format makes sense.
> [...]
>
> Yep, that's among the points we're trying to make, the way you've put
> it makes it clearer.
>
>> - IGMP/MAC snooping in a driver is just too hairy.
>
> mmm, any rough idea/direction how to do that otherwise?

Let me give you a non-hack recomendation.

- Give up on being wire compatible with IPoIB.

- Define and implement ethernet over inifiniband aka EoIB.

With EoIB:
- The SM would map ethernet address to inifiniband hardware addresses.
- You discover which multicast addresses are of interest from the
  IP layer above so no snooping is necessary.
- You could run queue pairs directly to hosts.

Shrug.  It is trivial and it will work.  It will probably run into the
same problems that have historically been a problem for using IPoIB
(lack of stateless offloads) but shrug that is mostly a NIC firmware
problem.  The switches will have no trouble and interoperability will
be assured.

If you want to map ethernet over infiniband please map ethernet over
infiniband.  Don't poorly NAT ethernet into infiniband.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Aug. 12, 2012, 8:54 p.m.
On Sun, Aug 12, 2012 at 05:13:43PM +0300, Or Gerlitz wrote:
> On 12/08/2012 16:55, Michael S. Tsirkin wrote:
> >I didn't realize you do ARP snooping. Why? I know you mangle
> >outgoing ARP packets,
> 
> Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP
> packets, from/to Ethernet ARPs
> to/from IPoIB ARPs.
> 
> 
> >this will go away if you maintain a mapping in SM accessible to
> >all guests.
> 
> guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
> another driver in the host. But what mapping exactly?

Well we are getting into protocol design here.  So here's a sketch
showing how you could build a protocol that does work.  But note it is
not *exactly* IPoIB. It is I think close enough that you can
use existing NIC hardware/firmware, which is why it
differs slightly from what Eric described, and is
more complex.  But it still shares the same property of no hacks, no
packet snooping in driver, etc.


And if you want to go that route, you really should talk to some IB
protocol people to figure out what works, write a spec and try to
standardize. lkml/netdev is not the right place.

But since you asked, if I had to, I would probably try to
do it like this:

- Each device registers with the SA specifying the
  mac address (+ vlan?), SA stores the translation from that
  to IPoIB address.
- alternatively, SA admin configures the translation statically
- you get a packet with 6 byte mac address,
  query the SA for a mapping to IPoIB address, strip
  ethernet frame and send
- multicast GID addresses can be similar, filled either when registering
  for multicast or by SA admin

I think it's possible that you could also convert a mac address to
EUI-64 and prepend a prefix to get a legal GID. But maybe I'm missing
something.  This could be handy for multicast.


In both cases:
- SA could return GID that you then resolve to
  LID using another query, or it could return LID so you save a roundtrip
- results can be cached locally
- SA can send updates when translation changes to flush this cache


Now above means protocols such as ARP and DHCP use 48 bit addresses so
you can not mix this new protocol with IPoIB.  Maybe IPoIB could simply
ignore irrelevant packets, but it's best not to try, get a
different all-broadcast group and CM ID instead to avoid confusion.

One other interesting thing you can do is forward multicast
registration data from the router, translate to mgid
by the SA and do appropriate IB mcast registrations.

My memory of how IB works is a bit rusty so I probably made some
mistakes above but should roughly work, I think.

> and remember that
> this code (VM through eipoib) can talk to any IPoIB element on the
> fabric, native,
> virtualized,  HW/SW gateways, etc etc.
> 
> Or.

If you want this, then you really want a limited form of IPoIB bridging.

Alternatively, decide that what you do is not IPoIB, have a proper
protocol of your own.
Or Gerlitz - Aug. 13, 2012, 8:33 a.m.
On 12/08/2012 18:40, Eric W. Biederman wrote:
> Let me give you a non-hack recomendation.
>
> - Give up on being wire compatible with IPoIB.
>
> - Define and implement ethernet over inifiniband aka EoIB.
>
> With EoIB:
> - The SM would map ethernet address to inifiniband hardware addresses.
> - You discover which multicast addresses are of interest from the
>    IP layer above so no snooping is necessary.
> - You could run queue pairs directly to hosts.
>
> Shrug.  It is trivial and it will work.  It will probably run into the
> same problems that have historically been a problem for using IPoIB
> (lack of stateless offloads) but shrug that is mostly a NIC firmware
> problem.  The switches will have no trouble and interoperability will
> be assured.
>
> If you want to map ethernet over infiniband please map ethernet over
> infiniband.  Don't poorly NAT ethernet into infiniband.
>
>


EoIB is a valid suggestion and we will look into it as well, BUT:

Providing EoIB is a separate discussion, obviously defining and 
standardizing a new protocol takes what is takes (a lot of time, longish 
term effort), and will also take time to develop/debug/mature e.g as you 
mentioned, some of the features/offloads might require new NIC HW, etc 
-- compared to IPoIB which is here for many years

In practice there is already a huge install base for IPoIB software and 
hardware products, in different operating environments/OS. We can't just 
through away everything and tell people to replace it all with a new 
protocol, e.g. bridging devices, storage systems/appliances, VMware, 
Windows, .. systems in production environments --- so
the interoperability concern you've mentioned gonna hit very hard.

The eIPoIB driver comes to provide a way to work with IPoIB in 
virtualized environments, where still, the suggestions/concerns raised 
in this thread should be addressed.

Or.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Aug. 13, 2012, 4:08 p.m.
Or Gerlitz <ogerlitz@mellanox.com> wrote:

>On 12/08/2012 18:40, Eric W. Biederman wrote:
>> Let me give you a non-hack recomendation.
>>
>> - Give up on being wire compatible with IPoIB.
>>
>> - Define and implement ethernet over inifiniband aka EoIB.
>>
>> With EoIB:
>> - The SM would map ethernet address to inifiniband hardware
>addresses.
>> - You discover which multicast addresses are of interest from the
>>    IP layer above so no snooping is necessary.
>> - You could run queue pairs directly to hosts.
>>
>> Shrug.  It is trivial and it will work.  It will probably run into
>the
>> same problems that have historically been a problem for using IPoIB
>> (lack of stateless offloads) but shrug that is mostly a NIC firmware
>> problem.  The switches will have no trouble and interoperability will
>> be assured.
>>
>> If you want to map ethernet over infiniband please map ethernet over
>> infiniband.  Don't poorly NAT ethernet into infiniband.
>>
>>
>
>
>EoIB is a valid suggestion and we will look into it as well, BUT:
>
>Providing EoIB is a separate discussion, obviously defining and 
>standardizing a new protocol takes what is takes (a lot of time,
>longish 
>term effort), and will also take time to develop/debug/mature e.g as

If you follow Michael Tirskins suggestion and use the same wire encoding as IPoIB and infer the mac address from the lids and queue pair numbers as you are already doing with for eIPoIB, except for defining exactly how to get the subnet manager to store the mac address to lid/qpn mapping you are done.

If you don't involve a comitte and simply define a defacto standard it will take less effort than this conversation and less effort than implementing your eIPoIB driver.


>you 
>mentioned, some of the features/offloads might require new NIC HW, etc 
>-- compared to IPoIB which is here for many years

So deploy routing and proxy arp and you are done.

>In practice there is already a huge install base for IPoIB software and
>
>hardware products, in different operating environments/OS. We can't
>just 
>through away everything and tell people to replace it all with a new 
>protocol, e.g. bridging devices, storage systems/appliances, VMware, 
>Windows, .. systems in production environments --- so
>the interoperability concern you've mentioned gonna hit very hard.

There is no need to throw anything away.  Just put them on different IP subnets.

Shrug.

>The eIPoIB driver comes to provide a way to work with IPoIB in 
>virtualized environments, where still, the suggestions/concerns raised 
>in this thread should be addressed.

eIPoIB does not work.

I can't get an IP address with out a specially configured dhcp server, and special dhcp clients.

eIPoIB does not work with IPv6.

As David Miller already said this code has no chance of being merged.

Shrug.  I have been polite and pointed out implementation choices that actually work.  Solutions that are less effort and less code, and provide more interoperability.

If after patient explanation you can not appreciate why people consider eIPoIB to be totally unacceptable that is your problem.

Good luck in your future endeavours,

Eric



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 14, 2012, 7:41 a.m.
On 12/08/2012 13:22, Michael S. Tsirkin wrote:
>>>>>>> - IGMP/MAC snooping in a driver is just too hairy.
>>>>
>>>>> mmm, any rough idea/direction how to do that otherwise?
>>> Sure, even two ways, ideally you'd do both:)
>>> A. fix macvtap
>>> 1. Use netdev_for_each_mc_addr etc to get multicast addresses
>>> 2. teach macvtap to fill that in (it currently floods multicasts
>>>     for guest to guest communication so we ned to fix it anyway)
>>>
>>> B. fix bridge
>>>     teach bridge to work for VMs without using promisc mode
>>

OK, I think I'm with you now... you suggest to avoid our direction of 
implementing promiscuous multicast mode which is applied by today's 
bridge, macvtap and friends by fixing these elements to support non 
promisc multicast mode, yep, sure, sounds as win/win, which will 
eliminate the need to do IGMP snooping in the driver.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Aug. 14, 2012, 8:44 a.m.
On 12/08/2012 23:54, Michael S. Tsirkin wrote:
> On Sun, Aug 12, 2012 at 05:13:43PM +0300, Or Gerlitz wrote:
>> On 12/08/2012 16:55, Michael S. Tsirkin wrote:
>>> I didn't realize you do ARP snooping. Why? I know you mangle
>>> outgoing ARP packets,
>>
>> Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP
>> packets, from/to Ethernet ARPs to/from IPoIB ARPs.
>>
>>
>>> this will go away if you maintain a mapping in SM accessible to all guests.
>>
>> guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
>> another driver in the host. But what mapping exactly?
>
> Well we are getting into protocol design here.

wait... reading your responses again, I realized that we 1st and most 
have to (try and) agree on the
problem statement before going/jumping to solutions.

AFAIU your email/s you maybe think that we mandate the admin to set a 
specific MAC to the VM which is derived from the LID/QPN of the IPoIB 
VIF serving it, well this is wrong, we don't,  OTOH, indeed, the VM 
source mac isn't sent on the wire, since the Ethernet header is dropped, 
and on the receiving side is constructed from the LID/QPN
the IB packet arrived from, see next.

This reconstruction of what we call the REMAC (remote ethernet mac) is 
based in the current submission on
the LID/QPN, and as I said earlier on this thread, we are revisiting 
this approach -- where your idea below sounds
good: the eipoib driver can register with the SA an IB "service record" 
entry mapping from LID/QPN to the VM mac, when ever a VM is to be served 
by this eipoib instance, and remove the entry when the VM shouldn't be 
served any more. This will allow to preserve 1:1 the Ethernet MAC header 
sent by VMs on the receiving side.



> So here's a sketch showing how you could build a protocol that does work.  But note it is not *exactly* IPoIB.

HOWEVER, this doesn't touch the IPoIB wire protocol, and hence on the 
wire it IS exactly IPoIB. We only make use of your lovely suggestion to 
apply this SA assistance, so the change doesn't involve 
hardware/firmware nor the wire protocol.

Or.


> It is I think close enough that you can use existing NIC hardware/firmware, which is why it differs slightly from what Eric described, and is more complex.  But it still shares the same property of no hacks, no packet snooping in driver, etc.
>
>
> And if you want to go that route, you really should talk to some IB
> protocol people to figure out what works, write a spec and try to
> standardize. lkml/netdev is not the right place.
>
> But since you asked, if I had to, I would probably try to
> do it like this:
>
> - Each device registers with the SA specifying the
>    mac address (+ vlan?), SA stores the translation from that
>    to IPoIB address.
> - alternatively, SA admin configures the translation statically
> - you get a packet with 6 byte mac address,
>    query the SA for a mapping to IPoIB address, strip
>    ethernet frame and send
> - multicast GID addresses can be similar, filled either when registering
>    for multicast or by SA admin
>
> I think it's possible that you could also convert a mac address to
> EUI-64 and prepend a prefix to get a legal GID. But maybe I'm missing
> something.  This could be handy for multicast.
>
>
> In both cases:
> - SA could return GID that you then resolve to
>    LID using another query, or it could return LID so you save a roundtrip
> - results can be cached locally
> - SA can send updates when translation changes to flush this cache
>
>
> Now above means protocols such as ARP and DHCP use 48 bit addresses so
> you can not mix this new protocol with IPoIB.  Maybe IPoIB could simply
> ignore irrelevant packets, but it's best not to try, get a
> different all-broadcast group and CM ID instead to avoid confusion.
>
> One other interesting thing you can do is forward multicast
> registration data from the router, translate to mgid
> by the SA and do appropriate IB mcast registrations.
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naoto MATSUMOTO - Aug. 15, 2012, 9:10 a.m.
Hi All.

Our lab's additional infomation(result) for all.

Ethernet over L2TPv3 over IPoIB Pseudo-wire Multiplexing (RFC4719)
using "ip l2tp" command on linux.   http://twitpic.com/ajn2nx

On Tue, 07 Aug 2012 10:21:33 +0900
Naoto MATSUMOTO <n-matsumoto@sakura.ad.jp> wrote:

> -----------------------------------------------------------------------
> Ethernet over EoIB using gretap with "Raid STP" performance result.
> http://twitpic.com/agd1pp
> 
> How to use EoIB with gretap. 
> http://twitpic.com/agd4io

regards,
Michael S. Tsirkin - Aug. 20, 2012, 6:57 p.m.
On Sun, Aug 12, 2012 at 11:54:57PM +0300, Michael S. Tsirkin wrote:
> > and remember that
> > this code (VM through eipoib) can talk to any IPoIB element on the
> > fabric, native,
> > virtualized,  HW/SW gateways, etc etc.
> > 
> > Or.
> 
> If you want this, then you really want a limited form of IPoIB bridging.


And to clarify that statement, here is how I would make such
IPoIB "bridging" work:

Guest side:

- Implement virtio-ipoib. This would be a device like virtio-net,
  but instead of ethernet packets, it would pass packets
  that consist of:
	IPoIB destination address
	IP packet

- this is passed to/from host without modifications, possibly with addition
  of header such as virtio net header

- flags such as broadcast can also be added to header

- like virtio net get capabilities from host and expose
  as netdev capabilities

Host side:
- create macvtap -passthrough like device that can sit on top of an
  ipoib interface
- expose this device QPN and GID to guest as hardware address
- as we get packet forward it on UD QPN or CM as appropriate
  depending on size,checksum and admin preference
- expose capabilities such as TSO
- can expose capability such as max MTU to guest too

Above means hardware address changes with migration.
So we need to notify guest when this happens.

This can be addressed from host by notifying all
neighbours.

Alternatively guest can notify all neighbours.

Notification can be done by broadcast.
This second option seems preferable.

this ipoib-vtap can support two modes
	- bridge like mode:
	  guest to guest and guest to host packets
	  can be detected by macvtap and passed
	  to/from guest directly like macvlan bridge mode

	- vepa like mode
	  guest to guest and guest to host packets
	  are sent out and looped back by IB switch
	  like macvlan vepa mode

As compared to the custom protocol I sent, it has -
Advantages: interoperates cleanly with ipoib
Disadvantages: no support for legacy (ethernet-only) guest
Or Gerlitz - Aug. 23, 2012, 6:45 a.m.
On 20/08/2012 21:57, Michael S. Tsirkin wrote:
> On Sun, Aug 12, 2012 at 11:54:57PM +0300, Michael S. Tsirkin wrote:
>
>> If you want this, then you really want a limited form of IPoIB bridging.
>
> And to clarify that statement, here is how I would make such IPoIB "bridging" work:
>
> Guest side:
>
> - Implement virtio-ipoib. This would be a device like virtio-net,
>    but instead of ethernet packets, it would pass packets
>    that consist of:
> 	IPoIB destination address
> 	IP packet
>
> - this is passed to/from host without modifications, possibly with addition
>    of header such as virtio net header
>
> - flags such as broadcast can also be added to header
>
> - like virtio net get capabilities from host and expose
>    as netdev capabilities
>
> Host side:
> - create macvtap -passthrough like device that can sit on top of an
>    ipoib interface
> - expose this device QPN and GID to guest as hardware address
> - as we get packet forward it on UD QPN or CM as appropriate
>    depending on size,checksum and admin preference
> - expose capabilities such as TSO
> - can expose capability such as max MTU to guest too
>
> Above means hardware address changes with migration.
> So we need to notify guest when this happens.
>
> This can be addressed from host by notifying all neighbours.
>
> Alternatively guest can notify all neighbours.
>
> Notification can be done by broadcast.
> This second option seems preferable.
>
> this ipoib-vtap can support two modes
> 	- bridge like mode:
> 	  guest to guest and guest to host packets
> 	  can be detected by macvtap and passed
> 	  to/from guest directly like macvlan bridge mode
>
> 	- vepa like mode
> 	  guest to guest and guest to host packets
> 	  are sent out and looped back by IB switch
> 	  like macvlan vepa mode
>
> As compared to the custom protocol I sent, it has -
> Advantages: interoperates cleanly with ipoib
> Disadvantages: no support for legacy (ethernet-only) guest
>

Hi Michael,

As you mentioned, the approach doesn't address legacy guests, who either 
don't have the virtio
driver, or don't have a virtio driver patched to support virtio-ipoib  
--  which doesn't go inline
with a strong requirement I got.

Other than this giant obstacle, I liked the suggestion and it seems 
valid and viable -- BTW IB HW has
loopback capability, so the VM/VM packets wouldn't actually go to the IB 
switch, but remain within the
HCA.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Sept. 3, 2012, 8:53 p.m.
Michael S. Tsirkin <mst@redhat.com> wrote:

> [...] so it seems that a sane solution would involve an extra level of
> indirection, with guest addresses being translated to host IB addresses.
> As long as you do this, maybe using an ethernet frame format makes sense.

> So far the things that make sense. Here are some that don't, to me:

> - Is a pdf presentation all you have in terms of documentation?
>   We are talking communication protocols here - I would expect a
>   proper spec, and some effort to standardize, otherwise where's the
>   guarantee it won't change in an incompatible way?
>   Other things that I would expect to be addressed in such a spec is
>   interaction with other IPoIB features, such as connected
>   mode, checksum offloading etc, and IB features such as multipath etc.
>
> - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>   more to IB addressing than just the LID.  Since everyone on the subnet
>   need access to this translation, I think it makes sense to store it in
>   the SM. I think this would also obviate some IPv4 specific hacks in kernel.

> - IGMP/MAC snooping in a driver is just too hairy.
>   As you point out, bridge currently needs the uplink in promisc mode.
>   I don't think a driver should work around that limitation.
>   For some setups, it might be interesting to remove the promisc
>  mode requirement, failing that, I think you could use macvtap passthrough.
>
> - Currently migration works without host kernel help, would be
>   preferable to keep it that way.

Hi Michael,

If we rewind to this point, basically, you had few concerns

0. not enough documentation

1. the sender VM MAC isn't preserved when the packet is received

2. the IGMP snooping we planned to do within netdevice - isn't good practice

3. mangling of ARPs within netdevice - isn't good practice as well.

For 0,1,2 we have a way to address  (see below)

So we are remained with #3 - the ARPs -- thinking on this a little
further, FWIW there --are-- components in the kernel which
mangle/generate ARPs and are exposing netdevice, such as openvswitch,
anyway:

does it make sense to forward ARPs received into / sent over the
eIPoIB netdevice (e.g using some sort of rule) to some outer entity
such as user-space
daemon  for interception and later re-injection into eIPoIB?

Or.

Documentation we will fix,

Preserving remote VM mac at the receiver we have few directions for
solution, e.g either along your suggestion with SA records and/or with
using "alias GUIDs" (details TBD when the submission resumes).

Multicast we accept the direction you suggested - implement  support
for multicast non promiscuous in the elements "above" eIPoIB (bridge,
macvtap, etc).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Sept. 3, 2012, 9:22 p.m.
On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > [...] so it seems that a sane solution would involve an extra level of
> > indirection, with guest addresses being translated to host IB addresses.
> > As long as you do this, maybe using an ethernet frame format makes sense.
> 
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> >   Other things that I would expect to be addressed in such a spec is
> >   interaction with other IPoIB features, such as connected
> >   mode, checksum offloading etc, and IB features such as multipath etc.
> >
> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >   need access to this translation, I think it makes sense to store it in
> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> 
> > - IGMP/MAC snooping in a driver is just too hairy.
> >   As you point out, bridge currently needs the uplink in promisc mode.
> >   I don't think a driver should work around that limitation.
> >   For some setups, it might be interesting to remove the promisc
> >  mode requirement, failing that, I think you could use macvtap passthrough.
> >
> > - Currently migration works without host kernel help, would be
> >   preferable to keep it that way.
> 
> Hi Michael,
> 
> If we rewind to this point, basically, you had few concerns

I think some other people gave feedback too, you need to address it in
the patch (as opposed to by mail - even if it's in documentation or
comments) don't just focus on what I wrote.

> 
> 0. not enough documentation
> 
> 1. the sender VM MAC isn't preserved when the packet is received
> 
> 2. the IGMP snooping we planned to do within netdevice - isn't good practice
> 
> 3. mangling of ARPs within netdevice - isn't good practice as well.
> 
> For 0,1,2 we have a way to address  (see below)
> 
> So we are remained with #3 - the ARPs -- thinking on this a little
> further, FWIW there --are-- components in the kernel which
> mangle/generate ARPs and are exposing netdevice, such as openvswitch,
> anyway:
> 
> does it make sense to forward ARPs received into / sent over the
> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
> such as user-space
> daemon  for interception and later re-injection into eIPoIB?
> 
> Or.

Well if this is all you want to do, you can bind a packet socket to the
interface, and drop them at the nic.  It is harder to do for incoming
ARP requests though.

I would do something else: send ARPs out to some defined IB address.
This could be local host or queries from some SA property.  Said remote
side could send you the responses in ethernet format so you do not need
to mangle responses at all.  Similarly for incoming ARP requests.

The rule to do this can also just redirect non IP packets -
this is IPoIB after all.

> Documentation we will fix,

And just to stress the point, document the limitations as well.

> Preserving remote VM mac at the receiver we have few directions for
> solution, e.g either along your suggestion with SA records and/or with
> using "alias GUIDs" (details TBD when the submission resumes).
> 
> Multicast we accept the direction you suggested - implement  support
> for multicast non promiscuous in the elements "above" eIPoIB (bridge,
> macvtap, etc).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Sept. 4, 2012, 6:50 p.m.
On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:

>> So we are remained with #3 - the ARPs -- thinking on this a little
>> further, FWIW there --are-- components in the kernel which
>> mangle/generate ARPs and are exposing netdevice, such as openvswitch, anyway:

>> does it make sense to forward ARPs received into / sent over the
>> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
>> such as user-space daemon  for interception and later re-injection into eIPoIB?

> Well if this is all you want to do, you can bind a packet socket to the
> interface, and drop them at the nic.  It is harder to do for incoming
> ARP requests though.

> I would do something else: send ARPs out to some defined IB address.
> This could be local host or queries from some SA property.  Said remote
> side could send you the responses in ethernet format so you do not need
> to mangle responses at all.  Similarly for incoming ARP requests.

> The rule to do this can also just redirect non IP packets - this is IPoIB after all.

Thanks for the heads up on the possible implementation route, will
look into that.

>> Documentation we will fix,

> And just to stress the point, document the limitations as well.

sure, not that I see concrete limitations for the **user** at this point, but
if there are such, will put them clearly written.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Sept. 4, 2012, 6:57 p.m.
On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
>> Michael S. Tsirkin <mst@redhat.com> wrote:
>> > [...] so it seems that a sane solution would involve an extra level of
>> > indirection, with guest addresses being translated to host IB addresses.
>> > As long as you do this, maybe using an ethernet frame format makes sense.
>>
>> > So far the things that make sense. Here are some that don't, to me:
>>
>> > - Is a pdf presentation all you have in terms of documentation?
>> >   We are talking communication protocols here - I would expect a
>> >   proper spec, and some effort to standardize, otherwise where's the
>> >   guarantee it won't change in an incompatible way?
>> >   Other things that I would expect to be addressed in such a spec is
>> >   interaction with other IPoIB features, such as connected
>> >   mode, checksum offloading etc, and IB features such as multipath etc.
>> >
>> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>> >   more to IB addressing than just the LID.  Since everyone on the subnet
>> >   need access to this translation, I think it makes sense to store it in
>> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
>>
>> > - IGMP/MAC snooping in a driver is just too hairy.
>> >   As you point out, bridge currently needs the uplink in promisc mode.
>> >   I don't think a driver should work around that limitation.
>> >   For some setups, it might be interesting to remove the promisc
>> >  mode requirement, failing that, I think you could use macvtap passthrough.
>> >
>> > - Currently migration works without host kernel help, would be
>> >   preferable to keep it that way.

>> If we rewind to this point, basically, you had few concerns

> I think some other people gave feedback too, you need to address it in
> the patch (as opposed to by mail - even if it's in documentation or
> comments) don't just focus on what I wrote.

The other feedback was:

1. suggesting to introduce new link layer for the para-virtualized
network stack, a direction pointed by Dave and you, for which  I
responded that it doesn't address a hard requirement I got, which is
provide service  to an arbitrary VM which have any OS and a virtual
Ethernet NIC on, emulated or para-virtualized.

2. Suggestions to use solutions which involve routing and/or
proxt-arp, for which I responded that in most/practical cases this
will require for the host to know the VM IP, something which isn't
valid assumption in many cases AND that bunch of cloud stacks,
specifically the leading ones, don't even support that option, which
also violates a hard requirement we got, to support these stacks which
are in use by customers.

3. Suggestions to invent EoIB -- I said, OK but this is long termish
process that we're looking on, might be around at some future point,
but still we are required to support whole echo systems which use
IPoIB, and there's no point to "route" between EoIB segment to IPoIB
segments, its back to #2 which we didn't accept

4. Other feedback saying the eIPoIB driver is messy and "we don't like
it" -- hard to exactly address in code changes

----> all in all, we were suggested few directions which don't allow
us to address  the problem statement, so there's no way to change the
code so they are fullfilled, and some not concrete comments which are
also hard to address --> we took the route of design changes along
your **concrete** comments, doesn't it make sense?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Sept. 4, 2012, 7:31 p.m.
Or Gerlitz <or.gerlitz@gmail.com> writes:

> On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
>
>>> So we are remained with #3 - the ARPs -- thinking on this a little
>>> further, FWIW there --are-- components in the kernel which
>>> mangle/generate ARPs and are exposing netdevice, such as openvswitch, anyway:
>
>>> does it make sense to forward ARPs received into / sent over the
>>> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
>>> such as user-space daemon  for interception and later re-injection into eIPoIB?
>
>> Well if this is all you want to do, you can bind a packet socket to the
>> interface, and drop them at the nic.  It is harder to do for incoming
>> ARP requests though.
>
>> I would do something else: send ARPs out to some defined IB address.
>> This could be local host or queries from some SA property.  Said remote
>> side could send you the responses in ethernet format so you do not need
>> to mangle responses at all.  Similarly for incoming ARP requests.
>
>> The rule to do this can also just redirect non IP packets - this is IPoIB after all.
>
> Thanks for the heads up on the possible implementation route, will
> look into that.
>
>>> Documentation we will fix,
>
>> And just to stress the point, document the limitations as well.
>
> sure, not that I see concrete limitations for the **user** at this point, but
> if there are such, will put them clearly written.

So far you are still playing with a design that is strongly NOT
ethernet.  So calling it eIPoIB will continue to be a LIE.

You are still playing with an implementation that doesn't even dream
of supporting IPv6 which makes it so far from ethernet I can't imagine
anyone taking your code seriously.

All ethernet protocols not working except IPv4 is a huge concrete
limitation.

Any implementation that breaks a naive ARP implementation also breaks
IPv6.  Not to mention everything else that runs over ethernet.

If you are clever you can use the current IPoIB hardware accelleration
but you need to do something different so that you can either encode
or imply the MAC address so you won't have to munge ethernet protocols.

Just for fun you might want to consider what it takes to support 2 VMs
in the same VLAN that share the same IP address (but different MAC
addresses) for failover purposes.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz - Sept. 4, 2012, 7:47 p.m.
On Tue, Sep 4, 2012 at 10:31 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:
>> On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> Documentation we will fix,
>>> And just to stress the point, document the limitations as well.
>> sure, not that I see concrete limitations for the **user** at this point, but
>> if there are such, will put them clearly written.

> All ethernet protocols not working except IPv4 is a huge concrete limitation.

Oh, sure, that WILL be documented, currently eIPoIB can deliver only what
IPoIB can which is IPv4, ARP/RARP, IPv6+ND, for IPv6 see next

> So far you are still playing with a design that is strongly NOT
> ethernet.  So calling it eIPoIB will continue to be a LIE.

> You are still playing with an implementation that doesn't even dream
> of supporting IPv6 which makes it so far from ethernet I can't imagine
> anyone taking your code seriously.

This design can and will support IPv6, the IPv6 ND handling will follow the path
we are talking now for the IPv4 ARPs, e.g not within the driver, etc.
Could you be
more specific?

> Any implementation that breaks a naive ARP implementation also breaks
> IPv6.  Not to mention everything else that runs over ethernet.

not sure to follow on the naive impl. comment,  eIPoIB solution will
include kernel driver along with supporting user-space portion, this
is to follow a comment made by the community on mangling ARPs in
network driver.

> If you are clever you can use the current IPoIB hardware accelleration
> but you need to do something different so that you can either encode
> or imply the MAC address so you won't have to munge ethernet protocols.

also here not sure to follow, we have a new design under which the
original VM MAC is preserved on the RX side, we will not generate MAC
for Ethernet frames sent by VMs which we reconstruct on the RX side
any more.

> Just for fun you might want to consider what it takes to support 2 VMs
> in the same VLAN that share the same IP address (but different MAC
> addresses) for failover purposes.

will look on that

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin - Sept. 4, 2012, 9:21 p.m.
On Tue, Sep 04, 2012 at 09:50:09PM +0300, Or Gerlitz wrote:
> > And just to stress the point, document the limitations as well.
> 
> sure, not that I see concrete limitations for the **user** at this point, but
> if there are such, will put them clearly written.

Hmm, I'm afraid you mistook a short list of some major
bugs that jumped out at me for an exhaustive list.
This was not intended as such.

Here's how to find some of the limitations in your design
1. look through list archives. Some where pointed out to you
2. list everything ethernet does that you dont, or do differently
3. list everything ipoib does that you don't or do differently
4. list any extra setup work required on behalf of the user
5. check various overheads, compare with native ipoib and alternatives
   such as routing
6. if you still have an empty list of limitations and disadvantages,
   look again :)

The point is to have documentation that is useful both
for reviewers - so they can know which bugs are known
and which need to be reported; and for users -
who should not be expected to be familiar with
internals of your implementation.

Patch

diff --git a/drivers/net/eipoib/eth_ipoib_main.c b/drivers/net/eipoib/eth_ipoib_main.c
new file mode 100644
index 0000000..0a8e7f4
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_main.c
@@ -0,0 +1,1953 @@ 
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "eth_ipoib.h"
+#include <net/ip.h>
+#include <linux/if_link.h>
+
+#define EMAC_IP_GC_TIME (10 * HZ)
+
+#define MIG_OUT_ARP_REQ_ISSUE_TIME (0.5 * HZ)
+
+#define MIG_OUT_MAX_ARP_RETRIES 5
+
+#define LIVE_MIG_PACKET 1
+
+#define PARENT_MAC_MASK 0xe7
+
+/* forward declaration */
+static rx_handler_result_t eipoib_handle_frame(struct sk_buff **pskb);
+static int eipoib_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr);
+static void free_ip_mem_in_rec(struct guest_emac_info *emac_info);
+
+static const char * const version =
+	DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n";
+
+LIST_HEAD(parent_dev_list);
+
+/* name space sys/fs functions */
+int eipoib_net_id __read_mostly;
+
+static int __net_init eipoib_net_init(struct net *net)
+{
+	int rc;
+	struct eipoib_net *eipoib_n = net_generic(net, eipoib_net_id);
+
+	eipoib_n->net = net;
+	rc = mod_create_sysfs(eipoib_n);
+
+	return rc;
+}
+
+static void __net_exit eipoib_net_exit(struct net *net)
+{
+	struct eipoib_net *eipoib_n = net_generic(net, eipoib_net_id);
+
+	mod_destroy_sysfs(eipoib_n);
+}
+
+static struct pernet_operations eipoib_net_ops = {
+	.init = eipoib_net_init,
+	.exit = eipoib_net_exit,
+	.id   = &eipoib_net_id,
+	.size = sizeof(struct eipoib_net),
+};
+
+/* set mac fields emac=<qpn><lid> */
+static inline
+void build_neigh_mac(u8 *_mac, u32 _qpn, u16 _lid)
+{
+	/* _qpn: 3B _lid: 2B */
+	*((__be32 *)(_mac)) = cpu_to_be32(_qpn);
+	*(u8 *)(_mac) = 0x2; /* set LG bit */
+	*(__be16 *)(_mac + sizeof(_qpn)) = cpu_to_be16(_lid);
+}
+
+static inline
+struct slave *get_slave_by_dev(struct parent *parent,
+			       struct net_device *slave_dev)
+{
+	struct slave *slave, *slave_tmp;
+	int found = 0;
+
+	parent_for_each_slave(parent, slave_tmp) {
+		if (slave_tmp->dev == slave_dev) {
+			found = 1;
+			slave = slave_tmp;
+			break;
+		}
+	}
+
+	return found ? slave : NULL;
+}
+
+static inline
+struct slave *get_slave_by_mac_and_vlan(struct parent *parent, u8 *mac,
+					u16 vlan)
+{
+	struct slave *slave, *slave_tmp;
+	int found = 0;
+
+	parent_for_each_slave(parent, slave_tmp) {
+		if ((!memcmp(slave_tmp->emac, mac, ETH_ALEN)) &&
+		    (slave_tmp->vlan == vlan)) {
+			found = 1;
+			slave = slave_tmp;
+			break;
+		}
+	}
+
+	return found ? slave : NULL;
+}
+
+
+static inline
+struct guest_emac_info *get_mac_ip_info_by_mac_and_vlan(struct parent *parent,
+							u8 *mac, u16 vlan)
+{
+	struct guest_emac_info *emac_info, *emac_info_ret;
+	int found = 0;
+
+	list_for_each_entry(emac_info, &parent->emac_ip_list, list) {
+		if ((!memcmp(emac_info->emac, mac, ETH_ALEN)) &&
+		    vlan == emac_info->vlan) {
+			found = 1;
+			emac_info_ret = emac_info;
+			break;
+		}
+	}
+
+	return found ? emac_info_ret : NULL;
+}
+
+/*
+ * searches for the relevant guest_emac_info in the parent.
+ * if found it, check if it contains the required ip
+ * if no such guest_emac_info object or no ip return 0,
+ * otherwise return 1 and if exist set the guest_emac_info obj.
+ */
+static inline
+int is_mac_info_contain_ip(struct parent *parent, u8 *mac, __be32 ip,
+			  struct guest_emac_info *emac_info, u16 vlan)
+{
+	struct ip_member *ipm;
+	int found = 0;
+
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, mac, vlan);
+	if (!emac_info)
+		return 0;
+
+	list_for_each_entry(ipm, &emac_info->ip_list, list) {
+		if (ipm->ip == ip) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found;
+}
+
+static inline int netdev_set_parent_master(struct net_device *slave,
+					   struct net_device *master)
+{
+	int err;
+
+	ASSERT_RTNL();
+
+	err = netdev_set_master(slave, master);
+	if (err)
+		return err;
+	if (master) {
+			slave->priv_flags |= IFF_EIPOIB_VIF;
+			/* deny bonding from enslaving it. */;
+			slave->flags |= IFF_SLAVE;
+	} else {
+		slave->priv_flags &= ~(IFF_EIPOIB_VIF);
+		slave->flags &= ~(IFF_SLAVE);
+	}
+
+	return 0;
+}
+
+static inline int is_driver_owner(struct net_device *dev, char *name)
+{
+	struct ethtool_drvinfo drvinfo;
+
+	if (dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) {
+		memset(&drvinfo, 0, sizeof(drvinfo));
+		dev->ethtool_ops->get_drvinfo(dev, &drvinfo);
+		if (!strstr(drvinfo.driver, name))
+			return 0;
+	} else
+		return 0;
+
+	return 1;
+}
+
+static inline int is_parent(struct net_device *dev)
+{
+	return (dev->priv_flags & IFF_EIPOIB_PIF) &&
+		is_driver_owner(dev, DRV_NAME);
+}
+
+static inline int is_parent_mac(struct net_device *dev, u8 *mac)
+{
+	return is_parent(dev) && !memcmp(mac, dev->dev_addr, dev->addr_len);
+}
+
+static inline int __is_slave(struct net_device *dev)
+{
+	return dev->master && is_parent(dev->master);
+}
+
+static inline int is_slave(struct net_device *dev)
+{
+	return (dev->priv_flags & IFF_EIPOIB_VIF) &&
+		is_driver_owner(dev, SDRV_NAME) && __is_slave(dev);
+}
+
+/*
+ * ------------------------------- Link status ------------------
+ * set parent carrier:
+ * link is up if at least one slave has link up
+ * otherwise, bring link down
+ * return 1 if parent carrier changed, zero otherwise
+ */
+static int parent_set_carrier(struct parent *parent)
+{
+	struct slave *slave;
+
+	if (parent->slave_cnt == 0)
+		goto down;
+
+	/* bring parent link up if one slave (at least) is up */
+	parent_for_each_slave(parent, slave) {
+		if (netif_carrier_ok(slave->dev)) {
+			if (!netif_carrier_ok(parent->dev)) {
+				netif_carrier_on(parent->dev);
+				return 1;
+			}
+			return 0;
+		}
+	}
+
+down:
+	if (netif_carrier_ok(parent->dev)) {
+		pr_info("bring down carrier\n");
+		netif_carrier_off(parent->dev);
+		return 1;
+	}
+	return 0;
+}
+
+static int parent_set_mtu(struct parent *parent)
+{
+	struct slave *slave, *f_slave;
+	unsigned int mtu;
+
+	if (parent->slave_cnt == 0)
+		return 0;
+
+	/* find min mtu */
+	f_slave = list_first_entry(&parent->slave_list, struct slave, list);
+	mtu = f_slave->dev->mtu;
+
+	parent_for_each_slave(parent, slave)
+		mtu = min(slave->dev->mtu, mtu);
+
+	if (parent->dev->mtu != mtu) {
+		dev_set_mtu(parent->dev, mtu);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ *--------------------------- slave list handling ------
+ *
+ * This function attaches the slave to the end of list.
+ * pay attention, the caller should held paren->lock
+ */
+static void parent_attach_slave(struct parent *parent,
+				struct slave *new_slave)
+{
+	list_add_tail(&new_slave->list, &parent->slave_list);
+	parent->slave_cnt++;
+}
+
+static void parent_detach_slave(struct parent *parent, struct slave *slave)
+{
+	list_del(&slave->list);
+	parent->slave_cnt--;
+}
+
+static netdev_features_t parent_fix_features(struct net_device *dev,
+					     netdev_features_t features)
+{
+	struct slave *slave;
+	struct parent *parent = netdev_priv(dev);
+	netdev_features_t mask;
+
+	read_lock_bh(&parent->lock);
+
+	mask = features;
+	features &= ~NETIF_F_ONE_FOR_ALL;
+	features |= NETIF_F_ALL_FOR_ALL;
+
+	parent_for_each_slave(parent, slave)
+		features = netdev_increment_features(features,
+						     slave->dev->features,
+						     mask);
+
+	features &= ~NETIF_F_VLAN_CHALLENGED;
+	read_unlock_bh(&parent->lock);
+	return features;
+}
+
+static int parent_compute_features(struct parent *parent)
+{
+	struct net_device *parent_dev = parent->dev;
+	u64 hw_features, features;
+	struct slave *slave;
+
+	if (list_empty(&parent->slave_list))
+		goto done;
+
+	/* starts with the max set of features mask */
+	hw_features = features = ~0LL;
+
+	/* gets the common features from all slaves */
+	parent_for_each_slave(parent, slave) {
+		features &= slave->dev->features;
+		hw_features &= slave->dev->hw_features;
+	}
+
+	features = features | PARENT_VLAN_FEATURES;
+	hw_features = hw_features | PARENT_VLAN_FEATURES;
+
+	hw_features &= ~NETIF_F_VLAN_CHALLENGED;
+	features &= hw_features;
+
+	parent_dev->hw_features = hw_features;
+	parent_dev->features = features;
+	parent_dev->vlan_features = parent_dev->features & ~PARENT_VLAN_FEATURES;
+done:
+	pr_info("%s: %s: Features: 0x%llx\n",
+		__func__, parent_dev->name, parent_dev->features);
+
+	return 0;
+}
+
+static inline u16 slave_get_pkey(struct net_device *dev)
+{
+	u16 pkey = (dev->broadcast[8] << 8) + dev->broadcast[9];
+
+	return pkey;
+}
+
+static void parent_setup_by_slave(struct net_device *parent_dev,
+				  struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	const struct net_device_ops *slave_ops = slave_dev->netdev_ops;
+
+	parent_dev->mtu = slave_dev->mtu;
+	parent_dev->hard_header_len = slave_dev->hard_header_len;
+
+	slave_ops->ndo_neigh_setup(slave_dev, &parent->nparms);
+
+}
+
+/* enslave device <slave> to parent device <master> */
+int parent_enslave(struct net_device *parent_dev, struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *new_slave = NULL;
+	int old_features = parent_dev->features;
+	int res = 0;
+	/* slave must be claimed by ipoib */
+	if (!is_driver_owner(slave_dev, SDRV_NAME))
+		return -EOPNOTSUPP;
+
+	/* parent must be initialized by parent_open() before enslaving */
+	if (!(parent_dev->flags & IFF_UP)) {
+		pr_warn("%s parent is not up in "
+			"parent_enslave\n",
+			parent_dev->name);
+		return -EPERM;
+	}
+
+	/* already enslaved */
+	if ((slave_dev->flags & IFF_SLAVE) ||
+		(slave_dev->priv_flags & IFF_EIPOIB_VIF)) {
+		pr_err("%s was already enslaved!!!\n", slave_dev->name);
+		return -EBUSY;
+	}
+
+	/* mark it as ipoib clone vif */
+	slave_dev->priv_flags |= IFF_EIPOIB_VIF;
+
+	/* set parent netdev attributes */
+	if (parent->slave_cnt == 0)
+		parent_setup_by_slave(parent_dev, slave_dev);
+	else {
+		/* check netdev attr match */
+		if (slave_dev->hard_header_len != parent_dev->hard_header_len) {
+			pr_err("%s slave %s has different HDR len %d != %d\n",
+			       parent_dev->name, slave_dev->name,
+			       slave_dev->hard_header_len,
+			       parent_dev->hard_header_len);
+			res = -EINVAL;
+			goto err_undo_flags;
+		}
+
+		if (slave_dev->type != ARPHRD_INFINIBAND ||
+		    slave_dev->addr_len != INFINIBAND_ALEN) {
+			pr_err("%s slave type/addr_len is invalid (%d/%d)\n",
+			       parent_dev->name, slave_dev->type,
+			       slave_dev->addr_len);
+			res = -EINVAL;
+			goto err_undo_flags;
+		}
+	}
+	/*
+	 * verfiy that this (slave) device belongs to the relevant PIF
+	 * abort if the name of the slave is not as the regular way in ipoib
+	 */
+	if (!strstr(slave_dev->name, parent->ipoib_main_interface)) {
+		pr_err("%s slave name (%s) doesn't contain parent name (%s) ",
+		       parent_dev->name, slave_dev->name,
+		       parent->ipoib_main_interface);
+		res = -EINVAL;
+		goto err_undo_flags;
+	}
+
+	new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
+	if (!new_slave) {
+		res = -ENOMEM;
+		goto err_undo_flags;
+	}
+
+	INIT_LIST_HEAD(&new_slave->neigh_list);
+
+	/* save slave's vlan */
+	new_slave->pkey = slave_get_pkey(slave_dev);
+
+	res = netdev_set_parent_master(slave_dev, parent_dev);
+	if (res) {
+		pr_err("%s %d calling netdev_set_master\n",
+		       slave_dev->name, res);
+		goto err_free;
+	}
+
+	res = dev_open(slave_dev);
+	if (res) {
+		pr_info("open failed %s\n",
+			slave_dev->name);
+		goto err_unset_master;
+	}
+
+	new_slave->dev = slave_dev;
+
+	write_lock_bh(&parent->lock);
+
+	parent_attach_slave(parent, new_slave);
+
+	parent_compute_features(parent);
+
+	write_unlock_bh(&parent->lock);
+
+	read_lock_bh(&parent->lock);
+
+	parent_set_carrier(parent);
+
+	read_unlock_bh(&parent->lock);
+
+	res = create_slave_symlinks(parent_dev, slave_dev);
+	if (res)
+		goto err_close;
+
+	/* register handler */
+	res = netdev_rx_handler_register(slave_dev, eipoib_handle_frame,
+					 new_slave);
+	if (res) {
+		pr_warn("%s %d calling netdev_rx_handler_register\n",
+			parent_dev->name, res);
+		goto err_close;
+	}
+
+	pr_info("%s: enslaving %s\n", parent_dev->name, slave_dev->name);
+
+	/* enslave is successful */
+	return 0;
+
+/* Undo stages on error */
+err_close:
+	dev_close(slave_dev);
+
+err_unset_master:
+	netdev_set_parent_master(slave_dev, NULL);
+
+err_free:
+	kfree(new_slave);
+
+err_undo_flags:
+	parent_dev->features = old_features;
+
+	return res;
+}
+
+static void slave_free(struct parent *parent, struct slave *slave)
+{
+	struct neigh *neigh, *neigh_tmp;
+
+	list_for_each_entry_safe(neigh, neigh_tmp, &slave->neigh_list, list) {
+		list_del(&neigh->list);
+		kfree(neigh);
+	}
+
+	netdev_rx_handler_unregister(slave->dev);
+
+	kfree(slave);
+}
+
+int parent_release_slave(struct net_device *parent_dev,
+			 struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave;
+	struct guest_emac_info *emac_info;
+
+	/* slave is not a slave or master is not master of this slave */
+	if (!(slave_dev->flags & IFF_SLAVE) ||
+	    (slave_dev->master != parent_dev)) {
+		pr_err("%s cannot release %s.\n",
+		       parent_dev->name, slave_dev->name);
+		return -EINVAL;
+	}
+
+	write_lock_bh(&parent->lock);
+
+	slave = get_slave_by_dev(parent, slave_dev);
+	if (!slave) {
+		/* not a slave of this parent */
+		pr_warn("%s not enslaved %s\n",
+			parent_dev->name, slave_dev->name);
+		write_unlock_bh(&parent->lock);
+		return -EINVAL;
+	}
+
+	pr_info("%s: releasing interface %s\n", parent_dev->name,
+		slave_dev->name);
+
+	/* for live migration, mark its mac_ip record as invalid */
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, slave->emac, slave->vlan);
+	if (!emac_info)
+		pr_warn("%s %s didn't find emac: %pM\n",
+			parent_dev->name, slave_dev->name, slave->emac);
+	else {
+		emac_info->rec_state = MIGRATED_OUT;
+		/* start GC work */
+		pr_info("%s: sending clean task for slave mac: %pM\n",
+			__func__, slave->emac);
+		queue_delayed_work(parent->wq, &parent->migrate_out_work, 0);
+		queue_delayed_work(parent->wq, &parent->emac_ip_work,
+				   EMAC_IP_GC_TIME);
+	}
+
+	/* release the slave from its parent */
+	parent_detach_slave(parent, slave);
+
+	parent_compute_features(parent);
+
+	if (parent->slave_cnt == 0)
+		parent_set_carrier(parent);
+
+	write_unlock_bh(&parent->lock);
+
+	/* must do this from outside any spinlocks */
+	destroy_slave_symlinks(parent_dev, slave_dev);
+
+	netdev_set_parent_master(slave_dev, NULL);
+
+	dev_close(slave_dev);
+
+	slave_free(parent, slave);
+
+	return 0;  /* deletion OK */
+}
+
+static int parent_release_all(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave, *slave_tmp;
+	struct net_device *slave_dev;
+	struct neigh *neigh_cmd, *neigh_cmd_tmp;
+	struct guest_emac_info *emac_info, *emac_info_tmp;
+	struct slave;
+
+	write_lock_bh(&parent->lock);
+
+	netif_carrier_off(parent_dev);
+
+	if (parent->slave_cnt == 0)
+		goto out;
+
+	list_for_each_entry_safe(slave, slave_tmp, &parent->slave_list, list) {
+		slave_dev = slave->dev;
+
+		/* remove slave from parent's slave-list */
+		parent_detach_slave(parent, slave);
+
+		parent_compute_features(parent);
+
+		write_unlock_bh(&parent->lock);
+
+		destroy_slave_symlinks(parent_dev, slave_dev);
+
+		netdev_set_parent_master(slave_dev, NULL);
+
+		dev_close(slave_dev);
+
+		slave_free(parent, slave);
+
+		write_lock_bh(&parent->lock);
+	}
+
+	list_for_each_entry_safe(neigh_cmd, neigh_cmd_tmp,
+				 &parent->neigh_add_list, list) {
+		list_del(&neigh_cmd->list);
+			kfree(neigh_cmd);
+		}
+
+	list_for_each_entry_safe(emac_info, emac_info_tmp,
+				 &parent->emac_ip_list, list) {
+		free_ip_mem_in_rec(emac_info);
+		list_del(&emac_info->list);
+		kfree(emac_info);
+	}
+
+	pr_info("%s: released all slaves\n", parent_dev->name);
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return 0;
+}
+
+/* -------------------------- Device entry points --------------------------- */
+static struct rtnl_link_stats64 *parent_get_stats(struct net_device *parent_dev,
+						  struct rtnl_link_stats64 *stats)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave;
+	struct rtnl_link_stats64 temp;
+
+	memset(stats, 0, sizeof(*stats));
+
+	read_lock_bh(&parent->lock);
+
+	parent_for_each_slave(parent, slave) {
+		const struct rtnl_link_stats64 *sstats =
+			dev_get_stats(slave->dev, &temp);
+
+		stats->rx_packets += sstats->rx_packets;
+		stats->rx_bytes += sstats->rx_bytes;
+		stats->rx_errors += sstats->rx_errors;
+		stats->rx_dropped += sstats->rx_dropped;
+
+		stats->tx_packets += sstats->tx_packets;
+		stats->tx_bytes += sstats->tx_bytes;
+		stats->tx_errors += sstats->tx_errors;
+		stats->tx_dropped += sstats->tx_dropped;
+
+		stats->multicast += sstats->multicast;
+		stats->collisions += sstats->collisions;
+
+		stats->rx_length_errors += sstats->rx_length_errors;
+		stats->rx_over_errors += sstats->rx_over_errors;
+		stats->rx_crc_errors += sstats->rx_crc_errors;
+		stats->rx_frame_errors += sstats->rx_frame_errors;
+		stats->rx_fifo_errors += sstats->rx_fifo_errors;
+		stats->rx_missed_errors += sstats->rx_missed_errors;
+
+		stats->tx_aborted_errors += sstats->tx_aborted_errors;
+		stats->tx_carrier_errors += sstats->tx_carrier_errors;
+		stats->tx_fifo_errors += sstats->tx_fifo_errors;
+		stats->tx_heartbeat_errors += sstats->tx_heartbeat_errors;
+		stats->tx_window_errors += sstats->tx_window_errors;
+	}
+
+	read_unlock_bh(&parent->lock);
+
+	return stats;
+}
+
+/* ---------------------------- Main funcs ---------------------------------- */
+static struct neigh *neigh_cmd_find_by_mac(struct slave *slave, u8 *mac)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct neigh *neigh;
+	int found = 0;
+
+	list_for_each_entry(neigh, &parent->neigh_add_list, list) {
+		if (!memcmp(neigh->emac, mac, ETH_ALEN)) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found ? neigh : NULL;
+}
+
+static struct neigh *neigh_find_by_mac(struct slave *slave, u8 *mac)
+{
+	struct neigh *neigh;
+	int found = 0;
+
+	list_for_each_entry(neigh, &slave->neigh_list, list) {
+		if (!memcmp(neigh->emac, mac, ETH_ALEN)) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found ? neigh : NULL;
+}
+
+static int neigh_learn(struct slave *slave, struct sk_buff *skb, u8 *remac)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct neigh *neigh_cmd;
+	u8 *rimac;
+	int rc;
+
+	/* linearize to easy on reading the arp payload */
+	rc = skb_linearize(skb);
+	if (rc) {
+		pr_err("%s: skb_linearize failed rc %d\n", dev->name, rc);
+		goto out;
+	} else
+		rimac = skb->data + sizeof(struct arphdr);
+
+	/* check if entry is being processed or already exists */
+	if (neigh_find_by_mac(slave, remac))
+		goto out;
+
+	if (neigh_cmd_find_by_mac(slave, remac))
+		goto out;
+
+	neigh_cmd = parent_get_neigh_cmd('+', slave->dev->name, remac, rimac);
+	if (!neigh_cmd) {
+		pr_err("%s cannot build neigh cmd\n", slave->dev->name);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	list_add_tail(&neigh_cmd->list, &parent->neigh_add_list);
+
+	/* calls neigh_learn_task() */
+	queue_delayed_work(parent->wq, &parent->neigh_learn_work, 0);
+
+out:
+	return rc;
+}
+
+static void neigh_learn_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     neigh_learn_work.work);
+	struct neigh *neigh_cmd, *neigh_cmd_tmp;
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	list_for_each_entry_safe(neigh_cmd, neigh_cmd_tmp,
+				 &parent->neigh_add_list, list) {
+		__parent_store_neighs(&parent->dev->dev, NULL,
+				      neigh_cmd->cmd, PAGE_SIZE);
+		list_del(&neigh_cmd->list);
+		kfree(neigh_cmd);
+	}
+
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static void parent_work_cancel_all(struct parent *parent)
+{
+	write_lock_bh(&parent->lock);
+	parent->kill_timers = 1;
+	write_unlock_bh(&parent->lock);
+
+	if (delayed_work_pending(&parent->neigh_learn_work))
+		cancel_delayed_work(&parent->neigh_learn_work);
+
+	if (delayed_work_pending(&parent->emac_ip_work))
+		cancel_delayed_work(&parent->emac_ip_work);
+
+	if (delayed_work_pending(&parent->migrate_out_work))
+		cancel_delayed_work(&parent->migrate_out_work);
+}
+
+static struct parent *get_parent_by_pif_name(char *pif_name)
+{
+	struct parent *parent, *nxt;
+
+	list_for_each_entry_safe(parent, nxt, &parent_dev_list, parent_list) {
+		if (!strcmp(parent->ipoib_main_interface, pif_name))
+			return parent;
+	}
+	return NULL;
+}
+
+static void free_ip_mem_in_rec(struct guest_emac_info *emac_info)
+{
+	struct ip_member *ipm, *tmp_ipm;
+	list_for_each_entry_safe(ipm, tmp_ipm, &emac_info->ip_list, list) {
+		list_del(&ipm->list);
+		kfree(ipm);
+	}
+}
+
+static inline void free_invalid_emac_ip_det(struct parent *parent)
+{
+	struct guest_emac_info *emac_info, *emac_info_tmp;
+
+	list_for_each_entry_safe(emac_info, emac_info_tmp,
+				 &parent->emac_ip_list, list) {
+		if (emac_info->rec_state == INVALID) {
+			free_ip_mem_in_rec(emac_info);
+			list_del(&emac_info->list);
+			kfree(emac_info);
+		}
+	}
+}
+
+static void emac_info_clean_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     emac_ip_work.work);
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	free_invalid_emac_ip_det(parent);
+
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static int migrate_out_gen_arp_req(struct parent *parent, u8 *emac,
+				   u16 vlan)
+{
+	struct guest_emac_info *emac_info;
+	struct ip_member *ipm;
+	struct slave *slave;
+	struct sk_buff *nskb;
+	int ret = 0;
+
+	slave = get_slave_by_mac_and_vlan(parent, parent->dev->dev_addr, vlan);
+	if (unlikely(!slave)) {
+		pr_info("%s: Failed to find parent slave !!! %pM\n",
+			__func__, parent->dev->dev_addr);
+		return -ENODEV;
+	}
+
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, emac, vlan);
+
+	if (!emac_info)
+		return 0;
+
+	/* go over all ip's attached to that mac */
+	list_for_each_entry(ipm, &emac_info->ip_list, list) {
+		/* create and send arp request to that ip.*/
+		pr_info("%s: Sending arp For migrate_out event, to %pI4 "
+			"from 0.0.0.0\n", parent->dev->name, &(ipm->ip));
+
+		nskb = arp_create(ARPOP_REQUEST,
+				  ETH_P_ARP,
+				  ipm->ip,
+				  slave->dev,
+				  0,
+				  slave->dev->broadcast,
+				  slave->dev->broadcast,
+				  slave->dev->broadcast);
+		if (nskb)
+			arp_xmit(nskb);
+		else {
+			pr_err("%s: %s failed creating skb\n",
+			       __func__, slave->dev->name);
+			ret = -ENOMEM;
+		}
+	}
+	return ret;
+}
+
+static void migrate_out_work_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     migrate_out_work.work);
+	struct guest_emac_info *emac_info;
+	int is_reschedule = 0;
+	int ret;
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	list_for_each_entry(emac_info, &parent->emac_ip_list, list) {
+		if (emac_info->rec_state == MIGRATED_OUT) {
+			if (emac_info->num_of_retries <
+			    MIG_OUT_MAX_ARP_RETRIES) {
+				ret = migrate_out_gen_arp_req(parent, emac_info->emac,
+							      emac_info->vlan);
+				if (ret)
+					pr_err("%s: migrate_out_gen_arp failed: %d\n",
+					       __func__, ret);
+
+				emac_info->num_of_retries =
+					emac_info->num_of_retries + 1;
+				is_reschedule = 1;
+			} else
+				emac_info->rec_state = INVALID;
+		}
+	}
+	/* issue arp request till the device removed that entry from list */
+	if (is_reschedule)
+		queue_delayed_work(parent->wq, &parent->migrate_out_work,
+				   MIG_OUT_ARP_REQ_ISSUE_TIME);
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static inline int add_emac_ip_info(struct net_device *slave_dev, __be32 ip,
+				   u8 *mac, u16 vlan)
+{
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct guest_emac_info *emac_info = NULL;
+	struct ip_member *ipm;
+	int ret;
+	int is_just_alloc_emac_info = 0;
+
+	ret = is_mac_info_contain_ip(parent, mac, ip, emac_info, vlan);
+	if (ret)
+		return 0;
+
+	/* new ip add it to the emc_ip obj */
+	if (!emac_info) {
+		emac_info = kzalloc(sizeof *emac_info, GFP_ATOMIC);
+		if (!emac_info) {
+			pr_err("%s: Failed allocating emac_info\n",
+			       parent_dev->name);
+			return -ENOMEM;
+		}
+		memcpy(emac_info->emac, mac, ETH_ALEN);
+		INIT_LIST_HEAD(&emac_info->ip_list);
+		emac_info->rec_state = VALID;
+		emac_info->vlan = vlan;
+		emac_info->num_of_retries = 0;
+		list_add_tail(&emac_info->list, &parent->emac_ip_list);
+		is_just_alloc_emac_info = 1;
+	}
+
+	ipm = kzalloc(sizeof *ipm, GFP_ATOMIC);
+	if (!ipm) {
+		pr_err(" %s Failed allocating emac_info (ipm)\n",
+		       parent_dev->name);
+		if (is_just_alloc_emac_info)
+			kfree(emac_info);
+		return -ENOMEM;
+	}
+
+	ipm->ip = ip;
+	list_add_tail(&ipm->list, &emac_info->ip_list);
+
+	return 0;
+}
+
+/* build ipoib arp/rarp request/reply packet */
+static struct sk_buff *get_slave_skb_arp(struct slave *slave,
+					 struct sk_buff *skb,
+					 u8 *rimac, int *ret)
+{
+	struct sk_buff *nskb;
+	struct arphdr *arphdr = (struct arphdr *)
+				(skb->data + sizeof(struct ethhdr));
+	struct eth_arp_data *arp_data = (struct eth_arp_data *)
+					(skb->data + sizeof(struct ethhdr) +
+					 sizeof(struct arphdr));
+	u8 t_addr[ETH_ALEN] = {0};
+	int err = 0;
+	/* mark regular packet handling */
+	*ret = 0;
+
+	/*
+	 * live-migration support: keeps the new mac/ip address:
+	 * In that way each driver knows which mac/vlan - IP's where on the
+	 * guests above, whenever migrate_out event comes it will send
+	 * arp request for all these IP's.
+	 */
+	if (skb->protocol == htons(ETH_P_ARP))
+		err = add_emac_ip_info(slave->dev, arp_data->arp_sip,
+				       arp_data->arp_sha, slave->vlan);
+	if (err)
+		pr_warn("%s: Failed creating: emac_ip_info for ip: %pI4",
+			__func__, &arp_data->arp_sip);
+	/*
+	 * live migration support:
+	 * 1.checck if we are in live migration process
+	 * 2.check if the arp response is for the parent
+	 * 3.ignore local-administrated bit, which was set to make sure
+	 *   that the bridge will not drop it.
+	 */
+	arp_data->arp_dha[0] = arp_data->arp_dha[0] & 0xFD;
+	if (htons(ARPOP_REPLY) == (arphdr->ar_op) &&
+	    !memcmp(arp_data->arp_dha, slave->dev->master->dev_addr, ETH_ALEN)) {
+		/*
+		 * when the source is the parent interface, assumes
+		 * that we are in the middle of live migration process,
+		 * so, we will send gratuitous arp.
+		 */
+		pr_info("%s: Arp packet for parent: %s",
+			__func__, slave->dev->master->name);
+		/* create gratuitous ARP on behalf of the guest */
+		nskb = arp_create(ARPOP_REQUEST,
+				  be16_to_cpu(skb->protocol),
+				  arp_data->arp_sip,
+				  slave->dev,
+				  arp_data->arp_sip,
+				  NULL,
+				  slave->dev->dev_addr,
+				  t_addr);
+		if (unlikely(!nskb))
+			pr_err("%s: %s live migration: failed creating skb\n",
+			       __func__, slave->dev->name);
+	} else {
+		nskb = arp_create(be16_to_cpu(arphdr->ar_op),
+				  be16_to_cpu(skb->protocol),
+				  arp_data->arp_dip,
+				  slave->dev,
+				  arp_data->arp_sip,
+				  rimac,
+				  slave->dev->dev_addr,
+				  NULL);
+	}
+
+	return nskb;
+}
+
+/*
+ * build ipoib arp request packet according to ip header.
+ * uses for live-migration, or missing neigh for new vif.
+ */
+static void get_slave_skb_arp_by_ip(struct slave *slave,
+				    struct sk_buff *skb)
+{
+	struct sk_buff *nskb = NULL;
+	struct iphdr *iph = ip_hdr(skb);
+
+	pr_info("Sending arp on behalf of slave %s, from %pI4"
+		" to %pI4" , slave->dev->name, &(iph->saddr),
+		&(iph->daddr));
+
+	nskb = arp_create(ARPOP_REQUEST,
+			  ETH_P_ARP,
+			  iph->daddr,
+			  slave->dev,
+			  iph->saddr,
+			  slave->dev->broadcast,
+			  slave->dev->dev_addr,
+			  NULL);
+	if (nskb)
+		arp_xmit(nskb);
+	else
+		pr_err("%s: %s failed creating skb\n",
+		       __func__, slave->dev->name);
+}
+
+/* build ipoib ipv4/ipv6 packet */
+static struct sk_buff *get_slave_skb_ip(struct slave *slave,
+					struct sk_buff *skb)
+{
+
+	skb_pull(skb, ETH_HLEN);
+	skb_reset_network_header(skb);
+
+	return skb;
+}
+
+/*
+ * get_slave_skb -- called in TX flow
+ * get skb that can be sent thru slave xmit func,
+ * if skb was adjusted (cloned, pulled, etc..) successfully
+ * the old skb (if any) is freed here.
+ */
+static struct sk_buff *get_slave_skb(struct slave *slave, struct sk_buff *skb)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct sk_buff *nskb = NULL;
+	struct ethhdr *ethh = (struct ethhdr *)(skb->data);
+	struct neigh *neigh = NULL;
+	u8 rimac[INFINIBAND_ALEN];
+	int ret = 0;
+
+	/* set neigh mac */
+	if (is_multicast_ether_addr(ethh->h_dest)) {
+		memcpy(rimac, dev->broadcast, INFINIBAND_ALEN);
+	} else {
+		neigh = neigh_find_by_mac(slave, ethh->h_dest);
+		if (neigh) {
+			memcpy(rimac, neigh->imac, INFINIBAND_ALEN);
+		} else {
+			++parent->port_stats.tx_neigh_miss;
+			/*
+			 * assume VIF migration, tries to get the neigh by
+			 * issue arp request on behalf of the vif.
+			 */
+			if (skb->protocol == htons(ETH_P_IP)) {
+				pr_info("Missed neigh for slave: %s,"
+					"issue ARP request\n",
+					slave->dev->name);
+				get_slave_skb_arp_by_ip(slave, skb);
+				goto out_arp_sent_instead;
+			}
+		}
+	}
+
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP)) {
+		nskb = get_slave_skb_arp(slave, skb, rimac, &ret);
+		if (!nskb && LIVE_MIG_PACKET == ret) {
+			pr_info("%s: live migration packets\n", __func__);
+			goto err;
+		}
+	} else {
+		if (!neigh)
+			goto err;
+		/* pull ethernet header here */
+		nskb = get_slave_skb_ip(slave, skb);
+	}
+
+	/* if new skb could not be adjusted/allocated, abort */
+	if (!nskb) {
+		pr_err("%s get_slave_skb_ip/arp failed 0x%x\n",
+		       dev->name, skb->protocol);
+		goto err;
+	}
+
+	if (neigh && nskb == skb) { /* ucast & non-arp/rarp */
+		/* dev_hard_header only for ucast, for arp done already.*/
+		if (dev_hard_header(nskb, dev, ntohs(skb->protocol), rimac,
+				    dev->dev_addr, nskb->len) < 0) {
+			pr_warn("%s: dev_hard_header failed\n",
+				dev->name);
+			goto err;
+		}
+	}
+
+	/*
+	 * new skb is ready to be sent, clean old skb if we hold a clone
+	 * (old skb is not shared, already checked that.)
+	 */
+	if ((nskb != skb))
+		dev_kfree_skb(skb);
+
+	nskb->dev = slave->dev;
+	return nskb;
+
+out_arp_sent_instead:/* whenever sent arp instead of ip packet */
+err:
+	/* got error after nskb was adjusted/allocated */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(nskb);
+
+	return NULL;
+}
+
+static struct sk_buff *get_parent_skb_arp(struct slave *slave,
+					  struct sk_buff *skb,
+					  u8 *remac)
+{
+	struct net_device *dev = slave->dev->master;
+	struct sk_buff *nskb;
+	struct arphdr *arphdr = (struct arphdr *)(skb->data);
+	struct ipoib_arp_data *arp_data = (struct ipoib_arp_data *)
+					(skb->data + sizeof(struct arphdr));
+	u8 *target_hw = slave->emac;
+	u8 *dst_hw = slave->emac;
+	u8 local_eth_addr[ETH_ALEN];
+
+	/* live migration: gets arp with broadcast src and dst */
+	if (!memcmp(arp_data->arp_sha, slave->dev->broadcast, INFINIBAND_ALEN) &&
+	    !memcmp(arp_data->arp_dha, slave->dev->broadcast, INFINIBAND_ALEN)) {
+		pr_info("%s: ARP with bcast src and dest send from src_hw: %pM\n",
+			__func__, slave->dev->master->dev_addr);
+		/* replace the src with the parent src: */
+		memcpy(local_eth_addr, slave->dev->master->dev_addr, ETH_ALEN);
+		/*
+		 * set local administrated bit,
+		 * that way the bridge will not throws it
+		 */
+		local_eth_addr[0] = local_eth_addr[0] | 0x2;
+		memcpy(remac, local_eth_addr, ETH_ALEN);
+		target_hw = NULL;
+		dst_hw = NULL;
+	}
+
+	nskb = arp_create(be16_to_cpu(arphdr->ar_op),
+			  be16_to_cpu(skb->protocol),
+			  arp_data->arp_dip,
+			  dev,
+			  arp_data->arp_sip,
+			  dst_hw,
+			  remac,
+			  target_hw);
+
+	/* prepare place for the headers. */
+	if (nskb)
+		skb_reserve(nskb, ETH_HLEN);
+
+	return nskb;
+}
+
+static struct sk_buff *get_parent_skb_ip(struct slave *slave,
+					 struct sk_buff *skb)
+{
+	/* nop */
+	return skb;
+}
+
+/* get_parent_skb -- called in RX flow */
+static struct sk_buff *get_parent_skb(struct slave *slave,
+				      struct sk_buff *skb, u8 *remac)
+{
+	struct net_device *dev = slave->dev->master;
+	struct sk_buff *nskb = NULL;
+	struct ethhdr *ethh;
+
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP))
+		nskb = get_parent_skb_arp(slave, skb, remac);
+	else
+		nskb = get_parent_skb_ip(slave, skb);
+
+	/* if new skb could not be adjusted/allocated, abort */
+	if (!nskb)
+		goto err;
+
+	/* at this point, we can free old skb if it was cloned */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(skb);
+
+	skb = nskb;
+
+	/* build ethernet header */
+	ethh = (struct ethhdr *)skb_push(skb, ETH_HLEN);
+	ethh->h_proto = skb->protocol;
+	memcpy(ethh->h_source, remac, ETH_ALEN);
+	memcpy(ethh->h_dest, slave->emac, ETH_ALEN);
+
+	/* zero padding whenever is needed (arp for example).to ETH_ZLEN size */
+	if (unlikely((skb->len < ETH_ZLEN))) {
+		if ((skb->tail + (ETH_ZLEN - skb->len) > skb->end) ||
+		    skb_is_nonlinear(skb))
+			/* nothing */;
+		else
+			memset(skb_put(skb, ETH_ZLEN - skb->len), 0,
+			       ETH_ZLEN - skb->len);
+	}
+
+	/* set new skb fields */
+	skb->pkt_type = PACKET_HOST;
+	/*
+	 * use master dev, to allow netpoll_receive_skb()
+	 * in netif_receive_skb()
+	 */
+	skb->dev = dev;
+
+	/* pull the Ethernet header and update other fields */
+	skb->protocol = eth_type_trans(skb, skb->dev);
+
+	return skb;
+
+err:
+	/* got error after nskb was adjusted/allocated */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(nskb);
+
+	return NULL;
+}
+
+static int parent_rx(struct sk_buff *skb, struct slave *slave)
+{
+	struct net_device *slave_dev = skb->dev;
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct eipoib_cb_data *data = IPOIB_HANDLER_CB(skb);
+	struct napi_struct *napi =  data->rx.napi;
+	struct sk_buff *nskb;
+	int rc = 0;
+	u8 remac[ETH_ALEN];
+	int vlan_tag;
+
+	build_neigh_mac(remac, data->rx.sqpn, data->rx.slid);
+
+	read_lock_bh(&parent->lock);
+
+	if (unlikely(skb_headroom(skb) < ETH_HLEN)) {
+		pr_warn("%s: small headroom %d < %d\n",
+			skb->dev->name, skb_headroom(skb), ETH_HLEN);
+		++parent->port_stats.rx_skb_errors;
+		goto drop;
+	}
+
+	/* learn neighs based on ARP snooping */
+	if (unlikely(ntohs(skb->protocol) == ETH_P_ARP)) {
+		read_unlock_bh(&parent->lock);
+		write_lock_bh(&parent->lock);
+		neigh_learn(slave, skb, remac);
+		write_unlock_bh(&parent->lock);
+		read_lock_bh(&parent->lock);
+	}
+
+	nskb = get_parent_skb(slave, skb, remac);
+	if (unlikely(!nskb)) {
+		++parent->port_stats.rx_skb_errors;
+		pr_warn("%s: failed to create parent_skb\n",
+			skb->dev->name);
+		goto drop;
+	} else
+		skb = nskb;
+
+	vlan_tag = slave->vlan & 0xfff;
+	if (vlan_tag) {
+		skb = __vlan_hwaccel_put_tag(skb, vlan_tag);
+		if (!skb) {
+			pr_err("%s failed to insert VLAN tag\n",
+			       skb->dev->name);
+			goto drop;
+		}
+		++parent->port_stats.rx_vlan;
+	}
+
+	if (napi)
+		rc = napi_gro_receive(napi, skb);
+	else
+		rc = netif_receive_skb(skb);
+
+	read_unlock_bh(&parent->lock);
+
+	return rc;
+
+drop:
+	dev_kfree_skb_any(skb);
+	read_unlock_bh(&parent->lock);
+
+	return NET_RX_DROP;
+}
+
+static rx_handler_result_t eipoib_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct slave *slave;
+
+	slave = eipoib_slave_get_rcu(skb->dev);
+
+	parent_rx(skb, slave);
+
+	return RX_HANDLER_CONSUMED;
+}
+
+static netdev_tx_t parent_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct parent *parent = netdev_priv(dev);
+	struct slave *slave = NULL;
+	struct ethhdr *ethh = (struct ethhdr *)(skb->data);
+	struct sk_buff *nskb;
+	int rc;
+	u16 vlan;
+	u8 mac_no_admin_bit[ETH_ALEN];
+
+	read_lock_bh(&parent->lock);
+
+	if (unlikely(!IS_E_IPOIB_PROTO(ethh->h_proto))) {
+		++parent->port_stats.tx_proto_errors;
+		goto drop;
+	}
+	/* assume: only orphan skb's */
+	if (unlikely(skb_shared(skb))) {
+		++parent->port_stats.tx_shared;
+		goto drop;
+	}
+
+	/* obtain VLAN information if present */
+	if (vlan_tx_tag_present(skb)) {
+		vlan = vlan_tx_tag_get(skb) & 0xfff;
+		++parent->port_stats.tx_vlan;
+	} else {
+		vlan = VLAN_N_VID;
+	}
+
+	/*
+	 * for live migration: mask the admin bit if exists.
+	 * only in ARP packets that came from parent's VIF interface.
+	 */
+	if (unlikely((htons(ETH_P_ARP) == ethh->h_proto) &&
+	    !memcmp(parent->dev->dev_addr + 1, ethh->h_source + 1, ETH_ALEN - 1))) {
+		/* parent's VIF: */
+		memcpy(mac_no_admin_bit, ethh->h_source, ETH_ALEN);
+		mac_no_admin_bit[0] = mac_no_admin_bit[0] & 0xFD;
+		/* get slave, and queue packet */
+		slave = get_slave_by_mac_and_vlan(parent, mac_no_admin_bit, vlan);
+	}
+	/* get slave, and queue packet */
+	if (!slave)
+		slave = get_slave_by_mac_and_vlan(parent, ethh->h_source, vlan);
+	if (unlikely(!slave)) {
+		pr_info("vif: %pM with vlan: %d miss for parent: %s\n",
+			ethh->h_source, vlan, parent->ipoib_main_interface);
+		++parent->port_stats.tx_vif_miss;
+		goto drop;
+	}
+
+	nskb = get_slave_skb(slave, skb);
+	if (unlikely(!nskb)) {
+		++parent->port_stats.tx_skb_errors;
+		goto drop;
+	} else
+		skb = nskb;
+
+	/*
+	 * VST mode: removes the vlan tag in the tx (will add it in the rx)
+	 * the slave is from IPoIB and it is NETIF_F_VLAN_CHALLENGED,
+	 * so must remove the vlan tag.
+	 */
+	if (vlan != VLAN_N_VID)
+		skb->vlan_tci = 0;
+
+	/* arp packets: */
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP)) {
+		arp_xmit(skb);
+		goto out;
+	}
+
+	/* ip packets */
+	skb_record_rx_queue(skb, skb_get_queue_mapping(skb));
+
+	rc = dev_queue_xmit(skb);
+	if (unlikely(rc)) {
+		pr_err("slave tx method failed %d\n", rc);
+		++parent->port_stats.tx_slave_err;
+		dev_kfree_skb(skb);
+	}
+
+	goto out;
+
+drop:
+	++parent->port_stats.tx_parent_dropped;
+	dev_kfree_skb(skb);
+
+out:
+	read_unlock_bh(&parent->lock);
+	return NETDEV_TX_OK;
+}
+
+static int parent_open(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent->kill_timers = 0;
+	INIT_DELAYED_WORK(&parent->neigh_learn_work, neigh_learn_task);
+	INIT_DELAYED_WORK(&parent->emac_ip_work, emac_info_clean_task);
+	INIT_DELAYED_WORK(&parent->migrate_out_work, migrate_out_work_task);
+	return 0;
+}
+
+static int parent_close(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	write_lock_bh(&parent->lock);
+	parent->kill_timers = 1;
+	write_unlock_bh(&parent->lock);
+
+	cancel_delayed_work(&parent->neigh_learn_work);
+	cancel_delayed_work(&parent->emac_ip_work);
+	cancel_delayed_work(&parent->migrate_out_work);
+
+	return 0;
+}
+
+
+static void parent_deinit(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	list_del(&parent->parent_list);
+
+	parent_work_cancel_all(parent);
+}
+
+static void parent_uninit(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent_deinit(parent_dev);
+	parent_destroy_sysfs_entry(parent);
+
+	if (parent->wq)
+		destroy_workqueue(parent->wq);
+}
+
+static struct lock_class_key parent_netdev_xmit_lock_key;
+static struct lock_class_key parent_netdev_addr_lock_key;
+
+static void parent_set_lockdep_class_one(struct net_device *dev,
+					 struct netdev_queue *txq,
+					 void *_unused)
+{
+	lockdep_set_class(&txq->_xmit_lock,
+			  &parent_netdev_xmit_lock_key);
+}
+
+static void parent_set_lockdep_class(struct net_device *dev)
+{
+	lockdep_set_class(&dev->addr_list_lock,
+			  &parent_netdev_addr_lock_key);
+	netdev_for_each_tx_queue(dev, parent_set_lockdep_class_one, NULL);
+}
+
+static int parent_init(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent->wq = create_singlethread_workqueue(parent_dev->name);
+	if (!parent->wq)
+		return -ENOMEM;
+
+	parent_set_lockdep_class(parent_dev);
+
+	list_add_tail(&parent->parent_list, &parent_dev_list);
+
+	return 0;
+}
+
+static u16 parent_select_q(struct net_device *dev, struct sk_buff *skb)
+{
+	return skb_tx_hash(dev, skb);
+}
+
+static int parent_add_vif_param(struct net_device *parent_dev,
+				struct net_device *new_vif_dev,
+				u16 vlan, u8 *mac)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *new_slave, *slave_tmp;
+	int ret = 0;
+
+	if (!is_valid_ether_addr(mac)) {
+		pr_err("Invalid mac input for slave:%pM \n", mac);
+		return -EINVAL;
+	}
+
+	write_lock_bh(&parent->lock);
+
+	new_slave = get_slave_by_dev(parent, new_vif_dev);
+	if (!new_slave) {
+		pr_err("%s: ERROR no slave:%s.!!!! \n",
+		       __func__, new_vif_dev->name);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!is_zero_ether_addr(new_slave->emac)) {
+		pr_err("slave %s mac already set to %pM\n",
+		       new_slave->dev->name, new_slave->emac);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* check another slave has this mac/vlan */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!memcmp(slave_tmp->emac, mac, ETH_ALEN) &&
+		    slave_tmp->vlan == new_slave->vlan) {
+			pr_err("cannot update %s, slave %s already has"
+			       " vlan 0x%x mac %pM\n",
+			       parent->dev->name, new_slave->dev->name,
+			       slave_tmp->vlan,
+			       mac);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	/* ready to go */
+	pr_info("slave %s mac is set to %pM, vlan set to: %d\n",
+		new_slave->dev->name, mac, vlan);
+
+	memcpy(new_slave->emac, mac, ETH_ALEN);
+
+	new_slave->vlan = vlan;
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return ret;
+}
+
+static const struct net_device_ops parent_netdev_ops = {
+	.ndo_init		= parent_init,
+	.ndo_uninit		= parent_uninit,
+	.ndo_open		= parent_open,
+	.ndo_stop		= parent_close,
+	.ndo_start_xmit		= parent_tx,
+	.ndo_select_queue	= parent_select_q,
+	/* parnt mtu is min(slaves_mtus) */
+	.ndo_change_mtu		= NULL,
+	.ndo_fix_features	= parent_fix_features,
+	/*
+	 * initial mac address is randomized, can be changed
+	 * thru this func later
+	 */
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_get_stats64 = parent_get_stats,
+	.ndo_add_slave = parent_enslave,
+	.ndo_del_slave = parent_release_slave,
+	.ndo_set_vif_param = parent_add_vif_param,
+};
+
+static void parent_setup(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	/* initialize rwlocks */
+	rwlock_init(&parent->lock);
+
+	/* Initialize pointers */
+	parent->dev = parent_dev;
+	INIT_LIST_HEAD(&parent->neigh_add_list);
+	INIT_LIST_HEAD(&parent->slave_list);
+	INIT_LIST_HEAD(&parent->emac_ip_list);
+	/* Initialize the device entry points */
+	ether_setup(parent_dev);
+	/* parent_dev->hard_header_len is adjusted later */
+	parent_dev->netdev_ops = &parent_netdev_ops;
+	parent_set_ethtool_ops(parent_dev);
+
+	/* Initialize the device options */
+	parent_dev->tx_queue_len = 0;
+	/* mark the parent intf as pif (master of other vifs.) */
+	parent_dev->priv_flags = IFF_EIPOIB_PIF;
+
+	parent_dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM |
+		NETIF_F_RXCSUM | NETIF_F_GRO | NETIF_F_TSO;
+
+	parent_dev->features = parent_dev->hw_features;
+	parent_dev->vlan_features = parent_dev->hw_features;
+
+	parent_dev->features |= PARENT_VLAN_FEATURES;
+}
+
+/*
+ * Create a new parent based on the specified name and parent parameters.
+ * Caller must NOT hold rtnl_lock; we need to release it here before we
+ * set up our sysfs entries.
+ */
+static struct parent *parent_create(struct net_device *ibd)
+{
+	struct net_device *parent_dev;
+	u32 num_queues;
+	int rc;
+	union ib_gid gid;
+	struct parent *parent = NULL;
+	int i, j;
+
+	memcpy(&gid, ibd->dev_addr + 4, sizeof(union ib_gid));
+	num_queues = num_online_cpus();
+	num_queues = roundup_pow_of_two(num_queues);
+
+	parent_dev = alloc_netdev_mq(sizeof(struct parent), "",
+				     parent_setup, num_queues);
+	if (!parent_dev) {
+		pr_err("%s failed to alloc netdev!\n", ibd->name);
+		rc = -ENOMEM;
+		goto out_rtnl;
+	}
+
+	rc = dev_alloc_name(parent_dev, "eth%d");
+	if (rc < 0)
+		goto out_netdev;
+
+	/* eIPoIB interface mac format. */
+	for (i = 0, j = 0; i < 8; i++) {
+		if ((PARENT_MAC_MASK >> i) & 0x1) {
+			if (j < 6) /* only 6 bytes eth address */
+				parent_dev->dev_addr[j] =
+					gid.raw[GUID_LEN + i];
+			j++;
+		}
+	}
+
+	/* assuming that the ibd->dev.parent was alreadey been set. */
+	SET_NETDEV_DEV(parent_dev, ibd->dev.parent);
+
+	rc = register_netdevice(parent_dev);
+	if (rc < 0)
+		goto out_parent;
+
+	dev_net_set(parent_dev, &init_net);
+
+	rc = parent_create_sysfs_entry(netdev_priv(parent_dev));
+	if (rc < 0)
+		goto out_unreg;
+
+	parent = netdev_priv(parent_dev);
+	memcpy(parent->gid.raw, gid.raw, GID_LEN);
+	strncpy(parent->ipoib_main_interface, ibd->name, IFNAMSIZ);
+	parent_dev->dev_id = ibd->dev_id;
+
+	return parent;
+
+out_unreg:
+	unregister_netdevice(parent_dev);
+out_parent:
+	parent_deinit(parent_dev);
+out_netdev:
+	free_netdev(parent_dev);
+out_rtnl:
+	return ERR_PTR(rc);
+}
+
+
+static void parent_free(struct parent *parent)
+{
+	struct net_device *parent_dev = parent->dev;
+
+	parent_work_cancel_all(parent);
+
+	parent_release_all(parent_dev);
+
+	unregister_netdevice(parent_dev);
+}
+
+static void parent_free_all(void)
+{
+	struct parent *parent, *nxt;
+
+	list_for_each_entry_safe(parent, nxt, &parent_dev_list, parent_list)
+		parent_free(parent);
+}
+
+/* netdev events handlers */
+static inline int is_ipoib_pif_intf(struct net_device *dev)
+{
+	if (ARPHRD_INFINIBAND == dev->type && dev->priv_flags & IFF_EIPOIB_PIF)
+		return 1;
+	return 0;
+}
+
+static int parent_event_changename(struct parent *parent)
+{
+	parent_destroy_sysfs_entry(parent);
+
+	parent_create_sysfs_entry(parent);
+
+	return NOTIFY_DONE;
+}
+
+static int parent_master_netdev_event(unsigned long event,
+				      struct net_device *parent_dev)
+{
+	struct parent *event_parent = netdev_priv(parent_dev);
+
+	switch (event) {
+	case NETDEV_CHANGENAME:
+		pr_info("%s: got NETDEV_CHANGENAME event", parent_dev->name);
+		return parent_event_changename(event_parent);
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int parent_slave_netdev_event(unsigned long event,
+				     struct net_device *slave_dev)
+{
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+
+	if (!parent_dev) {
+		pr_err("slave:%s has no parent.\n", slave_dev->name);
+		return NOTIFY_DONE;
+	}
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		parent_release_slave(parent_dev, slave_dev);
+		break;
+	case NETDEV_CHANGE:
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+		parent_set_carrier(parent);
+		break;
+	case NETDEV_CHANGEMTU:
+		parent_set_mtu(parent);
+		break;
+	case NETDEV_CHANGENAME:
+		break;
+	case NETDEV_FEAT_CHANGE:
+		parent_compute_features(parent);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int eipoib_netdev_event(struct notifier_block *this,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = (struct net_device *)ptr;
+
+	if (dev_net(event_dev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (is_parent(event_dev))
+		return parent_master_netdev_event(event, event_dev);
+
+	if (is_slave(event_dev))
+		return parent_slave_netdev_event(event, event_dev);
+	/*
+	 * general network device triggers event, check if it is new
+	 * ib interface that we want to enslave.
+	 */
+	return eipoib_device_event(this, event, ptr);
+}
+
+static struct notifier_block parent_netdev_notifier = {
+	.notifier_call = eipoib_netdev_event,
+};
+
+static int eipoib_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct parent *parent;
+
+	if (!is_ipoib_pif_intf(dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		parent = parent_create(dev);
+		if (IS_ERR(parent)) {
+			pr_warn("failed to create parent for %s\n",
+				dev->name);
+			break;
+		}
+		break;
+	case NETDEV_UNREGISTER:
+		parent = get_parent_by_pif_name(dev->name);
+		if (parent)
+			parent_free(parent);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int __init mod_init(void)
+{
+	int rc;
+
+	pr_info(DRV_NAME": %s", version);
+
+	rc = register_pernet_subsys(&eipoib_net_ops);
+	if (rc)
+		goto out;
+
+	rc = register_netdevice_notifier(&parent_netdev_notifier);
+	if (rc) {
+		pr_err("%s failed to register_netdevice_notifier, rc: 0x%x\n",
+		       __func__, rc);
+		goto unreg_subsys;
+	}
+
+	goto out;
+
+unreg_subsys:
+	unregister_pernet_subsys(&eipoib_net_ops);
+out:
+	return rc;
+
+}
+
+static void __exit mod_exit(void)
+{
+	unregister_netdevice_notifier(&parent_netdev_notifier);
+
+	unregister_pernet_subsys(&eipoib_net_ops);
+
+	rtnl_lock();
+	parent_free_all();
+	rtnl_unlock();
+}
+
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+MODULE_DESCRIPTION(DRV_DESCRIPTION ", v" DRV_VERSION);
+MODULE_AUTHOR("Ali Ayoub && Erez Shitrit");