diff mbox series

[ovs-dev,RFC] Introduce representor port plugging support

Message ID 20210509140305.1910796-1-frode.nordahl@canonical.com
State New
Headers show
Series [ovs-dev,RFC] Introduce representor port plugging support | expand

Commit Message

Frode Nordahl May 9, 2021, 2:03 p.m. UTC
Introduce plugging module that adds and removes ports on the
integration bridge, as directed by Port_Binding options.

Traditionally it has been the CMSs responsibility to create Virtual
Interfaces (VIFs) as part of instance (Container, Pod, Virtual
Machine etc.) life cycle, and subsequently manage plug/unplug
operations on the Open vSwitch integration bridge.

With the advent of NICs connected to multiple distinct CPUs we can
have a topology where the instance runs on one host and Open
vSwitch and OVN runs on a different host, the smartnic CPU.

The act of plugging and unplugging the representor port in Open
vSwitch running on the smartnic host CPU would be the same for
every smartnic variant (thanks to the devlink-port[0][1]
infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
As such it is natural to extend OVN to provide this common
functionality through its CMS facing API.

The instance will be connected to a SR-IOV Virtual Function or a
RDMA Mediated Device on the host system (the latter not currently
addressed in this implementation). The smartnic driver will
maintain a representor port for each of the host visible devices
on the smartnic CPU side.

It is the CMSs responsibility to maintain a mapping between
instance host and smartnic host, OVN can help by optionally
providing details such as board serial number of the smartnic
system as part of Chassis registration.

The CMS will use it's knowledge of instance host <-> smartnic
host mapping to add appropriate `requested-chassis` along with
the information OVN needs to identify the representor port as
options when creating Logical Switch Ports for instances. These
options will be copied over to the Port_Binding table by
ovn-northd.

OVN will use the devlink-port[0][1] interface to look up which
representor port corresponds to the host visible resource and
maintain presence of this representor port on the integration
bridge.

0: https://www.kernel.org/doc/html/latest/networking/devlink/devlink-port.html
1: https://mail.openvswitch.org/pipermail/ovs-dev/2021-May/382701.html

TODO:

* Implement run-time refresh or incremental updates to
  devlink_ports hash by opening a netlink socket and monitoring
  devlink changes.

* Make use of an index for unbound ports.

* Make use of the existing Incremental Processing support for
  Port-Bindings and call into plugging module from the binding
  module instead of having a separate run loop.

* Implement functionality to convey smartnic identifying
  characteristics as external-ids on Chassis registration. The
  intention is to help CMS map relationships between hosts. Source
  can be board serial number obtained from devlink-info, PCI VPD,
  or a value stored in the local Open vSwitch database maintained
  by the operator.

* Write tests.

* Update documentation.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 controller/automake.mk      |   2 +
 controller/binding.c        |  17 +-
 controller/lport.c          |   9 +
 controller/lport.h          |   2 +
 controller/ovn-controller.c |  12 +
 controller/plugging.c       | 422 ++++++++++++++++++++++++++++++++++++
 controller/plugging.h       |  81 +++++++
 7 files changed, 532 insertions(+), 13 deletions(-)
 create mode 100644 controller/plugging.c
 create mode 100644 controller/plugging.h

Comments

Ilya Maximets May 13, 2021, 3:12 p.m. UTC | #1
On 5/9/21 4:03 PM, Frode Nordahl wrote:
> Introduce plugging module that adds and removes ports on the
> integration bridge, as directed by Port_Binding options.
> 
> Traditionally it has been the CMSs responsibility to create Virtual
> Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> Machine etc.) life cycle, and subsequently manage plug/unplug
> operations on the Open vSwitch integration bridge.
> 
> With the advent of NICs connected to multiple distinct CPUs we can
> have a topology where the instance runs on one host and Open
> vSwitch and OVN runs on a different host, the smartnic CPU.
> 
> The act of plugging and unplugging the representor port in Open
> vSwitch running on the smartnic host CPU would be the same for
> every smartnic variant (thanks to the devlink-port[0][1]
> infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> As such it is natural to extend OVN to provide this common
> functionality through its CMS facing API.

Hi, Frode.  Thanks for putting this together, but it doesn't look
natural to me.  OVN, AFAIK, never touched physical devices or
interacted with the kernel directly.  This change introduces completely
new functionality inside OVN.  With the same effect we can run a fully
separate service on these smartnic CPUs that will do plugging
and configuration job for CMS.  You may even make it independent
from a particular CMS by creating a REST API for it or whatever.
This will additionally allow using same service for non-OVN setups.

Interactions with physical devices also makes OVN linux-dependent
at least for this use case, IIUC.

Maybe, others has different opinions.

Another though is that there is, obviously, a network connection
between the host and smartnic system.  Maybe it's possible to just
add an extra remote to the local ovsdb-server so CMS daemon on the
host system could just add interfaces over the network connection?

Best regards, Ilya Maximets.
Frode Nordahl May 13, 2021, 4:25 p.m. UTC | #2
On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > Introduce plugging module that adds and removes ports on the
> > integration bridge, as directed by Port_Binding options.
> >
> > Traditionally it has been the CMSs responsibility to create Virtual
> > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > Machine etc.) life cycle, and subsequently manage plug/unplug
> > operations on the Open vSwitch integration bridge.
> >
> > With the advent of NICs connected to multiple distinct CPUs we can
> > have a topology where the instance runs on one host and Open
> > vSwitch and OVN runs on a different host, the smartnic CPU.
> >
> > The act of plugging and unplugging the representor port in Open
> > vSwitch running on the smartnic host CPU would be the same for
> > every smartnic variant (thanks to the devlink-port[0][1]
> > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > As such it is natural to extend OVN to provide this common
> > functionality through its CMS facing API.
>
> Hi, Frode.  Thanks for putting this together, but it doesn't look
> natural to me.  OVN, AFAIK, never touched physical devices or
> interacted with the kernel directly.  This change introduces completely
> new functionality inside OVN.  With the same effect we can run a fully
> separate service on these smartnic CPUs that will do plugging
> and configuration job for CMS.  You may even make it independent
> from a particular CMS by creating a REST API for it or whatever.
> This will additionally allow using same service for non-OVN setups.

Ilya,

Thank you for taking the time to comment, much appreciated.

Yes, this is new functionality, NICs with separate control plane CPUs
and isolation from the host are also new, so this is one proposal for
how we could go about to enable the use of them.

The OVN controller does today get pretty close to the physical realm
by maintaining patch ports in Open vSwitch based on bridge mapping
configuration and presence of bridges to physical interfaces. It also
does react to events of physical interfaces being plugged into the
Open vSwitch instance it manages, albeit to date some other entity has
been doing the act of adding the port into the bridge.

The rationale for proposing to use the OVN database for coordinating
this is that the information about which ports to bind, and where to
bind them is already there. The timing of the information flow from
the CMS is also suitable for the task.

OVN relies on OVS library code, and all the necessary libraries for
interfacing with the kernel through netlink and friends are there or
would be easy to add. The rationale for using the netlink-devlink
interface is that it provides a generic infrastructure for these types
of NICs. So by using this interface we should be able to support most
if not all of the variants of these cards.


Providing a separate OVN service to do the task could work, but would
have the cost of an extra SB DB connection, IDL and monitors.

I fear it would be quite hard to build a whole separate project with
its own API, feels like a lot of duplicated effort when the flow of
data and APIs in OVN already align so well with CMSs interested in
using this?

> Interactions with physical devices also makes OVN linux-dependent
> at least for this use case, IIUC.

This specific bit would be linux-specific in the first iteration, yes.
But the vendors manufacturing and distributing the hardware do often
have drivers for other platforms, I am sure the necessary
infrastructure will become available there too over time, if it is not
there already.

We do currently have platform specific macros in the OVN build system,
so we could enable the functionality when built on a compatible
platform.

> Maybe, others has different opinions.

I appreciate your opinion, and enjoy discussing this topic.

> Another though is that there is, obviously, a network connection
> between the host and smartnic system.  Maybe it's possible to just
> add an extra remote to the local ovsdb-server so CMS daemon on the
> host system could just add interfaces over the network connection?

There are a few issues with such an approach. One of the main goals
with providing and using a NIC with control plane CPUs is having an
extra layer of security and isolation which is separate from the
hypervisor host the card happens to share a PCI complex with and draw
power from. Requiring a connection between the two for operation would
defy this purpose.

In addition to that, this class of cards provide visibility into
kernel interfaces, enumeration of representor ports etc. only from the
NIC control plane CPU side of the PCI complex, this information is not
provided to the host. So if a hypervisor host CMS agent were to do the
plugging through a remote ovsdb connection, it would have to
communicate with something else running on the NIC control plane CPU
to retrieve the information it needs before it can know what to relay
back over the ovsdb connection.

--
Frode Nordahl

> Best regards, Ilya Maximets.
Han Zhou June 10, 2021, 6:36 a.m. UTC | #3
On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com>
wrote:
>
> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > Introduce plugging module that adds and removes ports on the
> > > integration bridge, as directed by Port_Binding options.
> > >
> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > operations on the Open vSwitch integration bridge.
> > >
> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > have a topology where the instance runs on one host and Open
> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > >
> > > The act of plugging and unplugging the representor port in Open
> > > vSwitch running on the smartnic host CPU would be the same for
> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > As such it is natural to extend OVN to provide this common
> > > functionality through its CMS facing API.
> >
> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > natural to me.  OVN, AFAIK, never touched physical devices or
> > interacted with the kernel directly.  This change introduces completely
> > new functionality inside OVN.  With the same effect we can run a fully
> > separate service on these smartnic CPUs that will do plugging
> > and configuration job for CMS.  You may even make it independent
> > from a particular CMS by creating a REST API for it or whatever.
> > This will additionally allow using same service for non-OVN setups.
>
> Ilya,
>
> Thank you for taking the time to comment, much appreciated.
>
> Yes, this is new functionality, NICs with separate control plane CPUs
> and isolation from the host are also new, so this is one proposal for
> how we could go about to enable the use of them.
>
> The OVN controller does today get pretty close to the physical realm
> by maintaining patch ports in Open vSwitch based on bridge mapping
> configuration and presence of bridges to physical interfaces. It also
> does react to events of physical interfaces being plugged into the
> Open vSwitch instance it manages, albeit to date some other entity has
> been doing the act of adding the port into the bridge.
>
> The rationale for proposing to use the OVN database for coordinating
> this is that the information about which ports to bind, and where to
> bind them is already there. The timing of the information flow from
> the CMS is also suitable for the task.
>
> OVN relies on OVS library code, and all the necessary libraries for
> interfacing with the kernel through netlink and friends are there or
> would be easy to add. The rationale for using the netlink-devlink
> interface is that it provides a generic infrastructure for these types
> of NICs. So by using this interface we should be able to support most
> if not all of the variants of these cards.
>
>
> Providing a separate OVN service to do the task could work, but would
> have the cost of an extra SB DB connection, IDL and monitors.
>
> I fear it would be quite hard to build a whole separate project with
> its own API, feels like a lot of duplicated effort when the flow of
> data and APIs in OVN already align so well with CMSs interested in
> using this?
>
> > Interactions with physical devices also makes OVN linux-dependent
> > at least for this use case, IIUC.
>
> This specific bit would be linux-specific in the first iteration, yes.
> But the vendors manufacturing and distributing the hardware do often
> have drivers for other platforms, I am sure the necessary
> infrastructure will become available there too over time, if it is not
> there already.
>
> We do currently have platform specific macros in the OVN build system,
> so we could enable the functionality when built on a compatible
> platform.
>
> > Maybe, others has different opinions.
>
> I appreciate your opinion, and enjoy discussing this topic.
>
> > Another though is that there is, obviously, a network connection
> > between the host and smartnic system.  Maybe it's possible to just
> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > host system could just add interfaces over the network connection?
>
> There are a few issues with such an approach. One of the main goals
> with providing and using a NIC with control plane CPUs is having an
> extra layer of security and isolation which is separate from the
> hypervisor host the card happens to share a PCI complex with and draw
> power from. Requiring a connection between the two for operation would
> defy this purpose.
>
> In addition to that, this class of cards provide visibility into
> kernel interfaces, enumeration of representor ports etc. only from the
> NIC control plane CPU side of the PCI complex, this information is not
> provided to the host. So if a hypervisor host CMS agent were to do the
> plugging through a remote ovsdb connection, it would have to
> communicate with something else running on the NIC control plane CPU
> to retrieve the information it needs before it can know what to relay
> back over the ovsdb connection.
>
> --
> Frode Nordahl
>
> > Best regards, Ilya Maximets.

Here are my 2 cents.

Initially I had similar concerns to Ilya, and it seems OVN should stay away
from the physical interface plugging. As a reference, here is how
ovn-kubernetes is doing it without adding anything to OVN:
https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing

However, thinking more about it, the proposed approach in this patch just
expands the way how OVN can bind ports, utilizing the communication channel
of OVN (OVSDB connections). If all the information regarding port binding
can be specified by the CMS from NB, then it is not unnatural for
ovn-controller to perform interface binding directly (instead of passively
accepting what is attached by CMS). This kind of information already
existed to some extent - the "requested_chassis" option in OpenStack. Now
it seems this idea is just expanding it to a specific interface. The
difference is that "requested_chassis" is used for validation only, but now
we want to directly apply it. So I think at least I don't have a strong
opinion on the idea.

There are some benefits:
1) The mechanism can be reused by different CMSes, which may simplify CMS
implementation.
2) Compared with the ovn-k8s approach, it reuses OVN's communication
channel, which avoids an extra CMS communication channel on the smart NIC
side. (of course this can be achieved by a connection between the BM and
smart NIC with *restricted* API just to convey the necessary information)

As to the negative side, it would increase OVN's complexity, and as
mentioned by Ilya potentially breaks OVN's platform independence. To avoid
this, I think the *plugging* module itself needs to be independent and
pluggable. It can be extended as independent plugins. The plugin would need
to define what information is needed in LSP's "options", and then implement
corresponding drivers. With this approach, even the regular VIFs can be
attached by ovn-controller if CMS can tell the interface name. Anyway, this
is just my brief thinking.

Thanks,
Han
Ilya Maximets June 10, 2021, 11:46 a.m. UTC | #4
On 6/10/21 8:36 AM, Han Zhou wrote:
> 
> 
> On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
>>
>> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>> >
>> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
>> > > Introduce plugging module that adds and removes ports on the
>> > > integration bridge, as directed by Port_Binding options.
>> > >
>> > > Traditionally it has been the CMSs responsibility to create Virtual
>> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
>> > > Machine etc.) life cycle, and subsequently manage plug/unplug
>> > > operations on the Open vSwitch integration bridge.
>> > >
>> > > With the advent of NICs connected to multiple distinct CPUs we can
>> > > have a topology where the instance runs on one host and Open
>> > > vSwitch and OVN runs on a different host, the smartnic CPU.
>> > >
>> > > The act of plugging and unplugging the representor port in Open
>> > > vSwitch running on the smartnic host CPU would be the same for
>> > > every smartnic variant (thanks to the devlink-port[0][1]
>> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
>> > > As such it is natural to extend OVN to provide this common
>> > > functionality through its CMS facing API.
>> >
>> > Hi, Frode.  Thanks for putting this together, but it doesn't look
>> > natural to me.  OVN, AFAIK, never touched physical devices or
>> > interacted with the kernel directly.  This change introduces completely
>> > new functionality inside OVN.  With the same effect we can run a fully
>> > separate service on these smartnic CPUs that will do plugging
>> > and configuration job for CMS.  You may even make it independent
>> > from a particular CMS by creating a REST API for it or whatever.
>> > This will additionally allow using same service for non-OVN setups.
>>
>> Ilya,
>>
>> Thank you for taking the time to comment, much appreciated.
>>
>> Yes, this is new functionality, NICs with separate control plane CPUs
>> and isolation from the host are also new, so this is one proposal for
>> how we could go about to enable the use of them.
>>
>> The OVN controller does today get pretty close to the physical realm
>> by maintaining patch ports in Open vSwitch based on bridge mapping
>> configuration and presence of bridges to physical interfaces. It also
>> does react to events of physical interfaces being plugged into the
>> Open vSwitch instance it manages, albeit to date some other entity has
>> been doing the act of adding the port into the bridge.
>>
>> The rationale for proposing to use the OVN database for coordinating
>> this is that the information about which ports to bind, and where to
>> bind them is already there. The timing of the information flow from
>> the CMS is also suitable for the task.
>>
>> OVN relies on OVS library code, and all the necessary libraries for
>> interfacing with the kernel through netlink and friends are there or
>> would be easy to add. The rationale for using the netlink-devlink
>> interface is that it provides a generic infrastructure for these types
>> of NICs. So by using this interface we should be able to support most
>> if not all of the variants of these cards.
>>
>>
>> Providing a separate OVN service to do the task could work, but would
>> have the cost of an extra SB DB connection, IDL and monitors.

IMHO, CMS should never connect to Southbound DB.  It's just because the
Southbound DB is not meant to be a public interface, it just happened
to be available for connections.  I know that OpenStack has metadata
agents that connects to Sb DB, but if it's really required for them, I
think, there should be a different way to get/set required information
without connection to the Southbound.

>>
>> I fear it would be quite hard to build a whole separate project with
>> its own API, feels like a lot of duplicated effort when the flow of
>> data and APIs in OVN already align so well with CMSs interested in
>> using this?
>>
>> > Interactions with physical devices also makes OVN linux-dependent
>> > at least for this use case, IIUC.
>>
>> This specific bit would be linux-specific in the first iteration, yes.
>> But the vendors manufacturing and distributing the hardware do often
>> have drivers for other platforms, I am sure the necessary
>> infrastructure will become available there too over time, if it is not
>> there already.
>>
>> We do currently have platform specific macros in the OVN build system,
>> so we could enable the functionality when built on a compatible
>> platform.
>>
>> > Maybe, others has different opinions.
>>
>> I appreciate your opinion, and enjoy discussing this topic.
>>
>> > Another though is that there is, obviously, a network connection
>> > between the host and smartnic system.  Maybe it's possible to just
>> > add an extra remote to the local ovsdb-server so CMS daemon on the
>> > host system could just add interfaces over the network connection?
>>
>> There are a few issues with such an approach. One of the main goals
>> with providing and using a NIC with control plane CPUs is having an
>> extra layer of security and isolation which is separate from the
>> hypervisor host the card happens to share a PCI complex with and draw
>> power from. Requiring a connection between the two for operation would
>> defy this purpose.
>>
>> In addition to that, this class of cards provide visibility into
>> kernel interfaces, enumeration of representor ports etc. only from the
>> NIC control plane CPU side of the PCI complex, this information is not
>> provided to the host. So if a hypervisor host CMS agent were to do the
>> plugging through a remote ovsdb connection, it would have to
>> communicate with something else running on the NIC control plane CPU
>> to retrieve the information it needs before it can know what to relay
>> back over the ovsdb connection.
>>
>> --
>> Frode Nordahl
>>
>> > Best regards, Ilya Maximets.
> 
> Here are my 2 cents.
> 
> Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>

AFAICT, a big part of the work is already done on the ovn-k8s side:
  https://github.com/ovn-org/ovn-kubernetes/pull/2005
  https://github.com/ovn-org/ovn-kubernetes/pull/2042

> 
> However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.

While it's, probably, OK for OVN to add port to the OVSDB, in many
cases these ports will require a lot of extra configuration which
is typically done by os-vif or CNI/device plugins.  Imagine that OVS
is running with userspace datapath and you need to plug-in some DPDK
ports, where you have to specify the port type, DPDK port config,
porbably also number of rx/tx queues, number of descriptors in these
queues.  You may also want to configure affinity for these queues per
PMD thread in OVS.  For kernel interfaces it might be easier, but they
also might require some extra configuration that OVN will have to
think about now.  This is a typical job for CMS to configure this
kind of stuff, and that is why projects like os-vif or large variety
of CNI/device plugins exists.

> 
> There are some benefits:
> 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)

The problem I see is that at least ovn-k8s, AFAIU, will require
the daemon on the Smart NIC anyways to monitor and configure
OVN components, e.g. to configure ovn-remote for ovn-controller
or run management appctl commands if required.
So, I don't see the point why it can't do plugging if it's already
there.

> 
> As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.

Aside from plugging,
I also don't see the reason to have devlink code in OVN just
because it runs once on the init stage, IIUC.  So, I don't
understand why this information about the hardware cannot be
retrieved during the host provisioning and stored somewhere
(Nova config?).  Isn't hardware inventory a job for tripleo
or something?

Best regards, Ilya Maximets.
Frode Nordahl June 10, 2021, 2:13 p.m. UTC | #5
On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/10/21 8:36 AM, Han Zhou wrote:
> >
> >
> > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> >>
> >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> >> >
> >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> >> > > Introduce plugging module that adds and removes ports on the
> >> > > integration bridge, as directed by Port_Binding options.
> >> > >
> >> > > Traditionally it has been the CMSs responsibility to create Virtual
> >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> >> > > operations on the Open vSwitch integration bridge.
> >> > >
> >> > > With the advent of NICs connected to multiple distinct CPUs we can
> >> > > have a topology where the instance runs on one host and Open
> >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> >> > >
> >> > > The act of plugging and unplugging the representor port in Open
> >> > > vSwitch running on the smartnic host CPU would be the same for
> >> > > every smartnic variant (thanks to the devlink-port[0][1]
> >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> >> > > As such it is natural to extend OVN to provide this common
> >> > > functionality through its CMS facing API.
> >> >
> >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> >> > natural to me.  OVN, AFAIK, never touched physical devices or
> >> > interacted with the kernel directly.  This change introduces completely
> >> > new functionality inside OVN.  With the same effect we can run a fully
> >> > separate service on these smartnic CPUs that will do plugging
> >> > and configuration job for CMS.  You may even make it independent
> >> > from a particular CMS by creating a REST API for it or whatever.
> >> > This will additionally allow using same service for non-OVN setups.
> >>
> >> Ilya,
> >>
> >> Thank you for taking the time to comment, much appreciated.
> >>
> >> Yes, this is new functionality, NICs with separate control plane CPUs
> >> and isolation from the host are also new, so this is one proposal for
> >> how we could go about to enable the use of them.
> >>
> >> The OVN controller does today get pretty close to the physical realm
> >> by maintaining patch ports in Open vSwitch based on bridge mapping
> >> configuration and presence of bridges to physical interfaces. It also
> >> does react to events of physical interfaces being plugged into the
> >> Open vSwitch instance it manages, albeit to date some other entity has
> >> been doing the act of adding the port into the bridge.
> >>
> >> The rationale for proposing to use the OVN database for coordinating
> >> this is that the information about which ports to bind, and where to
> >> bind them is already there. The timing of the information flow from
> >> the CMS is also suitable for the task.
> >>
> >> OVN relies on OVS library code, and all the necessary libraries for
> >> interfacing with the kernel through netlink and friends are there or
> >> would be easy to add. The rationale for using the netlink-devlink
> >> interface is that it provides a generic infrastructure for these types
> >> of NICs. So by using this interface we should be able to support most
> >> if not all of the variants of these cards.
> >>
> >>
> >> Providing a separate OVN service to do the task could work, but would
> >> have the cost of an extra SB DB connection, IDL and monitors.
>
> IMHO, CMS should never connect to Southbound DB.  It's just because the
> Southbound DB is not meant to be a public interface, it just happened
> to be available for connections.  I know that OpenStack has metadata
> agents that connects to Sb DB, but if it's really required for them, I
> think, there should be a different way to get/set required information
> without connection to the Southbound.

The CMS-facing API is the Northbound DB, I was not suggesting direct
use of the Southbound DB by external to OVN services. My suggestion
was to have a separate OVN process do this if your objection was to
handle it as part of the ovn-controller process.

> >>
> >> I fear it would be quite hard to build a whole separate project with
> >> its own API, feels like a lot of duplicated effort when the flow of
> >> data and APIs in OVN already align so well with CMSs interested in
> >> using this?
> >>
> >> > Interactions with physical devices also makes OVN linux-dependent
> >> > at least for this use case, IIUC.
> >>
> >> This specific bit would be linux-specific in the first iteration, yes.
> >> But the vendors manufacturing and distributing the hardware do often
> >> have drivers for other platforms, I am sure the necessary
> >> infrastructure will become available there too over time, if it is not
> >> there already.
> >>
> >> We do currently have platform specific macros in the OVN build system,
> >> so we could enable the functionality when built on a compatible
> >> platform.
> >>
> >> > Maybe, others has different opinions.
> >>
> >> I appreciate your opinion, and enjoy discussing this topic.
> >>
> >> > Another though is that there is, obviously, a network connection
> >> > between the host and smartnic system.  Maybe it's possible to just
> >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> >> > host system could just add interfaces over the network connection?
> >>
> >> There are a few issues with such an approach. One of the main goals
> >> with providing and using a NIC with control plane CPUs is having an
> >> extra layer of security and isolation which is separate from the
> >> hypervisor host the card happens to share a PCI complex with and draw
> >> power from. Requiring a connection between the two for operation would
> >> defy this purpose.
> >>
> >> In addition to that, this class of cards provide visibility into
> >> kernel interfaces, enumeration of representor ports etc. only from the
> >> NIC control plane CPU side of the PCI complex, this information is not
> >> provided to the host. So if a hypervisor host CMS agent were to do the
> >> plugging through a remote ovsdb connection, it would have to
> >> communicate with something else running on the NIC control plane CPU
> >> to retrieve the information it needs before it can know what to relay
> >> back over the ovsdb connection.
> >>
> >> --
> >> Frode Nordahl
> >>
> >> > Best regards, Ilya Maximets.
> >
> > Here are my 2 cents.
> >
> > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
>
> AFAICT, a big part of the work is already done on the ovn-k8s side:
>   https://github.com/ovn-org/ovn-kubernetes/pull/2005
>   https://github.com/ovn-org/ovn-kubernetes/pull/2042

I am aware of the on-going effort to implement support for this in
ovn-kubernetes directly. What we have identified is that there are
other CMSs that want this functionality, and with that we have an
opportunity to generalise and provide an abstraction in a common place
that all the consuming CMSes can benefit from.

> >
> > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
>
> While it's, probably, OK for OVN to add port to the OVSDB, in many
> cases these ports will require a lot of extra configuration which
> is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> is running with userspace datapath and you need to plug-in some DPDK
> ports, where you have to specify the port type, DPDK port config,
> porbably also number of rx/tx queues, number of descriptors in these
> queues.  You may also want to configure affinity for these queues per
> PMD thread in OVS.  For kernel interfaces it might be easier, but they
> also might require some extra configuration that OVN will have to
> think about now.  This is a typical job for CMS to configure this
> kind of stuff, and that is why projects like os-vif or large variety
> of CNI/device plugins exists.

CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
them get their information from the CMS. Providing support for
userspace datapath and DPDK would require more information, some of
which is available through devlink, some fit well in key/value
options. Our initial target would be to support the kernel representor
port workflow.

> >
> > There are some benefits:
> > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
>
> The problem I see is that at least ovn-k8s, AFAIU, will require
> the daemon on the Smart NIC anyways to monitor and configure
> OVN components, e.g. to configure ovn-remote for ovn-controller
> or run management appctl commands if required.
> So, I don't see the point why it can't do plugging if it's already
> there.

This pattern is Kubernetes specific and is not the case for other
CMSes. The current proposal for enabling Smart NIC with control plane
CPUs for ovn-kubernetes could be simplified if the networking platform
provided means for coordinating more of the network related bits.

> >
> > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
>
> Aside from plugging,
> I also don't see the reason to have devlink code in OVN just
> because it runs once on the init stage, IIUC.  So, I don't
> understand why this information about the hardware cannot be
> retrieved during the host provisioning and stored somewhere
> (Nova config?).  Isn't hardware inventory a job for tripleo
> or something?

As noted in one of the TODOs in the commit message of the RFC one of
the work items to further develop this is to monitor devlink for
changes, the end product will not do one-time initialization.

While I agree with you that the current SR-IOV acceleration workflow
configuration is pretty static and can be done at deploy time, this
proposal prepares for the next generation subfunction workflow where
you will have a much higher density, and run-time configuration and
discovery of representor ports. There is a paper about it from netdev
conf[2], and all of this is part of the devlink infrastructure (Linux
kernel ABI).

2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf


--
Frode Nordahl

> Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/controller/automake.mk b/controller/automake.mk
index e664f1980..04e5708ec 100644
--- a/controller/automake.mk
+++ b/controller/automake.mk
@@ -26,6 +26,8 @@  controller_ovn_controller_SOURCES = \
 	controller/pinctrl.h \
 	controller/patch.c \
 	controller/patch.h \
+	controller/plugging.c \
+	controller/plugging.h \
 	controller/ovn-controller.c \
 	controller/ovn-controller.h \
 	controller/physical.c \
diff --git a/controller/binding.c b/controller/binding.c
index 4ca2b4d9a..66fa65091 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1081,15 +1081,6 @@  is_binding_lport_this_chassis(struct binding_lport *b_lport,
             b_lport->pb->chassis == chassis);
 }
 
-static bool
-can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
-                         const char *requested_chassis)
-{
-    return !requested_chassis || !requested_chassis[0]
-           || !strcmp(requested_chassis, chassis_rec->name)
-           || !strcmp(requested_chassis, chassis_rec->hostname);
-}
-
 /* Returns 'true' if the 'lbinding' has binding lports of type LP_CONTAINER,
  * 'false' otherwise. */
 static bool
@@ -1186,8 +1177,8 @@  consider_vif_lport(const struct sbrec_port_binding *pb,
                    struct hmap *qos_map)
 {
     const char *vif_chassis = smap_get(&pb->options, "requested-chassis");
-    bool can_bind = can_bind_on_this_chassis(b_ctx_in->chassis_rec,
-                                             vif_chassis);
+    bool can_bind = lport_can_bind_on_this_chassis(b_ctx_in->chassis_rec,
+                                                   vif_chassis);
 
     if (!lbinding) {
         lbinding = local_binding_find(&b_ctx_out->lbinding_data->bindings,
@@ -1278,8 +1269,8 @@  consider_container_lport(const struct sbrec_port_binding *pb,
     ovs_assert(parent_b_lport && parent_b_lport->pb);
     const char *vif_chassis = smap_get(&parent_b_lport->pb->options,
                                        "requested-chassis");
-    bool can_bind = can_bind_on_this_chassis(b_ctx_in->chassis_rec,
-                                             vif_chassis);
+    bool can_bind = lport_can_bind_on_this_chassis(b_ctx_in->chassis_rec,
+                                                   vif_chassis);
 
     return consider_vif_lport_(pb, can_bind, vif_chassis, b_ctx_in, b_ctx_out,
                                container_b_lport, qos_map);
diff --git a/controller/lport.c b/controller/lport.c
index 478fcfd82..e97cc17c5 100644
--- a/controller/lport.c
+++ b/controller/lport.c
@@ -120,3 +120,12 @@  mcgroup_lookup_by_dp_name(
 
     return retval;
 }
+
+bool
+lport_can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
+                               const char *requested_chassis)
+{
+    return !requested_chassis || !requested_chassis[0]
+           || !strcmp(requested_chassis, chassis_rec->name)
+           || !strcmp(requested_chassis, chassis_rec->hostname);
+}
diff --git a/controller/lport.h b/controller/lport.h
index 345efc184..4d9b6891d 100644
--- a/controller/lport.h
+++ b/controller/lport.h
@@ -54,5 +54,7 @@  lport_is_chassis_resident(struct ovsdb_idl_index *sbrec_port_binding_by_name,
                           const struct sbrec_chassis *chassis,
                           const struct sset *active_tunnels,
                           const char *port_name);
+bool lport_can_bind_on_this_chassis(const struct sbrec_chassis *,
+                                    const char *);
 
 #endif /* controller/lport.h */
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 67c51a86f..93b6e3e11 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -54,6 +54,7 @@ 
 #include "patch.h"
 #include "physical.h"
 #include "pinctrl.h"
+#include "plugging.h"
 #include "openvswitch/poll-loop.h"
 #include "lib/bitmap.h"
 #include "lib/hash.h"
@@ -269,6 +270,11 @@  update_sb_monitors(struct ovsdb_idl *ovnsb_idl,
          */
         sbrec_logical_flow_add_clause_logical_dp_group(&lf, OVSDB_F_NE, NULL);
     }
+    /* Monitor unbound LP_VIF ports to consider representor port plugging */
+    struct uuid zero_uuid;
+    memset(&zero_uuid, 0, sizeof(zero_uuid));
+    sbrec_port_binding_add_clause_chassis(&pb, OVSDB_F_EQ, &zero_uuid);
+    sbrec_port_binding_add_clause_type(&pb, OVSDB_F_EQ, "\"\"");
 
 out:;
     unsigned int cond_seqnos[] = {
@@ -2572,6 +2578,7 @@  main(int argc, char *argv[])
     binding_init();
     patch_init();
     pinctrl_init();
+    plugging_init();
     lflow_init();
 
     /* Connect to OVS OVSDB instance. */
@@ -3050,6 +3057,10 @@  main(int argc, char *argv[])
                             ovsrec_open_vswitch_table_get(ovs_idl_loop.idl),
                             ovsrec_port_table_get(ovs_idl_loop.idl),
                             br_int, chassis, &runtime_data->local_datapaths);
+                        plugging_run(ovs_idl_txn,
+                            sbrec_port_binding_table_get(ovnsb_idl_loop.idl),
+                            ovsrec_port_table_get(ovs_idl_loop.idl),
+                            br_int, chassis);
                         pinctrl_run(ovnsb_idl_txn,
                                     sbrec_datapath_binding_by_key,
                                     sbrec_port_binding_by_datapath,
@@ -3268,6 +3279,7 @@  loop_done:
     ofctrl_destroy();
     pinctrl_destroy();
     patch_destroy();
+    plugging_destroy();
 
     ovsdb_idl_loop_destroy(&ovs_idl_loop);
     ovsdb_idl_loop_destroy(&ovnsb_idl_loop);
diff --git a/controller/plugging.c b/controller/plugging.c
new file mode 100644
index 000000000..8e62dfcb9
--- /dev/null
+++ b/controller/plugging.c
@@ -0,0 +1,422 @@ 
+/* Copyright (c) 2021 Canonical
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/devlink.h>
+#include <net/if.h>
+
+#include "plugging.h"
+
+#include "hash.h"
+#include "lflow.h"
+#include "lib/vswitch-idl.h"
+#include "lport.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/vlog.h"
+#include "lib/ovn-sb-idl.h"
+#include "netlink-devlink.h"
+#include "ovn-controller.h"
+#include "openvswitch/shash.h"
+#include "packets.h"
+
+VLOG_DEFINE_THIS_MODULE(plugging);
+
+/* Contains netdev name of ports known to devlink indexed by PF MAC
+ * address and logical function number (if applicable).
+ *
+ * Examples:
+ *     SR-IOV Physical Function: key "00:53:00:00:00:42"    value "pf0hpf"
+ *     SR-IOV Virtual Function:  key "00:53:00:00:00:42-42" value "pf0vf42"
+ */
+static struct shash devlink_ports;
+
+/* Max number of physical ports connected to a single NIC SoC. */
+#define MAX_NIC_PHY_PORTS 64
+/* string repr of eth MAC, '-', logical function number (uint32_t) */
+#define MAX_KEY_LEN 17+1+10+1
+
+
+static bool compat_get_host_pf_mac(const char *, struct eth_addr *);
+
+static bool
+fill_devlink_ports_key_from_strs(char *buf, size_t bufsiz,
+                                const char *host_pf_mac,
+                                const char *function)
+{
+    return snprintf(buf, bufsiz,
+                    function != NULL ? "%s-%s": "%s",
+                    host_pf_mac, function) < bufsiz;
+}
+
+/* We deliberately pass the struct eth_addr by value as we would have to copy
+ * the data either way to make use of the ETH_ADDR_ARGS macro */
+static bool
+fill_devlink_ports_key_from_typed(char *buf, size_t bufsiz,
+                    struct eth_addr host_pf_mac,
+                    uint32_t function)
+{
+    return snprintf(
+        buf, bufsiz,
+        function < UINT32_MAX ? ETH_ADDR_FMT"-%"PRIu32 : ETH_ADDR_FMT,
+        ETH_ADDR_ARGS(host_pf_mac), function) < bufsiz;
+}
+
+static void
+devlink_port_add_function(struct dl_port *port_entry,
+                          struct eth_addr *host_pf_mac)
+{
+    char keybuf[MAX_KEY_LEN];
+    uint32_t function_number;
+
+    switch(port_entry->flavour) {
+    case DEVLINK_PORT_FLAVOUR_PCI_PF:
+        /* for Physical Function representor ports we only add the MAC address
+         * and no logical function number */
+        function_number = -1;
+        break;
+    case DEVLINK_PORT_FLAVOUR_PCI_VF:
+        function_number = port_entry->pci_vf_number;
+        break;
+    default:
+        VLOG_WARN("Unsupported flavour for port '%s': %s",
+            port_entry->netdev_name,
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PHYSICAL ? "PHYSICAL" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_CPU ? "CPU" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_DSA ? "DSA" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_PF ? "PCI_PF":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_VF ? "PCI_VF":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_VIRTUAL ? "VIRTUAL":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_UNUSED ? "UNUSED":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_SF ? "PCI_SF":
+            "UNKNOWN");
+        return;
+    };
+    /* Failure to fill key from typed values means calculation of the max key
+     * length is wrong, i.e. a bug. */
+    ovs_assert(fill_devlink_ports_key_from_typed(
+                            keybuf, sizeof(keybuf),
+                            *host_pf_mac, function_number));
+    shash_add(&devlink_ports, keybuf, xstrdup(port_entry->netdev_name));
+}
+
+
+void
+plugging_init(void)
+{
+    struct nl_dl_dump_state *port_dump;
+    struct dl_port port_entry;
+    int error;
+    struct eth_addr host_pf_macs[MAX_NIC_PHY_PORTS];
+
+    shash_init(&devlink_ports);
+
+    port_dump = nl_dl_dump_init();
+    if ((error = nl_dl_dump_init_error(port_dump))) {
+        VLOG_WARN(
+            "unable to start dump of ports from devlink-port interface");
+        return;
+    }
+    /* The core devlink infrastructure in the kernel keeps a linked list of
+     * the devices and each of those has a linked list of ports. These are
+     * populated by each device driver as devices are enumerated, and as such
+     * we can rely on ports being dumped in a consistent order on a device
+     * by device basis with logical numbering for each port flavour starting
+     * on 0 for each new device.
+     */
+    nl_dl_dump_start(DEVLINK_CMD_PORT_GET, port_dump);
+    while (nl_dl_port_dump_next(port_dump, &port_entry)) {
+        switch (port_entry.flavour) {
+        case DEVLINK_PORT_FLAVOUR_PHYSICAL:
+            /* The PHYSICAL flavoured port represent a network facing port on
+             * the NIC.
+             *
+             * For kernel versions where the devlink-port infrastructure does
+             * not provide MAC address for PCI_PF flavoured ports, there exist
+             * a interface in sysfs which is relative to the name of the
+             * PHYSICAL port netdev name.
+             *
+             * Since we at this point in the dump do not know if the MAC will
+             * be provided for the PCI_PF or not, proactively store the MAC
+             * address by looking up through the sysfs interface.
+             *
+             * If MAC address is available once we get to the PCI_PF we will
+             * overwrite the stored value.
+             */
+            if (port_entry.number > MAX_NIC_PHY_PORTS) {
+                VLOG_WARN("physical port number out of range for port '%s': "
+                          "%"PRIu32,
+                          port_entry.netdev_name, port_entry.number);
+                continue;
+            }
+            compat_get_host_pf_mac(port_entry.netdev_name,
+                                   &host_pf_macs[port_entry.number]);
+            break;
+        case DEVLINK_PORT_FLAVOUR_PCI_PF: /* FALL THROUGH */
+            /* The PCI_PF flavoured port represent a host facing port.
+             *
+             * For function flavours other than PHYSICAL pci_pf_number will be
+             * set to the logical number of which physical port the function
+             * belongs.
+             */
+            if (!eth_addr_is_zero(port_entry.function.eth_addr)) {
+                host_pf_macs[port_entry.pci_pf_number] =
+                    port_entry.function.eth_addr;
+            }
+            /* FALL THROUGH */
+        case DEVLINK_PORT_FLAVOUR_PCI_VF:
+            /* The PCI_VF flavoured port represent a host facing
+             * PCI Virtual Function.
+             *
+             * For function flavours other than PHYSICAL pci_pf_number will be
+             * set to the logical number of which physical port the function
+             * belongs.
+             */
+            if (port_entry.pci_pf_number > MAX_NIC_PHY_PORTS) {
+                VLOG_WARN("physical port number out of range for port '%s': "
+                          "%"PRIu32,
+                          port_entry.netdev_name, port_entry.pci_pf_number);
+                continue;
+            }
+            devlink_port_add_function(&port_entry,
+                                      &host_pf_macs[port_entry.pci_pf_number]);
+            break;
+        };
+    }
+    nl_dl_dump_finish(port_dump);
+    nl_dl_dump_destroy(port_dump);
+
+    struct shash_node *node;
+    SHASH_FOR_EACH (node, &devlink_ports) {
+        VLOG_INFO("HELLO %s -> %s", node->name, (char*)node->data);
+    }
+}
+
+void
+plugging_destroy(void)
+{
+    shash_destroy_free_data(&devlink_ports);
+}
+
+static bool
+match_port (const struct ovsrec_port *port, const char *name)
+{
+    return !name || !name[0]
+           || !strcmp(port->name, name);
+}
+
+/* Creates a port in bridge 'br_int' named 'name'.
+ *
+ * If such a port already exists, removes it from 'existing_ports'. */
+static void
+create_port(struct ovsdb_idl_txn *ovs_idl_txn,
+                  const char *iface_id,
+                  const struct ovsrec_bridge *br_int, const char *name,
+                  struct shash *existing_ports)
+{
+    for (size_t i = 0; i < br_int->n_ports; i++) {
+        if (match_port(br_int->ports[i], name)) {
+            VLOG_INFO("port already created: %s %s", iface_id, name);
+            shash_find_and_delete(existing_ports, br_int->ports[i]->name);
+            return;
+        }
+    }
+
+    ovsdb_idl_txn_add_comment(ovs_idl_txn,
+            "ovn-controller: plugging port '%s' into '%s'",
+            name, br_int->name);
+
+    struct ovsrec_interface *iface;
+    iface = ovsrec_interface_insert(ovs_idl_txn);
+    ovsrec_interface_set_name(iface, name);
+    const struct smap ids = SMAP_CONST2(
+        &ids,
+        "iface-id", iface_id,
+        "ovn-plugged", "true");
+    ovsrec_interface_set_external_ids(iface, &ids);
+
+    struct ovsrec_port *port;
+    port = ovsrec_port_insert(ovs_idl_txn);
+    ovsrec_port_set_name(port, name);
+    ovsrec_port_set_interfaces(port, &iface, 1);
+
+    struct ovsrec_port **ports;
+    ports = xmalloc(sizeof *ports * (br_int->n_ports + 1));
+    memcpy(ports, br_int->ports, sizeof *ports * br_int->n_ports);
+    ports[br_int->n_ports] = port;
+    ovsrec_bridge_verify_ports(br_int);
+    ovsrec_bridge_set_ports(br_int, ports, br_int->n_ports + 1);
+
+    free(ports);
+}
+
+static void
+remove_port(const struct ovsrec_bridge *br_int,
+            const struct ovsrec_port *port)
+{
+    for (size_t i = 0; i < br_int->n_ports; i++) {
+        if (br_int->ports[i] != port) {
+            continue;
+        }
+        struct ovsrec_port **new_ports;
+        new_ports = xmemdup(br_int->ports,
+                sizeof *new_ports * (br_int->n_ports - 1));
+        if (i != br_int->n_ports - 1) {
+            /* Removed port was not last */
+            new_ports[i] = br_int->ports[br_int->n_ports - 1];
+        }
+        ovsrec_bridge_verify_ports(br_int);
+        ovsrec_bridge_set_ports(br_int, new_ports, br_int->n_ports - 1);
+        free(new_ports);
+        ovsrec_port_delete(port);
+        return;
+    }
+}
+
+static bool
+can_plug(const char *vif_plugging)
+{
+    return !vif_plugging || !vif_plugging[0]
+           || !strcmp(vif_plugging, "true");
+}
+
+void
+plugging_run(struct ovsdb_idl_txn *ovs_idl_txn,
+             const struct sbrec_port_binding_table *port_binding_table,
+             const struct ovsrec_port_table *port_table,
+             const struct ovsrec_bridge *br_int,
+             const struct sbrec_chassis *chassis)
+{
+    if (!ovs_idl_txn) {
+        return;
+    }
+
+    /* Figure out what ports managed by OVN already exist. */
+    struct shash existing_ports = SHASH_INITIALIZER(&existing_ports);
+    const struct ovsrec_port *port;
+    OVSREC_PORT_TABLE_FOR_EACH (port, port_table) {
+        for (size_t i = 0; i < port->n_interfaces; i++) {
+            struct ovsrec_interface *iface = port->interfaces[i];
+            const char *port_iface_id;
+            if (can_plug(smap_get(&iface->external_ids, "ovn-plugged"))
+                && (port_iface_id = smap_get(&iface->external_ids,
+                                             "iface-id"))) {
+                shash_add(&existing_ports, port_iface_id, port);
+            }
+        }
+    }
+
+    /* Iterate over currently unbound ports destined for this chassis or ports
+     * already bound to this chassis and check if OVN management is requested.
+     * Remove ports from 'existing_ports' that do exist in the database and
+     * should be there. */
+    const struct sbrec_port_binding *port_binding;
+    SBREC_PORT_BINDING_TABLE_FOR_EACH (port_binding,
+                                       port_binding_table)
+    {
+        VLOG_INFO("HELLO %s", port_binding->logical_port);
+        const char *vif_chassis = smap_get(&port_binding->options,
+                                           "requested-chassis");
+        const char *vif_plugging = smap_get_def(&port_binding->options,
+                                                "ovn-plugging",
+                                                "false");
+        VLOG_INFO("HELLO %s", port_binding->logical_port);
+        if (lport_can_bind_on_this_chassis(chassis, vif_chassis)
+            && can_plug(vif_plugging))
+        {
+            char keybuf[MAX_KEY_LEN];
+            const char *rep_port;
+            const char *pf_mac;
+            const char *vf_num;
+
+            if (!fill_devlink_ports_key_from_strs(
+                                    keybuf, sizeof(keybuf),
+                                    (pf_mac = smap_get(
+                                        &port_binding->options, "pf-mac")),
+                                    (vf_num = smap_get(
+                                        &port_binding->options, "vf-num"))))
+            {
+                /* Overflow, most likely incorrect input data from database */
+                VLOG_WARN("Southbound DB port plugging options out of range: "
+                          "pf-mac: '%s' vf-num: '%s'", pf_mac, vf_num);
+                continue;
+            }
+
+            shash_find_and_delete(&existing_ports, port_binding->logical_port);
+
+            rep_port = shash_find_data(&devlink_ports, keybuf);
+            VLOG_INFO("plug %s (%s) -> %s",
+                      port_binding->logical_port, rep_port, br_int->name);
+            create_port(ovs_idl_txn, port_binding->logical_port,
+                        br_int, rep_port, &existing_ports);
+        }
+    }
+
+    /* Now 'existing_ports' only contains ports that exist in the
+     * database but shouldn't.  Delete them from the database. */
+    struct shash_node *port_node, *port_next_node;
+    SHASH_FOR_EACH_SAFE (port_node, port_next_node, &existing_ports) {
+        port = port_node->data;
+        shash_delete(&existing_ports, port_node);
+        VLOG_INFO("remove port %s", port->name);
+        remove_port(br_int, port);
+    }
+    shash_destroy(&existing_ports);
+}
+
+/* The kernel devlink-port interface provides a vendor neutral and standard way
+ * of discovering host visible resources such as MAC address of interfaces from
+ * a program running on the NIC SoC side.
+ *
+ * However a fairly recent kernel version is required for it to work, so until
+ * this is widely available we provide this helper to retrieve the same
+ * information from the interim sysfs solution. */
+static bool
+compat_get_host_pf_mac(const char *netdev_name, struct eth_addr *ea)
+{
+    char file_name[IFNAMSIZ+35+1];
+    FILE *stream;
+    char line[128];
+    bool retval = false;
+
+    snprintf(file_name, sizeof(file_name),
+             "/sys/class/net/%s/smart_nic/pf/config", netdev_name);
+    stream = fopen(file_name, "r");
+    if (!stream) {
+        VLOG_WARN("%s: open failed (%s)",
+                  file_name, ovs_strerror(errno));
+        *ea = eth_addr_zero;
+        return false;
+    }
+    while (fgets(line, sizeof(line), stream)) {
+        char key[16];
+        char *cp;
+        if (ovs_scan(line, "%15[^:]: ", key)
+            && key[0] == 'M' && key[1] == 'A' && key[2] == 'C')
+        {
+            /* strip any newline character */
+            if ((cp = strchr(line, '\n')) != NULL) {
+                *cp = '\0';
+            }
+            /* point cp at end of key + ': ', i.e. start of MAC address */
+            cp = line + strnlen(key, sizeof(key)) + 2;
+            retval = eth_addr_from_string(cp, ea);
+            break;
+        }
+    }
+    fclose(stream);
+    return retval;
+}
diff --git a/controller/plugging.h b/controller/plugging.h
new file mode 100644
index 000000000..324e5bdc0
--- /dev/null
+++ b/controller/plugging.h
@@ -0,0 +1,81 @@ 
+/* Copyright (c) 2021 Canonical
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef OVN_PLUGGING_H
+#define OVN_PLUGGING_H 1
+
+/* Interface Plugging
+ * ==================
+ *
+ * This module adds and removes ports on the integration bridge, as directed by
+ * Port_Binding options.
+ *
+ * Traditionally it has been the CMSs responsibility to create Virtual
+ * Interfaces as part of instance (Container, Pod, Virtual Machine etc.) life
+ * cycle, and subsequently manage plug/unplug operations on the Open vSwitch
+ * integration bridge.
+ *
+ * With the advent of NICs connected to multiple distinct CPUs we can have a
+ * topology where the instance runs on one host and Open vSwitch and OVN runs
+ * on a different host, the smartnic CPU.
+ *
+ * The act of plugging and unplugging the representor port in Open vSwitch
+ * running on the smartnic host CPU would be the same for every smartnic
+ * variant (thanks to the devlink-port infrastructure), and every CMS. As
+ * such it is natural to extend OVN to provide this common functionality
+ * through its CMS facing API.
+ *
+ * The instance will be connected to a SR-IOV Virtual Function or a RDMA
+ * Mediated Device on the host sytem (the latter not currently addressed in
+ * this implementation). The NIC driver will maintain a representor port for
+ * each of the host visible devices on the smartnic side.
+ *
+ * It is the CMSs responsibility to maintain a mapping between instance host
+ * and smartnic host, OVN can help by optionally providing details such as
+ * board serial number of the smartnic system as part of Chassis registration.
+ *
+ * The CMS will use it's knowledge of instance host <-> smartnic host mapping
+ * to add appropriate `requested-chassis` along with the information OVN needs
+ * to identify the representor port as options when creating Logical Switch
+ * Ports for instances. These options will be copied over to the Port_Binding
+ * table by ovn-northd.
+ *
+ * OVN will use the devlink interface to look up which representor port
+ * corresponds to the host visible resource and add this representor port to
+ * the integration bridge.
+ *
+ * Options API:
+ *   ovn-plugged: true
+ *   pf-mac: "00:53:00:00:00:42" // To distinguish between ports on NIC SoC
+ *   vf-num: 42 (optional)       // Refers to a logical PCI VF number
+ *                               // not specifying vf-num means plug PF
+ *                               // representor.
+ */
+
+struct ovsdb_idl_txn;
+struct sbrec_port_binding_table;
+struct ovsrec_port_table;
+struct ovsrec_bridge;
+struct sbrec_chassis;
+
+void plugging_run(struct ovsdb_idl_txn *,
+             const struct sbrec_port_binding_table *,
+             const struct ovsrec_port_table *,
+             const struct ovsrec_bridge *,
+             const struct sbrec_chassis *);
+void plugging_init(void);
+void plugging_destroy(void);
+
+#endif /* controller/plugging.h */