diff mbox series

[ovs-dev,RFC] Introduce representor port plugging support

Message ID 20210509140305.1910796-1-frode.nordahl@canonical.com
State Superseded
Headers show
Series [ovs-dev,RFC] Introduce representor port plugging support | expand

Commit Message

Frode Nordahl May 9, 2021, 2:03 p.m. UTC
Introduce plugging module that adds and removes ports on the
integration bridge, as directed by Port_Binding options.

Traditionally it has been the CMSs responsibility to create Virtual
Interfaces (VIFs) as part of instance (Container, Pod, Virtual
Machine etc.) life cycle, and subsequently manage plug/unplug
operations on the Open vSwitch integration bridge.

With the advent of NICs connected to multiple distinct CPUs we can
have a topology where the instance runs on one host and Open
vSwitch and OVN runs on a different host, the smartnic CPU.

The act of plugging and unplugging the representor port in Open
vSwitch running on the smartnic host CPU would be the same for
every smartnic variant (thanks to the devlink-port[0][1]
infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
As such it is natural to extend OVN to provide this common
functionality through its CMS facing API.

The instance will be connected to a SR-IOV Virtual Function or a
RDMA Mediated Device on the host system (the latter not currently
addressed in this implementation). The smartnic driver will
maintain a representor port for each of the host visible devices
on the smartnic CPU side.

It is the CMSs responsibility to maintain a mapping between
instance host and smartnic host, OVN can help by optionally
providing details such as board serial number of the smartnic
system as part of Chassis registration.

The CMS will use it's knowledge of instance host <-> smartnic
host mapping to add appropriate `requested-chassis` along with
the information OVN needs to identify the representor port as
options when creating Logical Switch Ports for instances. These
options will be copied over to the Port_Binding table by
ovn-northd.

OVN will use the devlink-port[0][1] interface to look up which
representor port corresponds to the host visible resource and
maintain presence of this representor port on the integration
bridge.

0: https://www.kernel.org/doc/html/latest/networking/devlink/devlink-port.html
1: https://mail.openvswitch.org/pipermail/ovs-dev/2021-May/382701.html

TODO:

* Implement run-time refresh or incremental updates to
  devlink_ports hash by opening a netlink socket and monitoring
  devlink changes.

* Make use of an index for unbound ports.

* Make use of the existing Incremental Processing support for
  Port-Bindings and call into plugging module from the binding
  module instead of having a separate run loop.

* Implement functionality to convey smartnic identifying
  characteristics as external-ids on Chassis registration. The
  intention is to help CMS map relationships between hosts. Source
  can be board serial number obtained from devlink-info, PCI VPD,
  or a value stored in the local Open vSwitch database maintained
  by the operator.

* Write tests.

* Update documentation.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 controller/automake.mk      |   2 +
 controller/binding.c        |  17 +-
 controller/lport.c          |   9 +
 controller/lport.h          |   2 +
 controller/ovn-controller.c |  12 +
 controller/plugging.c       | 422 ++++++++++++++++++++++++++++++++++++
 controller/plugging.h       |  81 +++++++
 7 files changed, 532 insertions(+), 13 deletions(-)
 create mode 100644 controller/plugging.c
 create mode 100644 controller/plugging.h

Comments

Ilya Maximets May 13, 2021, 3:12 p.m. UTC | #1
On 5/9/21 4:03 PM, Frode Nordahl wrote:
> Introduce plugging module that adds and removes ports on the
> integration bridge, as directed by Port_Binding options.
> 
> Traditionally it has been the CMSs responsibility to create Virtual
> Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> Machine etc.) life cycle, and subsequently manage plug/unplug
> operations on the Open vSwitch integration bridge.
> 
> With the advent of NICs connected to multiple distinct CPUs we can
> have a topology where the instance runs on one host and Open
> vSwitch and OVN runs on a different host, the smartnic CPU.
> 
> The act of plugging and unplugging the representor port in Open
> vSwitch running on the smartnic host CPU would be the same for
> every smartnic variant (thanks to the devlink-port[0][1]
> infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> As such it is natural to extend OVN to provide this common
> functionality through its CMS facing API.

Hi, Frode.  Thanks for putting this together, but it doesn't look
natural to me.  OVN, AFAIK, never touched physical devices or
interacted with the kernel directly.  This change introduces completely
new functionality inside OVN.  With the same effect we can run a fully
separate service on these smartnic CPUs that will do plugging
and configuration job for CMS.  You may even make it independent
from a particular CMS by creating a REST API for it or whatever.
This will additionally allow using same service for non-OVN setups.

Interactions with physical devices also makes OVN linux-dependent
at least for this use case, IIUC.

Maybe, others has different opinions.

Another though is that there is, obviously, a network connection
between the host and smartnic system.  Maybe it's possible to just
add an extra remote to the local ovsdb-server so CMS daemon on the
host system could just add interfaces over the network connection?

Best regards, Ilya Maximets.
Frode Nordahl May 13, 2021, 4:25 p.m. UTC | #2
On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > Introduce plugging module that adds and removes ports on the
> > integration bridge, as directed by Port_Binding options.
> >
> > Traditionally it has been the CMSs responsibility to create Virtual
> > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > Machine etc.) life cycle, and subsequently manage plug/unplug
> > operations on the Open vSwitch integration bridge.
> >
> > With the advent of NICs connected to multiple distinct CPUs we can
> > have a topology where the instance runs on one host and Open
> > vSwitch and OVN runs on a different host, the smartnic CPU.
> >
> > The act of plugging and unplugging the representor port in Open
> > vSwitch running on the smartnic host CPU would be the same for
> > every smartnic variant (thanks to the devlink-port[0][1]
> > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > As such it is natural to extend OVN to provide this common
> > functionality through its CMS facing API.
>
> Hi, Frode.  Thanks for putting this together, but it doesn't look
> natural to me.  OVN, AFAIK, never touched physical devices or
> interacted with the kernel directly.  This change introduces completely
> new functionality inside OVN.  With the same effect we can run a fully
> separate service on these smartnic CPUs that will do plugging
> and configuration job for CMS.  You may even make it independent
> from a particular CMS by creating a REST API for it or whatever.
> This will additionally allow using same service for non-OVN setups.

Ilya,

Thank you for taking the time to comment, much appreciated.

Yes, this is new functionality, NICs with separate control plane CPUs
and isolation from the host are also new, so this is one proposal for
how we could go about to enable the use of them.

The OVN controller does today get pretty close to the physical realm
by maintaining patch ports in Open vSwitch based on bridge mapping
configuration and presence of bridges to physical interfaces. It also
does react to events of physical interfaces being plugged into the
Open vSwitch instance it manages, albeit to date some other entity has
been doing the act of adding the port into the bridge.

The rationale for proposing to use the OVN database for coordinating
this is that the information about which ports to bind, and where to
bind them is already there. The timing of the information flow from
the CMS is also suitable for the task.

OVN relies on OVS library code, and all the necessary libraries for
interfacing with the kernel through netlink and friends are there or
would be easy to add. The rationale for using the netlink-devlink
interface is that it provides a generic infrastructure for these types
of NICs. So by using this interface we should be able to support most
if not all of the variants of these cards.


Providing a separate OVN service to do the task could work, but would
have the cost of an extra SB DB connection, IDL and monitors.

I fear it would be quite hard to build a whole separate project with
its own API, feels like a lot of duplicated effort when the flow of
data and APIs in OVN already align so well with CMSs interested in
using this?

> Interactions with physical devices also makes OVN linux-dependent
> at least for this use case, IIUC.

This specific bit would be linux-specific in the first iteration, yes.
But the vendors manufacturing and distributing the hardware do often
have drivers for other platforms, I am sure the necessary
infrastructure will become available there too over time, if it is not
there already.

We do currently have platform specific macros in the OVN build system,
so we could enable the functionality when built on a compatible
platform.

> Maybe, others has different opinions.

I appreciate your opinion, and enjoy discussing this topic.

> Another though is that there is, obviously, a network connection
> between the host and smartnic system.  Maybe it's possible to just
> add an extra remote to the local ovsdb-server so CMS daemon on the
> host system could just add interfaces over the network connection?

There are a few issues with such an approach. One of the main goals
with providing and using a NIC with control plane CPUs is having an
extra layer of security and isolation which is separate from the
hypervisor host the card happens to share a PCI complex with and draw
power from. Requiring a connection between the two for operation would
defy this purpose.

In addition to that, this class of cards provide visibility into
kernel interfaces, enumeration of representor ports etc. only from the
NIC control plane CPU side of the PCI complex, this information is not
provided to the host. So if a hypervisor host CMS agent were to do the
plugging through a remote ovsdb connection, it would have to
communicate with something else running on the NIC control plane CPU
to retrieve the information it needs before it can know what to relay
back over the ovsdb connection.

--
Frode Nordahl

> Best regards, Ilya Maximets.
Han Zhou June 10, 2021, 6:36 a.m. UTC | #3
On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com>
wrote:
>
> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > Introduce plugging module that adds and removes ports on the
> > > integration bridge, as directed by Port_Binding options.
> > >
> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > operations on the Open vSwitch integration bridge.
> > >
> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > have a topology where the instance runs on one host and Open
> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > >
> > > The act of plugging and unplugging the representor port in Open
> > > vSwitch running on the smartnic host CPU would be the same for
> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > As such it is natural to extend OVN to provide this common
> > > functionality through its CMS facing API.
> >
> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > natural to me.  OVN, AFAIK, never touched physical devices or
> > interacted with the kernel directly.  This change introduces completely
> > new functionality inside OVN.  With the same effect we can run a fully
> > separate service on these smartnic CPUs that will do plugging
> > and configuration job for CMS.  You may even make it independent
> > from a particular CMS by creating a REST API for it or whatever.
> > This will additionally allow using same service for non-OVN setups.
>
> Ilya,
>
> Thank you for taking the time to comment, much appreciated.
>
> Yes, this is new functionality, NICs with separate control plane CPUs
> and isolation from the host are also new, so this is one proposal for
> how we could go about to enable the use of them.
>
> The OVN controller does today get pretty close to the physical realm
> by maintaining patch ports in Open vSwitch based on bridge mapping
> configuration and presence of bridges to physical interfaces. It also
> does react to events of physical interfaces being plugged into the
> Open vSwitch instance it manages, albeit to date some other entity has
> been doing the act of adding the port into the bridge.
>
> The rationale for proposing to use the OVN database for coordinating
> this is that the information about which ports to bind, and where to
> bind them is already there. The timing of the information flow from
> the CMS is also suitable for the task.
>
> OVN relies on OVS library code, and all the necessary libraries for
> interfacing with the kernel through netlink and friends are there or
> would be easy to add. The rationale for using the netlink-devlink
> interface is that it provides a generic infrastructure for these types
> of NICs. So by using this interface we should be able to support most
> if not all of the variants of these cards.
>
>
> Providing a separate OVN service to do the task could work, but would
> have the cost of an extra SB DB connection, IDL and monitors.
>
> I fear it would be quite hard to build a whole separate project with
> its own API, feels like a lot of duplicated effort when the flow of
> data and APIs in OVN already align so well with CMSs interested in
> using this?
>
> > Interactions with physical devices also makes OVN linux-dependent
> > at least for this use case, IIUC.
>
> This specific bit would be linux-specific in the first iteration, yes.
> But the vendors manufacturing and distributing the hardware do often
> have drivers for other platforms, I am sure the necessary
> infrastructure will become available there too over time, if it is not
> there already.
>
> We do currently have platform specific macros in the OVN build system,
> so we could enable the functionality when built on a compatible
> platform.
>
> > Maybe, others has different opinions.
>
> I appreciate your opinion, and enjoy discussing this topic.
>
> > Another though is that there is, obviously, a network connection
> > between the host and smartnic system.  Maybe it's possible to just
> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > host system could just add interfaces over the network connection?
>
> There are a few issues with such an approach. One of the main goals
> with providing and using a NIC with control plane CPUs is having an
> extra layer of security and isolation which is separate from the
> hypervisor host the card happens to share a PCI complex with and draw
> power from. Requiring a connection between the two for operation would
> defy this purpose.
>
> In addition to that, this class of cards provide visibility into
> kernel interfaces, enumeration of representor ports etc. only from the
> NIC control plane CPU side of the PCI complex, this information is not
> provided to the host. So if a hypervisor host CMS agent were to do the
> plugging through a remote ovsdb connection, it would have to
> communicate with something else running on the NIC control plane CPU
> to retrieve the information it needs before it can know what to relay
> back over the ovsdb connection.
>
> --
> Frode Nordahl
>
> > Best regards, Ilya Maximets.

Here are my 2 cents.

Initially I had similar concerns to Ilya, and it seems OVN should stay away
from the physical interface plugging. As a reference, here is how
ovn-kubernetes is doing it without adding anything to OVN:
https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing

However, thinking more about it, the proposed approach in this patch just
expands the way how OVN can bind ports, utilizing the communication channel
of OVN (OVSDB connections). If all the information regarding port binding
can be specified by the CMS from NB, then it is not unnatural for
ovn-controller to perform interface binding directly (instead of passively
accepting what is attached by CMS). This kind of information already
existed to some extent - the "requested_chassis" option in OpenStack. Now
it seems this idea is just expanding it to a specific interface. The
difference is that "requested_chassis" is used for validation only, but now
we want to directly apply it. So I think at least I don't have a strong
opinion on the idea.

There are some benefits:
1) The mechanism can be reused by different CMSes, which may simplify CMS
implementation.
2) Compared with the ovn-k8s approach, it reuses OVN's communication
channel, which avoids an extra CMS communication channel on the smart NIC
side. (of course this can be achieved by a connection between the BM and
smart NIC with *restricted* API just to convey the necessary information)

As to the negative side, it would increase OVN's complexity, and as
mentioned by Ilya potentially breaks OVN's platform independence. To avoid
this, I think the *plugging* module itself needs to be independent and
pluggable. It can be extended as independent plugins. The plugin would need
to define what information is needed in LSP's "options", and then implement
corresponding drivers. With this approach, even the regular VIFs can be
attached by ovn-controller if CMS can tell the interface name. Anyway, this
is just my brief thinking.

Thanks,
Han
Ilya Maximets June 10, 2021, 11:46 a.m. UTC | #4
On 6/10/21 8:36 AM, Han Zhou wrote:
> 
> 
> On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
>>
>> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>> >
>> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
>> > > Introduce plugging module that adds and removes ports on the
>> > > integration bridge, as directed by Port_Binding options.
>> > >
>> > > Traditionally it has been the CMSs responsibility to create Virtual
>> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
>> > > Machine etc.) life cycle, and subsequently manage plug/unplug
>> > > operations on the Open vSwitch integration bridge.
>> > >
>> > > With the advent of NICs connected to multiple distinct CPUs we can
>> > > have a topology where the instance runs on one host and Open
>> > > vSwitch and OVN runs on a different host, the smartnic CPU.
>> > >
>> > > The act of plugging and unplugging the representor port in Open
>> > > vSwitch running on the smartnic host CPU would be the same for
>> > > every smartnic variant (thanks to the devlink-port[0][1]
>> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
>> > > As such it is natural to extend OVN to provide this common
>> > > functionality through its CMS facing API.
>> >
>> > Hi, Frode.  Thanks for putting this together, but it doesn't look
>> > natural to me.  OVN, AFAIK, never touched physical devices or
>> > interacted with the kernel directly.  This change introduces completely
>> > new functionality inside OVN.  With the same effect we can run a fully
>> > separate service on these smartnic CPUs that will do plugging
>> > and configuration job for CMS.  You may even make it independent
>> > from a particular CMS by creating a REST API for it or whatever.
>> > This will additionally allow using same service for non-OVN setups.
>>
>> Ilya,
>>
>> Thank you for taking the time to comment, much appreciated.
>>
>> Yes, this is new functionality, NICs with separate control plane CPUs
>> and isolation from the host are also new, so this is one proposal for
>> how we could go about to enable the use of them.
>>
>> The OVN controller does today get pretty close to the physical realm
>> by maintaining patch ports in Open vSwitch based on bridge mapping
>> configuration and presence of bridges to physical interfaces. It also
>> does react to events of physical interfaces being plugged into the
>> Open vSwitch instance it manages, albeit to date some other entity has
>> been doing the act of adding the port into the bridge.
>>
>> The rationale for proposing to use the OVN database for coordinating
>> this is that the information about which ports to bind, and where to
>> bind them is already there. The timing of the information flow from
>> the CMS is also suitable for the task.
>>
>> OVN relies on OVS library code, and all the necessary libraries for
>> interfacing with the kernel through netlink and friends are there or
>> would be easy to add. The rationale for using the netlink-devlink
>> interface is that it provides a generic infrastructure for these types
>> of NICs. So by using this interface we should be able to support most
>> if not all of the variants of these cards.
>>
>>
>> Providing a separate OVN service to do the task could work, but would
>> have the cost of an extra SB DB connection, IDL and monitors.

IMHO, CMS should never connect to Southbound DB.  It's just because the
Southbound DB is not meant to be a public interface, it just happened
to be available for connections.  I know that OpenStack has metadata
agents that connects to Sb DB, but if it's really required for them, I
think, there should be a different way to get/set required information
without connection to the Southbound.

>>
>> I fear it would be quite hard to build a whole separate project with
>> its own API, feels like a lot of duplicated effort when the flow of
>> data and APIs in OVN already align so well with CMSs interested in
>> using this?
>>
>> > Interactions with physical devices also makes OVN linux-dependent
>> > at least for this use case, IIUC.
>>
>> This specific bit would be linux-specific in the first iteration, yes.
>> But the vendors manufacturing and distributing the hardware do often
>> have drivers for other platforms, I am sure the necessary
>> infrastructure will become available there too over time, if it is not
>> there already.
>>
>> We do currently have platform specific macros in the OVN build system,
>> so we could enable the functionality when built on a compatible
>> platform.
>>
>> > Maybe, others has different opinions.
>>
>> I appreciate your opinion, and enjoy discussing this topic.
>>
>> > Another though is that there is, obviously, a network connection
>> > between the host and smartnic system.  Maybe it's possible to just
>> > add an extra remote to the local ovsdb-server so CMS daemon on the
>> > host system could just add interfaces over the network connection?
>>
>> There are a few issues with such an approach. One of the main goals
>> with providing and using a NIC with control plane CPUs is having an
>> extra layer of security and isolation which is separate from the
>> hypervisor host the card happens to share a PCI complex with and draw
>> power from. Requiring a connection between the two for operation would
>> defy this purpose.
>>
>> In addition to that, this class of cards provide visibility into
>> kernel interfaces, enumeration of representor ports etc. only from the
>> NIC control plane CPU side of the PCI complex, this information is not
>> provided to the host. So if a hypervisor host CMS agent were to do the
>> plugging through a remote ovsdb connection, it would have to
>> communicate with something else running on the NIC control plane CPU
>> to retrieve the information it needs before it can know what to relay
>> back over the ovsdb connection.
>>
>> --
>> Frode Nordahl
>>
>> > Best regards, Ilya Maximets.
> 
> Here are my 2 cents.
> 
> Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>

AFAICT, a big part of the work is already done on the ovn-k8s side:
  https://github.com/ovn-org/ovn-kubernetes/pull/2005
  https://github.com/ovn-org/ovn-kubernetes/pull/2042

> 
> However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.

While it's, probably, OK for OVN to add port to the OVSDB, in many
cases these ports will require a lot of extra configuration which
is typically done by os-vif or CNI/device plugins.  Imagine that OVS
is running with userspace datapath and you need to plug-in some DPDK
ports, where you have to specify the port type, DPDK port config,
porbably also number of rx/tx queues, number of descriptors in these
queues.  You may also want to configure affinity for these queues per
PMD thread in OVS.  For kernel interfaces it might be easier, but they
also might require some extra configuration that OVN will have to
think about now.  This is a typical job for CMS to configure this
kind of stuff, and that is why projects like os-vif or large variety
of CNI/device plugins exists.

> 
> There are some benefits:
> 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)

The problem I see is that at least ovn-k8s, AFAIU, will require
the daemon on the Smart NIC anyways to monitor and configure
OVN components, e.g. to configure ovn-remote for ovn-controller
or run management appctl commands if required.
So, I don't see the point why it can't do plugging if it's already
there.

> 
> As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.

Aside from plugging,
I also don't see the reason to have devlink code in OVN just
because it runs once on the init stage, IIUC.  So, I don't
understand why this information about the hardware cannot be
retrieved during the host provisioning and stored somewhere
(Nova config?).  Isn't hardware inventory a job for tripleo
or something?

Best regards, Ilya Maximets.
Frode Nordahl June 10, 2021, 2:13 p.m. UTC | #5
On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/10/21 8:36 AM, Han Zhou wrote:
> >
> >
> > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> >>
> >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> >> >
> >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> >> > > Introduce plugging module that adds and removes ports on the
> >> > > integration bridge, as directed by Port_Binding options.
> >> > >
> >> > > Traditionally it has been the CMSs responsibility to create Virtual
> >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> >> > > operations on the Open vSwitch integration bridge.
> >> > >
> >> > > With the advent of NICs connected to multiple distinct CPUs we can
> >> > > have a topology where the instance runs on one host and Open
> >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> >> > >
> >> > > The act of plugging and unplugging the representor port in Open
> >> > > vSwitch running on the smartnic host CPU would be the same for
> >> > > every smartnic variant (thanks to the devlink-port[0][1]
> >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> >> > > As such it is natural to extend OVN to provide this common
> >> > > functionality through its CMS facing API.
> >> >
> >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> >> > natural to me.  OVN, AFAIK, never touched physical devices or
> >> > interacted with the kernel directly.  This change introduces completely
> >> > new functionality inside OVN.  With the same effect we can run a fully
> >> > separate service on these smartnic CPUs that will do plugging
> >> > and configuration job for CMS.  You may even make it independent
> >> > from a particular CMS by creating a REST API for it or whatever.
> >> > This will additionally allow using same service for non-OVN setups.
> >>
> >> Ilya,
> >>
> >> Thank you for taking the time to comment, much appreciated.
> >>
> >> Yes, this is new functionality, NICs with separate control plane CPUs
> >> and isolation from the host are also new, so this is one proposal for
> >> how we could go about to enable the use of them.
> >>
> >> The OVN controller does today get pretty close to the physical realm
> >> by maintaining patch ports in Open vSwitch based on bridge mapping
> >> configuration and presence of bridges to physical interfaces. It also
> >> does react to events of physical interfaces being plugged into the
> >> Open vSwitch instance it manages, albeit to date some other entity has
> >> been doing the act of adding the port into the bridge.
> >>
> >> The rationale for proposing to use the OVN database for coordinating
> >> this is that the information about which ports to bind, and where to
> >> bind them is already there. The timing of the information flow from
> >> the CMS is also suitable for the task.
> >>
> >> OVN relies on OVS library code, and all the necessary libraries for
> >> interfacing with the kernel through netlink and friends are there or
> >> would be easy to add. The rationale for using the netlink-devlink
> >> interface is that it provides a generic infrastructure for these types
> >> of NICs. So by using this interface we should be able to support most
> >> if not all of the variants of these cards.
> >>
> >>
> >> Providing a separate OVN service to do the task could work, but would
> >> have the cost of an extra SB DB connection, IDL and monitors.
>
> IMHO, CMS should never connect to Southbound DB.  It's just because the
> Southbound DB is not meant to be a public interface, it just happened
> to be available for connections.  I know that OpenStack has metadata
> agents that connects to Sb DB, but if it's really required for them, I
> think, there should be a different way to get/set required information
> without connection to the Southbound.

The CMS-facing API is the Northbound DB, I was not suggesting direct
use of the Southbound DB by external to OVN services. My suggestion
was to have a separate OVN process do this if your objection was to
handle it as part of the ovn-controller process.

> >>
> >> I fear it would be quite hard to build a whole separate project with
> >> its own API, feels like a lot of duplicated effort when the flow of
> >> data and APIs in OVN already align so well with CMSs interested in
> >> using this?
> >>
> >> > Interactions with physical devices also makes OVN linux-dependent
> >> > at least for this use case, IIUC.
> >>
> >> This specific bit would be linux-specific in the first iteration, yes.
> >> But the vendors manufacturing and distributing the hardware do often
> >> have drivers for other platforms, I am sure the necessary
> >> infrastructure will become available there too over time, if it is not
> >> there already.
> >>
> >> We do currently have platform specific macros in the OVN build system,
> >> so we could enable the functionality when built on a compatible
> >> platform.
> >>
> >> > Maybe, others has different opinions.
> >>
> >> I appreciate your opinion, and enjoy discussing this topic.
> >>
> >> > Another though is that there is, obviously, a network connection
> >> > between the host and smartnic system.  Maybe it's possible to just
> >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> >> > host system could just add interfaces over the network connection?
> >>
> >> There are a few issues with such an approach. One of the main goals
> >> with providing and using a NIC with control plane CPUs is having an
> >> extra layer of security and isolation which is separate from the
> >> hypervisor host the card happens to share a PCI complex with and draw
> >> power from. Requiring a connection between the two for operation would
> >> defy this purpose.
> >>
> >> In addition to that, this class of cards provide visibility into
> >> kernel interfaces, enumeration of representor ports etc. only from the
> >> NIC control plane CPU side of the PCI complex, this information is not
> >> provided to the host. So if a hypervisor host CMS agent were to do the
> >> plugging through a remote ovsdb connection, it would have to
> >> communicate with something else running on the NIC control plane CPU
> >> to retrieve the information it needs before it can know what to relay
> >> back over the ovsdb connection.
> >>
> >> --
> >> Frode Nordahl
> >>
> >> > Best regards, Ilya Maximets.
> >
> > Here are my 2 cents.
> >
> > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
>
> AFAICT, a big part of the work is already done on the ovn-k8s side:
>   https://github.com/ovn-org/ovn-kubernetes/pull/2005
>   https://github.com/ovn-org/ovn-kubernetes/pull/2042

I am aware of the on-going effort to implement support for this in
ovn-kubernetes directly. What we have identified is that there are
other CMSs that want this functionality, and with that we have an
opportunity to generalise and provide an abstraction in a common place
that all the consuming CMSes can benefit from.

> >
> > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
>
> While it's, probably, OK for OVN to add port to the OVSDB, in many
> cases these ports will require a lot of extra configuration which
> is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> is running with userspace datapath and you need to plug-in some DPDK
> ports, where you have to specify the port type, DPDK port config,
> porbably also number of rx/tx queues, number of descriptors in these
> queues.  You may also want to configure affinity for these queues per
> PMD thread in OVS.  For kernel interfaces it might be easier, but they
> also might require some extra configuration that OVN will have to
> think about now.  This is a typical job for CMS to configure this
> kind of stuff, and that is why projects like os-vif or large variety
> of CNI/device plugins exists.

CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
them get their information from the CMS. Providing support for
userspace datapath and DPDK would require more information, some of
which is available through devlink, some fit well in key/value
options. Our initial target would be to support the kernel representor
port workflow.

> >
> > There are some benefits:
> > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
>
> The problem I see is that at least ovn-k8s, AFAIU, will require
> the daemon on the Smart NIC anyways to monitor and configure
> OVN components, e.g. to configure ovn-remote for ovn-controller
> or run management appctl commands if required.
> So, I don't see the point why it can't do plugging if it's already
> there.

This pattern is Kubernetes specific and is not the case for other
CMSes. The current proposal for enabling Smart NIC with control plane
CPUs for ovn-kubernetes could be simplified if the networking platform
provided means for coordinating more of the network related bits.

> >
> > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
>
> Aside from plugging,
> I also don't see the reason to have devlink code in OVN just
> because it runs once on the init stage, IIUC.  So, I don't
> understand why this information about the hardware cannot be
> retrieved during the host provisioning and stored somewhere
> (Nova config?).  Isn't hardware inventory a job for tripleo
> or something?

As noted in one of the TODOs in the commit message of the RFC one of
the work items to further develop this is to monitor devlink for
changes, the end product will not do one-time initialization.

While I agree with you that the current SR-IOV acceleration workflow
configuration is pretty static and can be done at deploy time, this
proposal prepares for the next generation subfunction workflow where
you will have a much higher density, and run-time configuration and
discovery of representor ports. There is a paper about it from netdev
conf[2], and all of this is part of the devlink infrastructure (Linux
kernel ABI).

2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf


--
Frode Nordahl

> Best regards, Ilya Maximets.
Numan Siddique June 29, 2021, 10:32 p.m. UTC | #6
On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
<frode.nordahl@canonical.com> wrote:
>
> On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > On 6/10/21 8:36 AM, Han Zhou wrote:
> > >
> > >
> > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > >>
> > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > >> >
> > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > >> > > Introduce plugging module that adds and removes ports on the
> > >> > > integration bridge, as directed by Port_Binding options.
> > >> > >
> > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > >> > > operations on the Open vSwitch integration bridge.
> > >> > >
> > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > >> > > have a topology where the instance runs on one host and Open
> > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > >> > >
> > >> > > The act of plugging and unplugging the representor port in Open
> > >> > > vSwitch running on the smartnic host CPU would be the same for
> > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > >> > > As such it is natural to extend OVN to provide this common
> > >> > > functionality through its CMS facing API.
> > >> >
> > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > >> > interacted with the kernel directly.  This change introduces completely
> > >> > new functionality inside OVN.  With the same effect we can run a fully
> > >> > separate service on these smartnic CPUs that will do plugging
> > >> > and configuration job for CMS.  You may even make it independent
> > >> > from a particular CMS by creating a REST API for it or whatever.
> > >> > This will additionally allow using same service for non-OVN setups.
> > >>
> > >> Ilya,
> > >>
> > >> Thank you for taking the time to comment, much appreciated.
> > >>
> > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > >> and isolation from the host are also new, so this is one proposal for
> > >> how we could go about to enable the use of them.
> > >>
> > >> The OVN controller does today get pretty close to the physical realm
> > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > >> configuration and presence of bridges to physical interfaces. It also
> > >> does react to events of physical interfaces being plugged into the
> > >> Open vSwitch instance it manages, albeit to date some other entity has
> > >> been doing the act of adding the port into the bridge.
> > >>
> > >> The rationale for proposing to use the OVN database for coordinating
> > >> this is that the information about which ports to bind, and where to
> > >> bind them is already there. The timing of the information flow from
> > >> the CMS is also suitable for the task.
> > >>
> > >> OVN relies on OVS library code, and all the necessary libraries for
> > >> interfacing with the kernel through netlink and friends are there or
> > >> would be easy to add. The rationale for using the netlink-devlink
> > >> interface is that it provides a generic infrastructure for these types
> > >> of NICs. So by using this interface we should be able to support most
> > >> if not all of the variants of these cards.
> > >>
> > >>
> > >> Providing a separate OVN service to do the task could work, but would
> > >> have the cost of an extra SB DB connection, IDL and monitors.
> >
> > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > Southbound DB is not meant to be a public interface, it just happened
> > to be available for connections.  I know that OpenStack has metadata
> > agents that connects to Sb DB, but if it's really required for them, I
> > think, there should be a different way to get/set required information
> > without connection to the Southbound.
>
> The CMS-facing API is the Northbound DB, I was not suggesting direct
> use of the Southbound DB by external to OVN services. My suggestion
> was to have a separate OVN process do this if your objection was to
> handle it as part of the ovn-controller process.
>
> > >>
> > >> I fear it would be quite hard to build a whole separate project with
> > >> its own API, feels like a lot of duplicated effort when the flow of
> > >> data and APIs in OVN already align so well with CMSs interested in
> > >> using this?
> > >>
> > >> > Interactions with physical devices also makes OVN linux-dependent
> > >> > at least for this use case, IIUC.
> > >>
> > >> This specific bit would be linux-specific in the first iteration, yes.
> > >> But the vendors manufacturing and distributing the hardware do often
> > >> have drivers for other platforms, I am sure the necessary
> > >> infrastructure will become available there too over time, if it is not
> > >> there already.
> > >>
> > >> We do currently have platform specific macros in the OVN build system,
> > >> so we could enable the functionality when built on a compatible
> > >> platform.
> > >>
> > >> > Maybe, others has different opinions.
> > >>
> > >> I appreciate your opinion, and enjoy discussing this topic.
> > >>
> > >> > Another though is that there is, obviously, a network connection
> > >> > between the host and smartnic system.  Maybe it's possible to just
> > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > >> > host system could just add interfaces over the network connection?
> > >>
> > >> There are a few issues with such an approach. One of the main goals
> > >> with providing and using a NIC with control plane CPUs is having an
> > >> extra layer of security and isolation which is separate from the
> > >> hypervisor host the card happens to share a PCI complex with and draw
> > >> power from. Requiring a connection between the two for operation would
> > >> defy this purpose.
> > >>
> > >> In addition to that, this class of cards provide visibility into
> > >> kernel interfaces, enumeration of representor ports etc. only from the
> > >> NIC control plane CPU side of the PCI complex, this information is not
> > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > >> plugging through a remote ovsdb connection, it would have to
> > >> communicate with something else running on the NIC control plane CPU
> > >> to retrieve the information it needs before it can know what to relay
> > >> back over the ovsdb connection.
> > >>
> > >> --
> > >> Frode Nordahl
> > >>
> > >> > Best regards, Ilya Maximets.
> > >
> > > Here are my 2 cents.
> > >
> > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> >
> > AFAICT, a big part of the work is already done on the ovn-k8s side:
> >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
>
> I am aware of the on-going effort to implement support for this in
> ovn-kubernetes directly. What we have identified is that there are
> other CMSs that want this functionality, and with that we have an
> opportunity to generalise and provide an abstraction in a common place
> that all the consuming CMSes can benefit from.
>
> > >
> > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> >
> > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > cases these ports will require a lot of extra configuration which
> > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > is running with userspace datapath and you need to plug-in some DPDK
> > ports, where you have to specify the port type, DPDK port config,
> > porbably also number of rx/tx queues, number of descriptors in these
> > queues.  You may also want to configure affinity for these queues per
> > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > also might require some extra configuration that OVN will have to
> > think about now.  This is a typical job for CMS to configure this
> > kind of stuff, and that is why projects like os-vif or large variety
> > of CNI/device plugins exists.
>
> CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> them get their information from the CMS. Providing support for
> userspace datapath and DPDK would require more information, some of
> which is available through devlink, some fit well in key/value
> options. Our initial target would be to support the kernel representor
> port workflow.
>
> > >
> > > There are some benefits:
> > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> >
> > The problem I see is that at least ovn-k8s, AFAIU, will require
> > the daemon on the Smart NIC anyways to monitor and configure
> > OVN components, e.g. to configure ovn-remote for ovn-controller
> > or run management appctl commands if required.
> > So, I don't see the point why it can't do plugging if it's already
> > there.
>
> This pattern is Kubernetes specific and is not the case for other
> CMSes. The current proposal for enabling Smart NIC with control plane
> CPUs for ovn-kubernetes could be simplified if the networking platform
> provided means for coordinating more of the network related bits.
>
> > >
> > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> >
> > Aside from plugging,
> > I also don't see the reason to have devlink code in OVN just
> > because it runs once on the init stage, IIUC.  So, I don't
> > understand why this information about the hardware cannot be
> > retrieved during the host provisioning and stored somewhere
> > (Nova config?).  Isn't hardware inventory a job for tripleo
> > or something?
>
> As noted in one of the TODOs in the commit message of the RFC one of
> the work items to further develop this is to monitor devlink for
> changes, the end product will not do one-time initialization.
>
> While I agree with you that the current SR-IOV acceleration workflow
> configuration is pretty static and can be done at deploy time, this
> proposal prepares for the next generation subfunction workflow where
> you will have a much higher density, and run-time configuration and
> discovery of representor ports. There is a paper about it from netdev
> conf[2], and all of this is part of the devlink infrastructure (Linux
> kernel ABI).
>
> 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf


My 2 cents.

I agree with Ilya.  It doesn't seem natural to me for OVN to create
OVS ports and also
to use devlink infrastructure.  I think an external entity can do all these.

The other downside I see with this RFC is that it will be now
monitoring all the port binding
rows without any condition because when a port binding is created the
chassis column will
be NULL.  Correct me If I'm wrong here.

Perhaps there can be a separate project called - ovn-vif which does this ?

Thanks
Numan

>
>
> --
> Frode Nordahl
>
> > Best regards, Ilya Maximets.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Frode Nordahl July 5, 2021, 4:09 p.m. UTC | #7
On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
>
> On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> <frode.nordahl@canonical.com> wrote:
> >
> > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >
> > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > >
> > > >
> > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > >>
> > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > >> >
> > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > >> > > Introduce plugging module that adds and removes ports on the
> > > >> > > integration bridge, as directed by Port_Binding options.
> > > >> > >
> > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > >> > > operations on the Open vSwitch integration bridge.
> > > >> > >
> > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > >> > > have a topology where the instance runs on one host and Open
> > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > >> > >
> > > >> > > The act of plugging and unplugging the representor port in Open
> > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > >> > > As such it is natural to extend OVN to provide this common
> > > >> > > functionality through its CMS facing API.
> > > >> >
> > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > >> > interacted with the kernel directly.  This change introduces completely
> > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > >> > separate service on these smartnic CPUs that will do plugging
> > > >> > and configuration job for CMS.  You may even make it independent
> > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > >> > This will additionally allow using same service for non-OVN setups.
> > > >>
> > > >> Ilya,
> > > >>
> > > >> Thank you for taking the time to comment, much appreciated.
> > > >>
> > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > >> and isolation from the host are also new, so this is one proposal for
> > > >> how we could go about to enable the use of them.
> > > >>
> > > >> The OVN controller does today get pretty close to the physical realm
> > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > >> configuration and presence of bridges to physical interfaces. It also
> > > >> does react to events of physical interfaces being plugged into the
> > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > >> been doing the act of adding the port into the bridge.
> > > >>
> > > >> The rationale for proposing to use the OVN database for coordinating
> > > >> this is that the information about which ports to bind, and where to
> > > >> bind them is already there. The timing of the information flow from
> > > >> the CMS is also suitable for the task.
> > > >>
> > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > >> interfacing with the kernel through netlink and friends are there or
> > > >> would be easy to add. The rationale for using the netlink-devlink
> > > >> interface is that it provides a generic infrastructure for these types
> > > >> of NICs. So by using this interface we should be able to support most
> > > >> if not all of the variants of these cards.
> > > >>
> > > >>
> > > >> Providing a separate OVN service to do the task could work, but would
> > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > >
> > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > Southbound DB is not meant to be a public interface, it just happened
> > > to be available for connections.  I know that OpenStack has metadata
> > > agents that connects to Sb DB, but if it's really required for them, I
> > > think, there should be a different way to get/set required information
> > > without connection to the Southbound.
> >
> > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > use of the Southbound DB by external to OVN services. My suggestion
> > was to have a separate OVN process do this if your objection was to
> > handle it as part of the ovn-controller process.
> >
> > > >>
> > > >> I fear it would be quite hard to build a whole separate project with
> > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > >> data and APIs in OVN already align so well with CMSs interested in
> > > >> using this?
> > > >>
> > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > >> > at least for this use case, IIUC.
> > > >>
> > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > >> But the vendors manufacturing and distributing the hardware do often
> > > >> have drivers for other platforms, I am sure the necessary
> > > >> infrastructure will become available there too over time, if it is not
> > > >> there already.
> > > >>
> > > >> We do currently have platform specific macros in the OVN build system,
> > > >> so we could enable the functionality when built on a compatible
> > > >> platform.
> > > >>
> > > >> > Maybe, others has different opinions.
> > > >>
> > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > >>
> > > >> > Another though is that there is, obviously, a network connection
> > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > >> > host system could just add interfaces over the network connection?
> > > >>
> > > >> There are a few issues with such an approach. One of the main goals
> > > >> with providing and using a NIC with control plane CPUs is having an
> > > >> extra layer of security and isolation which is separate from the
> > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > >> power from. Requiring a connection between the two for operation would
> > > >> defy this purpose.
> > > >>
> > > >> In addition to that, this class of cards provide visibility into
> > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > >> plugging through a remote ovsdb connection, it would have to
> > > >> communicate with something else running on the NIC control plane CPU
> > > >> to retrieve the information it needs before it can know what to relay
> > > >> back over the ovsdb connection.
> > > >>
> > > >> --
> > > >> Frode Nordahl
> > > >>
> > > >> > Best regards, Ilya Maximets.
> > > >
> > > > Here are my 2 cents.
> > > >
> > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > >
> > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> >
> > I am aware of the on-going effort to implement support for this in
> > ovn-kubernetes directly. What we have identified is that there are
> > other CMSs that want this functionality, and with that we have an
> > opportunity to generalise and provide an abstraction in a common place
> > that all the consuming CMSes can benefit from.
> >
> > > >
> > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > >
> > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > cases these ports will require a lot of extra configuration which
> > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > is running with userspace datapath and you need to plug-in some DPDK
> > > ports, where you have to specify the port type, DPDK port config,
> > > porbably also number of rx/tx queues, number of descriptors in these
> > > queues.  You may also want to configure affinity for these queues per
> > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > also might require some extra configuration that OVN will have to
> > > think about now.  This is a typical job for CMS to configure this
> > > kind of stuff, and that is why projects like os-vif or large variety
> > > of CNI/device plugins exists.
> >
> > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > them get their information from the CMS. Providing support for
> > userspace datapath and DPDK would require more information, some of
> > which is available through devlink, some fit well in key/value
> > options. Our initial target would be to support the kernel representor
> > port workflow.
> >
> > > >
> > > > There are some benefits:
> > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > >
> > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > the daemon on the Smart NIC anyways to monitor and configure
> > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > or run management appctl commands if required.
> > > So, I don't see the point why it can't do plugging if it's already
> > > there.
> >
> > This pattern is Kubernetes specific and is not the case for other
> > CMSes. The current proposal for enabling Smart NIC with control plane
> > CPUs for ovn-kubernetes could be simplified if the networking platform
> > provided means for coordinating more of the network related bits.
> >
> > > >
> > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > >
> > > Aside from plugging,
> > > I also don't see the reason to have devlink code in OVN just
> > > because it runs once on the init stage, IIUC.  So, I don't
> > > understand why this information about the hardware cannot be
> > > retrieved during the host provisioning and stored somewhere
> > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > or something?
> >
> > As noted in one of the TODOs in the commit message of the RFC one of
> > the work items to further develop this is to monitor devlink for
> > changes, the end product will not do one-time initialization.
> >
> > While I agree with you that the current SR-IOV acceleration workflow
> > configuration is pretty static and can be done at deploy time, this
> > proposal prepares for the next generation subfunction workflow where
> > you will have a much higher density, and run-time configuration and
> > discovery of representor ports. There is a paper about it from netdev
> > conf[2], and all of this is part of the devlink infrastructure (Linux
> > kernel ABI).
> >
> > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf

> My 2 cents.

Thank you for weighing in with your opinion, Numan.

> I agree with Ilya.  It doesn't seem natural to me for OVN to create
> OVS ports and also
> to use devlink infrastructure.  I think an external entity can do all these.

No doubt an external entity could do this, the challenge is that due
to how the SmartNIC with separate control plane CPUs are laid out
existing CMS tooling cannot do it, and we are working out a way to
efficiently and securely coordinate how and when this external entity
does the plugging. At the end of the day the data being passed around
is networking related, and I believe it is possible to create a
solution that would be consumable by all OVN-enabled CMS's. We'd be
happy to prove this out outside of the ovn-controller first if that is
more palatable, see below.

> The other downside I see with this RFC is that it will be now
> monitoring all the port binding
> rows without any condition because when a port binding is created the
> chassis column will
> be NULL.  Correct me If I'm wrong here.

The intention of the condition is to consider unbound ports for
plugging, as soon as they have been plugged and bound there will be a
value in the chassis column and at that point only the respective
chassis will monitor them. I guess this could be a problem if there is
a large number of unbound ports in a deployment, from your experience
would that be the case? (When I think of it I do remember the load
balancer vip perhaps being a consumer of unbound ports).

We considered using a separate LSP type, but apart from the plugging
bit these LSP's will behave just like regular VIFs from OVN Port
Binding code POV, and if all your hypervisors have SmartNICs with
control plane CPUs there would be just as many of those ports.

> Perhaps there can be a separate project called - ovn-vif which does this ?

Creating a separate ovn-org project called ovn-vif would allow us to
move forward, we would of course have to document the optional CMS API
in core OVN pointing to an optional external dependency for their use.
Would this be an acceptable compromise?

--
Frode Nordahl

> Thanks
> Numan
>
> >
> >
> > --
> > Frode Nordahl
> >
> > > Best regards, Ilya Maximets.
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
Numan Siddique July 5, 2021, 4:57 p.m. UTC | #8
On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
<frode.nordahl@canonical.com> wrote:
>
> On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> >
> > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > <frode.nordahl@canonical.com> wrote:
> > >
> > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > >
> > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > >
> > > > >
> > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > >>
> > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > >> >
> > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > >> > >
> > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > >> > > operations on the Open vSwitch integration bridge.
> > > > >> > >
> > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > >> > > have a topology where the instance runs on one host and Open
> > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > >> > >
> > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > >> > > As such it is natural to extend OVN to provide this common
> > > > >> > > functionality through its CMS facing API.
> > > > >> >
> > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > >> > and configuration job for CMS.  You may even make it independent
> > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > >>
> > > > >> Ilya,
> > > > >>
> > > > >> Thank you for taking the time to comment, much appreciated.
> > > > >>
> > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > >> and isolation from the host are also new, so this is one proposal for
> > > > >> how we could go about to enable the use of them.
> > > > >>
> > > > >> The OVN controller does today get pretty close to the physical realm
> > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > >> does react to events of physical interfaces being plugged into the
> > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > >> been doing the act of adding the port into the bridge.
> > > > >>
> > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > >> this is that the information about which ports to bind, and where to
> > > > >> bind them is already there. The timing of the information flow from
> > > > >> the CMS is also suitable for the task.
> > > > >>
> > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > >> interfacing with the kernel through netlink and friends are there or
> > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > >> interface is that it provides a generic infrastructure for these types
> > > > >> of NICs. So by using this interface we should be able to support most
> > > > >> if not all of the variants of these cards.
> > > > >>
> > > > >>
> > > > >> Providing a separate OVN service to do the task could work, but would
> > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > >
> > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > Southbound DB is not meant to be a public interface, it just happened
> > > > to be available for connections.  I know that OpenStack has metadata
> > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > think, there should be a different way to get/set required information
> > > > without connection to the Southbound.
> > >
> > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > use of the Southbound DB by external to OVN services. My suggestion
> > > was to have a separate OVN process do this if your objection was to
> > > handle it as part of the ovn-controller process.
> > >
> > > > >>
> > > > >> I fear it would be quite hard to build a whole separate project with
> > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > >> using this?
> > > > >>
> > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > >> > at least for this use case, IIUC.
> > > > >>
> > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > >> have drivers for other platforms, I am sure the necessary
> > > > >> infrastructure will become available there too over time, if it is not
> > > > >> there already.
> > > > >>
> > > > >> We do currently have platform specific macros in the OVN build system,
> > > > >> so we could enable the functionality when built on a compatible
> > > > >> platform.
> > > > >>
> > > > >> > Maybe, others has different opinions.
> > > > >>
> > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > >>
> > > > >> > Another though is that there is, obviously, a network connection
> > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > >> > host system could just add interfaces over the network connection?
> > > > >>
> > > > >> There are a few issues with such an approach. One of the main goals
> > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > >> extra layer of security and isolation which is separate from the
> > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > >> power from. Requiring a connection between the two for operation would
> > > > >> defy this purpose.
> > > > >>
> > > > >> In addition to that, this class of cards provide visibility into
> > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > >> plugging through a remote ovsdb connection, it would have to
> > > > >> communicate with something else running on the NIC control plane CPU
> > > > >> to retrieve the information it needs before it can know what to relay
> > > > >> back over the ovsdb connection.
> > > > >>
> > > > >> --
> > > > >> Frode Nordahl
> > > > >>
> > > > >> > Best regards, Ilya Maximets.
> > > > >
> > > > > Here are my 2 cents.
> > > > >
> > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > >
> > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > >
> > > I am aware of the on-going effort to implement support for this in
> > > ovn-kubernetes directly. What we have identified is that there are
> > > other CMSs that want this functionality, and with that we have an
> > > opportunity to generalise and provide an abstraction in a common place
> > > that all the consuming CMSes can benefit from.
> > >
> > > > >
> > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > >
> > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > cases these ports will require a lot of extra configuration which
> > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > ports, where you have to specify the port type, DPDK port config,
> > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > queues.  You may also want to configure affinity for these queues per
> > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > also might require some extra configuration that OVN will have to
> > > > think about now.  This is a typical job for CMS to configure this
> > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > of CNI/device plugins exists.
> > >
> > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > them get their information from the CMS. Providing support for
> > > userspace datapath and DPDK would require more information, some of
> > > which is available through devlink, some fit well in key/value
> > > options. Our initial target would be to support the kernel representor
> > > port workflow.
> > >
> > > > >
> > > > > There are some benefits:
> > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > >
> > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > or run management appctl commands if required.
> > > > So, I don't see the point why it can't do plugging if it's already
> > > > there.
> > >
> > > This pattern is Kubernetes specific and is not the case for other
> > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > provided means for coordinating more of the network related bits.
> > >
> > > > >
> > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > >
> > > > Aside from plugging,
> > > > I also don't see the reason to have devlink code in OVN just
> > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > understand why this information about the hardware cannot be
> > > > retrieved during the host provisioning and stored somewhere
> > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > or something?
> > >
> > > As noted in one of the TODOs in the commit message of the RFC one of
> > > the work items to further develop this is to monitor devlink for
> > > changes, the end product will not do one-time initialization.
> > >
> > > While I agree with you that the current SR-IOV acceleration workflow
> > > configuration is pretty static and can be done at deploy time, this
> > > proposal prepares for the next generation subfunction workflow where
> > > you will have a much higher density, and run-time configuration and
> > > discovery of representor ports. There is a paper about it from netdev
> > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > kernel ABI).
> > >
> > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
>
> > My 2 cents.
>
> Thank you for weighing in with your opinion, Numan.
>
> > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > OVS ports and also
> > to use devlink infrastructure.  I think an external entity can do all these.
>
> No doubt an external entity could do this, the challenge is that due
> to how the SmartNIC with separate control plane CPUs are laid out
> existing CMS tooling cannot do it, and we are working out a way to
> efficiently and securely coordinate how and when this external entity
> does the plugging. At the end of the day the data being passed around
> is networking related, and I believe it is possible to create a
> solution that would be consumable by all OVN-enabled CMS's. We'd be
> happy to prove this out outside of the ovn-controller first if that is
> more palatable, see below.
>
> > The other downside I see with this RFC is that it will be now
> > monitoring all the port binding
> > rows without any condition because when a port binding is created the
> > chassis column will
> > be NULL.  Correct me If I'm wrong here.
>
> The intention of the condition is to consider unbound ports for
> plugging, as soon as they have been plugged and bound there will be a
> value in the chassis column and at that point only the respective
> chassis will monitor them. I guess this could be a problem if there is
> a large number of unbound ports in a deployment, from your experience
> would that be the case? (When I think of it I do remember the load
> balancer vip perhaps being a consumer of unbound ports).
>
> We considered using a separate LSP type, but apart from the plugging
> bit these LSP's will behave just like regular VIFs from OVN Port
> Binding code POV, and if all your hypervisors have SmartNICs with
> control plane CPUs there would be just as many of those ports.

In my opinion having a separate type would be better.  At the least it
will not impact
normal or existing deployments.

Let's say we call a type as "representor"  (just picking a name for
now), then each ovn-controller
can add a conditional monitor to get updates for these types (just
like how we do now for patch or external ports).

What do you think of the below approach

1.  A new LSP type - representor is added.  CMS when creating this
lport also sets the requested-chassis option.

2. ovn-controller on the requested chassis sees this lport type,
creates an OVS port/interface and possibly sets some options or hints
in the external_ids column
 of ovs interface.  At this point, it will not do anything.

3.  An external entity (similar to your proposed pluggable.c) running
on the chassis, would have built an shash of  "devlink_ports" using
the devlink sockets.

4.  This entity will also be monitoring the ovs interfaces.  When it
sees a new interface with the proper hints, will set the
external_ids:iface-id=<lport_name>

5. ovn-controller when it sees the external_ids:iface-id,  will bind
the port and add flows.


With (2) we now add the support in ovn-controller to create OVS ports,
 but at the same time OVN will not bother about the internal details
of SMART nics
or bother about the devlink sockets.

I'm not sure if the above approach seems hacky.  The external entity
could be called "ovn-vif".

What do you think ?


>
> > Perhaps there can be a separate project called - ovn-vif which does this ?
>
> Creating a separate ovn-org project called ovn-vif would allow us to
> move forward, we would of course have to document the optional CMS API
> in core OVN pointing to an optional external dependency for their use.
> Would this be an acceptable compromise?

I personally do not have any objections to that.  Maybe others have
different opinions.

Thanks
Numan

>
> --
> Frode Nordahl
>
> > Thanks
> > Numan
> >
> > >
> > >
> > > --
> > > Frode Nordahl
> > >
> > > > Best regards, Ilya Maximets.
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Frode Nordahl July 6, 2021, 8:18 a.m. UTC | #9
On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
>
> On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> <frode.nordahl@canonical.com> wrote:
> >
> > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> > >
> > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > <frode.nordahl@canonical.com> wrote:
> > > >
> > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > >
> > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > >
> > > > > >
> > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > >>
> > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > >> >
> > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > >> > >
> > > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > >> > >
> > > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > > >> > > have a topology where the instance runs on one host and Open
> > > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > > >> > >
> > > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > > >> > > As such it is natural to extend OVN to provide this common
> > > > > >> > > functionality through its CMS facing API.
> > > > > >> >
> > > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > > >> > and configuration job for CMS.  You may even make it independent
> > > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > > >>
> > > > > >> Ilya,
> > > > > >>
> > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > >>
> > > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > > >> and isolation from the host are also new, so this is one proposal for
> > > > > >> how we could go about to enable the use of them.
> > > > > >>
> > > > > >> The OVN controller does today get pretty close to the physical realm
> > > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > > >> does react to events of physical interfaces being plugged into the
> > > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > > >> been doing the act of adding the port into the bridge.
> > > > > >>
> > > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > > >> this is that the information about which ports to bind, and where to
> > > > > >> bind them is already there. The timing of the information flow from
> > > > > >> the CMS is also suitable for the task.
> > > > > >>
> > > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > > >> interfacing with the kernel through netlink and friends are there or
> > > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > > >> interface is that it provides a generic infrastructure for these types
> > > > > >> of NICs. So by using this interface we should be able to support most
> > > > > >> if not all of the variants of these cards.
> > > > > >>
> > > > > >>
> > > > > >> Providing a separate OVN service to do the task could work, but would
> > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > >
> > > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > > Southbound DB is not meant to be a public interface, it just happened
> > > > > to be available for connections.  I know that OpenStack has metadata
> > > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > > think, there should be a different way to get/set required information
> > > > > without connection to the Southbound.
> > > >
> > > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > > use of the Southbound DB by external to OVN services. My suggestion
> > > > was to have a separate OVN process do this if your objection was to
> > > > handle it as part of the ovn-controller process.
> > > >
> > > > > >>
> > > > > >> I fear it would be quite hard to build a whole separate project with
> > > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > > >> using this?
> > > > > >>
> > > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > > >> > at least for this use case, IIUC.
> > > > > >>
> > > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > >> infrastructure will become available there too over time, if it is not
> > > > > >> there already.
> > > > > >>
> > > > > >> We do currently have platform specific macros in the OVN build system,
> > > > > >> so we could enable the functionality when built on a compatible
> > > > > >> platform.
> > > > > >>
> > > > > >> > Maybe, others has different opinions.
> > > > > >>
> > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > >>
> > > > > >> > Another though is that there is, obviously, a network connection
> > > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > > >> > host system could just add interfaces over the network connection?
> > > > > >>
> > > > > >> There are a few issues with such an approach. One of the main goals
> > > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > > >> extra layer of security and isolation which is separate from the
> > > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > > >> power from. Requiring a connection between the two for operation would
> > > > > >> defy this purpose.
> > > > > >>
> > > > > >> In addition to that, this class of cards provide visibility into
> > > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > >> communicate with something else running on the NIC control plane CPU
> > > > > >> to retrieve the information it needs before it can know what to relay
> > > > > >> back over the ovsdb connection.
> > > > > >>
> > > > > >> --
> > > > > >> Frode Nordahl
> > > > > >>
> > > > > >> > Best regards, Ilya Maximets.
> > > > > >
> > > > > > Here are my 2 cents.
> > > > > >
> > > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > > >
> > > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > >
> > > > I am aware of the on-going effort to implement support for this in
> > > > ovn-kubernetes directly. What we have identified is that there are
> > > > other CMSs that want this functionality, and with that we have an
> > > > opportunity to generalise and provide an abstraction in a common place
> > > > that all the consuming CMSes can benefit from.
> > > >
> > > > > >
> > > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > > >
> > > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > > cases these ports will require a lot of extra configuration which
> > > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > > ports, where you have to specify the port type, DPDK port config,
> > > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > > queues.  You may also want to configure affinity for these queues per
> > > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > > also might require some extra configuration that OVN will have to
> > > > > think about now.  This is a typical job for CMS to configure this
> > > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > > of CNI/device plugins exists.
> > > >
> > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > > them get their information from the CMS. Providing support for
> > > > userspace datapath and DPDK would require more information, some of
> > > > which is available through devlink, some fit well in key/value
> > > > options. Our initial target would be to support the kernel representor
> > > > port workflow.
> > > >
> > > > > >
> > > > > > There are some benefits:
> > > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > > >
> > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > or run management appctl commands if required.
> > > > > So, I don't see the point why it can't do plugging if it's already
> > > > > there.
> > > >
> > > > This pattern is Kubernetes specific and is not the case for other
> > > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > > provided means for coordinating more of the network related bits.
> > > >
> > > > > >
> > > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > > >
> > > > > Aside from plugging,
> > > > > I also don't see the reason to have devlink code in OVN just
> > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > understand why this information about the hardware cannot be
> > > > > retrieved during the host provisioning and stored somewhere
> > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > or something?
> > > >
> > > > As noted in one of the TODOs in the commit message of the RFC one of
> > > > the work items to further develop this is to monitor devlink for
> > > > changes, the end product will not do one-time initialization.
> > > >
> > > > While I agree with you that the current SR-IOV acceleration workflow
> > > > configuration is pretty static and can be done at deploy time, this
> > > > proposal prepares for the next generation subfunction workflow where
> > > > you will have a much higher density, and run-time configuration and
> > > > discovery of representor ports. There is a paper about it from netdev
> > > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > > kernel ABI).
> > > >
> > > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> >
> > > My 2 cents.
> >
> > Thank you for weighing in with your opinion, Numan.
> >
> > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > OVS ports and also
> > > to use devlink infrastructure.  I think an external entity can do all these.
> >
> > No doubt an external entity could do this, the challenge is that due
> > to how the SmartNIC with separate control plane CPUs are laid out
> > existing CMS tooling cannot do it, and we are working out a way to
> > efficiently and securely coordinate how and when this external entity
> > does the plugging. At the end of the day the data being passed around
> > is networking related, and I believe it is possible to create a
> > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > happy to prove this out outside of the ovn-controller first if that is
> > more palatable, see below.
> >
> > > The other downside I see with this RFC is that it will be now
> > > monitoring all the port binding
> > > rows without any condition because when a port binding is created the
> > > chassis column will
> > > be NULL.  Correct me If I'm wrong here.
> >
> > The intention of the condition is to consider unbound ports for
> > plugging, as soon as they have been plugged and bound there will be a
> > value in the chassis column and at that point only the respective
> > chassis will monitor them. I guess this could be a problem if there is
> > a large number of unbound ports in a deployment, from your experience
> > would that be the case? (When I think of it I do remember the load
> > balancer vip perhaps being a consumer of unbound ports).
> >
> > We considered using a separate LSP type, but apart from the plugging
> > bit these LSP's will behave just like regular VIFs from OVN Port
> > Binding code POV, and if all your hypervisors have SmartNICs with
> > control plane CPUs there would be just as many of those ports.
>
> In my opinion having a separate type would be better.  At the least it
> will not impact
> normal or existing deployments.
>
> Let's say we call a type as "representor"  (just picking a name for
> now), then each ovn-controller
> can add a conditional monitor to get updates for these types (just
> like how we do now for patch or external ports).
>
> What do you think of the below approach
>
> 1.  A new LSP type - representor is added.  CMS when creating this
> lport also sets the requested-chassis option.

Works for me.

> 2. ovn-controller on the requested chassis sees this lport type,
> creates an OVS port/interface and possibly sets some options or hints
> in the external_ids column
>  of ovs interface.  At this point, it will not do anything.

This could potentially help us avoid the issue of adding more
processes and IDLs talking to the SB DB.

However, at this point in time, the ovn-controller does not know the
details of the representor port, such as name/ifindex for the kernel
representor port, and the interface it creates in OVS has to match the
kernel representor port, and afict it is not possible to change these
details after port creation. This information is provided by the
devlink lookup in the current proposal.

It might work if it was possible to create the port and interface
separately, but that would cause a referential integrity violation.
Another alternative could be that ovn-vif created a new interface
record in 4), attaching that to the port record created by
ovn-controller replacing a throw-away interface record, but we would
be approaching hacky territory.

Could we do some sort of local IPC perhaps through a Unix domain socket?

> 3.  An external entity (similar to your proposed pluggable.c) running
> on the chassis, would have built an shash of  "devlink_ports" using
> the devlink sockets.
>
> 4.  This entity will also be monitoring the ovs interfaces.  When it
> sees a new interface with the proper hints, will set the
> external_ids:iface-id=<lport_name>
>
> 5. ovn-controller when it sees the external_ids:iface-id,  will bind
> the port and add flows.
>
>
> With (2) we now add the support in ovn-controller to create OVS ports,
>  but at the same time OVN will not bother about the internal details
> of SMART nics
> or bother about the devlink sockets.
>
> I'm not sure if the above approach seems hacky.  The external entity
> could be called "ovn-vif".
>
> What do you think ?

This is an interesting idea and it could work if we figure out a way
for ovn-controller and ovn-vif to exchange information before the
port/interface is created.

> >
> > > Perhaps there can be a separate project called - ovn-vif which does this ?
> >
> > Creating a separate ovn-org project called ovn-vif would allow us to
> > move forward, we would of course have to document the optional CMS API
> > in core OVN pointing to an optional external dependency for their use.
> > Would this be an acceptable compromise?
>
> I personally do not have any objections to that.  Maybe others have
> different opinions.

Great.

Cheers!

--
Frode Nordahl


>
> Thanks
> Numan
>
> >
> > --
> > Frode Nordahl
> >
> > > Thanks
> > > Numan
> > >
> > > >
> > > >
> > > > --
> > > > Frode Nordahl
> > > >
> > > > > Best regards, Ilya Maximets.
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > >
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
Han Zhou July 6, 2021, 6:11 p.m. UTC | #10
On Tue, Jul 6, 2021 at 1:19 AM Frode Nordahl <frode.nordahl@canonical.com>
wrote:
>
> On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
> >
> > On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> > <frode.nordahl@canonical.com> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org>
wrote:
> > > >
> > > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > > <frode.nordahl@canonical.com> wrote:
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org>
wrote:
> > > > > >
> > > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <
frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > > >>
> > > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <
i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > > >> >
> > > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > > >> > > Introduce plugging module that adds and removes ports on
the
> > > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > > >> > >
> > > > > > >> > > Traditionally it has been the CMSs responsibility to
create Virtual
> > > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod,
Virtual
> > > > > > >> > > Machine etc.) life cycle, and subsequently manage
plug/unplug
> > > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > > >> > >
> > > > > > >> > > With the advent of NICs connected to multiple distinct
CPUs we can
> > > > > > >> > > have a topology where the instance runs on one host and
Open
> > > > > > >> > > vSwitch and OVN runs on a different host, the smartnic
CPU.
> > > > > > >> > >
> > > > > > >> > > The act of plugging and unplugging the representor port
in Open
> > > > > > >> > > vSwitch running on the smartnic host CPU would be the
same for
> > > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD,
OpenStack, etc.).
> > > > > > >> > > As such it is natural to extend OVN to provide this
common
> > > > > > >> > > functionality through its CMS facing API.
> > > > > > >> >
> > > > > > >> > Hi, Frode.  Thanks for putting this together, but it
doesn't look
> > > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices
or
> > > > > > >> > interacted with the kernel directly.  This change
introduces completely
> > > > > > >> > new functionality inside OVN.  With the same effect we can
run a fully
> > > > > > >> > separate service on these smartnic CPUs that will do
plugging
> > > > > > >> > and configuration job for CMS.  You may even make it
independent
> > > > > > >> > from a particular CMS by creating a REST API for it or
whatever.
> > > > > > >> > This will additionally allow using same service for
non-OVN setups.
> > > > > > >>
> > > > > > >> Ilya,
> > > > > > >>
> > > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > > >>
> > > > > > >> Yes, this is new functionality, NICs with separate control
plane CPUs
> > > > > > >> and isolation from the host are also new, so this is one
proposal for
> > > > > > >> how we could go about to enable the use of them.
> > > > > > >>
> > > > > > >> The OVN controller does today get pretty close to the
physical realm
> > > > > > >> by maintaining patch ports in Open vSwitch based on bridge
mapping
> > > > > > >> configuration and presence of bridges to physical
interfaces. It also
> > > > > > >> does react to events of physical interfaces being plugged
into the
> > > > > > >> Open vSwitch instance it manages, albeit to date some other
entity has
> > > > > > >> been doing the act of adding the port into the bridge.
> > > > > > >>
> > > > > > >> The rationale for proposing to use the OVN database for
coordinating
> > > > > > >> this is that the information about which ports to bind, and
where to
> > > > > > >> bind them is already there. The timing of the information
flow from
> > > > > > >> the CMS is also suitable for the task.
> > > > > > >>
> > > > > > >> OVN relies on OVS library code, and all the necessary
libraries for
> > > > > > >> interfacing with the kernel through netlink and friends are
there or
> > > > > > >> would be easy to add. The rationale for using the
netlink-devlink
> > > > > > >> interface is that it provides a generic infrastructure for
these types
> > > > > > >> of NICs. So by using this interface we should be able to
support most
> > > > > > >> if not all of the variants of these cards.
> > > > > > >>
> > > > > > >>
> > > > > > >> Providing a separate OVN service to do the task could work,
but would
> > > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > > >
> > > > > > IMHO, CMS should never connect to Southbound DB.  It's just
because the
> > > > > > Southbound DB is not meant to be a public interface, it just
happened
> > > > > > to be available for connections.  I know that OpenStack has
metadata
> > > > > > agents that connects to Sb DB, but if it's really required for
them, I
> > > > > > think, there should be a different way to get/set required
information
> > > > > > without connection to the Southbound.
> > > > >
> > > > > The CMS-facing API is the Northbound DB, I was not suggesting
direct
> > > > > use of the Southbound DB by external to OVN services. My
suggestion
> > > > > was to have a separate OVN process do this if your objection was
to
> > > > > handle it as part of the ovn-controller process.
> > > > >
> > > > > > >>
> > > > > > >> I fear it would be quite hard to build a whole separate
project with
> > > > > > >> its own API, feels like a lot of duplicated effort when the
flow of
> > > > > > >> data and APIs in OVN already align so well with CMSs
interested in
> > > > > > >> using this?
> > > > > > >>
> > > > > > >> > Interactions with physical devices also makes OVN
linux-dependent
> > > > > > >> > at least for this use case, IIUC.
> > > > > > >>
> > > > > > >> This specific bit would be linux-specific in the first
iteration, yes.
> > > > > > >> But the vendors manufacturing and distributing the hardware
do often
> > > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > > >> infrastructure will become available there too over time, if
it is not
> > > > > > >> there already.
> > > > > > >>
> > > > > > >> We do currently have platform specific macros in the OVN
build system,
> > > > > > >> so we could enable the functionality when built on a
compatible
> > > > > > >> platform.
> > > > > > >>
> > > > > > >> > Maybe, others has different opinions.
> > > > > > >>
> > > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > > >>
> > > > > > >> > Another though is that there is, obviously, a network
connection
> > > > > > >> > between the host and smartnic system.  Maybe it's possible
to just
> > > > > > >> > add an extra remote to the local ovsdb-server so CMS
daemon on the
> > > > > > >> > host system could just add interfaces over the network
connection?
> > > > > > >>
> > > > > > >> There are a few issues with such an approach. One of the
main goals
> > > > > > >> with providing and using a NIC with control plane CPUs is
having an
> > > > > > >> extra layer of security and isolation which is separate from
the
> > > > > > >> hypervisor host the card happens to share a PCI complex with
and draw
> > > > > > >> power from. Requiring a connection between the two for
operation would
> > > > > > >> defy this purpose.
> > > > > > >>
> > > > > > >> In addition to that, this class of cards provide visibility
into
> > > > > > >> kernel interfaces, enumeration of representor ports etc.
only from the
> > > > > > >> NIC control plane CPU side of the PCI complex, this
information is not
> > > > > > >> provided to the host. So if a hypervisor host CMS agent were
to do the
> > > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > > >> communicate with something else running on the NIC control
plane CPU
> > > > > > >> to retrieve the information it needs before it can know what
to relay
> > > > > > >> back over the ovsdb connection.
> > > > > > >>
> > > > > > >> --
> > > > > > >> Frode Nordahl
> > > > > > >>
> > > > > > >> > Best regards, Ilya Maximets.
> > > > > > >
> > > > > > > Here are my 2 cents.
> > > > > > >
> > > > > > > Initially I had similar concerns to Ilya, and it seems OVN
should stay away from the physical interface plugging. As a reference, here
is how ovn-kubernetes is doing it without adding anything to OVN:
https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing
<
https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing
>
> > > > > >
> > > > > > AFAICT, a big part of the work is already done on the ovn-k8s
side:
> > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > > >
> > > > > I am aware of the on-going effort to implement support for this in
> > > > > ovn-kubernetes directly. What we have identified is that there are
> > > > > other CMSs that want this functionality, and with that we have an
> > > > > opportunity to generalise and provide an abstraction in a common
place
> > > > > that all the consuming CMSes can benefit from.
> > > > >
> > > > > > >
> > > > > > > However, thinking more about it, the proposed approach in
this patch just expands the way how OVN can bind ports, utilizing the
communication channel of OVN (OVSDB connections). If all the information
regarding port binding can be specified by the CMS from NB, then it is not
unnatural for ovn-controller to perform interface binding directly (instead
of passively accepting what is attached by CMS). This kind of information
already existed to some extent - the "requested_chassis" option in
OpenStack. Now it seems this idea is just expanding it to a specific
interface. The difference is that "requested_chassis" is used for
validation only, but now we want to directly apply it. So I think at least
I don't have a strong opinion on the idea.
> > > > > >
> > > > > > While it's, probably, OK for OVN to add port to the OVSDB, in
many
> > > > > > cases these ports will require a lot of extra configuration
which
> > > > > > is typically done by os-vif or CNI/device plugins.  Imagine
that OVS
> > > > > > is running with userspace datapath and you need to plug-in some
DPDK
> > > > > > ports, where you have to specify the port type, DPDK port
config,
> > > > > > porbably also number of rx/tx queues, number of descriptors in
these
> > > > > > queues.  You may also want to configure affinity for these
queues per
> > > > > > PMD thread in OVS.  For kernel interfaces it might be easier,
but they
> > > > > > also might require some extra configuration that OVN will have
to
> > > > > > think about now.  This is a typical job for CMS to configure
this
> > > > > > kind of stuff, and that is why projects like os-vif or large
variety
> > > > > > of CNI/device plugins exists.
> > > > >
> > > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And
both of
> > > > > them get their information from the CMS. Providing support for
> > > > > userspace datapath and DPDK would require more information, some
of
> > > > > which is available through devlink, some fit well in key/value
> > > > > options. Our initial target would be to support the kernel
representor
> > > > > port workflow.
> > > > >
> > > > > > >
> > > > > > > There are some benefits:
> > > > > > > 1) The mechanism can be reused by different CMSes, which may
simplify CMS implementation.
> > > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's
communication channel, which avoids an extra CMS communication channel on
the smart NIC side. (of course this can be achieved by a connection between
the BM and smart NIC with *restricted* API just to convey the necessary
information)
> > > > > >
> > > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > > or run management appctl commands if required.
> > > > > > So, I don't see the point why it can't do plugging if it's
already
> > > > > > there.
> > > > >
> > > > > This pattern is Kubernetes specific and is not the case for other
> > > > > CMSes. The current proposal for enabling Smart NIC with control
plane
> > > > > CPUs for ovn-kubernetes could be simplified if the networking
platform
> > > > > provided means for coordinating more of the network related bits.
> > > > >
> > > > > > >
> > > > > > > As to the negative side, it would increase OVN's complexity,
and as mentioned by Ilya potentially breaks OVN's platform independence. To
avoid this, I think the *plugging* module itself needs to be independent
and pluggable. It can be extended as independent plugins. The plugin would
need to define what information is needed in LSP's "options", and then
implement corresponding drivers. With this approach, even the regular VIFs
can be attached by ovn-controller if CMS can tell the interface name.
Anyway, this is just my brief thinking.
> > > > > >
> > > > > > Aside from plugging,
> > > > > > I also don't see the reason to have devlink code in OVN just
> > > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > > understand why this information about the hardware cannot be
> > > > > > retrieved during the host provisioning and stored somewhere
> > > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > > or something?
> > > > >
> > > > > As noted in one of the TODOs in the commit message of the RFC one
of
> > > > > the work items to further develop this is to monitor devlink for
> > > > > changes, the end product will not do one-time initialization.
> > > > >
> > > > > While I agree with you that the current SR-IOV acceleration
workflow
> > > > > configuration is pretty static and can be done at deploy time,
this
> > > > > proposal prepares for the next generation subfunction workflow
where
> > > > > you will have a much higher density, and run-time configuration
and
> > > > > discovery of representor ports. There is a paper about it from
netdev
> > > > > conf[2], and all of this is part of the devlink infrastructure
(Linux
> > > > > kernel ABI).
> > > > >
> > > > > 2:
https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> > >
> > > > My 2 cents.
> > >
> > > Thank you for weighing in with your opinion, Numan.
> > >
> > > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > > OVS ports and also
> > > > to use devlink infrastructure.  I think an external entity can do
all these.
> > >
> > > No doubt an external entity could do this, the challenge is that due
> > > to how the SmartNIC with separate control plane CPUs are laid out
> > > existing CMS tooling cannot do it, and we are working out a way to
> > > efficiently and securely coordinate how and when this external entity
> > > does the plugging. At the end of the day the data being passed around
> > > is networking related, and I believe it is possible to create a
> > > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > > happy to prove this out outside of the ovn-controller first if that is
> > > more palatable, see below.
> > >
> > > > The other downside I see with this RFC is that it will be now
> > > > monitoring all the port binding
> > > > rows without any condition because when a port binding is created
the
> > > > chassis column will
> > > > be NULL.  Correct me If I'm wrong here.
> > >
> > > The intention of the condition is to consider unbound ports for
> > > plugging, as soon as they have been plugged and bound there will be a
> > > value in the chassis column and at that point only the respective
> > > chassis will monitor them. I guess this could be a problem if there is
> > > a large number of unbound ports in a deployment, from your experience
> > > would that be the case? (When I think of it I do remember the load
> > > balancer vip perhaps being a consumer of unbound ports).
> > >
> > > We considered using a separate LSP type, but apart from the plugging
> > > bit these LSP's will behave just like regular VIFs from OVN Port
> > > Binding code POV, and if all your hypervisors have SmartNICs with
> > > control plane CPUs there would be just as many of those ports.
> >
> > In my opinion having a separate type would be better.  At the least it
> > will not impact
> > normal or existing deployments.
> >
> > Let's say we call a type as "representor"  (just picking a name for
> > now), then each ovn-controller
> > can add a conditional monitor to get updates for these types (just
> > like how we do now for patch or external ports).
> >
> > What do you think of the below approach
> >
> > 1.  A new LSP type - representor is added.  CMS when creating this
> > lport also sets the requested-chassis option.
>
> Works for me.
>
I think it is better not to add a new LSP type for this purpose. It would
be better to make this "interface plugging" feature generic, so that it can
work for other types as well. It could be supported in the future for a
regular VIF, or for a new LSP type, so I think this behavior is better not
to be tied to a LSP type.

We could add a new column in the Port_Binding table: plugged_by.
ovn-controller can monitor all port_bindings that have "plugged_by" equals
to current chassis. This way it should be backward compatible and not
affecting the use cases that doesn't need "plugged_by".

The details of plugging information can be stored in the options column
with new options defined, such as: "plug_type". For representor port the
"plug_type" can be defined as "representor". For each plug_type there can
be type-specific information defined and also stored in the options column.
(Or we can define a new column "plug_options" just for plugging
information).

> > 2. ovn-controller on the requested chassis sees this lport type,
> > creates an OVS port/interface and possibly sets some options or hints
> > in the external_ids column
> >  of ovs interface.  At this point, it will not do anything.
>
> This could potentially help us avoid the issue of adding more
> processes and IDLs talking to the SB DB.
>
> However, at this point in time, the ovn-controller does not know the
> details of the representor port, such as name/ifindex for the kernel
> representor port, and the interface it creates in OVS has to match the
> kernel representor port, and afict it is not possible to change these
> details after port creation. This information is provided by the
> devlink lookup in the current proposal.
>
> It might work if it was possible to create the port and interface
> separately, but that would cause a referential integrity violation.
> Another alternative could be that ovn-vif created a new interface
> record in 4), attaching that to the port record created by
> ovn-controller replacing a throw-away interface record, but we would
> be approaching hacky territory.
>
> Could we do some sort of local IPC perhaps through a Unix domain socket?

I think it would be good to have separate *modules* to handle interface
plugging (one module per plug_type), but there is no need to have a
separate *process* for it, which increases the complexity and the IPC cost.
The modules can be invoked as callbacks registered for each plug_type, and
responsible for explaining the options defined for its type, which is
directly retrieved from the port_binding record from SB DB. This way the
core ovn-controller components are kept untouched (except calling the
plugging modules at some point, but not worrying about the details),
without introducing a separate process. (A separate process is needed only
if there are performance concerns or blocking I/Os).

It is similar to how different dpif providers (or different dev types) are
supported in OVS.

>
> > 3.  An external entity (similar to your proposed pluggable.c) running
> > on the chassis, would have built an shash of  "devlink_ports" using
> > the devlink sockets.
> >
> > 4.  This entity will also be monitoring the ovs interfaces.  When it
> > sees a new interface with the proper hints, will set the
> > external_ids:iface-id=<lport_name>
> >
I think setting iface-id (if still necessary) shall be done together with
the port creation, by the plugging module (through a callback).

> > 5. ovn-controller when it sees the external_ids:iface-id,  will bind
> > the port and add flows.
> >
> >
> > With (2) we now add the support in ovn-controller to create OVS ports,
> >  but at the same time OVN will not bother about the internal details
> > of SMART nics
> > or bother about the devlink sockets.
> >
> > I'm not sure if the above approach seems hacky.  The external entity
> > could be called "ovn-vif".
> >
> > What do you think ?
>
> This is an interesting idea and it could work if we figure out a way
> for ovn-controller and ovn-vif to exchange information before the
> port/interface is created.
>

I'd really prefer not to introduce such extra IPC. Would the plug_type
mechanism be generic enough and avoid such external entities?

Thanks,
Han

> > >
> > > > Perhaps there can be a separate project called - ovn-vif which does
this ?
> > >
> > > Creating a separate ovn-org project called ovn-vif would allow us to
> > > move forward, we would of course have to document the optional CMS API
> > > in core OVN pointing to an optional external dependency for their use.
> > > Would this be an acceptable compromise?
> >
> > I personally do not have any objections to that.  Maybe others have
> > different opinions.
>
> Great.
>
> Cheers!
>
> --
> Frode Nordahl
>
>
> >
> > Thanks
> > Numan
> >
> > >
> > > --
> > > Frode Nordahl
> > >
> > > > Thanks
> > > > Numan
> > > >
> > > > >
> > > > >
> > > > > --
> > > > > Frode Nordahl
> > > > >
> > > > > > Best regards, Ilya Maximets.
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > dev@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > >
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
>
>
>
> --
> Frode Nordahl
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Frode Nordahl July 7, 2021, 6:30 a.m. UTC | #11
On Tue, Jul 6, 2021 at 8:11 PM Han Zhou <zhouhan@gmail.com> wrote:
>
>
>
> On Tue, Jul 6, 2021 at 1:19 AM Frode Nordahl <frode.nordahl@canonical.com> wrote:
> >
> > On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
> > >
> > > On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> > > <frode.nordahl@canonical.com> wrote:
> > > >
> > > > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > > >
> > > > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > > > <frode.nordahl@canonical.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > >
> > > > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > > > >>
> > > > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > > > >> >
> > > > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > > > >> > >
> > > > > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > > > >> > >
> > > > > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > > > > >> > > have a topology where the instance runs on one host and Open
> > > > > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > > > > >> > >
> > > > > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > > > > >> > > As such it is natural to extend OVN to provide this common
> > > > > > > >> > > functionality through its CMS facing API.
> > > > > > > >> >
> > > > > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > > > > >> > and configuration job for CMS.  You may even make it independent
> > > > > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > > > > >>
> > > > > > > >> Ilya,
> > > > > > > >>
> > > > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > > > >>
> > > > > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > > > > >> and isolation from the host are also new, so this is one proposal for
> > > > > > > >> how we could go about to enable the use of them.
> > > > > > > >>
> > > > > > > >> The OVN controller does today get pretty close to the physical realm
> > > > > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > > > > >> does react to events of physical interfaces being plugged into the
> > > > > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > > > > >> been doing the act of adding the port into the bridge.
> > > > > > > >>
> > > > > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > > > > >> this is that the information about which ports to bind, and where to
> > > > > > > >> bind them is already there. The timing of the information flow from
> > > > > > > >> the CMS is also suitable for the task.
> > > > > > > >>
> > > > > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > > > > >> interfacing with the kernel through netlink and friends are there or
> > > > > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > > > > >> interface is that it provides a generic infrastructure for these types
> > > > > > > >> of NICs. So by using this interface we should be able to support most
> > > > > > > >> if not all of the variants of these cards.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Providing a separate OVN service to do the task could work, but would
> > > > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > > > >
> > > > > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > > > > Southbound DB is not meant to be a public interface, it just happened
> > > > > > > to be available for connections.  I know that OpenStack has metadata
> > > > > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > > > > think, there should be a different way to get/set required information
> > > > > > > without connection to the Southbound.
> > > > > >
> > > > > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > > > > use of the Southbound DB by external to OVN services. My suggestion
> > > > > > was to have a separate OVN process do this if your objection was to
> > > > > > handle it as part of the ovn-controller process.
> > > > > >
> > > > > > > >>
> > > > > > > >> I fear it would be quite hard to build a whole separate project with
> > > > > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > > > > >> using this?
> > > > > > > >>
> > > > > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > > > > >> > at least for this use case, IIUC.
> > > > > > > >>
> > > > > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > > > >> infrastructure will become available there too over time, if it is not
> > > > > > > >> there already.
> > > > > > > >>
> > > > > > > >> We do currently have platform specific macros in the OVN build system,
> > > > > > > >> so we could enable the functionality when built on a compatible
> > > > > > > >> platform.
> > > > > > > >>
> > > > > > > >> > Maybe, others has different opinions.
> > > > > > > >>
> > > > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > > > >>
> > > > > > > >> > Another though is that there is, obviously, a network connection
> > > > > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > > > > >> > host system could just add interfaces over the network connection?
> > > > > > > >>
> > > > > > > >> There are a few issues with such an approach. One of the main goals
> > > > > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > > > > >> extra layer of security and isolation which is separate from the
> > > > > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > > > > >> power from. Requiring a connection between the two for operation would
> > > > > > > >> defy this purpose.
> > > > > > > >>
> > > > > > > >> In addition to that, this class of cards provide visibility into
> > > > > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > > > >> communicate with something else running on the NIC control plane CPU
> > > > > > > >> to retrieve the information it needs before it can know what to relay
> > > > > > > >> back over the ovsdb connection.
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Frode Nordahl
> > > > > > > >>
> > > > > > > >> > Best regards, Ilya Maximets.
> > > > > > > >
> > > > > > > > Here are my 2 cents.
> > > > > > > >
> > > > > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > > > > >
> > > > > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > > > >
> > > > > > I am aware of the on-going effort to implement support for this in
> > > > > > ovn-kubernetes directly. What we have identified is that there are
> > > > > > other CMSs that want this functionality, and with that we have an
> > > > > > opportunity to generalise and provide an abstraction in a common place
> > > > > > that all the consuming CMSes can benefit from.
> > > > > >
> > > > > > > >
> > > > > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > > > > >
> > > > > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > > > > cases these ports will require a lot of extra configuration which
> > > > > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > > > > ports, where you have to specify the port type, DPDK port config,
> > > > > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > > > > queues.  You may also want to configure affinity for these queues per
> > > > > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > > > > also might require some extra configuration that OVN will have to
> > > > > > > think about now.  This is a typical job for CMS to configure this
> > > > > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > > > > of CNI/device plugins exists.
> > > > > >
> > > > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > > > > them get their information from the CMS. Providing support for
> > > > > > userspace datapath and DPDK would require more information, some of
> > > > > > which is available through devlink, some fit well in key/value
> > > > > > options. Our initial target would be to support the kernel representor
> > > > > > port workflow.
> > > > > >
> > > > > > > >
> > > > > > > > There are some benefits:
> > > > > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > > > > >
> > > > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > > > or run management appctl commands if required.
> > > > > > > So, I don't see the point why it can't do plugging if it's already
> > > > > > > there.
> > > > > >
> > > > > > This pattern is Kubernetes specific and is not the case for other
> > > > > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > > > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > > > > provided means for coordinating more of the network related bits.
> > > > > >
> > > > > > > >
> > > > > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > > > > >
> > > > > > > Aside from plugging,
> > > > > > > I also don't see the reason to have devlink code in OVN just
> > > > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > > > understand why this information about the hardware cannot be
> > > > > > > retrieved during the host provisioning and stored somewhere
> > > > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > > > or something?
> > > > > >
> > > > > > As noted in one of the TODOs in the commit message of the RFC one of
> > > > > > the work items to further develop this is to monitor devlink for
> > > > > > changes, the end product will not do one-time initialization.
> > > > > >
> > > > > > While I agree with you that the current SR-IOV acceleration workflow
> > > > > > configuration is pretty static and can be done at deploy time, this
> > > > > > proposal prepares for the next generation subfunction workflow where
> > > > > > you will have a much higher density, and run-time configuration and
> > > > > > discovery of representor ports. There is a paper about it from netdev
> > > > > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > > > > kernel ABI).
> > > > > >
> > > > > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> > > >
> > > > > My 2 cents.
> > > >
> > > > Thank you for weighing in with your opinion, Numan.
> > > >
> > > > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > > > OVS ports and also
> > > > > to use devlink infrastructure.  I think an external entity can do all these.
> > > >
> > > > No doubt an external entity could do this, the challenge is that due
> > > > to how the SmartNIC with separate control plane CPUs are laid out
> > > > existing CMS tooling cannot do it, and we are working out a way to
> > > > efficiently and securely coordinate how and when this external entity
> > > > does the plugging. At the end of the day the data being passed around
> > > > is networking related, and I believe it is possible to create a
> > > > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > > > happy to prove this out outside of the ovn-controller first if that is
> > > > more palatable, see below.
> > > >
> > > > > The other downside I see with this RFC is that it will be now
> > > > > monitoring all the port binding
> > > > > rows without any condition because when a port binding is created the
> > > > > chassis column will
> > > > > be NULL.  Correct me If I'm wrong here.
> > > >
> > > > The intention of the condition is to consider unbound ports for
> > > > plugging, as soon as they have been plugged and bound there will be a
> > > > value in the chassis column and at that point only the respective
> > > > chassis will monitor them. I guess this could be a problem if there is
> > > > a large number of unbound ports in a deployment, from your experience
> > > > would that be the case? (When I think of it I do remember the load
> > > > balancer vip perhaps being a consumer of unbound ports).
> > > >
> > > > We considered using a separate LSP type, but apart from the plugging
> > > > bit these LSP's will behave just like regular VIFs from OVN Port
> > > > Binding code POV, and if all your hypervisors have SmartNICs with
> > > > control plane CPUs there would be just as many of those ports.


> > > In my opinion having a separate type would be better.  At the least it
> > > will not impact
> > > normal or existing deployments.
> > >
> > > Let's say we call a type as "representor"  (just picking a name for
> > > now), then each ovn-controller
> > > can add a conditional monitor to get updates for these types (just
> > > like how we do now for patch or external ports).
> > >
> > > What do you think of the below approach
> > >
> > > 1.  A new LSP type - representor is added.  CMS when creating this
> > > lport also sets the requested-chassis option.
> >
> > Works for me.
> >
> I think it is better not to add a new LSP type for this purpose. It would be better to make this "interface plugging" feature generic, so that it can work for other types as well. It could be supported in the future for a regular VIF, or for a new LSP type, so I think this behavior is better not to be tied to a LSP type.
>
> We could add a new column in the Port_Binding table: plugged_by. ovn-controller can monitor all port_bindings that have "plugged_by" equals to current chassis. This way it should be backward compatible and not affecting the use cases that doesn't need "plugged_by".

That is a good idea, during development of the RFC I did make an
attempt at creating an IDL index for the options:requested-chassis,
but that does not work because it would only match rows with
requested-chassis as the *only* option.

Creating a new column would resolve this issue!

> The details of plugging information can be stored in the options column with new options defined, such as: "plug_type". For representor port the "plug_type" can be defined as "representor". For each plug_type there can be type-specific information defined and also stored in the options column. (Or we can define a new column "plug_options" just for plugging information).

A convention to prefix all plugging related options the same way,
let's say: 'plug_', would work as well as a separate column, but I
have nothing against a separate column if there is appetite for that.

> > > 2. ovn-controller on the requested chassis sees this lport type,
> > > creates an OVS port/interface and possibly sets some options or hints
> > > in the external_ids column
> > >  of ovs interface.  At this point, it will not do anything.
> >
> > This could potentially help us avoid the issue of adding more
> > processes and IDLs talking to the SB DB.
> >
> > However, at this point in time, the ovn-controller does not know the
> > details of the representor port, such as name/ifindex for the kernel
> > representor port, and the interface it creates in OVS has to match the
> > kernel representor port, and afict it is not possible to change these
> > details after port creation. This information is provided by the
> > devlink lookup in the current proposal.
> >
> > It might work if it was possible to create the port and interface
> > separately, but that would cause a referential integrity violation.
> > Another alternative could be that ovn-vif created a new interface
> > record in 4), attaching that to the port record created by
> > ovn-controller replacing a throw-away interface record, but we would
> > be approaching hacky territory.
> >
> > Could we do some sort of local IPC perhaps through a Unix domain socket?
>
> I think it would be good to have separate *modules* to handle interface plugging (one module per plug_type), but there is no need to have a separate *process* for it, which increases the complexity and the IPC cost. The modules can be invoked as callbacks registered for each plug_type, and responsible for explaining the options defined for its type, which is directly retrieved from the port_binding record from SB DB. This way the core ovn-controller components are kept untouched (except calling the plugging modules at some point, but not worrying about the details), without introducing a separate process. (A separate process is needed only if there are performance concerns or blocking I/Os).
>
> It is similar to how different dpif providers (or different dev types) are supported in OVS.

Yes, I like this. The currently proposed devlink informed plugging
would have no issues with blocking I/O, and we could document a
"charter" for the ovn-vif project that anything going into it would
have to operate as if it was native OVN asynchronous code.

> >
> > > 3.  An external entity (similar to your proposed pluggable.c) running
> > > on the chassis, would have built an shash of  "devlink_ports" using
> > > the devlink sockets.
> > >
> > > 4.  This entity will also be monitoring the ovs interfaces.  When it
> > > sees a new interface with the proper hints, will set the
> > > external_ids:iface-id=<lport_name>
> > >
> I think setting iface-id (if still necessary) shall be done together with the port creation, by the plugging module (through a callback).

+1

> > > 5. ovn-controller when it sees the external_ids:iface-id,  will bind
> > > the port and add flows.
> > >
> > >
> > > With (2) we now add the support in ovn-controller to create OVS ports,
> > >  but at the same time OVN will not bother about the internal details
> > > of SMART nics
> > > or bother about the devlink sockets.
> > >
> > > I'm not sure if the above approach seems hacky.  The external entity
> > > could be called "ovn-vif".
> > >
> > > What do you think ?
> >
> > This is an interesting idea and it could work if we figure out a way
> > for ovn-controller and ovn-vif to exchange information before the
> > port/interface is created.
> >
>
> I'd really prefer not to introduce such extra IPC. Would the plug_type mechanism be generic enough and avoid such external entities?

I agree, optionally pulling the code in at compile time would make
things much simpler and more effective.
Frode Nordahl July 8, 2021, 9:58 a.m. UTC | #12
On Wed, Jul 7, 2021 at 8:30 AM Frode Nordahl
<frode.nordahl@canonical.com> wrote:
>
> On Tue, Jul 6, 2021 at 8:11 PM Han Zhou <zhouhan@gmail.com> wrote:
> >
> >
> >
> > On Tue, Jul 6, 2021 at 1:19 AM Frode Nordahl <frode.nordahl@canonical.com> wrote:
> > >
> > > On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
> > > >
> > > > On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> > > > <frode.nordahl@canonical.com> wrote:
> > > > >
> > > > > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > > > >
> > > > > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > > > > <frode.nordahl@canonical.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > >
> > > > > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > > > > >>
> > > > > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > > > > >> >
> > > > > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > > > > >> > >
> > > > > > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > > > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > > > > >> > >
> > > > > > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > > > > > >> > > have a topology where the instance runs on one host and Open
> > > > > > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > > > > > >> > >
> > > > > > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > > > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > > > > > >> > > As such it is natural to extend OVN to provide this common
> > > > > > > > >> > > functionality through its CMS facing API.
> > > > > > > > >> >
> > > > > > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > > > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > > > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > > > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > > > > > >> > and configuration job for CMS.  You may even make it independent
> > > > > > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > > > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > > > > > >>
> > > > > > > > >> Ilya,
> > > > > > > > >>
> > > > > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > > > > >>
> > > > > > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > > > > > >> and isolation from the host are also new, so this is one proposal for
> > > > > > > > >> how we could go about to enable the use of them.
> > > > > > > > >>
> > > > > > > > >> The OVN controller does today get pretty close to the physical realm
> > > > > > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > > > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > > > > > >> does react to events of physical interfaces being plugged into the
> > > > > > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > > > > > >> been doing the act of adding the port into the bridge.
> > > > > > > > >>
> > > > > > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > > > > > >> this is that the information about which ports to bind, and where to
> > > > > > > > >> bind them is already there. The timing of the information flow from
> > > > > > > > >> the CMS is also suitable for the task.
> > > > > > > > >>
> > > > > > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > > > > > >> interfacing with the kernel through netlink and friends are there or
> > > > > > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > > > > > >> interface is that it provides a generic infrastructure for these types
> > > > > > > > >> of NICs. So by using this interface we should be able to support most
> > > > > > > > >> if not all of the variants of these cards.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Providing a separate OVN service to do the task could work, but would
> > > > > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > > > > >
> > > > > > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > > > > > Southbound DB is not meant to be a public interface, it just happened
> > > > > > > > to be available for connections.  I know that OpenStack has metadata
> > > > > > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > > > > > think, there should be a different way to get/set required information
> > > > > > > > without connection to the Southbound.
> > > > > > >
> > > > > > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > > > > > use of the Southbound DB by external to OVN services. My suggestion
> > > > > > > was to have a separate OVN process do this if your objection was to
> > > > > > > handle it as part of the ovn-controller process.
> > > > > > >
> > > > > > > > >>
> > > > > > > > >> I fear it would be quite hard to build a whole separate project with
> > > > > > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > > > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > > > > > >> using this?
> > > > > > > > >>
> > > > > > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > > > > > >> > at least for this use case, IIUC.
> > > > > > > > >>
> > > > > > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > > > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > > > > >> infrastructure will become available there too over time, if it is not
> > > > > > > > >> there already.
> > > > > > > > >>
> > > > > > > > >> We do currently have platform specific macros in the OVN build system,
> > > > > > > > >> so we could enable the functionality when built on a compatible
> > > > > > > > >> platform.
> > > > > > > > >>
> > > > > > > > >> > Maybe, others has different opinions.
> > > > > > > > >>
> > > > > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > > > > >>
> > > > > > > > >> > Another though is that there is, obviously, a network connection
> > > > > > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > > > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > > > > > >> > host system could just add interfaces over the network connection?
> > > > > > > > >>
> > > > > > > > >> There are a few issues with such an approach. One of the main goals
> > > > > > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > > > > > >> extra layer of security and isolation which is separate from the
> > > > > > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > > > > > >> power from. Requiring a connection between the two for operation would
> > > > > > > > >> defy this purpose.
> > > > > > > > >>
> > > > > > > > >> In addition to that, this class of cards provide visibility into
> > > > > > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > > > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > > > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > > > > >> communicate with something else running on the NIC control plane CPU
> > > > > > > > >> to retrieve the information it needs before it can know what to relay
> > > > > > > > >> back over the ovsdb connection.
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Frode Nordahl
> > > > > > > > >>
> > > > > > > > >> > Best regards, Ilya Maximets.
> > > > > > > > >
> > > > > > > > > Here are my 2 cents.
> > > > > > > > >
> > > > > > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > > > > > >
> > > > > > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > > > > >
> > > > > > > I am aware of the on-going effort to implement support for this in
> > > > > > > ovn-kubernetes directly. What we have identified is that there are
> > > > > > > other CMSs that want this functionality, and with that we have an
> > > > > > > opportunity to generalise and provide an abstraction in a common place
> > > > > > > that all the consuming CMSes can benefit from.
> > > > > > >
> > > > > > > > >
> > > > > > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > > > > > >
> > > > > > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > > > > > cases these ports will require a lot of extra configuration which
> > > > > > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > > > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > > > > > ports, where you have to specify the port type, DPDK port config,
> > > > > > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > > > > > queues.  You may also want to configure affinity for these queues per
> > > > > > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > > > > > also might require some extra configuration that OVN will have to
> > > > > > > > think about now.  This is a typical job for CMS to configure this
> > > > > > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > > > > > of CNI/device plugins exists.
> > > > > > >
> > > > > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > > > > > them get their information from the CMS. Providing support for
> > > > > > > userspace datapath and DPDK would require more information, some of
> > > > > > > which is available through devlink, some fit well in key/value
> > > > > > > options. Our initial target would be to support the kernel representor
> > > > > > > port workflow.
> > > > > > >
> > > > > > > > >
> > > > > > > > > There are some benefits:
> > > > > > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > > > > > >
> > > > > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > > > > or run management appctl commands if required.
> > > > > > > > So, I don't see the point why it can't do plugging if it's already
> > > > > > > > there.
> > > > > > >
> > > > > > > This pattern is Kubernetes specific and is not the case for other
> > > > > > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > > > > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > > > > > provided means for coordinating more of the network related bits.
> > > > > > >
> > > > > > > > >
> > > > > > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > > > > > >
> > > > > > > > Aside from plugging,
> > > > > > > > I also don't see the reason to have devlink code in OVN just
> > > > > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > > > > understand why this information about the hardware cannot be
> > > > > > > > retrieved during the host provisioning and stored somewhere
> > > > > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > > > > or something?
> > > > > > >
> > > > > > > As noted in one of the TODOs in the commit message of the RFC one of
> > > > > > > the work items to further develop this is to monitor devlink for
> > > > > > > changes, the end product will not do one-time initialization.
> > > > > > >
> > > > > > > While I agree with you that the current SR-IOV acceleration workflow
> > > > > > > configuration is pretty static and can be done at deploy time, this
> > > > > > > proposal prepares for the next generation subfunction workflow where
> > > > > > > you will have a much higher density, and run-time configuration and
> > > > > > > discovery of representor ports. There is a paper about it from netdev
> > > > > > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > > > > > kernel ABI).
> > > > > > >
> > > > > > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> > > > >
> > > > > > My 2 cents.
> > > > >
> > > > > Thank you for weighing in with your opinion, Numan.
> > > > >
> > > > > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > > > > OVS ports and also
> > > > > > to use devlink infrastructure.  I think an external entity can do all these.
> > > > >
> > > > > No doubt an external entity could do this, the challenge is that due
> > > > > to how the SmartNIC with separate control plane CPUs are laid out
> > > > > existing CMS tooling cannot do it, and we are working out a way to
> > > > > efficiently and securely coordinate how and when this external entity
> > > > > does the plugging. At the end of the day the data being passed around
> > > > > is networking related, and I believe it is possible to create a
> > > > > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > > > > happy to prove this out outside of the ovn-controller first if that is
> > > > > more palatable, see below.
> > > > >
> > > > > > The other downside I see with this RFC is that it will be now
> > > > > > monitoring all the port binding
> > > > > > rows without any condition because when a port binding is created the
> > > > > > chassis column will
> > > > > > be NULL.  Correct me If I'm wrong here.
> > > > >
> > > > > The intention of the condition is to consider unbound ports for
> > > > > plugging, as soon as they have been plugged and bound there will be a
> > > > > value in the chassis column and at that point only the respective
> > > > > chassis will monitor them. I guess this could be a problem if there is
> > > > > a large number of unbound ports in a deployment, from your experience
> > > > > would that be the case? (When I think of it I do remember the load
> > > > > balancer vip perhaps being a consumer of unbound ports).
> > > > >
> > > > > We considered using a separate LSP type, but apart from the plugging
> > > > > bit these LSP's will behave just like regular VIFs from OVN Port
> > > > > Binding code POV, and if all your hypervisors have SmartNICs with
> > > > > control plane CPUs there would be just as many of those ports.
>
>
> > > > In my opinion having a separate type would be better.  At the least it
> > > > will not impact
> > > > normal or existing deployments.
> > > >
> > > > Let's say we call a type as "representor"  (just picking a name for
> > > > now), then each ovn-controller
> > > > can add a conditional monitor to get updates for these types (just
> > > > like how we do now for patch or external ports).
> > > >


> > > > What do you think of the below approach
> > > >
> > > > 1.  A new LSP type - representor is added.  CMS when creating this
> > > > lport also sets the requested-chassis option.
> > >
> > > Works for me.
> > >
> > I think it is better not to add a new LSP type for this purpose. It would be better to make this "interface plugging" feature generic, so that it can work for other types as well. It could be supported in the future for a regular VIF, or for a new LSP type, so I think this behavior is better not to be tied to a LSP type.
> >
> > We could add a new column in the Port_Binding table: plugged_by. ovn-controller can monitor all port_bindings that have "plugged_by" equals to current chassis. This way it should be backward compatible and not affecting the use cases that doesn't need "plugged_by".
>
> That is a good idea, during development of the RFC I did make an
> attempt at creating an IDL index for the options:requested-chassis,
> but that does not work because it would only match rows with
> requested-chassis as the *only* option.
>
> Creating a new column would resolve this issue!

Looking into how this bit could be implemented, I see that northd is
already performing lookup of chassis by name in several places, so I'm
thinking we could accomplish this without any change to NB schema.

With the following change to SB schema, we could have northd check for
presence of Logical_Switch_Port options:plug_type and
options:requested-chassis, if both are there look up chassis UUID by
name in options:requested-chassis and insert into SB Port_Binding
plugged_by column.

diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
index bbf60781d..0dee8e155 100644
--- a/ovn-sb.ovsschema
+++ b/ovn-sb.ovsschema
@@ -229,7 +229,11 @@
                 "external_ids": {"type": {"key": "string",
                                  "value": "string",
                                  "min": 0,
-                                 "max": "unlimited"}}},
+                                 "max": "unlimited"}},
+                "plugged_by": {"type": {"key": {"type": "uuid",
+                                                "refTable": "Chassis",
+                                                "refType": "weak"},
+                                        "min": 0, "max": 1}}},
             "indexes": [["datapath", "tunnel_key"], ["logical_port"]],
             "isRoot": true},
         "MAC_Binding": {

What do you think?

--
Frode Nordahl

> > The details of plugging information can be stored in the options column with new options defined, such as: "plug_type". For representor port the "plug_type" can be defined as "representor". For each plug_type there can be type-specific information defined and also stored in the options column. (Or we can define a new column "plug_options" just for plugging information).
>
> A convention to prefix all plugging related options the same way,
> let's say: 'plug_', would work as well as a separate column, but I
> have nothing against a separate column if there is appetite for that.
>
> > > > 2. ovn-controller on the requested chassis sees this lport type,
> > > > creates an OVS port/interface and possibly sets some options or hints
> > > > in the external_ids column
> > > >  of ovs interface.  At this point, it will not do anything.
> > >
> > > This could potentially help us avoid the issue of adding more
> > > processes and IDLs talking to the SB DB.
> > >
> > > However, at this point in time, the ovn-controller does not know the
> > > details of the representor port, such as name/ifindex for the kernel
> > > representor port, and the interface it creates in OVS has to match the
> > > kernel representor port, and afict it is not possible to change these
> > > details after port creation. This information is provided by the
> > > devlink lookup in the current proposal.
> > >
> > > It might work if it was possible to create the port and interface
> > > separately, but that would cause a referential integrity violation.
> > > Another alternative could be that ovn-vif created a new interface
> > > record in 4), attaching that to the port record created by
> > > ovn-controller replacing a throw-away interface record, but we would
> > > be approaching hacky territory.
> > >
> > > Could we do some sort of local IPC perhaps through a Unix domain socket?
> >
> > I think it would be good to have separate *modules* to handle interface plugging (one module per plug_type), but there is no need to have a separate *process* for it, which increases the complexity and the IPC cost. The modules can be invoked as callbacks registered for each plug_type, and responsible for explaining the options defined for its type, which is directly retrieved from the port_binding record from SB DB. This way the core ovn-controller components are kept untouched (except calling the plugging modules at some point, but not worrying about the details), without introducing a separate process. (A separate process is needed only if there are performance concerns or blocking I/Os).
> >
> > It is similar to how different dpif providers (or different dev types) are supported in OVS.
>
> Yes, I like this. The currently proposed devlink informed plugging
> would have no issues with blocking I/O, and we could document a
> "charter" for the ovn-vif project that anything going into it would
> have to operate as if it was native OVN asynchronous code.
>
> > >
> > > > 3.  An external entity (similar to your proposed pluggable.c) running
> > > > on the chassis, would have built an shash of  "devlink_ports" using
> > > > the devlink sockets.
> > > >
> > > > 4.  This entity will also be monitoring the ovs interfaces.  When it
> > > > sees a new interface with the proper hints, will set the
> > > > external_ids:iface-id=<lport_name>
> > > >
> > I think setting iface-id (if still necessary) shall be done together with the port creation, by the plugging module (through a callback).
>
> +1
>
> > > > 5. ovn-controller when it sees the external_ids:iface-id,  will bind
> > > > the port and add flows.
> > > >
> > > >
> > > > With (2) we now add the support in ovn-controller to create OVS ports,
> > > >  but at the same time OVN will not bother about the internal details
> > > > of SMART nics
> > > > or bother about the devlink sockets.
> > > >
> > > > I'm not sure if the above approach seems hacky.  The external entity
> > > > could be called "ovn-vif".
> > > >
> > > > What do you think ?
> > >
> > > This is an interesting idea and it could work if we figure out a way
> > > for ovn-controller and ovn-vif to exchange information before the
> > > port/interface is created.
> > >
> >
> > I'd really prefer not to introduce such extra IPC. Would the plug_type mechanism be generic enough and avoid such external entities?
>
> I agree, optionally pulling the code in at compile time would make
> things much simpler and more effective.
>
> --
> Frode Nordahl
>
> > Thanks,
> > Han
> >
> > > > >
> > > > > > Perhaps there can be a separate project called - ovn-vif which does this ?
> > > > >
> > > > > Creating a separate ovn-org project called ovn-vif would allow us to
> > > > > move forward, we would of course have to document the optional CMS API
> > > > > in core OVN pointing to an optional external dependency for their use.
> > > > > Would this be an acceptable compromise?
> > > >
> > > > I personally do not have any objections to that.  Maybe others have
> > > > different opinions.
> > >
> > > Great.
> > >
> > > Cheers!
> > >
> > > --
> > > Frode Nordahl
> > >
> > >
> > > >
> > > > Thanks
> > > > Numan
> > > >
> > > > >
> > > > > --
> > > > > Frode Nordahl
> > > > >
> > > > > > Thanks
> > > > > > Numan
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Frode Nordahl
> > > > > > >
> > > > > > > > Best regards, Ilya Maximets.
> > > > > > > _______________________________________________
> > > > > > > dev mailing list
> > > > > > > dev@openvswitch.org
> > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > > >
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > dev@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > >
> > >
> > >
> > >
> > > --
> > > Frode Nordahl
> > > _______________________________________________
> > > dev mailing list
> > > dev@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Frode Nordahl July 9, 2021, 7:04 a.m. UTC | #13
On Thu, Jul 8, 2021 at 11:58 AM Frode Nordahl
<frode.nordahl@canonical.com> wrote:
>
> On Wed, Jul 7, 2021 at 8:30 AM Frode Nordahl
> <frode.nordahl@canonical.com> wrote:
> >
> > On Tue, Jul 6, 2021 at 8:11 PM Han Zhou <zhouhan@gmail.com> wrote:
> > >
> > >
> > >
> > > On Tue, Jul 6, 2021 at 1:19 AM Frode Nordahl <frode.nordahl@canonical.com> wrote:
> > > >
> > > > On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
> > > > >
> > > > > On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> > > > > <frode.nordahl@canonical.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > > > > > <frode.nordahl@canonical.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > >
> > > > > > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > > > > > >>
> > > > > > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > > > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > > > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > > > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > > > > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > > > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > > > > > >> > >
> > > > > > > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > > > > > > >> > > have a topology where the instance runs on one host and Open
> > > > > > > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > > > > > > >> > >
> > > > > > > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > > > > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > > > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > > > > > > >> > > As such it is natural to extend OVN to provide this common
> > > > > > > > > >> > > functionality through its CMS facing API.
> > > > > > > > > >> >
> > > > > > > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > > > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > > > > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > > > > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > > > > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > > > > > > >> > and configuration job for CMS.  You may even make it independent
> > > > > > > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > > > > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > > > > > > >>
> > > > > > > > > >> Ilya,
> > > > > > > > > >>
> > > > > > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > > > > > >>
> > > > > > > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > > > > > > >> and isolation from the host are also new, so this is one proposal for
> > > > > > > > > >> how we could go about to enable the use of them.
> > > > > > > > > >>
> > > > > > > > > >> The OVN controller does today get pretty close to the physical realm
> > > > > > > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > > > > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > > > > > > >> does react to events of physical interfaces being plugged into the
> > > > > > > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > > > > > > >> been doing the act of adding the port into the bridge.
> > > > > > > > > >>
> > > > > > > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > > > > > > >> this is that the information about which ports to bind, and where to
> > > > > > > > > >> bind them is already there. The timing of the information flow from
> > > > > > > > > >> the CMS is also suitable for the task.
> > > > > > > > > >>
> > > > > > > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > > > > > > >> interfacing with the kernel through netlink and friends are there or
> > > > > > > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > > > > > > >> interface is that it provides a generic infrastructure for these types
> > > > > > > > > >> of NICs. So by using this interface we should be able to support most
> > > > > > > > > >> if not all of the variants of these cards.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Providing a separate OVN service to do the task could work, but would
> > > > > > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > > > > > >
> > > > > > > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > > > > > > Southbound DB is not meant to be a public interface, it just happened
> > > > > > > > > to be available for connections.  I know that OpenStack has metadata
> > > > > > > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > > > > > > think, there should be a different way to get/set required information
> > > > > > > > > without connection to the Southbound.
> > > > > > > >
> > > > > > > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > > > > > > use of the Southbound DB by external to OVN services. My suggestion
> > > > > > > > was to have a separate OVN process do this if your objection was to
> > > > > > > > handle it as part of the ovn-controller process.
> > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> I fear it would be quite hard to build a whole separate project with
> > > > > > > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > > > > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > > > > > > >> using this?
> > > > > > > > > >>
> > > > > > > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > > > > > > >> > at least for this use case, IIUC.
> > > > > > > > > >>
> > > > > > > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > > > > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > > > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > > > > > >> infrastructure will become available there too over time, if it is not
> > > > > > > > > >> there already.
> > > > > > > > > >>
> > > > > > > > > >> We do currently have platform specific macros in the OVN build system,
> > > > > > > > > >> so we could enable the functionality when built on a compatible
> > > > > > > > > >> platform.
> > > > > > > > > >>
> > > > > > > > > >> > Maybe, others has different opinions.
> > > > > > > > > >>
> > > > > > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > > > > > >>
> > > > > > > > > >> > Another though is that there is, obviously, a network connection
> > > > > > > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > > > > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > > > > > > >> > host system could just add interfaces over the network connection?
> > > > > > > > > >>
> > > > > > > > > >> There are a few issues with such an approach. One of the main goals
> > > > > > > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > > > > > > >> extra layer of security and isolation which is separate from the
> > > > > > > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > > > > > > >> power from. Requiring a connection between the two for operation would
> > > > > > > > > >> defy this purpose.
> > > > > > > > > >>
> > > > > > > > > >> In addition to that, this class of cards provide visibility into
> > > > > > > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > > > > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > > > > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > > > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > > > > > >> communicate with something else running on the NIC control plane CPU
> > > > > > > > > >> to retrieve the information it needs before it can know what to relay
> > > > > > > > > >> back over the ovsdb connection.
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> Frode Nordahl
> > > > > > > > > >>
> > > > > > > > > >> > Best regards, Ilya Maximets.
> > > > > > > > > >
> > > > > > > > > > Here are my 2 cents.
> > > > > > > > > >
> > > > > > > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > > > > > > >
> > > > > > > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > > > > > >
> > > > > > > > I am aware of the on-going effort to implement support for this in
> > > > > > > > ovn-kubernetes directly. What we have identified is that there are
> > > > > > > > other CMSs that want this functionality, and with that we have an
> > > > > > > > opportunity to generalise and provide an abstraction in a common place
> > > > > > > > that all the consuming CMSes can benefit from.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > > > > > > >
> > > > > > > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > > > > > > cases these ports will require a lot of extra configuration which
> > > > > > > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > > > > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > > > > > > ports, where you have to specify the port type, DPDK port config,
> > > > > > > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > > > > > > queues.  You may also want to configure affinity for these queues per
> > > > > > > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > > > > > > also might require some extra configuration that OVN will have to
> > > > > > > > > think about now.  This is a typical job for CMS to configure this
> > > > > > > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > > > > > > of CNI/device plugins exists.
> > > > > > > >
> > > > > > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > > > > > > them get their information from the CMS. Providing support for
> > > > > > > > userspace datapath and DPDK would require more information, some of
> > > > > > > > which is available through devlink, some fit well in key/value
> > > > > > > > options. Our initial target would be to support the kernel representor
> > > > > > > > port workflow.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > There are some benefits:
> > > > > > > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > > > > > > >
> > > > > > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > > > > > or run management appctl commands if required.
> > > > > > > > > So, I don't see the point why it can't do plugging if it's already
> > > > > > > > > there.
> > > > > > > >
> > > > > > > > This pattern is Kubernetes specific and is not the case for other
> > > > > > > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > > > > > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > > > > > > provided means for coordinating more of the network related bits.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > > > > > > >
> > > > > > > > > Aside from plugging,
> > > > > > > > > I also don't see the reason to have devlink code in OVN just
> > > > > > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > > > > > understand why this information about the hardware cannot be
> > > > > > > > > retrieved during the host provisioning and stored somewhere
> > > > > > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > > > > > or something?
> > > > > > > >
> > > > > > > > As noted in one of the TODOs in the commit message of the RFC one of
> > > > > > > > the work items to further develop this is to monitor devlink for
> > > > > > > > changes, the end product will not do one-time initialization.
> > > > > > > >
> > > > > > > > While I agree with you that the current SR-IOV acceleration workflow
> > > > > > > > configuration is pretty static and can be done at deploy time, this
> > > > > > > > proposal prepares for the next generation subfunction workflow where
> > > > > > > > you will have a much higher density, and run-time configuration and
> > > > > > > > discovery of representor ports. There is a paper about it from netdev
> > > > > > > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > > > > > > kernel ABI).
> > > > > > > >
> > > > > > > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> > > > > >
> > > > > > > My 2 cents.
> > > > > >
> > > > > > Thank you for weighing in with your opinion, Numan.
> > > > > >
> > > > > > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > > > > > OVS ports and also
> > > > > > > to use devlink infrastructure.  I think an external entity can do all these.
> > > > > >
> > > > > > No doubt an external entity could do this, the challenge is that due
> > > > > > to how the SmartNIC with separate control plane CPUs are laid out
> > > > > > existing CMS tooling cannot do it, and we are working out a way to
> > > > > > efficiently and securely coordinate how and when this external entity
> > > > > > does the plugging. At the end of the day the data being passed around
> > > > > > is networking related, and I believe it is possible to create a
> > > > > > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > > > > > happy to prove this out outside of the ovn-controller first if that is
> > > > > > more palatable, see below.
> > > > > >
> > > > > > > The other downside I see with this RFC is that it will be now
> > > > > > > monitoring all the port binding
> > > > > > > rows without any condition because when a port binding is created the
> > > > > > > chassis column will
> > > > > > > be NULL.  Correct me If I'm wrong here.
> > > > > >
> > > > > > The intention of the condition is to consider unbound ports for
> > > > > > plugging, as soon as they have been plugged and bound there will be a
> > > > > > value in the chassis column and at that point only the respective
> > > > > > chassis will monitor them. I guess this could be a problem if there is
> > > > > > a large number of unbound ports in a deployment, from your experience
> > > > > > would that be the case? (When I think of it I do remember the load
> > > > > > balancer vip perhaps being a consumer of unbound ports).
> > > > > >
> > > > > > We considered using a separate LSP type, but apart from the plugging
> > > > > > bit these LSP's will behave just like regular VIFs from OVN Port
> > > > > > Binding code POV, and if all your hypervisors have SmartNICs with
> > > > > > control plane CPUs there would be just as many of those ports.
> >
> >
> > > > > In my opinion having a separate type would be better.  At the least it
> > > > > will not impact
> > > > > normal or existing deployments.
> > > > >
> > > > > Let's say we call a type as "representor"  (just picking a name for
> > > > > now), then each ovn-controller
> > > > > can add a conditional monitor to get updates for these types (just
> > > > > like how we do now for patch or external ports).
> > > > >
>
>
> > > > > What do you think of the below approach
> > > > >
> > > > > 1.  A new LSP type - representor is added.  CMS when creating this
> > > > > lport also sets the requested-chassis option.
> > > >
> > > > Works for me.
> > > >
> > > I think it is better not to add a new LSP type for this purpose. It would be better to make this "interface plugging" feature generic, so that it can work for other types as well. It could be supported in the future for a regular VIF, or for a new LSP type, so I think this behavior is better not to be tied to a LSP type.
> > >
> > > We could add a new column in the Port_Binding table: plugged_by. ovn-controller can monitor all port_bindings that have "plugged_by" equals to current chassis. This way it should be backward compatible and not affecting the use cases that doesn't need "plugged_by".
> >
> > That is a good idea, during development of the RFC I did make an
> > attempt at creating an IDL index for the options:requested-chassis,
> > but that does not work because it would only match rows with
> > requested-chassis as the *only* option.
> >
> > Creating a new column would resolve this issue!
>
> Looking into how this bit could be implemented, I see that northd is
> already performing lookup of chassis by name in several places, so I'm
> thinking we could accomplish this without any change to NB schema.
>
> With the following change to SB schema, we could have northd check for
> presence of Logical_Switch_Port options:plug_type and
> options:requested-chassis, if both are there look up chassis UUID by
> name in options:requested-chassis and insert into SB Port_Binding
> plugged_by column.
>
> diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
> index bbf60781d..0dee8e155 100644
> --- a/ovn-sb.ovsschema
> +++ b/ovn-sb.ovsschema
> @@ -229,7 +229,11 @@
>                  "external_ids": {"type": {"key": "string",
>                                   "value": "string",
>                                   "min": 0,
> -                                 "max": "unlimited"}}},
> +                                 "max": "unlimited"}},
> +                "plugged_by": {"type": {"key": {"type": "uuid",
> +                                                "refTable": "Chassis",
> +                                                "refType": "weak"},
> +                                        "min": 0, "max": 1}}},
>              "indexes": [["datapath", "tunnel_key"], ["logical_port"]],
>              "isRoot": true},
>          "MAC_Binding": {
>
> What do you think?
>
> --
> Frode Nordahl
>
> > > The details of plugging information can be stored in the options column with new options defined, such as: "plug_type". For representor port the "plug_type" can be defined as "representor". For each plug_type there can be type-specific information defined and also stored in the options column. (Or we can define a new column "plug_options" just for plugging information).
> >
> > A convention to prefix all plugging related options the same way,
> > let's say: 'plug_', would work as well as a separate column, but I
> > have nothing against a separate column if there is appetite for that.
> >
> > > > > 2. ovn-controller on the requested chassis sees this lport type,
> > > > > creates an OVS port/interface and possibly sets some options or hints
> > > > > in the external_ids column
> > > > >  of ovs interface.  At this point, it will not do anything.
> > > >
> > > > This could potentially help us avoid the issue of adding more
> > > > processes and IDLs talking to the SB DB.
> > > >
> > > > However, at this point in time, the ovn-controller does not know the
> > > > details of the representor port, such as name/ifindex for the kernel
> > > > representor port, and the interface it creates in OVS has to match the
> > > > kernel representor port, and afict it is not possible to change these
> > > > details after port creation. This information is provided by the
> > > > devlink lookup in the current proposal.
> > > >
> > > > It might work if it was possible to create the port and interface
> > > > separately, but that would cause a referential integrity violation.
> > > > Another alternative could be that ovn-vif created a new interface
> > > > record in 4), attaching that to the port record created by
> > > > ovn-controller replacing a throw-away interface record, but we would
> > > > be approaching hacky territory.
> > > >
> > > > Could we do some sort of local IPC perhaps through a Unix domain socket?
> > >
> > > I think it would be good to have separate *modules* to handle interface plugging (one module per plug_type), but there is no need to have a separate *process* for it, which increases the complexity and the IPC cost. The modules can be invoked as callbacks registered for each plug_type, and responsible for explaining the options defined for its type, which is directly retrieved from the port_binding record from SB DB. This way the core ovn-controller components are kept untouched (except calling the plugging modules at some point, but not worrying about the details), without introducing a separate process. (A separate process is needed only if there are performance concerns or blocking I/Os).
> > >
> > > It is similar to how different dpif providers (or different dev types) are supported in OVS.
> >
> > Yes, I like this. The currently proposed devlink informed plugging
> > would have no issues with blocking I/O, and we could document a
> > "charter" for the ovn-vif project that anything going into it would
> > have to operate as if it was native OVN asynchronous code.
> >

This bit could possibly be implemented with an interface like this
(note that this is draft/pseudocode at this point):
diff --git a/controller/plug-provider.h b/controller/plug-provider.h
new file mode 100644
index 000000000..9b7a2e722
--- /dev/null
+++ b/controller/plug-provider.h
@@ -0,0 +1,45 @@
+struct plug {
+    const struct plug_class *plug_class;
+};
+
+struct plug_iface {
+    char *iface_name;
+    struct smap iface_external_ids;
+    struct smap iface_options;
+    struct smap iface_other_config;
+};
+
+struct plug_class {
+    /* Type of plugger in this class. */
+    const char *type;
+
+    /* Called when the plug provider is registered, typically at program
+     * startup. */
+    int (*init)(void);
+
+    /* Performs periodic work needed by plugger, if any is necessary. */
+    bool (*run)(struct plug *plug);
+
+    /* Called when the plug provider is removed, typically at program exit. */
+    int (*destroy)(struct plug *plug);
+
+    /* Pass lport name and options to plugger to prepare for port creation.
+     *
+     * The plug implemantation can perform lookup or any per port
+     * initialization and should fill plug_iface with data required for
+     * port/interface creation.
+     *
+     * OVN owns the contents of plug_iface struct and is responsible to free
+     * any data allocated when done with it. */
+    bool (*plug_prepare)(const char *lport_name,
+                         const struct smap *lport_options,
+                         struct plug_iface *);
+
+    /* Notify plugger that port was created. */
+    bool (*plug_port_created)(const char *lport_name,
+                              const struct smap *lport_options);
+
+    /* Notify plugger that port was removed. */
+    bool (*plug_port_removed)(const char *lport_name,
+                              const struct smap *lport_options);
+};

With that the callbacks could be enabled and the infrastructure tested
in core OVN, and we could host the platform dependent bits externally
which would be enabled by the plug implementation calling a register
function like with the OVS dpif framework.
Frode Nordahl July 13, 2021, 8:47 a.m. UTC | #14
On Fri, Jul 9, 2021 at 9:04 AM Frode Nordahl
<frode.nordahl@canonical.com> wrote:
>
> On Thu, Jul 8, 2021 at 11:58 AM Frode Nordahl
> <frode.nordahl@canonical.com> wrote:
> >
> > On Wed, Jul 7, 2021 at 8:30 AM Frode Nordahl
> > <frode.nordahl@canonical.com> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 8:11 PM Han Zhou <zhouhan@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jul 6, 2021 at 1:19 AM Frode Nordahl <frode.nordahl@canonical.com> wrote:
> > > > >
> > > > > On Mon, Jul 5, 2021 at 6:57 PM Numan Siddique <numans@ovn.org> wrote:
> > > > > >
> > > > > > On Mon, Jul 5, 2021 at 12:12 PM Frode Nordahl
> > > > > > <frode.nordahl@canonical.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 30, 2021 at 12:32 AM Numan Siddique <numans@ovn.org> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jun 10, 2021 at 10:13 AM Frode Nordahl
> > > > > > > > <frode.nordahl@canonical.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jun 10, 2021 at 1:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On 6/10/21 8:36 AM, Han Zhou wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, May 13, 2021 at 9:25 AM Frode Nordahl <frode.nordahl@canonical.com <mailto:frode.nordahl@canonical.com>> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, May 13, 2021 at 5:12 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > On 5/9/21 4:03 PM, Frode Nordahl wrote:
> > > > > > > > > > >> > > Introduce plugging module that adds and removes ports on the
> > > > > > > > > > >> > > integration bridge, as directed by Port_Binding options.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Traditionally it has been the CMSs responsibility to create Virtual
> > > > > > > > > > >> > > Interfaces (VIFs) as part of instance (Container, Pod, Virtual
> > > > > > > > > > >> > > Machine etc.) life cycle, and subsequently manage plug/unplug
> > > > > > > > > > >> > > operations on the Open vSwitch integration bridge.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > With the advent of NICs connected to multiple distinct CPUs we can
> > > > > > > > > > >> > > have a topology where the instance runs on one host and Open
> > > > > > > > > > >> > > vSwitch and OVN runs on a different host, the smartnic CPU.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > The act of plugging and unplugging the representor port in Open
> > > > > > > > > > >> > > vSwitch running on the smartnic host CPU would be the same for
> > > > > > > > > > >> > > every smartnic variant (thanks to the devlink-port[0][1]
> > > > > > > > > > >> > > infrastructure) and every CMS (Kubernetes, LXD, OpenStack, etc.).
> > > > > > > > > > >> > > As such it is natural to extend OVN to provide this common
> > > > > > > > > > >> > > functionality through its CMS facing API.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Hi, Frode.  Thanks for putting this together, but it doesn't look
> > > > > > > > > > >> > natural to me.  OVN, AFAIK, never touched physical devices or
> > > > > > > > > > >> > interacted with the kernel directly.  This change introduces completely
> > > > > > > > > > >> > new functionality inside OVN.  With the same effect we can run a fully
> > > > > > > > > > >> > separate service on these smartnic CPUs that will do plugging
> > > > > > > > > > >> > and configuration job for CMS.  You may even make it independent
> > > > > > > > > > >> > from a particular CMS by creating a REST API for it or whatever.
> > > > > > > > > > >> > This will additionally allow using same service for non-OVN setups.
> > > > > > > > > > >>
> > > > > > > > > > >> Ilya,
> > > > > > > > > > >>
> > > > > > > > > > >> Thank you for taking the time to comment, much appreciated.
> > > > > > > > > > >>
> > > > > > > > > > >> Yes, this is new functionality, NICs with separate control plane CPUs
> > > > > > > > > > >> and isolation from the host are also new, so this is one proposal for
> > > > > > > > > > >> how we could go about to enable the use of them.
> > > > > > > > > > >>
> > > > > > > > > > >> The OVN controller does today get pretty close to the physical realm
> > > > > > > > > > >> by maintaining patch ports in Open vSwitch based on bridge mapping
> > > > > > > > > > >> configuration and presence of bridges to physical interfaces. It also
> > > > > > > > > > >> does react to events of physical interfaces being plugged into the
> > > > > > > > > > >> Open vSwitch instance it manages, albeit to date some other entity has
> > > > > > > > > > >> been doing the act of adding the port into the bridge.
> > > > > > > > > > >>
> > > > > > > > > > >> The rationale for proposing to use the OVN database for coordinating
> > > > > > > > > > >> this is that the information about which ports to bind, and where to
> > > > > > > > > > >> bind them is already there. The timing of the information flow from
> > > > > > > > > > >> the CMS is also suitable for the task.
> > > > > > > > > > >>
> > > > > > > > > > >> OVN relies on OVS library code, and all the necessary libraries for
> > > > > > > > > > >> interfacing with the kernel through netlink and friends are there or
> > > > > > > > > > >> would be easy to add. The rationale for using the netlink-devlink
> > > > > > > > > > >> interface is that it provides a generic infrastructure for these types
> > > > > > > > > > >> of NICs. So by using this interface we should be able to support most
> > > > > > > > > > >> if not all of the variants of these cards.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Providing a separate OVN service to do the task could work, but would
> > > > > > > > > > >> have the cost of an extra SB DB connection, IDL and monitors.
> > > > > > > > > >
> > > > > > > > > > IMHO, CMS should never connect to Southbound DB.  It's just because the
> > > > > > > > > > Southbound DB is not meant to be a public interface, it just happened
> > > > > > > > > > to be available for connections.  I know that OpenStack has metadata
> > > > > > > > > > agents that connects to Sb DB, but if it's really required for them, I
> > > > > > > > > > think, there should be a different way to get/set required information
> > > > > > > > > > without connection to the Southbound.
> > > > > > > > >
> > > > > > > > > The CMS-facing API is the Northbound DB, I was not suggesting direct
> > > > > > > > > use of the Southbound DB by external to OVN services. My suggestion
> > > > > > > > > was to have a separate OVN process do this if your objection was to
> > > > > > > > > handle it as part of the ovn-controller process.
> > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >> I fear it would be quite hard to build a whole separate project with
> > > > > > > > > > >> its own API, feels like a lot of duplicated effort when the flow of
> > > > > > > > > > >> data and APIs in OVN already align so well with CMSs interested in
> > > > > > > > > > >> using this?
> > > > > > > > > > >>
> > > > > > > > > > >> > Interactions with physical devices also makes OVN linux-dependent
> > > > > > > > > > >> > at least for this use case, IIUC.
> > > > > > > > > > >>
> > > > > > > > > > >> This specific bit would be linux-specific in the first iteration, yes.
> > > > > > > > > > >> But the vendors manufacturing and distributing the hardware do often
> > > > > > > > > > >> have drivers for other platforms, I am sure the necessary
> > > > > > > > > > >> infrastructure will become available there too over time, if it is not
> > > > > > > > > > >> there already.
> > > > > > > > > > >>
> > > > > > > > > > >> We do currently have platform specific macros in the OVN build system,
> > > > > > > > > > >> so we could enable the functionality when built on a compatible
> > > > > > > > > > >> platform.
> > > > > > > > > > >>
> > > > > > > > > > >> > Maybe, others has different opinions.
> > > > > > > > > > >>
> > > > > > > > > > >> I appreciate your opinion, and enjoy discussing this topic.
> > > > > > > > > > >>
> > > > > > > > > > >> > Another though is that there is, obviously, a network connection
> > > > > > > > > > >> > between the host and smartnic system.  Maybe it's possible to just
> > > > > > > > > > >> > add an extra remote to the local ovsdb-server so CMS daemon on the
> > > > > > > > > > >> > host system could just add interfaces over the network connection?
> > > > > > > > > > >>
> > > > > > > > > > >> There are a few issues with such an approach. One of the main goals
> > > > > > > > > > >> with providing and using a NIC with control plane CPUs is having an
> > > > > > > > > > >> extra layer of security and isolation which is separate from the
> > > > > > > > > > >> hypervisor host the card happens to share a PCI complex with and draw
> > > > > > > > > > >> power from. Requiring a connection between the two for operation would
> > > > > > > > > > >> defy this purpose.
> > > > > > > > > > >>
> > > > > > > > > > >> In addition to that, this class of cards provide visibility into
> > > > > > > > > > >> kernel interfaces, enumeration of representor ports etc. only from the
> > > > > > > > > > >> NIC control plane CPU side of the PCI complex, this information is not
> > > > > > > > > > >> provided to the host. So if a hypervisor host CMS agent were to do the
> > > > > > > > > > >> plugging through a remote ovsdb connection, it would have to
> > > > > > > > > > >> communicate with something else running on the NIC control plane CPU
> > > > > > > > > > >> to retrieve the information it needs before it can know what to relay
> > > > > > > > > > >> back over the ovsdb connection.
> > > > > > > > > > >>
> > > > > > > > > > >> --
> > > > > > > > > > >> Frode Nordahl
> > > > > > > > > > >>
> > > > > > > > > > >> > Best regards, Ilya Maximets.
> > > > > > > > > > >
> > > > > > > > > > > Here are my 2 cents.
> > > > > > > > > > >
> > > > > > > > > > > Initially I had similar concerns to Ilya, and it seems OVN should stay away from the physical interface plugging. As a reference, here is how ovn-kubernetes is doing it without adding anything to OVN: https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing <https://docs.google.com/document/d/11IoMKiohK7hIyIE36FJmwJv46DEBx52a4fqvrpCBBcg/edit?usp=sharing>
> > > > > > > > > >
> > > > > > > > > > AFAICT, a big part of the work is already done on the ovn-k8s side:
> > > > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2005
> > > > > > > > > >   https://github.com/ovn-org/ovn-kubernetes/pull/2042
> > > > > > > > >
> > > > > > > > > I am aware of the on-going effort to implement support for this in
> > > > > > > > > ovn-kubernetes directly. What we have identified is that there are
> > > > > > > > > other CMSs that want this functionality, and with that we have an
> > > > > > > > > opportunity to generalise and provide an abstraction in a common place
> > > > > > > > > that all the consuming CMSes can benefit from.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > However, thinking more about it, the proposed approach in this patch just expands the way how OVN can bind ports, utilizing the communication channel of OVN (OVSDB connections). If all the information regarding port binding can be specified by the CMS from NB, then it is not unnatural for ovn-controller to perform interface binding directly (instead of passively accepting what is attached by CMS). This kind of information already existed to some extent - the "requested_chassis" option in OpenStack. Now it seems this idea is just expanding it to a specific interface. The difference is that "requested_chassis" is used for validation only, but now we want to directly apply it. So I think at least I don't have a strong opinion on the idea.
> > > > > > > > > >

> > > > > > > > > > While it's, probably, OK for OVN to add port to the OVSDB, in many
> > > > > > > > > > cases these ports will require a lot of extra configuration which
> > > > > > > > > > is typically done by os-vif or CNI/device plugins.  Imagine that OVS
> > > > > > > > > > is running with userspace datapath and you need to plug-in some DPDK
> > > > > > > > > > ports, where you have to specify the port type, DPDK port config,
> > > > > > > > > > porbably also number of rx/tx queues, number of descriptors in these
> > > > > > > > > > queues.  You may also want to configure affinity for these queues per
> > > > > > > > > > PMD thread in OVS.  For kernel interfaces it might be easier, but they
> > > > > > > > > > also might require some extra configuration that OVN will have to
> > > > > > > > > > think about now.  This is a typical job for CMS to configure this
> > > > > > > > > > kind of stuff, and that is why projects like os-vif or large variety
> > > > > > > > > > of CNI/device plugins exists.
> > > > > > > > >
> > > > > > > > > CNI is Kubernetes specific, os-vif is OpenStack specific. And both of
> > > > > > > > > them get their information from the CMS. Providing support for
> > > > > > > > > userspace datapath and DPDK would require more information, some of
> > > > > > > > > which is available through devlink, some fit well in key/value
> > > > > > > > > options. Our initial target would be to support the kernel representor
> > > > > > > > > port workflow.

Circling back to the DPDK argument, I had a look at what this would
look like in practice. For the DPDK representor port workflow, the
information required from CMS would be which PF and a logical VF
number. In the current proposal the CMS provides a MAC address of PF
and a VF number, with that information the plugging library can look
up a PCI address and fill the `options:dpdk-devargs` for example
'0000:03:00.0,representor=[0]'.

When OVN is running on a SmartNIC, there will be no vHost sockets as
the workload would be running on a different host, at the other end of
the host facing PF+VF pair.

As for other tuning parameters such as queues, descriptors and queue
affinity, it appears to me CMS systems are generally unaware of this
tuning today and that blanket configuration is applied through other
means. When OVN is running on a SmartNIC any queue affinity options
would be relative to control plane CPU cores and have no relation to
which host CPU cores the workload is running on at the other end of
the host facing PF+VF pair. We could choose to replicate this behavior
by having the plugging library have its own defaults, possibly
influenced by local per chassis configuration and keep this out of the
CMS API.

Does this address your concerns wrt. DPDK support or do you have other
examples of configuration we would need to consider?

> > > > > > > > > > >
> > > > > > > > > > > There are some benefits:
> > > > > > > > > > > 1) The mechanism can be reused by different CMSes, which may simplify CMS implementation.
> > > > > > > > > > > 2) Compared with the ovn-k8s approach, it reuses OVN's communication channel, which avoids an extra CMS communication channel on the smart NIC side. (of course this can be achieved by a connection between the BM and smart NIC with *restricted* API just to convey the necessary information)
> > > > > > > > > >
> > > > > > > > > > The problem I see is that at least ovn-k8s, AFAIU, will require
> > > > > > > > > > the daemon on the Smart NIC anyways to monitor and configure
> > > > > > > > > > OVN components, e.g. to configure ovn-remote for ovn-controller
> > > > > > > > > > or run management appctl commands if required.
> > > > > > > > > > So, I don't see the point why it can't do plugging if it's already
> > > > > > > > > > there.
> > > > > > > > >
> > > > > > > > > This pattern is Kubernetes specific and is not the case for other
> > > > > > > > > CMSes. The current proposal for enabling Smart NIC with control plane
> > > > > > > > > CPUs for ovn-kubernetes could be simplified if the networking platform
> > > > > > > > > provided means for coordinating more of the network related bits.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > As to the negative side, it would increase OVN's complexity, and as mentioned by Ilya potentially breaks OVN's platform independence. To avoid this, I think the *plugging* module itself needs to be independent and pluggable. It can be extended as independent plugins. The plugin would need to define what information is needed in LSP's "options", and then implement corresponding drivers. With this approach, even the regular VIFs can be attached by ovn-controller if CMS can tell the interface name. Anyway, this is just my brief thinking.
> > > > > > > > > >
> > > > > > > > > > Aside from plugging,
> > > > > > > > > > I also don't see the reason to have devlink code in OVN just
> > > > > > > > > > because it runs once on the init stage, IIUC.  So, I don't
> > > > > > > > > > understand why this information about the hardware cannot be
> > > > > > > > > > retrieved during the host provisioning and stored somewhere
> > > > > > > > > > (Nova config?).  Isn't hardware inventory a job for tripleo
> > > > > > > > > > or something?
> > > > > > > > >
> > > > > > > > > As noted in one of the TODOs in the commit message of the RFC one of
> > > > > > > > > the work items to further develop this is to monitor devlink for
> > > > > > > > > changes, the end product will not do one-time initialization.
> > > > > > > > >
> > > > > > > > > While I agree with you that the current SR-IOV acceleration workflow
> > > > > > > > > configuration is pretty static and can be done at deploy time, this
> > > > > > > > > proposal prepares for the next generation subfunction workflow where
> > > > > > > > > you will have a much higher density, and run-time configuration and
> > > > > > > > > discovery of representor ports. There is a paper about it from netdev
> > > > > > > > > conf[2], and all of this is part of the devlink infrastructure (Linux
> > > > > > > > > kernel ABI).
> > > > > > > > >
> > > > > > > > > 2: https://netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
> > > > > > >
> > > > > > > > My 2 cents.
> > > > > > >
> > > > > > > Thank you for weighing in with your opinion, Numan.
> > > > > > >
> > > > > > > > I agree with Ilya.  It doesn't seem natural to me for OVN to create
> > > > > > > > OVS ports and also
> > > > > > > > to use devlink infrastructure.  I think an external entity can do all these.
> > > > > > >
> > > > > > > No doubt an external entity could do this, the challenge is that due
> > > > > > > to how the SmartNIC with separate control plane CPUs are laid out
> > > > > > > existing CMS tooling cannot do it, and we are working out a way to
> > > > > > > efficiently and securely coordinate how and when this external entity
> > > > > > > does the plugging. At the end of the day the data being passed around
> > > > > > > is networking related, and I believe it is possible to create a
> > > > > > > solution that would be consumable by all OVN-enabled CMS's. We'd be
> > > > > > > happy to prove this out outside of the ovn-controller first if that is
> > > > > > > more palatable, see below.
> > > > > > >
> > > > > > > > The other downside I see with this RFC is that it will be now
> > > > > > > > monitoring all the port binding
> > > > > > > > rows without any condition because when a port binding is created the
> > > > > > > > chassis column will
> > > > > > > > be NULL.  Correct me If I'm wrong here.
> > > > > > >
> > > > > > > The intention of the condition is to consider unbound ports for
> > > > > > > plugging, as soon as they have been plugged and bound there will be a
> > > > > > > value in the chassis column and at that point only the respective
> > > > > > > chassis will monitor them. I guess this could be a problem if there is
> > > > > > > a large number of unbound ports in a deployment, from your experience
> > > > > > > would that be the case? (When I think of it I do remember the load
> > > > > > > balancer vip perhaps being a consumer of unbound ports).
> > > > > > >
> > > > > > > We considered using a separate LSP type, but apart from the plugging
> > > > > > > bit these LSP's will behave just like regular VIFs from OVN Port
> > > > > > > Binding code POV, and if all your hypervisors have SmartNICs with
> > > > > > > control plane CPUs there would be just as many of those ports.
> > >
> > >
> > > > > > In my opinion having a separate type would be better.  At the least it
> > > > > > will not impact
> > > > > > normal or existing deployments.
> > > > > >
> > > > > > Let's say we call a type as "representor"  (just picking a name for
> > > > > > now), then each ovn-controller
> > > > > > can add a conditional monitor to get updates for these types (just
> > > > > > like how we do now for patch or external ports).
> > > > > >
> >
> >
> > > > > > What do you think of the below approach
> > > > > >
> > > > > > 1.  A new LSP type - representor is added.  CMS when creating this
> > > > > > lport also sets the requested-chassis option.
> > > > >
> > > > > Works for me.
> > > > >
> > > > I think it is better not to add a new LSP type for this purpose. It would be better to make this "interface plugging" feature generic, so that it can work for other types as well. It could be supported in the future for a regular VIF, or for a new LSP type, so I think this behavior is better not to be tied to a LSP type.
> > > >
> > > > We could add a new column in the Port_Binding table: plugged_by. ovn-controller can monitor all port_bindings that have "plugged_by" equals to current chassis. This way it should be backward compatible and not affecting the use cases that doesn't need "plugged_by".
> > >
> > > That is a good idea, during development of the RFC I did make an
> > > attempt at creating an IDL index for the options:requested-chassis,
> > > but that does not work because it would only match rows with
> > > requested-chassis as the *only* option.
> > >
> > > Creating a new column would resolve this issue!
> >
> > Looking into how this bit could be implemented, I see that northd is
> > already performing lookup of chassis by name in several places, so I'm
> > thinking we could accomplish this without any change to NB schema.
> >
> > With the following change to SB schema, we could have northd check for
> > presence of Logical_Switch_Port options:plug_type and
> > options:requested-chassis, if both are there look up chassis UUID by
> > name in options:requested-chassis and insert into SB Port_Binding
> > plugged_by column.
> >
> > diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
> > index bbf60781d..0dee8e155 100644
> > --- a/ovn-sb.ovsschema
> > +++ b/ovn-sb.ovsschema
> > @@ -229,7 +229,11 @@
> >                  "external_ids": {"type": {"key": "string",
> >                                   "value": "string",
> >                                   "min": 0,
> > -                                 "max": "unlimited"}}},
> > +                                 "max": "unlimited"}},
> > +                "plugged_by": {"type": {"key": {"type": "uuid",
> > +                                                "refTable": "Chassis",
> > +                                                "refType": "weak"},
> > +                                        "min": 0, "max": 1}}},
> >              "indexes": [["datapath", "tunnel_key"], ["logical_port"]],
> >              "isRoot": true},
> >          "MAC_Binding": {
> >
> > What do you think?
> >
> > --
> > Frode Nordahl
> >
> > > > The details of plugging information can be stored in the options column with new options defined, such as: "plug_type". For representor port the "plug_type" can be defined as "representor". For each plug_type there can be type-specific information defined and also stored in the options column. (Or we can define a new column "plug_options" just for plugging information).
> > >
> > > A convention to prefix all plugging related options the same way,
> > > let's say: 'plug_', would work as well as a separate column, but I
> > > have nothing against a separate column if there is appetite for that.
> > >
> > > > > > 2. ovn-controller on the requested chassis sees this lport type,
> > > > > > creates an OVS port/interface and possibly sets some options or hints
> > > > > > in the external_ids column
> > > > > >  of ovs interface.  At this point, it will not do anything.
> > > > >
> > > > > This could potentially help us avoid the issue of adding more
> > > > > processes and IDLs talking to the SB DB.
> > > > >
> > > > > However, at this point in time, the ovn-controller does not know the
> > > > > details of the representor port, such as name/ifindex for the kernel
> > > > > representor port, and the interface it creates in OVS has to match the
> > > > > kernel representor port, and afict it is not possible to change these
> > > > > details after port creation. This information is provided by the
> > > > > devlink lookup in the current proposal.
> > > > >
> > > > > It might work if it was possible to create the port and interface
> > > > > separately, but that would cause a referential integrity violation.
> > > > > Another alternative could be that ovn-vif created a new interface
> > > > > record in 4), attaching that to the port record created by
> > > > > ovn-controller replacing a throw-away interface record, but we would
> > > > > be approaching hacky territory.
> > > > >
> > > > > Could we do some sort of local IPC perhaps through a Unix domain socket?
> > > >
> > > > I think it would be good to have separate *modules* to handle interface plugging (one module per plug_type), but there is no need to have a separate *process* for it, which increases the complexity and the IPC cost. The modules can be invoked as callbacks registered for each plug_type, and responsible for explaining the options defined for its type, which is directly retrieved from the port_binding record from SB DB. This way the core ovn-controller components are kept untouched (except calling the plugging modules at some point, but not worrying about the details), without introducing a separate process. (A separate process is needed only if there are performance concerns or blocking I/Os).
> > > >
> > > > It is similar to how different dpif providers (or different dev types) are supported in OVS.
> > >
> > > Yes, I like this. The currently proposed devlink informed plugging
> > > would have no issues with blocking I/O, and we could document a
> > > "charter" for the ovn-vif project that anything going into it would
> > > have to operate as if it was native OVN asynchronous code.
> > >
>
> This bit could possibly be implemented with an interface like this
> (note that this is draft/pseudocode at this point):
> diff --git a/controller/plug-provider.h b/controller/plug-provider.h
> new file mode 100644
> index 000000000..9b7a2e722
> --- /dev/null
> +++ b/controller/plug-provider.h
> @@ -0,0 +1,45 @@
> +struct plug {
> +    const struct plug_class *plug_class;
> +};
> +
> +struct plug_iface {
> +    char *iface_name;
> +    struct smap iface_external_ids;
> +    struct smap iface_options;
> +    struct smap iface_other_config;
> +};
> +
> +struct plug_class {
> +    /* Type of plugger in this class. */
> +    const char *type;
> +
> +    /* Called when the plug provider is registered, typically at program
> +     * startup. */
> +    int (*init)(void);
> +
> +    /* Performs periodic work needed by plugger, if any is necessary. */
> +    bool (*run)(struct plug *plug);
> +
> +    /* Called when the plug provider is removed, typically at program exit. */
> +    int (*destroy)(struct plug *plug);
> +
> +    /* Pass lport name and options to plugger to prepare for port creation.
> +     *
> +     * The plug implemantation can perform lookup or any per port
> +     * initialization and should fill plug_iface with data required for
> +     * port/interface creation.
> +     *
> +     * OVN owns the contents of plug_iface struct and is responsible to free
> +     * any data allocated when done with it. */
> +    bool (*plug_prepare)(const char *lport_name,
> +                         const struct smap *lport_options,
> +                         struct plug_iface *);
> +
> +    /* Notify plugger that port was created. */
> +    bool (*plug_port_created)(const char *lport_name,
> +                              const struct smap *lport_options);
> +
> +    /* Notify plugger that port was removed. */
> +    bool (*plug_port_removed)(const char *lport_name,
> +                              const struct smap *lport_options);
> +};

We should probably have a struct to keep the values the plugging
library needs to do its processing instead of multiple arguments,
something like:

struct plug_port_ctx {
    bool dpdk_initialized; /* from local Open_vSwitch table */
    const char *datapath_type; /* from local integration bridge */
    /* alternatively the above two could be collapsed into a single
     * dpdk_enabled / use_dpdk bool */
    const char *lport_name;
    const struct smap *lport_options;
};
Frode Nordahl July 22, 2021, 10:49 a.m. UTC | #15
As we discussed in the IRC meeting on Thu, July 15th I'm working on a
new RFC patch series which incorporates the discussion so far and I'll
propose that to the list when done.

For anyone following along I'll be staging it over here:
https://github.com/fnordahl/ovn/commits/pluggingv2
diff mbox series

Patch

diff --git a/controller/automake.mk b/controller/automake.mk
index e664f1980..04e5708ec 100644
--- a/controller/automake.mk
+++ b/controller/automake.mk
@@ -26,6 +26,8 @@  controller_ovn_controller_SOURCES = \
 	controller/pinctrl.h \
 	controller/patch.c \
 	controller/patch.h \
+	controller/plugging.c \
+	controller/plugging.h \
 	controller/ovn-controller.c \
 	controller/ovn-controller.h \
 	controller/physical.c \
diff --git a/controller/binding.c b/controller/binding.c
index 4ca2b4d9a..66fa65091 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1081,15 +1081,6 @@  is_binding_lport_this_chassis(struct binding_lport *b_lport,
             b_lport->pb->chassis == chassis);
 }
 
-static bool
-can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
-                         const char *requested_chassis)
-{
-    return !requested_chassis || !requested_chassis[0]
-           || !strcmp(requested_chassis, chassis_rec->name)
-           || !strcmp(requested_chassis, chassis_rec->hostname);
-}
-
 /* Returns 'true' if the 'lbinding' has binding lports of type LP_CONTAINER,
  * 'false' otherwise. */
 static bool
@@ -1186,8 +1177,8 @@  consider_vif_lport(const struct sbrec_port_binding *pb,
                    struct hmap *qos_map)
 {
     const char *vif_chassis = smap_get(&pb->options, "requested-chassis");
-    bool can_bind = can_bind_on_this_chassis(b_ctx_in->chassis_rec,
-                                             vif_chassis);
+    bool can_bind = lport_can_bind_on_this_chassis(b_ctx_in->chassis_rec,
+                                                   vif_chassis);
 
     if (!lbinding) {
         lbinding = local_binding_find(&b_ctx_out->lbinding_data->bindings,
@@ -1278,8 +1269,8 @@  consider_container_lport(const struct sbrec_port_binding *pb,
     ovs_assert(parent_b_lport && parent_b_lport->pb);
     const char *vif_chassis = smap_get(&parent_b_lport->pb->options,
                                        "requested-chassis");
-    bool can_bind = can_bind_on_this_chassis(b_ctx_in->chassis_rec,
-                                             vif_chassis);
+    bool can_bind = lport_can_bind_on_this_chassis(b_ctx_in->chassis_rec,
+                                                   vif_chassis);
 
     return consider_vif_lport_(pb, can_bind, vif_chassis, b_ctx_in, b_ctx_out,
                                container_b_lport, qos_map);
diff --git a/controller/lport.c b/controller/lport.c
index 478fcfd82..e97cc17c5 100644
--- a/controller/lport.c
+++ b/controller/lport.c
@@ -120,3 +120,12 @@  mcgroup_lookup_by_dp_name(
 
     return retval;
 }
+
+bool
+lport_can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
+                               const char *requested_chassis)
+{
+    return !requested_chassis || !requested_chassis[0]
+           || !strcmp(requested_chassis, chassis_rec->name)
+           || !strcmp(requested_chassis, chassis_rec->hostname);
+}
diff --git a/controller/lport.h b/controller/lport.h
index 345efc184..4d9b6891d 100644
--- a/controller/lport.h
+++ b/controller/lport.h
@@ -54,5 +54,7 @@  lport_is_chassis_resident(struct ovsdb_idl_index *sbrec_port_binding_by_name,
                           const struct sbrec_chassis *chassis,
                           const struct sset *active_tunnels,
                           const char *port_name);
+bool lport_can_bind_on_this_chassis(const struct sbrec_chassis *,
+                                    const char *);
 
 #endif /* controller/lport.h */
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 67c51a86f..93b6e3e11 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -54,6 +54,7 @@ 
 #include "patch.h"
 #include "physical.h"
 #include "pinctrl.h"
+#include "plugging.h"
 #include "openvswitch/poll-loop.h"
 #include "lib/bitmap.h"
 #include "lib/hash.h"
@@ -269,6 +270,11 @@  update_sb_monitors(struct ovsdb_idl *ovnsb_idl,
          */
         sbrec_logical_flow_add_clause_logical_dp_group(&lf, OVSDB_F_NE, NULL);
     }
+    /* Monitor unbound LP_VIF ports to consider representor port plugging */
+    struct uuid zero_uuid;
+    memset(&zero_uuid, 0, sizeof(zero_uuid));
+    sbrec_port_binding_add_clause_chassis(&pb, OVSDB_F_EQ, &zero_uuid);
+    sbrec_port_binding_add_clause_type(&pb, OVSDB_F_EQ, "\"\"");
 
 out:;
     unsigned int cond_seqnos[] = {
@@ -2572,6 +2578,7 @@  main(int argc, char *argv[])
     binding_init();
     patch_init();
     pinctrl_init();
+    plugging_init();
     lflow_init();
 
     /* Connect to OVS OVSDB instance. */
@@ -3050,6 +3057,10 @@  main(int argc, char *argv[])
                             ovsrec_open_vswitch_table_get(ovs_idl_loop.idl),
                             ovsrec_port_table_get(ovs_idl_loop.idl),
                             br_int, chassis, &runtime_data->local_datapaths);
+                        plugging_run(ovs_idl_txn,
+                            sbrec_port_binding_table_get(ovnsb_idl_loop.idl),
+                            ovsrec_port_table_get(ovs_idl_loop.idl),
+                            br_int, chassis);
                         pinctrl_run(ovnsb_idl_txn,
                                     sbrec_datapath_binding_by_key,
                                     sbrec_port_binding_by_datapath,
@@ -3268,6 +3279,7 @@  loop_done:
     ofctrl_destroy();
     pinctrl_destroy();
     patch_destroy();
+    plugging_destroy();
 
     ovsdb_idl_loop_destroy(&ovs_idl_loop);
     ovsdb_idl_loop_destroy(&ovnsb_idl_loop);
diff --git a/controller/plugging.c b/controller/plugging.c
new file mode 100644
index 000000000..8e62dfcb9
--- /dev/null
+++ b/controller/plugging.c
@@ -0,0 +1,422 @@ 
+/* Copyright (c) 2021 Canonical
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/devlink.h>
+#include <net/if.h>
+
+#include "plugging.h"
+
+#include "hash.h"
+#include "lflow.h"
+#include "lib/vswitch-idl.h"
+#include "lport.h"
+#include "openvswitch/hmap.h"
+#include "openvswitch/vlog.h"
+#include "lib/ovn-sb-idl.h"
+#include "netlink-devlink.h"
+#include "ovn-controller.h"
+#include "openvswitch/shash.h"
+#include "packets.h"
+
+VLOG_DEFINE_THIS_MODULE(plugging);
+
+/* Contains netdev name of ports known to devlink indexed by PF MAC
+ * address and logical function number (if applicable).
+ *
+ * Examples:
+ *     SR-IOV Physical Function: key "00:53:00:00:00:42"    value "pf0hpf"
+ *     SR-IOV Virtual Function:  key "00:53:00:00:00:42-42" value "pf0vf42"
+ */
+static struct shash devlink_ports;
+
+/* Max number of physical ports connected to a single NIC SoC. */
+#define MAX_NIC_PHY_PORTS 64
+/* string repr of eth MAC, '-', logical function number (uint32_t) */
+#define MAX_KEY_LEN 17+1+10+1
+
+
+static bool compat_get_host_pf_mac(const char *, struct eth_addr *);
+
+static bool
+fill_devlink_ports_key_from_strs(char *buf, size_t bufsiz,
+                                const char *host_pf_mac,
+                                const char *function)
+{
+    return snprintf(buf, bufsiz,
+                    function != NULL ? "%s-%s": "%s",
+                    host_pf_mac, function) < bufsiz;
+}
+
+/* We deliberately pass the struct eth_addr by value as we would have to copy
+ * the data either way to make use of the ETH_ADDR_ARGS macro */
+static bool
+fill_devlink_ports_key_from_typed(char *buf, size_t bufsiz,
+                    struct eth_addr host_pf_mac,
+                    uint32_t function)
+{
+    return snprintf(
+        buf, bufsiz,
+        function < UINT32_MAX ? ETH_ADDR_FMT"-%"PRIu32 : ETH_ADDR_FMT,
+        ETH_ADDR_ARGS(host_pf_mac), function) < bufsiz;
+}
+
+static void
+devlink_port_add_function(struct dl_port *port_entry,
+                          struct eth_addr *host_pf_mac)
+{
+    char keybuf[MAX_KEY_LEN];
+    uint32_t function_number;
+
+    switch(port_entry->flavour) {
+    case DEVLINK_PORT_FLAVOUR_PCI_PF:
+        /* for Physical Function representor ports we only add the MAC address
+         * and no logical function number */
+        function_number = -1;
+        break;
+    case DEVLINK_PORT_FLAVOUR_PCI_VF:
+        function_number = port_entry->pci_vf_number;
+        break;
+    default:
+        VLOG_WARN("Unsupported flavour for port '%s': %s",
+            port_entry->netdev_name,
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PHYSICAL ? "PHYSICAL" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_CPU ? "CPU" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_DSA ? "DSA" :
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_PF ? "PCI_PF":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_VF ? "PCI_VF":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_VIRTUAL ? "VIRTUAL":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_UNUSED ? "UNUSED":
+            port_entry->flavour == DEVLINK_PORT_FLAVOUR_PCI_SF ? "PCI_SF":
+            "UNKNOWN");
+        return;
+    };
+    /* Failure to fill key from typed values means calculation of the max key
+     * length is wrong, i.e. a bug. */
+    ovs_assert(fill_devlink_ports_key_from_typed(
+                            keybuf, sizeof(keybuf),
+                            *host_pf_mac, function_number));
+    shash_add(&devlink_ports, keybuf, xstrdup(port_entry->netdev_name));
+}
+
+
+void
+plugging_init(void)
+{
+    struct nl_dl_dump_state *port_dump;
+    struct dl_port port_entry;
+    int error;
+    struct eth_addr host_pf_macs[MAX_NIC_PHY_PORTS];
+
+    shash_init(&devlink_ports);
+
+    port_dump = nl_dl_dump_init();
+    if ((error = nl_dl_dump_init_error(port_dump))) {
+        VLOG_WARN(
+            "unable to start dump of ports from devlink-port interface");
+        return;
+    }
+    /* The core devlink infrastructure in the kernel keeps a linked list of
+     * the devices and each of those has a linked list of ports. These are
+     * populated by each device driver as devices are enumerated, and as such
+     * we can rely on ports being dumped in a consistent order on a device
+     * by device basis with logical numbering for each port flavour starting
+     * on 0 for each new device.
+     */
+    nl_dl_dump_start(DEVLINK_CMD_PORT_GET, port_dump);
+    while (nl_dl_port_dump_next(port_dump, &port_entry)) {
+        switch (port_entry.flavour) {
+        case DEVLINK_PORT_FLAVOUR_PHYSICAL:
+            /* The PHYSICAL flavoured port represent a network facing port on
+             * the NIC.
+             *
+             * For kernel versions where the devlink-port infrastructure does
+             * not provide MAC address for PCI_PF flavoured ports, there exist
+             * a interface in sysfs which is relative to the name of the
+             * PHYSICAL port netdev name.
+             *
+             * Since we at this point in the dump do not know if the MAC will
+             * be provided for the PCI_PF or not, proactively store the MAC
+             * address by looking up through the sysfs interface.
+             *
+             * If MAC address is available once we get to the PCI_PF we will
+             * overwrite the stored value.
+             */
+            if (port_entry.number > MAX_NIC_PHY_PORTS) {
+                VLOG_WARN("physical port number out of range for port '%s': "
+                          "%"PRIu32,
+                          port_entry.netdev_name, port_entry.number);
+                continue;
+            }
+            compat_get_host_pf_mac(port_entry.netdev_name,
+                                   &host_pf_macs[port_entry.number]);
+            break;
+        case DEVLINK_PORT_FLAVOUR_PCI_PF: /* FALL THROUGH */
+            /* The PCI_PF flavoured port represent a host facing port.
+             *
+             * For function flavours other than PHYSICAL pci_pf_number will be
+             * set to the logical number of which physical port the function
+             * belongs.
+             */
+            if (!eth_addr_is_zero(port_entry.function.eth_addr)) {
+                host_pf_macs[port_entry.pci_pf_number] =
+                    port_entry.function.eth_addr;
+            }
+            /* FALL THROUGH */
+        case DEVLINK_PORT_FLAVOUR_PCI_VF:
+            /* The PCI_VF flavoured port represent a host facing
+             * PCI Virtual Function.
+             *
+             * For function flavours other than PHYSICAL pci_pf_number will be
+             * set to the logical number of which physical port the function
+             * belongs.
+             */
+            if (port_entry.pci_pf_number > MAX_NIC_PHY_PORTS) {
+                VLOG_WARN("physical port number out of range for port '%s': "
+                          "%"PRIu32,
+                          port_entry.netdev_name, port_entry.pci_pf_number);
+                continue;
+            }
+            devlink_port_add_function(&port_entry,
+                                      &host_pf_macs[port_entry.pci_pf_number]);
+            break;
+        };
+    }
+    nl_dl_dump_finish(port_dump);
+    nl_dl_dump_destroy(port_dump);
+
+    struct shash_node *node;
+    SHASH_FOR_EACH (node, &devlink_ports) {
+        VLOG_INFO("HELLO %s -> %s", node->name, (char*)node->data);
+    }
+}
+
+void
+plugging_destroy(void)
+{
+    shash_destroy_free_data(&devlink_ports);
+}
+
+static bool
+match_port (const struct ovsrec_port *port, const char *name)
+{
+    return !name || !name[0]
+           || !strcmp(port->name, name);
+}
+
+/* Creates a port in bridge 'br_int' named 'name'.
+ *
+ * If such a port already exists, removes it from 'existing_ports'. */
+static void
+create_port(struct ovsdb_idl_txn *ovs_idl_txn,
+                  const char *iface_id,
+                  const struct ovsrec_bridge *br_int, const char *name,
+                  struct shash *existing_ports)
+{
+    for (size_t i = 0; i < br_int->n_ports; i++) {
+        if (match_port(br_int->ports[i], name)) {
+            VLOG_INFO("port already created: %s %s", iface_id, name);
+            shash_find_and_delete(existing_ports, br_int->ports[i]->name);
+            return;
+        }
+    }
+
+    ovsdb_idl_txn_add_comment(ovs_idl_txn,
+            "ovn-controller: plugging port '%s' into '%s'",
+            name, br_int->name);
+
+    struct ovsrec_interface *iface;
+    iface = ovsrec_interface_insert(ovs_idl_txn);
+    ovsrec_interface_set_name(iface, name);
+    const struct smap ids = SMAP_CONST2(
+        &ids,
+        "iface-id", iface_id,
+        "ovn-plugged", "true");
+    ovsrec_interface_set_external_ids(iface, &ids);
+
+    struct ovsrec_port *port;
+    port = ovsrec_port_insert(ovs_idl_txn);
+    ovsrec_port_set_name(port, name);
+    ovsrec_port_set_interfaces(port, &iface, 1);
+
+    struct ovsrec_port **ports;
+    ports = xmalloc(sizeof *ports * (br_int->n_ports + 1));
+    memcpy(ports, br_int->ports, sizeof *ports * br_int->n_ports);
+    ports[br_int->n_ports] = port;
+    ovsrec_bridge_verify_ports(br_int);
+    ovsrec_bridge_set_ports(br_int, ports, br_int->n_ports + 1);
+
+    free(ports);
+}
+
+static void
+remove_port(const struct ovsrec_bridge *br_int,
+            const struct ovsrec_port *port)
+{
+    for (size_t i = 0; i < br_int->n_ports; i++) {
+        if (br_int->ports[i] != port) {
+            continue;
+        }
+        struct ovsrec_port **new_ports;
+        new_ports = xmemdup(br_int->ports,
+                sizeof *new_ports * (br_int->n_ports - 1));
+        if (i != br_int->n_ports - 1) {
+            /* Removed port was not last */
+            new_ports[i] = br_int->ports[br_int->n_ports - 1];
+        }
+        ovsrec_bridge_verify_ports(br_int);
+        ovsrec_bridge_set_ports(br_int, new_ports, br_int->n_ports - 1);
+        free(new_ports);
+        ovsrec_port_delete(port);
+        return;
+    }
+}
+
+static bool
+can_plug(const char *vif_plugging)
+{
+    return !vif_plugging || !vif_plugging[0]
+           || !strcmp(vif_plugging, "true");
+}
+
+void
+plugging_run(struct ovsdb_idl_txn *ovs_idl_txn,
+             const struct sbrec_port_binding_table *port_binding_table,
+             const struct ovsrec_port_table *port_table,
+             const struct ovsrec_bridge *br_int,
+             const struct sbrec_chassis *chassis)
+{
+    if (!ovs_idl_txn) {
+        return;
+    }
+
+    /* Figure out what ports managed by OVN already exist. */
+    struct shash existing_ports = SHASH_INITIALIZER(&existing_ports);
+    const struct ovsrec_port *port;
+    OVSREC_PORT_TABLE_FOR_EACH (port, port_table) {
+        for (size_t i = 0; i < port->n_interfaces; i++) {
+            struct ovsrec_interface *iface = port->interfaces[i];
+            const char *port_iface_id;
+            if (can_plug(smap_get(&iface->external_ids, "ovn-plugged"))
+                && (port_iface_id = smap_get(&iface->external_ids,
+                                             "iface-id"))) {
+                shash_add(&existing_ports, port_iface_id, port);
+            }
+        }
+    }
+
+    /* Iterate over currently unbound ports destined for this chassis or ports
+     * already bound to this chassis and check if OVN management is requested.
+     * Remove ports from 'existing_ports' that do exist in the database and
+     * should be there. */
+    const struct sbrec_port_binding *port_binding;
+    SBREC_PORT_BINDING_TABLE_FOR_EACH (port_binding,
+                                       port_binding_table)
+    {
+        VLOG_INFO("HELLO %s", port_binding->logical_port);
+        const char *vif_chassis = smap_get(&port_binding->options,
+                                           "requested-chassis");
+        const char *vif_plugging = smap_get_def(&port_binding->options,
+                                                "ovn-plugging",
+                                                "false");
+        VLOG_INFO("HELLO %s", port_binding->logical_port);
+        if (lport_can_bind_on_this_chassis(chassis, vif_chassis)
+            && can_plug(vif_plugging))
+        {
+            char keybuf[MAX_KEY_LEN];
+            const char *rep_port;
+            const char *pf_mac;
+            const char *vf_num;
+
+            if (!fill_devlink_ports_key_from_strs(
+                                    keybuf, sizeof(keybuf),
+                                    (pf_mac = smap_get(
+                                        &port_binding->options, "pf-mac")),
+                                    (vf_num = smap_get(
+                                        &port_binding->options, "vf-num"))))
+            {
+                /* Overflow, most likely incorrect input data from database */
+                VLOG_WARN("Southbound DB port plugging options out of range: "
+                          "pf-mac: '%s' vf-num: '%s'", pf_mac, vf_num);
+                continue;
+            }
+
+            shash_find_and_delete(&existing_ports, port_binding->logical_port);
+
+            rep_port = shash_find_data(&devlink_ports, keybuf);
+            VLOG_INFO("plug %s (%s) -> %s",
+                      port_binding->logical_port, rep_port, br_int->name);
+            create_port(ovs_idl_txn, port_binding->logical_port,
+                        br_int, rep_port, &existing_ports);
+        }
+    }
+
+    /* Now 'existing_ports' only contains ports that exist in the
+     * database but shouldn't.  Delete them from the database. */
+    struct shash_node *port_node, *port_next_node;
+    SHASH_FOR_EACH_SAFE (port_node, port_next_node, &existing_ports) {
+        port = port_node->data;
+        shash_delete(&existing_ports, port_node);
+        VLOG_INFO("remove port %s", port->name);
+        remove_port(br_int, port);
+    }
+    shash_destroy(&existing_ports);
+}
+
+/* The kernel devlink-port interface provides a vendor neutral and standard way
+ * of discovering host visible resources such as MAC address of interfaces from
+ * a program running on the NIC SoC side.
+ *
+ * However a fairly recent kernel version is required for it to work, so until
+ * this is widely available we provide this helper to retrieve the same
+ * information from the interim sysfs solution. */
+static bool
+compat_get_host_pf_mac(const char *netdev_name, struct eth_addr *ea)
+{
+    char file_name[IFNAMSIZ+35+1];
+    FILE *stream;
+    char line[128];
+    bool retval = false;
+
+    snprintf(file_name, sizeof(file_name),
+             "/sys/class/net/%s/smart_nic/pf/config", netdev_name);
+    stream = fopen(file_name, "r");
+    if (!stream) {
+        VLOG_WARN("%s: open failed (%s)",
+                  file_name, ovs_strerror(errno));
+        *ea = eth_addr_zero;
+        return false;
+    }
+    while (fgets(line, sizeof(line), stream)) {
+        char key[16];
+        char *cp;
+        if (ovs_scan(line, "%15[^:]: ", key)
+            && key[0] == 'M' && key[1] == 'A' && key[2] == 'C')
+        {
+            /* strip any newline character */
+            if ((cp = strchr(line, '\n')) != NULL) {
+                *cp = '\0';
+            }
+            /* point cp at end of key + ': ', i.e. start of MAC address */
+            cp = line + strnlen(key, sizeof(key)) + 2;
+            retval = eth_addr_from_string(cp, ea);
+            break;
+        }
+    }
+    fclose(stream);
+    return retval;
+}
diff --git a/controller/plugging.h b/controller/plugging.h
new file mode 100644
index 000000000..324e5bdc0
--- /dev/null
+++ b/controller/plugging.h
@@ -0,0 +1,81 @@ 
+/* Copyright (c) 2021 Canonical
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef OVN_PLUGGING_H
+#define OVN_PLUGGING_H 1
+
+/* Interface Plugging
+ * ==================
+ *
+ * This module adds and removes ports on the integration bridge, as directed by
+ * Port_Binding options.
+ *
+ * Traditionally it has been the CMSs responsibility to create Virtual
+ * Interfaces as part of instance (Container, Pod, Virtual Machine etc.) life
+ * cycle, and subsequently manage plug/unplug operations on the Open vSwitch
+ * integration bridge.
+ *
+ * With the advent of NICs connected to multiple distinct CPUs we can have a
+ * topology where the instance runs on one host and Open vSwitch and OVN runs
+ * on a different host, the smartnic CPU.
+ *
+ * The act of plugging and unplugging the representor port in Open vSwitch
+ * running on the smartnic host CPU would be the same for every smartnic
+ * variant (thanks to the devlink-port infrastructure), and every CMS. As
+ * such it is natural to extend OVN to provide this common functionality
+ * through its CMS facing API.
+ *
+ * The instance will be connected to a SR-IOV Virtual Function or a RDMA
+ * Mediated Device on the host sytem (the latter not currently addressed in
+ * this implementation). The NIC driver will maintain a representor port for
+ * each of the host visible devices on the smartnic side.
+ *
+ * It is the CMSs responsibility to maintain a mapping between instance host
+ * and smartnic host, OVN can help by optionally providing details such as
+ * board serial number of the smartnic system as part of Chassis registration.
+ *
+ * The CMS will use it's knowledge of instance host <-> smartnic host mapping
+ * to add appropriate `requested-chassis` along with the information OVN needs
+ * to identify the representor port as options when creating Logical Switch
+ * Ports for instances. These options will be copied over to the Port_Binding
+ * table by ovn-northd.
+ *
+ * OVN will use the devlink interface to look up which representor port
+ * corresponds to the host visible resource and add this representor port to
+ * the integration bridge.
+ *
+ * Options API:
+ *   ovn-plugged: true
+ *   pf-mac: "00:53:00:00:00:42" // To distinguish between ports on NIC SoC
+ *   vf-num: 42 (optional)       // Refers to a logical PCI VF number
+ *                               // not specifying vf-num means plug PF
+ *                               // representor.
+ */
+
+struct ovsdb_idl_txn;
+struct sbrec_port_binding_table;
+struct ovsrec_port_table;
+struct ovsrec_bridge;
+struct sbrec_chassis;
+
+void plugging_run(struct ovsdb_idl_txn *,
+             const struct sbrec_port_binding_table *,
+             const struct ovsrec_port_table *,
+             const struct ovsrec_bridge *,
+             const struct sbrec_chassis *);
+void plugging_init(void);
+void plugging_destroy(void);
+
+#endif /* controller/plugging.h */